MyriMatch 2.1

For MS-MS Database Search

 

For MyriMatch 1.6 documentation, click here.

 

Table of Contents:

I.             Introduction

II.           Usage

a.    Basic

b.    Specifying static or dynamic mass modifications

c.    Configuration parameters guide

III.          Interpreting results

a.    Search-time output

b.    PepXML format guide

c.    Validation

 

 

Introduction

MyriMatch is a tool designed to take experimental data from shotgun proteomics experiments and compare those spectra against sequences in a known database of proteins. Whether the program is being run in a single-computer environment or across an entire cluster of processing nodes, it is able to optimally divide work in a much more efficient way than many other database search programs. This is because it only generates candidate sequences from the known database once for the entire set of spectra instead of once for every spectrum. Thus, for each candidate sequence generated, it is compared against every spectrum. The spectra keep a certain (user-defined) number of candidate sequences that had the highest scores.

 

Usage

a)  The basic usage of MyriMatch is quite simple:

myrimatch [flags] -ProteinDatabase <FASTA database filepath> <MS/MS data filepath in a supported file format> <another MS/MS data filepath>

 

When running it from the command line, the command line parser first determines what flags you have specified in the command. The flags can be anywhere on the command line. The following basic flags are supported:

            -cfg <file>                                                                  specifies a runtime configuration [default: myrimatch.cfg]

            -workdir <path>                                                      specifies a path to use as the working directory during execution [default: current working directory]

            -cpus <integer>                                                       specifies the number of worker threads to use during search [default: all available processors]

 

If a flag is specified that expects an argument but no argument is provided, it might be treated as a spectrum data file which probably undesirable. If you do not specify a runtime configuration file with -cfg and the default configuration file is not found, then default runtime values are used (and a warning that no configuration file was found will be shown).

 

There is another type of flag that is supported that has a unique pattern: the override flags. Instead of having a name like cfg, the override flags have the same name as the variable that they override. Overriding a variable is specifying a different value on the command line than the one that is in the configuration file (just like the configuration file overrides the built-in values). For example, to override the variable DynamicMods to have the value “M @ 16”, use the override flag:

      -DynamicMods "M @ 16"

 

The double quotes are necessary on the command line because the value of the variable has spaces in it.

 

After the flags are parsed, the file arguments are processed. The first argument is usually “-ProteinDatabase” followed by the relative or absolute path to the FASTA protein database you want to search against. Every file argument after that is a relative or absolute path to a MS/MS spectra data file from which to extract experimental spectra. The FASTA database filepath must be a valid filepath, it does not support wildcards. The MS/MS spectra data filepaths, however, do support wildcards. The provided spectra can be in any of the formats that ProteoWizard MSData supports. Click here for a list.

 

b)   A static mass modification is something like carboxymethylation of cysteines, where all cysteines should be treated as about +57 in MyriMatch and all subsequent downstream analysis. Refer to the StaticMods variable in the configuration parameters guide. A dynamic mass modification is something like a potential oxidation of methionine, where each methionine may be occur as either its natural mass or about +16. Refer to the DynamicMods variable in the configuration parameters guide.


 

c)   Configuration parameters guide

Name

(Type, Default Value)

Description

NumChargeStates

(integer, 3)

Controls the number of charge states that MyriMatch will handle during all stages of the program. It is especially important during determination of charge state (see DuplicateSpectra for more information).

OutputSuffix

(string, none)

The output of a MyriMatch job will be an identification file for each input file. The string specified by this parameter will be appended to each output filename. It is useful for differentiating jobs within a single directory.

OutputFormat

(string, “pepXML”)

MyriMatch can write identifications in either “mzIdentML” or “pepXML” format.

ProteinDatabase

(string, none)

Specifies the FASTA protein database to be searched.

DecoyPrefix

(string, “rev_”)

Specifying a decoy prefix enables MyriMatch to know if it is making a target or a decoy comparison for each PSM. If the protein database has proteins that begin with DecoyPrefix, then those proteins are decoys. If not, decoy proteins are created on-the-fly by reversing each target protein in the database (so there is one decoy protein per target protein). The automatic reversal (as well as the ability to distinguish between target and decoy comparisons) can be disabled by setting the DecoyPrefix to the empty string (“”).

SpectrumListFilters

(string, “peakPicking false 2-“)

A semicolon-delimited list of filters applied to spectra as it is read in. Supported filters are defined by ProteoWizard:

 

Filter Name

Definition

Arguments

index

Filters spectra by position in the spectrum list

Int_set

msLevel

Filters spectra by MS level

Int_set

scanNumber

Filters spectra by scan number or by index+1

Int_set

scanEvent

Filters spectra by scan event

Int_set

scanTime

Filters spectra by scan start time

[scanTimeLow,scanTimeHigh]

mzPrecursors

Filters spectra by precursor m/z

[mz1, mz2, … mzN]

defaultArrayLength

Filters spectra by number of primary data points

Int_set

activation

Filters spectra by activation type

<ETD|CID|SA|HCD|BIRD|ECD|

IRMPD|PD|PSD|PQD|SID|SORI>

analyzer

Filters spectra by analyzer type

<quad|orbi|FT|IT|TOF>

polarity

Filters spectra by scan polarity

<positive|negative|+|->

peakPicking

Replaces profile peaks with centroided peaks

<prefer_vendor>:boolean(true) <msLevels>:int_set

threshold

Filters spectrum data points by intensity

<count|count-after-ties|absolute|

bpi-relative|tic-relative|tic-cutoff> <threshold>

<most-intense|least-intense> [<msLevels>:int_set]

mzWindow

Filters spectrum data points by m/z

[mzLow, mzHigh]

MS2Denoise

Applies a moving window filter to MS2 spectra

<window peak count>:int(6)

<window m/z width>:int(30)

<multicharge relaxation>:bool(true)

MS2Deisotope

Deisotopes MS2 spectra using Markey method

 

ETDFilter

Filters ETD MSn spectrum data points, removing unreacted precursors, charge-reduced precursors, and neutral losses

<remove precursor>:bool(true)

<remove charge-reduced>:bool(true)

<remove neutral losses>:bool(true)

<blanket removal>:bool(false)

<matching tolerance>:real(3) <PPM|MZ>

chargeStatePredictor

Predicts MSn spectrum precursors to be singly or multiply charged depending on the ratio of intensity above and below the precursor m/z

<override existing charge>bool(false)

<max. multiple charge>:int(3)

<min. multiple charge>:int(2)

<TIC fraction threshold>:real(0.9)

 

'int_set' means that a set of integers must be specified, as a list of intervals of the form [a,b] or a[-][b]

 

If no chargeStatePredictor is specified, a default one will be added like: “chargeStatePredictor false <NumChargeStates> 2 0.9”

TicCutoffPercentage

(real, 0.98)

In order to maximize the effectiveness of the MVH scoring algorithm, an important step in preprocessing the experimental spectra is filtering out noise peaks. Noise peaks are filtered out by sorting the original peaks in descending order of intensity, and then picking peaks from that list until the cumulative ion current of the picked peaks divided by the total ion current (TIC) is greater than or equal to this parameter. Lower percentages mean that less of the spectrums total intensity will be allowed to pass through preprocessing. See the section on Advanced Usage for tips on how to use this parameter optimally.

MaxPeakCount

Filters out all peaks except the MaxPeakCount most intense peaks.

AvgPrecursorMzTolerance

(real, 1.5 m/z)

A generated sequence candidate is only compared to an experimental spectrum if the candidates mass is within this tolerance of the experimental spectrum’s precursor mass. The units (“daltons” or “ppm”) must be provided as well as the magnitude. The actual tolerance used for the search is calculated by multiplying the tolerance by the charge state, so this parameter should be set to the tolerance that is desired for +1 spectra. At the default value, the precursor mass tolerances are 1.5, 3, and 4.5 Da for the first three charge states, respectively. 

MonoPrecursorMzTolerance

(real, 10 ppm)

A generated sequence candidate is only compared to an experimental spectrum if the candidates mass is within this tolerance of the experimental spectrum’s precursor mass. The units (“daltons” or “ppm”) must be provided as well as the magnitude. The actual tolerance used for the search is calculated by multiplying the tolerance by the charge state, so this parameter should be set to the tolerance that is desired for +1 spectra. At the default value, the precursor mass tolerances are 10, 20, and 30 ppm for the first three charge states, respectively. 

MonoisotopeAdjustmentSet

(integer set, [-1,2])

Sometimes a mass spectrometer will pick the wrong isotope as the monoisotope of an eluting peptide. When using narrow tolerances for monoisotopic precursors, this can cause identifiable spectra to be missed. This parameter defines a set of isotopes (0 being the instrument-called monoisotope) to try as the monoisotopic precursor m/z. To disable this technique, set the value to “0”.

PrecursorMzToleranceRule

(string, “auto”)

This parameter controls the automatic selection of precursor mass type. For data from Thermo instruments, using the “auto” setting on a RAW, mzML, or mz5 file will automatically choose monoisotopic or average mass values (and the corresponding precursor tolerance). For other instruments or older data formats, the “mono” or “avg” tolerance should be set explicitly.

FragmentMzTolerance

(real, 0.5 m/z)

This parameter controls how much tolerance there is on each side of the calculated m/z when looking for an ion fragment peak during candidate scoring. The units (“daltons” or “ppm”) must be provided as well as the magnitude.

FragmentationRule

(string, “CID”)

This parameter determines which ion series are used to build the theoretical spectrum for each candidate peptide. Possible values are:

CID: b, y

ETD: c, z*

manual: user-defined (a comma-separated list of [abcxyz] or z* (z+1), e.g. manual:b,y,z

FragmentationAutoRule

(boolean, true)

If true, MyriMatch will automatically choose the fragmentation rule based on the activation type of each MSn spectrum. This allows a single search to handle CID and ETD spectra (i.e. an interleaved or decision tree run). If false or if the input format does not specify the input format then FragmentationRule is used (see above).

StaticMods

(string, none)

If a residue (or multiple residues) should always be treated as having a modification on their natural mass, set this parameter to inform the search engine which residues are modified. Residues are entered into this string as a space-delimited list of pairs. Each pair is of the form:

<AA residue character> <mod mass>

 

Thus, to treat cysteine as always being carboxymethylated, this parameter would be set to something like the string:

“C 57”

DynamicMods

(string, none)

Note: avoid using the “#” symbol in a configuration file since it begins a comment section. Using the “#” symbol in a command-line override works fine.

 

In order to search a database for potential post-translational modifications of candidate sequences, the user must configure this parameter to inform the search engine which residues may be modified. Residues that are modifiable are entered into this string in a space-delimited list of triplets. Each triplet is of the form:

<AA motif> <character to represent mod> <mod mass>

 

Thus, to search for potentially oxidized methionines and phosphorylated serines, this parameter would be set to something like the string:

“M * 15.995 S $ 79.966”

 

The AA motif can include multiple residues, and the peptide termini are represented by opening “(“ and closing “)” parentheses for the N and C termini, respectively. For example, an N terminal deamidation of glutamine may be specified by the string:

“(Q ^ -17”

 

If the last residue in the motif is not the residue intended to be modifiable, then use an exclamation mark to indicate that the residue preceding the mark is the modifiable residue. Using the previous example, “(Q! ^ -17” is an equivalent way to specify it. Another example would be specifying the demidation of asparagine when it is N terminal to a glycine, which might look like:

“N!G @ -17”

 

Another possibility is to specify a block of interchangeable residues in the motif, which is supported by the “[“ and “]” brackets. For example, to specify a potential phosphorylation on any serine, threonine, or tyrosine, use the string:

“[STY] * 79.966”

 

The “{“ and “}” brackets work in the opposite way as the “[“ and “]” brackets, i.e. “{STY} * 79.966” specifies a potential phosphorylation on every residue EXCEPT serine, threonine, or tyrosine. Both kinds of brackets can be combined with the exclamation mark, in which case the exclamation mark should come after the block (because the block counts as a single residue). Using the previous example, “[STY]! * 79.966” is an equivalent way to specify it.

 

Using the negative multi-residue brackets is the best way to indicate the “any residue except” concept, and it works on single residues as well. For example, to specify a mod on lysine except when it is at the C terminus of a peptide, use something like the string “K!{)} # 144”. Another example is specifying the cleavage-blocking homoserine mod in a CnBr digest when a serine or threonine is C terminal to a methionine:

“M![ST] * -29.99”

 

Note that it is not currently possible to specify (for example) the non-cleavage-blocking homoserine lactone mod in a CnBr digest, because the motif would extend outside of the peptide sequence itself. In the future a string like “M!){ST} * -17” might work for that, but for now, if “(“ is used it must be the first character in the motif, and likewise if “)” is used it must be the last character in a motif.

MaxDynamicMods

(integer, 2)

This parameter sets the maximum number of modified residues that may be in any candidate sequence.

MaxResultRank

(integer, 5)

This parameter sets the maximum rank of peptide-spectrum-matches to report for each spectrum. A rank is all PSMs that score the same (common for isobaric residues and ambiguous modification localization). MyriMatch may report extra ranks in order to ensure that the top target match and top decoy match from each digestion specificity (full, semi, non) is reported.

CleavageRules

(string, “Trypsin/P”)

This important parameter allows the user to control the way peptides are generated from the protein database. It can be used to configure the search on tryptic peptides only, on non-tryptics, or anything in between. It can even be used to test multiple residue motifs at a potential cleavage site. This parameter describes which amino acids are valid on the N and C termini of a digestion site. The parameter is specified in PSI-MS regular expression syntax (a limited Perl regular expression syntax). MyriMatch can recognize the following protease names and automatically use the corresponding regular expression for this parameter.

 

         Protease names:

-       “Trypsin” (allows for cut after K or R)

-       “Trypsin/P” (normal trypsin cut, disallows cutting when the site is before a proline)

-       "Chymotrypsin” (allows cut after F,Y,W,L. Disallows cutting before proline)

-       "TrypChymo” (combines “Trypsin/P” and “Chymotrypsin” cleavage rules)

-       “Lys-C”

-       “Lys-C/P” (Lys-C, disallowing cutting before proline)

-       “Asp-N”

-       PepsinA” (Cuts right after F, L)

-       CNBr” (Cyanogen bromide)

-       Formic_acid” (Formic acid)

-       NoEnzyme” (not supported; use the proper enzyme and set MinTerminiCleavages to 0)

 

A complete list of supported protease names can be found here.

 

Note: CleavageRules can also work with an earlier but deprecated regular expression syntax. We highly discourage users from using the old syntax. Briefly, the old syntax is a space-delimited list of cleavage rules, where each cleavage rule itself is a space-delimited pair of strings. The first string of the cleavage rule specifies the residue or residues that must be N-terminal to a potential cleavage site. The second string specifies the residue or residues that must be C-terminal to the site. Either string in the pair can contain multiple sequences of one or more residues, separated by the ‘|’ character. A ‘.’ character is a wildcard that will accept anything. Additionally, the ‘[‘ and ‘]’ characters refer to the N and C termini of a protein.

 

Now that you are thoroughly confused, here are examples of single cleavage rules:

R .                               a site is valid for cleavage if the N-terminal residue is R

R|K .                           a site is valid for cleavage if the N-terminal residue is R or K

[ .                                 a site is valid for cleavage at the N terminus of a protein

. ]                                 a site is valid for cleavage at the C terminus of a protein

 

The “.” wildcard is important because it allows the cleavage routine to work very quickly. However, if you wanted to leave out proline from a tryptic digest, you would have to explicitly declare the valid residues for both sides of a cleavage site:

R|K A|C|D|E|F|G|H|I|K|L|M|N|Q|R|S|T|V|W|Y

 

Remember that this parameter is a list of cleavage rules; a real tryptic digest can be declared with two cleavage rules:

[|R|K . . ]                    a site is valid for cleavage if it is at the N terminus of a protein,

                                    or if the N-terminal residue is R or K; a site is valid for cleavage

                                    if it is at the C terminus of a protein

 

Also note that a cleavage rule can have a residue string of more than one residue, allowing for multiple-residue cleavage motifs:

[M|[ .                           a site is valid for cleavage if it is at the N terminus of a protein,

                                    or if the N-terminal sequence of residues is [M (i.e. the M must

                                    be at the N terminus of a protein)

MinTerminiCleavages

(integer, 2)

By default, when generating peptides from the protein database, a peptide must start and end at a valid cleavage site. Setting this parameter to 0 or 1 will reduce that requirement, so that neither terminus or only one terminus of the peptide must match one of the cleavage rules specified in the CleavageRules parameter. This parameter is useful to turn a tryptic digest into a semi-tryptic digest.

MaxMissedCleavages

(integer, -1)

By default, when generating peptides from the protein database, a peptide may contain any number of missed cleavages. A missed cleavage is a site within the peptide that matches one of the cleavage rules (refer to CleavageRules). Settings this parameter to some other number will stop generating peptides from a sequence if it contains more than the specified number of missed cleavages.

MinPeptideMass

(real, 0 Da)

When preprocessing the experimental spectra, any spectrum with a precursor mass that is less than the specified mass will be disqualified. This parameter is useful to eliminate inherently unidentifiable spectra from an input data set. A setting of 500 for example, will eliminate most 3-residue matches and clean up the output file quite a lot.

MaxPeptideMass

(real, 10000 Da)

When preprocessing the experimental spectra, any spectrum with a precursor mass that exceeds the specified mass will be disqualified.

MinPeptideLength

(integer, 5)

When digesting proteins, any peptide which does not meet or exceed the specified length will be disqualified.

MaxPeptideLength

(integer, 75)

When digesting proteins, any peptide which exceeds this specified length will be disqualified.

StatusUpdateFrequency

(real, 5 seconds)

Preprocessing spectra and scoring candidates may take a long time. A measure of progress through the protein database will be given on intervals that are specified by this parameter, measured in seconds.

UseSmartPlusThreeModel

(boolean, true)

Once a candidate sequence has been generated from the protein database, MyriMatch determines which spectra will be compared to the sequence. For each unique charge state of those spectra, a set of theoretical fragment ions is generated by one of several different algorithms.

 

For +1 and +2 precursors, a +1 b and y ion is always predicted at each peptide bond.

 

For +3 and higher precursors, the fragment ions predicted depend on the way this parameter is set. When this parameter is true, then for each peptide bond, an internal calculation is done to estimate the basicity of the b and y fragment sequence. The precursors protons are distributed to those ions based on that calculation, with the more basic sequence generally getting more of the protons. For example, when this parameter is true, each peptide bond of a +3 precursor will either generate a +2 bi and a +1 yi ion, or a +1 bi and a +2 yi ion. For a +4 precursor, depending on basicity, a peptide bond breakage may result in a +1 bi and a +3 yi ion, a +2 bi and a +2 yi ion, or a +3 bi and a +1 yi ion. When this parameter is false, however, ALL possible charge distributions for the fragment ions are generated for every peptide bond. So a +3 sequence of length 10 will always have theoretical +1 y5, +2 y5, +1 b5, and +2 b5 ions.

ProteinSampleSize

(integer, 100)

Before beginning sequence candidate generation and scoring, MyriMatch will do a random sampling of the protein database to get an estimate of the number of comparisons that will be done by the job. The bigger the sample size, the longer this estimate will take and the more accurate it will be. Of course, if there are fewer proteins in the database than the sample size, all proteins will be used in the sampling and the number of comparisons will be exact.

ComputeXCorr

(boolean, true)

If true, a Sequest-like cross correlation (xcorr) score will be calculated for the top ranking hits in each spectrum’s result set.

NumIntensityClasses

(integer, 3)

Before scoring any candidates, experimental spectra have their peaks stratified into the number of intensity classes specified by this parameter. Spectra that are very dense in peaks will likely benefit from more intensity classes in order to best take advantage of the variation in peak intensities. Spectra that are very sparse will not see much benefit from using many intensity classes.

ClassSizeMultiplier

(real, 2)

When stratifying peaks into a specified, fixed number of intensity classes, this parameter controls the size of each class relative to the class above it (where the peaks are more intense). At default values, if the best class, A, has 1 peak in it, then class B will have 2 peaks in it and class C will have 4 peaks.

NumBatches

(integer, 50)

This parameter sets a number of batches per node to strive for when using the MPI-based parallelization features. Setting this too low means that some nodes will finish before others (idle processor time), while setting it too high means more overhead in network transmission as each batch is smaller.

ThreadCountMultiplier

(integer, 10)

MyriMatch is designed to take advantage of (symmetric) multiprocessor systems by multithreading the database search. A search process on an SMP system will spawn one worker thread for each processing unit (where a processing unit can be either a core on a multi-core CPU or a separate CPU entirely). The main thread then generates a list of worker numbers which is equal to the number of worker threads multiplied by this parameter. The worker threads then take a worker number from the list and use that number to iterate through the protein list. It is possible that one thread will be assigned all the proteins that generate a few candidates while another thread is assigned all the proteins that generate many candidates, resulting in one thread finishing its searching early. By having each thread use multiple worker numbers, the chance of one thread being penalized for picking all the easy proteins is reduced because if it finishes early it can just pick a new number. The only disadvantage to this system is that picking the new number incurs some overhead because of synchronizing with the other worker threads that might be trying to pick a worker number at the same time. The default value is a nice compromise between incurring that overhead and minimizing wasted time.

UseMultipleProcessors

(boolean, true)

If true, each process will use all the processing units available on the system it is running on.

 

 

Interpreting results

a)   Search-time output of MyriMatch serves several purposes. The majority of the output will usually be progress information, telling the user which part of the job that MyriMatch is currently working on, and in some cases how far along into that part the job is. There will be periodic updates when MyriMatch is preprocessing spectra and when it is generating candidates from the database and comparing those candidates against the spectra. In a multi-process (MPI) job, there will also be progress information on bulk transfers of data over the network. Additionally, MyriMatch will display statistics on the spectra that remain after preprocessing, specifically the average number of peaks in a spectrum before and after preprocessing. Also provided is the average number of the percentage of peaks that were filtered out by the preprocessing step. Finally, in the case of an MPI job, when the database search is complete each node that took part in the search will display statistics detailing the work that node did. The lines will be like:

Process #1 (foohost) stats: <numBatches> / <numProteins> / <numCandidatesGenerated> / <numCandidatesSearched> / <numComparisonsDone>

 

b)   One pepXML (a database search output format originally developed at the SPC) file is produced for every input spectra file that a MyriMatch job searches. The file contains an entry for each spectrum kept during the search (i.e. only the spectra that were not obviously junk) that was compared to any candidate. Only the best MaxResults (a config parameter, defaulting to 5) matches for each spectrum are kept and output in the file. The pepXML format is quite versatile, but does have a few limitations. It is not possible to fully encode MyriMatch’s CleavageRules variable if multi-residue motifs are used (in the rule “[|[M|K|R .” the “[M” part is a multi-residue motif), and it is not possible to fully encode Static/DynamicMods if multi-residue motifs are used (in the mod “(Q ^ -17” the “(Q” part is a multi-residue motif). We have proposed extensions to pepXML to support these motif-oriented capabilities.

 

c)    To validate results generated by MyriMatch (and also several other popular search engines), users can pass pepXML files to IDPickerQonvert (in preparation for analyzing the results in the IDPicker suite). This approach only works when the protein database that was searched included distracter (decoy) proteins with a common prefix in their name that is not a prefix in any of the valid proteins. Most often, the database that was searched will include proteins in their forward and reverse orders, with the reversed protein having a prefix added to the name like “rev_”.