TagRecon

Reconciles MS-MS Sequence Tags Against a Database

 

Table of Contents:

I.             Introduction

II.           Usage

a.    Basic

b.    Specifying static or dynamic mass modifications

c.    Configuration parameters guide

III.          Interpreting results

a.    Search-time output

b.    PepXML format guide

c.    Validation

 

 

Introduction

TagRecon is a mutation-tolerant search engine. The software reconciles peptide sequence tags derived by DirecTag against a protein database. The software can be configured to make allowances for unanticipated mutations and posttranslational modifications during the tag reconciliation process. The software was thoroughly validated on multiple mass spectrometer platforms. The software can distribute its operations across multi-core CPUs and multi-node computer clusters.

 

Usage

a)  The basic usage of TagRecon is quite simple:

TagRecon [flags] –ProteinDatabase <FASTA database filepath> <sequence tags filepath in a supported file format> <another sequence tags filepath>

 

When running it from the command line, the command line parser first determines what flags you have specified in the command. The flags can be anywhere on the command line. The following basic flags are supported:

            -cfg <file>                                                                  specifies a runtime configuration [default: tagrecon.cfg]

            -workdir <path>                                                      specifies an absolute path to use as the working directory during execution [default: current working directory]

            -cpus <integer>                                                       specifies the number of worker threads to use during search [default: all available processors]

 

If a flag is specified that expects an argument but no argument is provided, it might be treated as a sequence tags file which probably undesirable. If you do not specify a runtime configuration file with -cfg and the default configuration file is not found, then default runtime values are used (and a warning that no configuration file was found will be shown).

 

There is another type of flag that is supported that has a unique pattern: the override flags. Instead of having a name like cfg, the override flags have the same name as the variable that they override. Overriding a variable is specifying a different value on the command line than the one that is in the configuration file (just like the configuration file overrides the built-in values). For example, to override the variable DynamicMods to have the value “M @ 16”, use the override flag:

      -DynamicMods "M @ 16"

 

The double quotes are necessary on the command line because the value of the variable has spaces in it.

 

After the flags are parsed, the file arguments are processed. The first argument is usually “-ProteinDatabase” followed by the relative or absolute path to the FASTA protein database you want to search against.  Every file argument after that is a relative or absolute path to a sequence tags file from which to extract short sequence tags. The FASTA database filepath must be a valid filepath, it does not support wildcards. The sequence tags file path, however, supports, do support wildcards. The provided tags must be in .tags format (DirecTag). The software expects the corresponding raw data to be present in the same folder as the tags file. The software supports the following MS/MS data formats:

            mzML 1.x (version 1.1 is recommended)

            mzXML 3.x

            Mascot Generic (MGF)

            Bruker BioTools Data Exchange (btdx.xml)

            Bruker BAF,FID, Compass .d

            Agilent/Bruker YEP

            Agilent MassHunter .d

            Thermo RAW

            Waters RAW

            ABI WIFF

Note: Reading instrument native formats requires installation of appropriate vendor readers on the computer (Follow the link and see the section “Vendor-specific requirements”). Most vendor readers are available for windows platform only. Linux/Unix/Mac users are encouraged to use mzML 1.1 format.

 

b)   A static mass modification is something like carboxymethylation of cysteines, where all cysteines should be treated as about +57 in TagRecon and all subsequent downstream analysis. Refer to the StaticMods variable in the configuration parameters guide. A dynamic mass modification is something like a potential oxidation of methionine, where each methionine may be occur as either its natural mass or about +16. Refer to the DynamicMods variable in the configuration parameters guide.


 

c)   Configuration parameters guide

Name

(Type, Default Value)

Description

AdjustPrecursorMass

(boolean, false)

If true, the preprocessing step will correct the precursor mass by adjusting it through a specified range in steps of a specified length, finally choosing the optimal adjustment. The optimal adjustment is the one that maximizes the sum of products of all complementary peaks in the spectrum.

BlindPTMResidues

(string, “”)

A comma-separated string of amino acid residue specificity for blind PTM searches. This variable also considers N-terminal and C-terminal symbols ([ and ]).  

BlosumMatrix

(string, “blosum62.fas”)

When a mismatch occurs during mass matching, the software computes the delta mass (ΔM) between the corresponding database sequence and spectral flanking masses. The ΔM and the amino acids in the mass mismatch region are used to identify potential substitutions from a BLOSUM62 log-odds substitution matrix. This parameter tells TagRecon where to find the corresponding BLOSUM matrix.

BlosumThreshold

(real, 0)

This parameter filters the potential substitutions from the BLOSUM matrix based on the log-odds scores.

ClassSizeMultiplier

(real, 2)

When stratifying peaks into a specified, fixed number of intensity classes, this parameter controls the size of each class relative to the class above it (where the peaks are more intense). At default values, if the best class, A, has 1 peak in it, then class B will have 2 peaks in it and class C will have 4 peaks.

CleavageRules

(string, “Trypsin/P”)

This important parameter allows the user to control the way peptides are generated from the protein database. It can be used to configure the search on tryptic peptides only, on non-tryptics, or anything in between. It can even be used to test multiple residue motifs at a potential cleavage site. This parameter describes which amino acids are valid on the N and C termini of a digestion site. The parameter is specified in PSI-MS regular expression syntax (a limited Perl regular expression syntax). TagRecon can recognize the following protease names and automatically use the corresponding regular expression for this parameter.

 

         Protease names:

-       “Trypsin” (allows for cut after K or R and disallows cutting when the site is before a proline)

-       “Trypsin/P” (overrides the proline exception of “Trypsin” )

-       "Chymotrypsin” (allows cut after F,Y,W,L. Disallows cutting before proline)

-       "TrypChymo” (combines “Trypsin/P” and “Chymotrypsin” cleavage rules)

-       “Lys-C” (Lys-C, disallowing cutting before proline)

-       “Lys-C/P” (overrides proline exception if “Lys-C”)

-       “Asp-N”

-       “PepsinA” (Cuts right after F, L)

-       “CNBr” (Cyanogen bromide)

-       “Formic_acid” (Formic acid)

-       “NoEnzyme” (Cuts everywhere)

 

A complete list of supported protease names can be found here.

 

Note: CleavageRules can also work with an earlier but deprecated regular expression syntax. We highly discourage users from using the old syntax. Briefly, the old syntax is a space-delimited list of cleavage rules, where each cleavage rule itself is a space-delimited pair of strings. The first string of the cleavage rule specifies the residue or residues that must be N-terminal to a potential cleavage site. The second string specifies the residue or residues that must be C-terminal to the site. Either string in the pair can contain multiple sequences of one or more residues, separated by the ‘|’ character. A ‘.’ character is a wildcard that will accept anything. Additionally, the ‘[‘ and ‘]’ characters refer to the N and C termini of a protein.

 

Now that you are thoroughly confused, here are examples of single cleavage rules:

R .                               a site is valid for cleavage if the N-terminal residue is R

R|K .                           a site is valid for cleavage if the N-terminal residue is R or K

[ .                                 a site is valid for cleavage at the N terminus of a protein

. ]                                 a site is valid for cleavage at the C terminus of a protein

 

The “.” wildcard is important because it allows the cleavage routine to work very quickly. However, if you wanted to leave out proline from a tryptic digest, you would have to explicitly declare the valid residues for both sides of a cleavage site:

R|K A|C|D|E|F|G|H|I|K|L|M|N|Q|R|S|T|V|W|Y

 

Remember that this parameter is a list of cleavage rules; a real tryptic digest can be declared with two cleavage rules:

[|R|K . . ]                    a site is valid for cleavage if it is at the N terminus of a protein,

                                    or if the N-terminal residue is R or K; a site is valid for cleavage

                                    if it is at the C terminus of a protein

 

Also note that a cleavage rule can have a residue string of more than one residue, allowing for multiple-residue cleavage motifs:

[M|[ .                           a site is valid for cleavage if it is at the N terminus of a protein,

                                    or if the N-terminal sequence of residues is [M (i.e. the M must

                                    be at the N terminus of a protein)

ComputeXCorr

(boolean, false)

If true, a Sequest-like cross correlation (xcorr) score will be calculated for the top ranking hits in each spectrum’s result set.

CTerminusMzTolerance

(real, 0.5 m/z)

TagRecon rapidly scans a database for candidate peptides using short sequence tags derived by DirecTag from MS/MS spectra. Candidates are matched to spectra by comparing prefix and suffix masses on either side of tag match. This parameter defines the mass tolerance used for sufix matching.

DecoyPrefix

(string, “rev_”)

Specifying a decoy prefix enables TagRecon to know if it is making a target or a decoy comparison for each PSM. If the protein database has proteins that begin with DecoyPrefix, then those proteins are decoys. If not, decoy proteins are created on-the-fly by reversing each target protein in the database (so there is one decoy protein per target protein). The automatic reversal (as well as the ability to distinguish between target and decoy comparisons) can be disabled by setting the DecoyPrefix to the empty string (“”).

DuplicateSpectra

(boolean, true)

If TagRecon determines a spectrum to be multiply charged and this parameter is true, the spectrum will be copied and treated as if it was all possible charge states from +2 to +<NumChargeStates>. If this parameter is false, the spectrum will simply be treated as a +2.

DynamicMods

(string, none)

Note: avoid using the “#” symbol in a configuration file since it begins a comment section. Using the “#” symbol in a command-line override works fine.

 

In order to search a database for potential post-translational modifications of candidate sequences, the user must configure this parameter to inform the search engine which residues may be modified. Residues that are modifiable are entered into this string in a space-delimited list of triplets. Each triplet is of the form:

<AA motif> <character to represent mod> <mod mass>

 

Thus, to search for potentially oxidized methionines and phosphorylated serines, this parameter would be set to something like the string:

“M * 15.995 S $ 79.966”

 

The AA motif can include multiple residues, and the peptide termini are represented by opening “(“ and closing “)” parentheses for the N and C termini, respectively. For example, an N terminal deamidation of glutamine may be specified by the string:

“(Q ^ -17”

 

If the last residue in the motif is not the residue intended to be modifiable, then use an exclamation mark to indicate that the residue preceding the mark is the modifiable residue. Using the previous example, “(Q! ^ -17” is an equivalent way to specify it. Another example would be specifying the demidation of asparagine when it is N terminal to a glycine, which might look like:

“N!G @ -17”

 

Another possibility is to specify a block of interchangeable residues in the motif, which is supported by the “[“ and “]” brackets. For example, to specify a potential phosphorylation on any serine, threonine, or tyrosine, use the string:

“[STY] * 79.966”

 

The “{“ and “}” brackets work in the opposite way as the “[“ and “]” brackets, i.e. “{STY} * 79.966” specifies a potential phosphorylation on every residue EXCEPT serine, threonine, or tyrosine. Both kinds of brackets can be combined with the exclamation mark, in which case the exclamation mark should come after the block (because the block counts as a single residue). Using the previous example, “[STY]! * 79.966” is an equivalent way to specify it.

 

Using the negative multi-residue brackets is the best way to indicate the “any residue except” concept, and it works on single residues as well. For example, to specify a mod on lysine except when it is at the C terminus of a peptide, use something like the string “K!{)} # 144”. Another example is specifying the cleavage-blocking homoserine mod in a CnBr digest when a serine or threonine is C terminal to a methionine:

“M![ST] * -29.99”

 

Note that it is not currently possible to specify (for example) the non-cleavage-blocking homoserine lactone mod in a CnBr digest, because the motif would extend outside of the peptide sequence itself. In the future a string like “M!){ST} * -17” might work for that, but for now, if “(“ is used it must be the first character in the motif, and likewise if “)” is used it must be the last character in a motif.

ExplainUnknownMassShiftsAs

(string, none)

This variable controls how TagRecon interprets the mass differences between the precursor and database peptide. This variable can take on three possible values:

1.    “blindptms”: Mass shifts are interpreted as unknown modifications. A mass shift is localized to the residue that maximizes the peptide identification score.

2.    “mutations”: Mass shifts are snapped to the BLOSUM amino acid mutation matrix. Candidate mutations are filtered with a user-specified BlosumScoreThreshold and interpretations are generated. Interpretations are scored against the spectrum and high-scoring interpretation is chosen as best match.

3.    preferredptms”: Mass shifts are snapped to a list of modifications specified by the variable “PreferredDeltaMasses”.

 

An empty string turns off the delta mass interpretation in TagRecon.

FragmentationAutoRule

(boolean, true)

If true, TagRecon will automatically choose the fragmentation rule based on the activation type of each MSn spectrum. This allows a single search to handle CID and ETD spectra (i.e. an interleaved or decision tree run). If false or if the input format does not specify the input format then FragmentationRule is used (see above).

FragmentationRule

(string, “CID”)

This parameter determines which ion series are used to build the theoretical spectrum for each candidate peptide. Possible values are:

CID: b, y

ETD: c, z*

manual: user-defined (a comma-separated list of [abcxyz] or z* (z+1), e.g. manual:b,y,z

FragmentMzTolerance

(real, 0.5 m/z)

This parameter controls how much tolerance there is on each side of the calculated m/z when looking for an ion fragment peak during candidate scoring.

MassReconMode

(boolean, false)

If set to true, TagRecon will ignore tag matching, and match each spectrum to all peptide sequences culled from a protein database while allowing for an unknown modification.

MaxDynamicMods

(integer, 2)

This parameter sets the maximum number of modified residues that may be in any candidate sequence.

MaxMissedCleavages

(integer, -1)

By default, when generating peptides from the protein database, a peptide may contain any number of missed cleavages. A missed cleavage is a site within the peptide that matches one of the cleavage rules (refer to CleavageRules). Settings this parameter to some other number will stop generating peptides from a sequence if it contains more than the specified number of missed cleavages.

MaxModificationMassPlus

(real, 300.0)

This flag defines the positive size of the mass mismatch. Mass mismatches bigger than the definition would be ignored and not interpreted as mutations or unknown modifications. By default, any mass mismatches bigger than +300Da are ignored.

MaxModificationMassMinus

(real, 150.0)

This flag defines the negative size of the mass mismatch. Mass mismatches smaller than the definition would be ignored and not interpreted as mutations or unknown modifications. By default, any mass mismatches smaller than    -150Da are ignored. This parameter along with the above parameter defines the window for the size of the modifications allowed in the mutation and unknown modification search. By default, TagRecon can search for mutations or modifications that are within a mass range of -150Da ≤ modMass ≤ 300Da.

MaxNumPreferredDeltaMasses

(integer, 1)

This variable tells TagRecon how many preferred delta masses should be invoked to explain a delta mass between precursor and database peptide.

MaxPeakCount

(integer, 200)

Filters out all peaks except the MaxPeakCount most intense peaks.

MaxPeptideLength

(integer, 75)

When digesting proteins, any peptide which exceeds this specified length will be disqualified.

MaxPrecursorAdjustment

(real, 2.5 Da)

When adjusting the precursor mass, this parameter sets the upper mass limit of adjustment allowable from the original precursor mass, measured in Daltons.

MinPrecursorAdjustment

(real, -2.5 Da)

When adjusting the precursor mass, this parameter sets the lower mass limit of adjustment allowable from the original precursor mass, measured in Daltons.

MaxResultRank

(integer, 2)

This parameter sets the maximum rank of peptide-spectrum-matches to report for each spectrum. A rank is all PSMs that score the same (common for isobaric residues and ambiguous modification localization). TagRecon may report extra ranks in order to ensure that the top target match and top decoy match from each digestion specificity (full, semi, non) is reported.

MaxSequenceMass

(real, 10000 Da)

When preprocessing the experimental spectra, any spectrum with a precursor mass that exceeds the specified mass will be disqualified.

MinPeptideLength

(integer, 5)

When digesting proteins, any peptide which does not meet or exceed the specified length will be disqualified.

MinSequenceMass

(real, 0 Da)

When preprocessing the experimental spectra, any spectrum with a precursor mass that is less than the specified mass will be disqualified. This parameter is useful to eliminate inherently unidentifiable spectra from an input data set. A setting of 500 for example, will eliminate most 3-residue matches and clean up the output file quite a lot.

MinTerminiCleavages

(integer, 2)

By default, when generating peptides from the protein database, a peptide must start after a cleavage and end before a cleavage. Setting this parameter to 0 or 1 will reduce that requirement, so that neither terminus or only one terminus of the peptide must match one of the cleavage rules specified in the CleavageRules parameter. This parameter is useful to turn a tryptic digest into a semi-tryptic digest.

NTerminusMzTolerance

(real, 0.75 m/z)

TagRecon rapidly scans a database for candidate peptides using short sequence tags derived by DirecTag from MS/MS spectra. Candidates are matched to spectra by comparing prefix and suffix masses on either side of tag match. This parameter defines the mass tolerance used for prefix matching.

NumBatches

(integer, 50)

This parameter sets a number of batches per node to strive for when using the MPI-based parallelization features. Setting this too low means that some nodes will finish before others (idle processor time), while setting it too high means more overhead in network transmission as each batch is smaller.

NumChargeStates

(integer, 3)

Controls the number of charge states that TagRecon will handle during all stages of the program. It is especially important during determination of charge state (see DuplicateSpectra for more information).

NumIntensityClasses

(integer, 3)

Before scoring any candidates, experimental spectra have their peaks stratified into the number of intensity classes specified by this parameter. Spectra that are very dense in peaks will likely benefit from more intensity classes in order to best take advantage of the variation in peak intensities. Spectra that are very sparse will not see much benefit from using many intensity classes.

OutputFormat

(string, “pepXML”)

TagRecon can write identifications in either “mzIdentML” or “pepXML” format.

OutputSuffix

(string, none)

The output of a TagRecon job will be a pepXML file for each input file. The string specified by this parameter will be appended to each pepXML filename. It is useful for differentiating jobs within a single directory.

PreferredDeltaMasses

(string, none)

The syntax of this variable is same as DynamicMods except the variable DOES NOT expect user to supply symbols for each modification.

PrecursorAdjustmentStep

(real, 0.1 Da)

When adjusting the precursor mass, this parameter sets the size of the steps between adjustments, measured in Daltons.

PrecursorMzTolerance

(real, 1.25 m/z)

A generated sequence candidate is only compared to an experimental spectrum if the candidates mass is within this tolerance of the experimental spectrums precursor mass. This value is given in Daltons/z units, but the actual tolerance is calculated by multiplying by the charge state. This parameter should be set to the tolerance that is desired for +1 spectra. At the default value, the precursor mass tolerances are 1.25, 2.5, and 3.75 Da for the first three charge states, respectively.

ProteinDatabase

(string, none)

Specifies the FASTA protein database to be searched.

ProteinSampleSize

(integer, 100)

Before beginning sequence candidate generation and scoring, TagRecon will do a random sampling of the protein database to get an estimate of the number of comparisons that will be done by the job. The bigger the sample size, the longer this estimate will take and the more accurate it will be. Of course, if there are fewer proteins in the database than the sample size, all proteins will be used in the sampling and the number of comparisons will be exact.

SearchUntaggedSpectra

(boolean, false)

When true TagRecon searches untagged spectra like a database search.

SpectrumListFilters

(string, “peakPicking false 2-“)

A semicolon-delimited list of filters applied to spectra as it is read in. Supported filters are defined by ProteoWizard:

 

Filter Name

Definition

Arguments

index

Filters spectra by position in the spectrum list

Int_set

msLevel

Filters spectra by MS level

Int_set

scanNumber

Filters spectra by scan number or by index+1

Int_set

scanEvent

Filters spectra by scan event

Int_set

scanTime

Filters spectra by scan start time

[scanTimeLow,scanTimeHigh]

mzPrecursors

Filters spectra by precursor m/z

[mz1, mz2, … mzN]

defaultArrayLength

Filters spectra by number of primary data points

Int_set

activation

Filters spectra by activation type

<ETD|CID|SA|HCD|BIRD|ECD|

IRMPD|PD|PSD|PQD|SID|SORI>

analyzer

Filters spectra by analyzer type

<quad|orbi|FT|IT|TOF>

polarity

Filters spectra by scan polarity

<positive|negative|+|->

peakPicking

Replaces profile peaks with centroided peaks

<prefer_vendor>:boolean(true) <msLevels>:int_set

threshold

Filters spectrum data points by intensity

<count|count-after-ties|absolute|

bpi-relative|tic-relative|tic-cutoff> <threshold>

<most-intense|least-intense> [<msLevels>:int_set]

mzWindow

Filters spectrum data points by m/z

[mzLow, mzHigh]

MS2Denoise

Applies a moving window filter to MS2 spectra

<window peak count>:int(6)

<window m/z width>:int(30)

<multicharge relaxation>:bool(true)

MS2Deisotope

Deisotopes MS2 spectra using Markey method

 

ETDFilter

Filters ETD MSn spectrum data points, removing unreacted precursors, charge-reduced precursors, and neutral losses

<remove precursor>:bool(true)

<remove charge-reduced>:bool(true)

<remove neutral losses>:bool(true)

<blanket removal>:bool(false)

<matching tolerance>:real(3) <PPM|MZ>

chargeStatePredictor

Predicts MSn spectrum precursors to be singly or multiply charged depending on the ratio of intensity above and below the precursor m/z

<override existing charge>bool(false)

<max. multiple charge>:int(3)

<min. multiple charge>:int(2)

<TIC fraction threshold>:real(0.9)

 

'int_set' means that a set of integers must be specified, as a list of intervals of the form [a,b] or a[-][b]

 

If no chargeStatePredictor is specified, a default one will be added like: “chargeStatePredictor false <NumChargeStates> 2 0.9”

StaticMods

(string, none)

If a residue (or multiple residues) should always be treated as having a modification on their natural mass, set this parameter to inform the search engine which residues are modified. Residues are entered into this string as a space-delimited list of pairs. Each pair is of the form:

<AA residue character> <mod mass>

 

Thus, to treat cysteine as always being carboxymethylated, this parameter would be set to something like the string:

“C 57”

StartSpectraScanNum

(integer, 0)

 

EndSpectraScanNum

(integer, -1)

A useful feature to focus a search on a subset of spectra in a particular data file, these two parameters can be set in order to limit the possible range of scan numbers that TagRecon will read from the input data files. By default, all tandem mass spectra in the input files are read in for processing.

StartProteinIndex

(integer, 0)

 

EndProteinIndex

(integer, -1)

A useful feature to focus a search on a subset of proteins in the protein database, these two parameters can be set in order to limit the range of proteins that TagRecon will read from the protein database. By default, all proteins in the protein database are read in for processing.

StatusUpdateFrequency

(real, 5 seconds)

Preprocessing spectra and scoring candidates may take a long time. A measure of progress through the protein database will be given on intervals that are specified by this parameter, measured in seconds.

TicCutoffPercentage

(real, 0.98)

In order to maximize the effectiveness of the MVH scoring algorithm, an important step in preprocessing the experimental spectra is filtering out noise peaks. Noise peaks are filtered out by sorting the original peaks in descending order of intensity, and then picking peaks from that list until the cumulative ion current of the picked peaks divided by the total ion current (TIC) is greater than or equal to this parameter. Lower percentages mean that less of the spectrums total intensity will be allowed to pass through preprocessing. See the section on Advanced Usage for tips on how to use this parameter optimally.

ThreadCountMultiplier

(integer, 10)

TagRecon is designed to take advantage of (symmetric) multiprocessor systems by multithreading the database search. A search process on an SMP system will spawn one worker thread for each processing unit (where a processing unit can be either a core on a multi-core CPU or a separate CPU entirely). The main thread then generates a list of worker numbers which is equal to the number of worker threads multiplied by this parameter. The worker threads then take a worker number from the list and use that number to iterate through the protein list. It is possible that one thread will be assigned all the proteins that generate a few candidates while another thread is assigned all the proteins that generate many candidates, resulting in one thread finishing its searching early. By having each thread use multiple worker numbers, the chance of one thread being penalized for picking all the easy proteins is reduced because if it finishes early it can just pick a new number. The only disadvantage to this system is that picking the new number incurs some overhead because of synchronizing with the other worker threads that might be trying to pick a worker number at the same time. The default value is a nice compromise between incurring that overhead and minimizing wasted time.

UnimodXML

(string, “unimod.xml”)

This parameter tells TagRecon where to find the unimod XML file.

UseAvgMassOfSequences

(boolean, true)

If true, the mass of candidate sequences will be calculated using the average masses of its amino acid residues. This parameter should be set based on whether the experimental data has precursor masses that are monoisotopic. For example, LCQ/LTQ-derived precursors are generally measured by average masses and FT-ICR/Orbitrap-derived precursors are generally measured by monoisotopic masses.

UseChargeStateFromMS

(boolean, false)

If true, TagRecon will use the charge state from the input data if it is available. If false, or if charge state is not available from a particular spectrum, TagRecon will use its internal algorithm to determine charge state. If, for a given spectrum, TagRecon uses its internal algorithm to determine charge state and the result is multiply charged, that spectrum may be duplicated to other charge states (see DuplicateSpectra for more information).

UseMultipleProcessors

(boolean, true)

If true, each process will use all the processing units available on the system it is running on.

UseNETAdjustment

(Boolean, false)

If true, TagRecon adds a probabilistic bonus to peptide scores depending on whether the peptides are fully-enzymatic, semi-enzymatic, or non-enzymatic.

UseSmartPlusThreeModel

(boolean, true)

Once a candidate sequence has been generated from the protein database, TagRecon determines which spectra will be compared to the sequence. For each unique charge state of those spectra, a set of theoretical fragment ions is generated by one of several different algorithms.

 

For +1 and +2 precursors, a +1 b and y ion is always predicted at each peptide bond.

 

For +3 and higher precursors, the fragment ions predicted depend on the way this parameter is set. When this parameter is true, then for each peptide bond, an internal calculation is done to estimate the basicity of the b and y fragment sequence. The precursors protons are distributed to those ions based on that calculation, with the more basic sequence generally getting more of the protons. For example, when this parameter is true, each peptide bond of a +3 precursor will either generate a +2 bi and a +1 yi ion, or a +1 bi and a +2 yi ion. For a +4 precursor, depending on basicity, a peptide bond breakage may result in a +1 bi and a +3 yi ion, a +2 bi and a +2 yi ion, or a +3 bi and a +1 yi ion. When this parameter is false, however, ALL possible charge distributions for the fragment ions are generated for every peptide bond. So a +3 sequence of length 10 will always have theoretical +1 y5, +2 y5, +1 b5, and +2 b5 ions.

UntaggedSpectraPrecMZTol

(real, 1.25 mz)

TagRecon uses this variable to when “SearchUntaggedSpectra” is set to true. A generated sequence candidate is only compared to an experimental spectrum if the candidates mass is within this tolerance of the experimental spectrum’s precursor mass. The units (“daltons” or “ppm”) must be provided as well as the magnitude. The actual tolerance used for the search is calculated by multiplying the tolerance by the charge state, so this parameter should be set to the tolerance that is desired for +1 spectra. At the default value, the precursor mass tolerances are 1.5, 3, and 4.5 Da for the first three charge states, respectively. 

 

 

Interpreting results

a)   Search-time output of TagRecon serves several purposes. The majority of the output will usually be progress information, telling the user which part of the job that TagRecon is currently working on, and in some cases how far along into that part the job is. There will be periodic updates when TagRecon is preprocessing spectra and when it is generating candidates from the database and comparing those candidates against the spectra. In a multi-process (MPI) job, there will also be progress information on bulk transfers of data over the network. Additionally, TagRecon will display statistics on the spectra that remain after preprocessing, specifically the average number of peaks in a spectrum before and after preprocessing. Also provided is the average number of the percentage of peaks that were filtered out by the preprocessing step. Finally, in the case of an MPI job, when the database search is complete each node that took part in the search will display statistics detailing the work that node did. The lines will be like:

Process #1 (foohost) stats: <numBatches> / <numProteins> / <numCandidatesGenerated> / <numCandidatesSearched> / <numComparisonsDone>

 

b)   One pepXML (a database search output format originally developed at the SPC) file is produced for every input spectra file that a TagRecon job searches. The file contains an entry for each spectrum kept during the search (i.e. only the spectra that were not obviously junk) that was compared to any candidate. Only the best MaxResults (a config parameter, defaulting to 5) matches for each spectrum are kept and output in the file. The pepXML format is quite versatile, but does have a few limitations. It is not possible to fully encode TagRecon’s CleavageRules variable if multi-residue motifs are used (in the rule “[|[M|K|R .” the “[M” part is a multi-residue motif), and it is not possible to fully encode Static/DynamicMods if multi-residue motifs are used (in the mod “(Q ^ -17” the “(Q” part is a multi-residue motif). We have proposed extensions to pepXML to support these motif-oriented capabilities.

 

c)    To validate results generated by TagRecon (and also several other popular search engines), users can pass pepXML files to IDPickerQonvert (in preparation for analyzing the results in the IDPicker suite). This approach only works when the protein database that was searched included distracter (decoy) proteins with a common prefix in their name that is not a prefix in any of the valid proteins. Most often, the database that was searched will include proteins in their forward and reverse orders, with the reversed protein having a prefix added to the name like “rev_”.