Pepitome 1.0

For MS-MS Spectral Library Search

 

 

Table of Contents:

I.             Introduction

II.           Usage

a.    Basic

b.    Specifying static or dynamic mass modifications

c.    Configuration parameters guide

III.          Interpreting results

a.    Search-time output

b.    PepXML format guide

c.    Validation

 

 

Introduction

Pepitome is a tool designed to take experimental data from shotgun proteomics experiments and compare those spectra against library of pre-identified spectra. The program is being multi-threaded and it distributes the work between processing entities. The program produces proababilistic scores for each spectrum-spectrum match rather than traditional dot product scores. The spectra keep a certain (user-defined) number of candidate sequences that had the highest scores. The results are written to standard identification formatted files.

 

Usage

a)  The basic usage of Pepitome is quite simple:

peptiome [flags] -SpectralLibrary <SPTXT formated spectral library> [-ProteinDatabase FASTA protein database] <MS/MS data filepath in a supported file format> <another MS/MS data filepath>

 

When running it from the command line, the command line parser first determines what flags you have specified in the command. The flags can be anywhere on the command line. The following basic flags are supported:

            -cfg <file>                                                                  specifies a runtime configuration [default: Pepitome.cfg]

            -workdir <path>                                                      specifies a path to use as the working directory during execution [default: current working directory]

            -cpus <integer>                                                       specifies the number of worker threads to use during search [default: all available processors]

 

If a flag is specified that expects an argument but no argument is provided, it might be treated as a spectrum data file which probably undesirable. If you do not specify a runtime configuration file with -cfg and the default configuration file is not found, then default runtime values are used (and a warning that no configuration file was found will be shown).

 

There is another type of flag that is supported that has a unique pattern: the override flags. Instead of having a name like cfg, the override flags have the same name as the variable that they override. Overriding a variable is specifying a different value on the command line than the one that is in the configuration file (just like the configuration file overrides the built-in values). For example, to override the variable DynamicMods to have the value “M @ 16”, use the override flag:

      -DynamicMods "M @ 16"

 

The double quotes are necessary on the command line because the value of the variable has spaces in it.

 

After the flags are parsed, the file arguments are processed. The first argument is usually “-SpectralLibrary” followed by the relative or absolute path to the library file you want to search against. Next argument could be the "-ProteinDatabase" followed by the relative or absolute path to the FASTA formatted protein databae. Every file argument after that is a relative or absolute path to a MS/MS spectra data file from which to extract experimental spectra. The FASTA database and spectral library filepaths must be valid filepaths, it does not support wildcards. The MS/MS spectra data filepaths, however, do support wildcards. The provided spectra can be in any of the formats that ProteoWizard MSData supports. Click here for a list.

 

b)   A static mass modification is something like carboxymethylation of cysteines, where all cysteines should be treated as about +57 in Pepitome and all subsequent downstream analysis. Refer to the StaticMods variable in the configuration parameters guide. A dynamic mass modification is something like a potential oxidation of methionine, where each methionine may be occur as either its natural mass or about +16. Refer to the DynamicMods variable in the configuration parameters guide.


 

c)   Configuration parameters guide

Name

(Type, Default Value)

Description

NumChargeStates

(integer, 3)

Controls the number of charge states that Pepitome will handle during all stages of the program. It is especially important during determination of charge state (see DuplicateSpectra for more information).

OutputSuffix

(string, none)

The output of a Pepitome job will be an identification file for each input file. The string specified by this parameter will be appended to each output filename. It is useful for differentiating jobs within a single directory.

OutputFormat

(string, “pepXML”)

Pepitome can write identifications in either “mzIdentML” or “pepXML” format.

SpectralLibrary

(string, none)

Specifies the SPTXT formatted spectral library to be searched. The software indexes the library upon its first use. The indexed library will have an .index appended to the end of the original filename. The program can accept the .index library for subsequent use.

ProteinDatabase

(string, none)

Specifies the FASTA protein database to be searched.

DecoyPrefix

(string, “DECOY_”)

Specifying a decoy prefix enables Pepitome to know if it is making a target or a decoy comparison for each PSM. If the spectral library has peptides that begin with DecoyPrefix, then those peptides are decoys. These decoys are genereally created with the SpectraST software from ISB. Follow this link for information about how to create a decoy library from NIST libraries. Pepitome expects both target and decoy peptides to be in the same library.

SpectrumListFilters

(string, “peakPicking false 2-“)

A semicolon-delimited list of filters applied to spectra as it is read in. Supported filters are defined by ProteoWizard:

 

Filter Name

Definition

Arguments

index

Filters spectra by position in the spectrum list

Int_set

msLevel

Filters spectra by MS level

Int_set

scanNumber

Filters spectra by scan number or by index+1

Int_set

scanEvent

Filters spectra by scan event

Int_set

scanTime

Filters spectra by scan start time

[scanTimeLow,scanTimeHigh]

mzPrecursors

Filters spectra by precursor m/z

[mz1, mz2, … mzN]

defaultArrayLength

Filters spectra by number of primary data points

Int_set

activation

Filters spectra by activation type

<ETD|CID|SA|HCD|BIRD|ECD|

IRMPD|PD|PSD|PQD|SID|SORI>

analyzer

Filters spectra by analyzer type

<quad|orbi|FT|IT|TOF>

polarity

Filters spectra by scan polarity

<positive|negative|+|->

peakPicking

Replaces profile peaks with centroided peaks

<prefer_vendor>:boolean(true) <msLevels>:int_set

threshold

Filters spectrum data points by intensity

<count|count-after-ties|absolute|

bpi-relative|tic-relative|tic-cutoff> <threshold>

<most-intense|least-intense> [<msLevels>:int_set]

mzWindow

Filters spectrum data points by m/z

[mzLow, mzHigh]

MS2Denoise

Applies a moving window filter to MS2 spectra

<window peak count>:int(6)

<window m/z width>:int(30)

<multicharge relaxation>:bool(true)

MS2Deisotope

Deisotopes MS2 spectra using Markey method

 

ETDFilter

Filters ETD MSn spectrum data points, removing unreacted precursors, charge-reduced precursors, and neutral losses

<remove precursor>:bool(true)

<remove charge-reduced>:bool(true)

<remove neutral losses>:bool(true)

<blanket removal>:bool(false)

<matching tolerance>:real(3) <PPM|MZ>

chargeStatePredictor

Predicts MSn spectrum precursors to be singly or multiply charged depending on the ratio of intensity above and below the precursor m/z

<override existing charge>bool(false)

<max. multiple charge>:int(3)

<min. multiple charge>:int(2)

<TIC fraction threshold>:real(0.9)

 

'int_set' means that a set of integers must be specified, as a list of intervals of the form [a,b] or a[-][b]

 

If no chargeStatePredictor is specified, a default one will be added like: “chargeStatePredictor false <NumChargeStates> 2 0.9”

TicCutoffPercentage

(real, 0.98)

In order to maximize the effectiveness of the scoring algorithms, an important step in preprocessing the experimental spectra is filtering out noise peaks. Noise peaks are filtered out by sorting the original peaks in descending order of intensity, and then picking peaks from that list until the cumulative ion current of the picked peaks divided by the total ion current (TIC) is greater than or equal to this parameter. Lower percentages mean that less of the spectrums total intensity will be allowed to pass through preprocessing. See the section on Advanced Usage for tips on how to use this parameter optimally.

LibTicCutoffPercentage

(real, 0.98)

This parameter is same as the TicCutoffPercentage applied to the library spectra.

MaxPeakCount

(integer, 150)

Filters out all peaks except the MaxPeakCount most intense peaks from the experimental spectra.

LibMaxPeakCount

(integer, 100)

Filters out all peaks except the LibMaxPeakCount most intense peaks from the library spectrum.

AvgPrecursorMzTolerance

(real, 1.5 m/z)

A library spectrum is only compared to an experimental spectrum if the candidates mass is within this tolerance of the experimental spectrum’s precursor mass. The units (“daltons” or “ppm”) must be provided as well as the magnitude. The actual tolerance used for the search is calculated by multiplying the tolerance by the charge state, so this parameter should be set to the tolerance that is desired for +1 spectra. At the default value, the precursor mass tolerances are 1.5, 3, and 4.5 Da for the first three charge states, respectively. 

MonoPrecursorMzTolerance

(real, 10 ppm)

A library spectrum is only compared to an experimental spectrum if the candidates mass is within this tolerance of the experimental spectrum’s precursor mass. The units (“daltons” or “ppm”) must be provided as well as the magnitude. The actual tolerance used for the search is calculated by multiplying the tolerance by the charge state, so this parameter should be set to the tolerance that is desired for +1 spectra. At the default value, the precursor mass tolerances are 10, 20, and 30 ppm for the first three charge states, respectively. 

MonoisotopeAdjustmentSet

(integer set, [-1,2])

Sometimes a mass spectrometer will pick the wrong isotope as the monoisotope of an eluting peptide. When using narrow tolerances for monoisotopic precursors, this can cause identifiable spectra to be missed. This parameter defines a set of isotopes (0 being the instrument-called monoisotope) to try as the monoisotopic precursor m/z. To disable this technique, set the value to “0”.

PrecursorMzToleranceRule

(string, “auto”)

This parameter controls the automatic selection of precursor mass type. For data from Thermo instruments, using the “auto” setting on a RAW, mzML, or mz5 file will automatically choose monoisotopic or average mass values (and the corresponding precursor tolerance). For other instruments or older data formats, the “mono” or “avg” tolerance should be set explicitly.

FragmentMzTolerance

(real, 0.5 m/z)

This parameter controls how much tolerance there is on each side of the library m/z when looking for an ion fragment peak during candidate scoring. The units (“daltons” or “ppm”) must be provided as well as the magnitude.

CleanLibSpectra

(boolean, “true”)

This parameter removes precursors, precursor neutral losses, and isotopic peaks from the library spectrum.

StaticMods

(string, none)

If a residue (or multiple residues) should always be treated as having a modification on their natural mass, set this parameter to inform the search engine which residues are modified. Residues are entered into this string as a space-delimited list of pairs. Each pair is of the form:

<AA residue character> <mod mass>

 

Thus, to treat cysteine as always being carboxymethylated, this parameter would be set to something like the string:

“C 57”. The search engine uses this to filter out candidates that violate the static modification rule.

MaxDynamicMods

(integer, 2)

This parameter sets the maximum number of modified residues that may be in any candidate match.

MaxResultRank

(integer, 5)

This parameter sets the maximum rank of spectrum-spectrum matches to report for each spectrum. A rank is all PSMs that score the same (common for isobaric residues and ambiguous modification localization). Pepitome may report extra ranks in order to ensure that the top target match and top decoy match from each digestion specificity (full, semi, non) is reported.

CleavageRules

(string, “Trypsin/P”)

This important parameter allows the user to control the peptides that are reported by Pepitome search. Library identifications that do not confirm to the CleavageRules will be discarded. It can be used to configure the search for tryptic peptides only, non-tryptics, or anything in between. It can even be used to test multiple residue motifs at a potential cleavage site. This parameter describes which amino acids are valid on the N and C termini of a digestion site. The parameter is specified in PSI-MS regular expression syntax (a limited Perl regular expression syntax). Pepitome can recognize the following protease names and automatically use the corresponding regular expression for this parameter.

 

         Protease names:

-       “Trypsin” (allows for cut after K or R)

-       “Trypsin/P” (normal trypsin cut, disallows cutting when the site is before a proline)

-       "Chymotrypsin” (allows cut after F,Y,W,L. Disallows cutting before proline)

-       "TrypChymo” (combines “Trypsin/P” and “Chymotrypsin” cleavage rules)

-       “Lys-C”

-       “Lys-C/P” (Lys-C, disallowing cutting before proline)

-       “Asp-N”

-       PepsinA” (Cuts right after F, L)

-       CNBr” (Cyanogen bromide)

-       Formic_acid” (Formic acid)

-       NoEnzyme” (not supported; use the proper enzyme and set MinTerminiCleavages to 0)

 

A complete list of supported protease names can be found here.

 

Note: CleavageRules can also work with an earlier but deprecated regular expression syntax. We highly discourage users from using the old syntax. Briefly, the old syntax is a space-delimited list of cleavage rules, where each cleavage rule itself is a space-delimited pair of strings. The first string of the cleavage rule specifies the residue or residues that must be N-terminal to a potential cleavage site. The second string specifies the residue or residues that must be C-terminal to the site. Either string in the pair can contain multiple sequences of one or more residues, separated by the ‘|’ character. A ‘.’ character is a wildcard that will accept anything. Additionally, the ‘[‘ and ‘]’ characters refer to the N and C termini of a protein.

 

Now that you are thoroughly confused, here are examples of single cleavage rules:

R .                               a site is valid for cleavage if the N-terminal residue is R

R|K .                           a site is valid for cleavage if the N-terminal residue is R or K

[ .                                 a site is valid for cleavage at the N terminus of a protein

. ]                                 a site is valid for cleavage at the C terminus of a protein

 

The “.” wildcard is important because it allows the cleavage routine to work very quickly. However, if you wanted to leave out proline from a tryptic digest, you would have to explicitly declare the valid residues for both sides of a cleavage site:

R|K A|C|D|E|F|G|H|I|K|L|M|N|Q|R|S|T|V|W|Y

 

Remember that this parameter is a list of cleavage rules; a real tryptic digest can be declared with two cleavage rules:

[|R|K . . ]                    a site is valid for cleavage if it is at the N terminus of a protein,

                                    or if the N-terminal residue is R or K; a site is valid for cleavage

                                    if it is at the C terminus of a protein

 

Also note that a cleavage rule can have a residue string of more than one residue, allowing for multiple-residue cleavage motifs:

[M|[ .                           a site is valid for cleavage if it is at the N terminus of a protein,

                                    or if the N-terminal sequence of residues is [M (i.e. the M must

                                    be at the N terminus of a protein)

MinTerminiCleavages

(integer, 2)

By default, when reporting peptides from a library search, a peptide must start and end at a valid cleavage site. Setting this parameter to 0 or 1 will reduce that requirement, so that neither terminus or only one terminus of the peptide must match one of the cleavage rules specified in the CleavageRules parameter. This parameter is useful to report semi-tryptic peptides.

MaxMissedCleavages

(integer, -1)

By default, when reporting peptides from the library search, a peptide may contain any number of missed cleavages. A missed cleavage is a site within the peptide that matches one of the cleavage rules (refer to CleavageRules). Settings this parameter to some other number will forestall reporting peptides from a sequence if it contains more than the specified number of missed cleavages.

MinPeptideMass

(real, 0 Da)

When preprocessing the experimental spectra, any spectrum with a precursor mass that is less than the specified mass will be disqualified. This parameter is useful to eliminate inherently unidentifiable spectra from an input data set. A setting of 500 for example, will eliminate most 3-residue matches and clean up the output file quite a lot.

MaxPeptideMass

(real, 10000 Da)

When preprocessing the experimental spectra, any spectrum with a precursor mass that exceeds the specified mass will be disqualified.

MinPeptideLength

(integer, 5)

When digesting proteins, any peptide which does not meet or exceed the specified length will be disqualified.

MaxPeptideLength

(integer, 75)

When digesting proteins, any peptide which exceeds this specified length will be disqualified.

StatusUpdateFrequency

(real, 5 seconds)

Preprocessing spectra and scoring candidates may take a long time. A measure of progress through the protein database will be given on intervals that are specified by this parameter, measured in seconds.

RecalculateLibPepMasses

(boolean, true)

If true, all the library peptide masses are recalculated.

NumIntensityClasses

(integer, 3)

Before scoring any candidates, experimental spectra have their peaks stratified into the number of intensity classes specified by this parameter. Spectra that are very dense in peaks will likely benefit from more intensity classes in order to best take advantage of the variation in peak intensities. Spectra that are very sparse will not see much benefit from using many intensity classes.

ClassSizeMultiplier

(real, 2)

When stratifying peaks into a specified, fixed number of intensity classes, this parameter controls the size of each class relative to the class above it (where the peaks are more intense). At default values, if the best class, A, has 1 peak in it, then class B will have 2 peaks in it and class C will have 4 peaks.

FASTARefreshResults

(boolean, true)

This parameter will refresh all the peptide-protein associations in the library against the given protein database (Provided with -ProteinDatabase option.

UseMultipleProcessors

(boolean, true)

If true, each process will use all the processing units available on the system it is running on.

 

 

Interpreting results

a)   Search-time output of Pepitome serves several purposes. The majority of the output will usually be progress information, telling the user which part of the job that Pepitome is currently working on, and in some cases how far along into that part the job is. There will be periodic updates when Pepitome is preprocessing spectra and when it is processing library spectra and comparing those library candidates against the experimental spectra. Additionally, Pepitome will display statistics on the spectra that remain after preprocessing, specifically the average number of peaks in a spectrum before and after preprocessing. Also provided is the average number of the percentage of peaks that were filtered out by the preprocessing step.

 

b)   One pepXML (a database search output format originally developed at the SPC) file is produced for every input spectra file that a Pepitome job searches. The file contains an entry for each spectrum kept during the search (i.e. only the spectra that were not obviously junk) that was compared to any candidate. Only the best (MaxResults a config parameter, defaulting to 5) matches for each spectrum are kept and output in the file. The pepXML format is quite versatile, but does have a few limitations. It is not possible to fully encode Pepitome’s CleavageRules variable if multi-residue motifs are used (in the rule “[|[M|K|R .” the “[M” part is a multi-residue motif).

 

c)    To validate results generated by Pepitome (and also several other popular search engines), users can pass pepXML files to IDPickerQonvert (in preparation for analyzing the results in the IDPicker suite). This approach only works when the spectral library contains distracter (decoy) spectra with a common prefix in their name that is not a prefix in any of the valid spectra. Most often, the library that was searched will include normal spectra and shuffled spectra, with the decoy prefix added to its protein name like “DECOY_”.