Annotating Genomes Using Proteomics Data: Andy Jones Department of Preclinical Veterinary Science
Annotating Genomes Using Proteomics Data: Andy Jones Department of Preclinical Veterinary Science
Annotating Genomes Using Proteomics Data: Andy Jones Department of Preclinical Veterinary Science
proteomics data
Andy Jones
Genomic DNA
mRNA
• Genome annotation:
– Find start codons / transcriptional initiation
– Recognise splice acceptor and donor sequences
– Stop codon
– Predict alternative splicing...
Computational gene prediction
• De novo prediction – single genome
– Trained with “typical” gene structures - learn exon-intron
signals, translation initiation and termination signals e.g.
Markov models
– Many different predictions scored based on training set of
known genes
• Multiple genome
– Compare confirmed gene sequences from other species
– Coding regions more highly conserved conservation
indicates gene position
– Pattern searching: Higher mutation rate of bases separated
in multiples of three (mutations in 3rd position of codons are
often silent)
Study aims:
• Identify as many components of the
proteome as possible
• Relate peptide sequence data back to
genome to confirm genes
• Relate protein expression data to
transcriptional data (EST / microarray)
Cut bands
1D gel Trypsin digestion
electrophoresis
Mass spectrometry
2D gel electrophoresis
Fractions
Trypsin digestion
ToxoDB
1. Re-querying pipeline
– each time gene models change, all mass spectra are automatically re-
queried
2. Integrate peptide evidence directly into gene finding
software
3. Maximising the number of informative mass spectra
4. Attempt to optimise algorithms for de novo sequencing of
peptides
5. N-terminal proteomics
- Could be used to confirm gene initiation point
Spectra
Stage 1 Multiple
Official
database search Confirmed official
gene set model
engines
Genome Gene
sequence Finder
Stage 2 Multiple
Alternative Promote alternative
gene models database search
model
engines
Stage 3
Modified de
novo Novel ORF, splice
junction
algorithms
Proteomic evidence
Peptide identifications
Omssa
Omssa X!Tandem
Peptides
X!Tandem Rescoring
Combined
Peptides Algorithm list
(FDR)
Mascot
Peptides
Mascot
• Each search engine produces a different non-standard score of the quality of a match
• Developed a search engine independent score, based on analysis of false discovery rate
• Identifications made more search engines are scored more highly
• Can generate 35% more peptide identification than best single search engine
Email: [email protected]