Genome Annotation
Genome Annotation
Genome Annotation
AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA
CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT
GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG
GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT
GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATG
AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAGT
TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG
GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGC
CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGCTAGTAT
ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATG
GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTC
AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA
AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGC
What is Annotation?
• dictionary definition of “to annotate”:
– “to make or furnish critical or explanatory notes or comment”
• some of what this includes for genomics
– gene product names
– functional characteristics of gene products
– physical characteristics of gene/protein/genome
– overall metabolic profile of the organism
• elements of the annotation process
– gene finding
– homology searches
– functional assignment
– ORF management
– data availability
• manual vs. automatic
– automatic = computer makes the decisions
• good on easy ones
• bad on hard ones
– manual = human makes the decisions
• highest quality
**Due to the VOLUMES of genome data today, most genome projects are
annotated primarily using automated methods with limited manual annotation
Annotation pipeline
Homology Searches
Putative ID
Frameshift Detection
Ambiguity Report
Role Assignment
Metabolic Pathways
Gene Families
DNA Motifs
Regulatory Elements
Repetitive Sequences
Comparative Genomics
Genome Structure
Prokaryote
Eukaryote
http://pps00.cryst.bbk.ac.uk/course/section6/henryb/genestrp.htm
Eukaryotic Gene Structure and Transcript Processing
Structural Annotation: Finding the
Genes in Genomic DNA
Two main types of data used in defining gene
structure:
Running a Gene-finder
is a two-part process
+2
+1
-1
-2
-3
Possible translations represented by arrows, moving from start to stop, the dotted
line represents an ORF with no start site.
AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG
AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG
Transcript evidence:
Name
Descriptive common name for the protein, with as much
specificity as the evidence supports; gene symbol.
Role
Describe what the protein is doing in the cell and why.
Associated information:
Supporting evidence:Domain and motifs
EC number if protein is an enzyme.
Paralogous family membership.
Evidence for Gene Function
• PROSITE Motifs
– collection of protein motifs associated with active
sites, binding sites, etc.
– help in classifying genes into functional families
when HMMs for that family have not been built
• InterPro
– Brings together HMMs (both TIGR and Pfam)
Prosite motifs and other forms of motif/domain
clustering
– Results in motif “signatures” for families or functions
– GO terms have been assigned to many of these
Sequence Alignments
Compare sequence
against other databases
Gene function evidence
Functional Annotation: Gene Product Names
Gene Name Assignment: Based on similarity to known proteins in nraa
database
Categories:
Gene
Predictions
Annotated Gene
Top: editing panel Splice site predictions:
Bottom: final curation red: acceptor sites
blue: donor sites
-Iterative, never perfect, can always be improved with new evidence and
improved algorithms