Genes
Genes
Overview
Michelle Gwinn
August, 2005
Annotation
• Gene
– requires translation “start” codon
• bacterial starts = ATG, GTG, TTG
• genes go from “start” to “stop”
– has biological significance
• catalytic or structural RNAs
• protein coding regions
+3
+2
+1
-1
-2
-3
start
stop
These are some of the many ORFs in this graphic.
Gene Finding with Glimmer
Two options to
get additional training genes
Training
IMM
built
computer algorithm in Glimmer
Gene Finding with Glimmer
What happens during training.
ATGCGTAAGGCTTTCACAGTATGCGAGTAAGCTGCGTCGTAA
GG
ATGCGTAAGGCTTTCACAGTATGCGAGTAAGCTGCGTCGTAA
GG
ATGCGTAAGGCTTTCACAGTATGCGAGTAAGCTGCGTCGTAA
GG
+2
+1
-1
-2
-3
The coding sequencse resulting from the candidate ORFs are represented
by arrows, going from start to stop, the dotted line represents an ORF
with no start site. which therefore can not be a gene. A long ORF does
not necessarily result in a long putative gene.
Green ORFs scored well to the model, red ORFs scored less well. The
green ORFs are chosen by Glimmer as the set of likely genes and
numbered sequentially from the beginning of the DNA molecule on which
they reside. ORFs in the area of lateral transfer, although real genes,
often will not be chosen since they don’t match the model built from the
patterns of the genome as a whole. Often when viewing a 6-frame
translation, the genes are represented as arrows drawn above (or, as in
this slide, below) the 6-frame translation.
ORF00002 ORF00004
ORF00001 ORF00003
Coordinates
0 10000
• Homology searching
– comparing sequences of unknown
function to those of known function
Homology searching
Pairwise
-two rows of amino acids compared to each other, the top row is the
search protein and the bottom row is the match protein, numbers
indicate amino acid position in the sequence
-solid lines between amino acids indicate identity
-dashed lines (colons) between amino acids indicate similarity
Multiple
All of these are collapsed into one entry in NIAA that is linked
to all three accessions.
Experimentally
Characterized Proteins
• It is important to know what proteins in
our search database are characterized.
– We store the accessions of proteins known
or suspected to be characterized in the
“characterized table” in our database
– A confidence status is assigned to each
entry in the characterized table.
• Annotators see this information in the
search results as color coded output:
– green = full experimental characterization
– red = automated process (Swiss-Prot parse)
– sky blue = partial characterization
– olive = trusted, used when multiple extremely good
lines of evidence exist for function but no experiment
has been done (rarely used)
– blue-green = fragment/domain has been characterized
– fuzzy gray = void, used to indicate that something that
was originally thought to be characterized really is not
(rarely used)
– gray = accession exists in the omnium only - therefore
represents automated annotation
• Our table does not contain all
characterized proteins, not even close.
– Any time additional characterized proteins
are found it is important that they be entered
into the table
BLAST-extend-repraze (BER)
TIGR’s pairwise protein search tool
• Initial BLAST search
– against NIAA
– finds local alignments
– stores good hits in mini-database for each protein
• Protein sequence is extended by 300
nucleotides on each end and translated
(see later slide)
• A modified Smith-Waterman alignment is
generated with each sequence in the mini-
database
– extends local alignments as far as homology
continues over lengths of extended proteins
– produces a file of alignments between the query
protein and the match protein for each match
protein in the mini-database
– as the alignment generating algorithm builds the
alignment, if the level of similarity falls below the
necessary threshold, the program looks in
different frames and through stop codons for
homology to continue - this similarity can continue
into the upstream and/or downstream extensions
BLAST-extend-repraze (BER)
genome.pep
vs. niaa
(non-identical amino acid)
BLAST
...
, ,
Significant hits put into mini-dbs for each protein
Extend 300 nucleotides
on both ends
modified Smith-
Waterman Alignment
vs.
extended protein
mini-database from BLAST search
BER Alignment
An alignment like this will be generated for every match
protein in the mini-database. In the next slides we will
look closely at the types of information displayed here.
BER Alignment detail:
Boxed Header
-The top bar lists the percent identity/similarity and the organism
from which the protein comes (if available).
-The bottom section lists all of the accession numbers and names
for all the instances of the match protein from the source
databases (used in building NIAA for the searches.)
-The accession numbers are links to pages for the match protein
in the source databases.
-A particular entry in the list will have colored text (the color
corresponding to its characterized status) if that is the accession
that is entered into the characterized table - this tells the
annotators which link they should follow to find experimental
characterization information. Only one accession for the match
protein need be in the characterized table for the header to turn
gold.
-There are links at the end of each line to enter the accession into
the characterized table or to edit an already existing entry in the
characterized table.
BER Alignment detail:
alignment header
-It is most important to look at the range over which the alignment
stretches and the percent identity
-The top line show the amino acid coordinates over which the match
extends for our protein
-The second line shows the amino acid coordinates over which the
match extends for the match protein, along with the name and
accession of the match protein
-The last line indicates the number of amino acids in the alignment
found in each forward frame for the sequence as defined by the
coordinates of the gene. The primary frame is the one starting with
nucleotide one of the gene. If all is well with the protein, all of the
matching amino acids should be in frame 1.
-If there is a frameshift in the alignment (see later slide) the phrase
“Frame Shifts = #” will flash and indicate how many frameshifts there
are.
BER Alignment detail:
alignment of amino acids
-In these alignments the codons of the DNA sequence read down in
columns with the corresponding amino acid underneath.
-Vertical lines between amino acids of our protein and the match
protein (bottom line) indicate exact matches, dotted lines (colons)
indicate similar amino acids.
-Start sites are color coded: ATG is green, GTG is blue, TTG is
red/orange
search protein
match protein
normal full length match
! FS
* PM
similarity extending in the same frame through a stop codon
?
FS or PM ?
two functionally unrelated genes from other species matching
one of our proteins could indicate incorrectly fused ORFs
Frameshifted alignment
Hidden Markov Models - HMMs
• HMMs are statistical models of the
patterns of amino acids in a group of
functionally related proteins found
across species.
– this group is called the “seed”
– HMMs are built from multiple alignments of
the seed members.
• Proteins searched against an HMM
receive a score indicating how well
they match the model.
– Proteins scoring well to the model can be
expected to share the function that the HMM
represents.
• HMMs can be built at varying levels of
functional relationship.
– The most powerful level of relationship is
one representing the exact same function.
– It is important to know the kind of
relationship an HMM models to be able to
draw the correct conclusions from it
• Annotation can be attached to HMMs
– protein name
– gene symbol
– EC number
– role information
TIGR’s HMM Isology Types
Equivalog: The supreme HMM, designed so that all members
and all proteins scoring above trusted share the same function.
genome.pep
ATG_______________ HMM database
ATG__________
GTG________________________ vs.
ATG____________________ TIGRFAMS + Pfams
ATG_____________________________
TIGR01234 TIGR01004
, ,
PF00012
,
NONE
TIGR00005
etc.
Each protein in the genome is searched against all HMMs
in our db. Some will not have significant hits to any HMM,
some will have significant hits to several HMMs. Multiple
HMM hits can arise in many ways, for example: the same
protein could hit an equivalog model, a superfamily model
to which the equivalog function belongs, and a domain
model representing the catalytic domain for the particular
equivalog function. There is also overlap between TIGR
and Pfam HMMs.
Evaluating HMM scores
N T
-50 0 100
-50 0 100
-50 0 100
HMM Output in Manatee
Genome Properties
• Used to get “the big picture” of an organism. Specifically to
record and/or predict the presence/absence of:
– metabolic pathways
• biotin biosynthesis
– cellular structures
• outer membrane
– traits
• anaerobic vs. aerobic
• optimal growth temperature
• Particular property has a given “state” in each organism, for
example:
– YES - the property is definitely present
– NO - the property is definitely not present
– Some evidence - the property may be present and more
investigation is required to make a determination
• The state of some properties can be determined
computationally
– metabolic pathway
• the property is defined be several reaction steps which are
modeled by HMMs
• HMM matches to steps in pathway indicate that the organism
has the property
• Other property’s states must be entered manually (growth
temp, anaerobic/aerobic, etc.)
• data for a particular genome viewable in Manatee
– links from HMM section
– links from gene list for role category
– entire list of properties and states can be viewed
• Searchable across genomes on the CMR site
– covered in the CMR segment of the course
“Biotin Biosynthesis”
Genome Property
“Cell Shape” Genome Property
Paralogous Families
genome.pep
ATG_______________
genome.pep
ATG_______________
ATG__________ ATG__________
GTG________________________ vs. GTG________________________
ATG____________________ ATG____________________
ATG_____________________________ ATG_____________________________
• Molecular Weight/pI
• DNA repeats
• RNAs
– tRNAs are found using tRNAscan (Sean Eddy)
– structural RNAs are found using BLAST
searches
– We are starting to implement Rfam, a set of
HMMs modeling non-coding RNAs (Sanger,
WashU)
• GC content
– for the whole genome and individual genes
• terminators
• operons
Making the annotations:
Assigning names and roles to the proteins
Functional Assignments:
What we want to accomplish.
Role
Both TIGR and Gene Ontology, to describe
what the protein is doing in the cell and why.
Supporting evidence:
HMMs, Prosite, InterPro
Characterized match from BER search
Paralogous family membership.
Functional Assignments:
What we want to avoid!
Genome Rot!
Transitive Annotation: A is like
B, B is like C, C is like D, but A
is not like D
Other HMMs
Etc.
Manual Annotation:
Assigning Names to Proteins
Functional Assignments:
High Confidence in Precise Function
Criteria:
-At least one good alignment (minimum 35% identity,
over the full lengths of both proteins) to a protein from
another organism that has been experimentally
characterized, preferably multiple such alignments.
-Above trusted cutoff hits to any HMMs for this gene.
-Conservation of active sites, binding sites, appropriate
number of membrane spans, etc.
“hypothetical protein”
-many FS/PM, ! ! !
“degenerate” ** *
-”truncation”
-”deletion”
-”insertion”
(~20-50aa)
-interruption
“interruption-N” N-term
(some genes)
C-term
“interruption-C”
-”fusion”
-”fragment”
Available evidence
for Function
catalytic activity
3 genes kinase activity
carbohydrate kinase activity
#1 ribokinase activity
-HMM for “ribokinase’ glucokinase activity
-characterized match to fructokinase activity
ribokinase
Process
#2
metabolism
-HMM for “kinase”
carbohydrate metabolism
-characterized matches
monosaccharide metabolism
to a “glucokinase’ AND a
hexose metabolism
‘fructokinase’
glucose metabolism
fructose metabolism
pentose metabolism
#3
ribose metabolism
-HMM for “kinase’
GO Evidence
• Just as we store evidence for our annotation, we must
store evidence for GO term assignments:
– Assign Evidence Code
• Ev Codes tell users what kind of evidence was used
– sequence similarity (99% of our work) - ISS
– experimental characterization - IMP, IDA, etc.
• IEA - code for electronic annotation - immediately allows
users to tell manual curation from automatic
– Assign “Reference”
• PMID of paper describing characterization or method used
for annotation
• database reference (GO standards)
– Assign “with” (when appropriate)
• Used with ISS to store the accession of the thing the
sequence similarity is with
– Modifier column
• “contributes to” - use this modifier when you assign a
function term representing the function of a complex to
proteins that are part of the complex but can not
themselves carry out the function of the complex
• All accessions used as evidence must be represented
according to GO’s format – “database:accession”
(where “database” is the abbreviation defined at GO).
Manatee knows these rules and automatically puts
the accessions in the correct format.
– Examples
• TIGR_TIGRFAMS:TIGR01234
• Swiss-Prot:P12345
GO Evidence codes
http://www.geneontology.org/GO.evidence.html
http://www.geneontology.org/GO.annotation.html
Association files at GO
GO is a work in progress
What to consider:
- Start site frequency: ATG >> GTG >> TTG
- Ribosome Binding Site (RBS): a string of AG rich sequence
located 5-11 bp upstream of the start codon
- Similarity to match protein, both in BER and Paralogous
Family - probably the most important factor.
(Remember to note, that the DNA sequence reads down in
columns for each codon.)
-In the example below (showing just the beginning of one
BER alignment), homology starts exactly at the first atg (the
current chosen start, aa #1), there is a very favorable RBS
beginning 9bp upstream of this atg (gagggaga). There is no
reason to consider the ttg, and no justification for moving to
the second atg (this would cut off some similarity and it does
not have an RBS.)
Overlap analysis
When two ORFs overlap (boxed areas), the one without similarity
to anything (another protein, an HMM, etc.) is removed. If both
don’t match anything, other considerations such as presence in a
putative operon and potential start codon quality are considered.
This process has both automated (for the easy ones) and manual
(for the hard ones) components. Small regions of overlap are
allowed (circle).
InterEvidence regions
Areas of the genome with no genes and areas within genes without
any kind of evidence (no match to another protein, HMM, etc., such
regions may include an entire gene in case of “hypothetical
proteins”) are translated in all 6 frames and searched against niaa.
Results are evaluated by the annotation team.
Data Availability
• Publication
– TIGR staff/collaborators analysis of
genome data
• GenBank
– Sequence and annotation submitted
to GenBank at the time of publication
– Updates sent as needed
• Comprehensive Microbial
Resource (CMR)
– Data available for downloading
– extensive analyses within and
between genomes
Useful links
• CMR Home
– http://www.tigr.org/tigr-
scripts/CMR2/CMRHomePage.spl
• SIB web site (Swiss-Prot, Prosite, etc.)
– http://www.expasy.org
• PIR
– http://pir.georgetown.edu
• NCBI
– http://www.ncbi.nlm.gov
• BLAST
– http://blast.wustl.edu
• GO
– http://www.geneontology.org
• TIGRFAM HMMs
– http://www.tigr.org/tigr-
scripts/CMR2/find_hmm.spl?db=CMR
– OR
– http://tigrblast.tigr.org/web-hmm/
Acknowledgements
And the many other TIGR employees, present and past, who have
contributed to the development of these tools and to the annotation
protocols I have described.
Also thanks to the funding agencies that make our work possible
including NIH, NSF, and DOE.