Gene Prediction
Gene Prediction
Gene Prediction
Srivani Narra
Indian Institute of Technology Kanpur
Email: srivani@iitk.ac.in
Abstract:
The gene identification problem is the problem of interpreting nucleotide
sequences by computer, in order to provide tentative annotation on the location, structure,
and functional class of protein-coding genes. This problem is of self-evident importance,
and is far from being fully solved, particularly for higher Eukaryotes. With the advent of
whole-genome sequencing projects, there is considerable use for programs that scan
genomic DNA sequences to find genes, particularly those that encode proteins. Even
though there is no substitute to experimentation in determining the exact locations of
genes in the genome sequence, a prior knowledge of the approximate location of genes
will fasten the process to a great extent apart from saving a huge amount of laboratory
time and resources. Hidden Markov Models have been developed to find protein coding
genes in DNA. Three models are developed with increasing complexity. First model
includes states that model the codons and their frequencies in the genes. Second model
includes states for amino acids with corresponding codons as the observation events. The
third model, in addition, includes states that model Shine-Delgarno motif and Repetitive
Extragenic Palindromic sequences (REPs). In order to compare the performance, an ANN
is also developed to classify a set of nucleotide sequences into genes and non-coding
regions.
1 A primer on molecular biology
Cells are fundamental working units of every living system. All instructions
needed to direct their activities are contained within the chemical DNA
(Deoxyribonucleic Acid). DNA from all organisms is made up of the same chemical and
physical components. The DNA sequence is the particular side-by-side arrangement of
bases along the DNA strand. (E.g.: ATTCCGGA) This order spells out the extra
instructions required to create a particular organism with its own unique traits.
The genome is an organism's complete set of DNA. However only fragments of
genome are responsible for the functioning of the cell. These fragments, called genes, are
the basic physical and functional units of heredity. Genes are made up of a contiguous set
of codons, each of which specifies an amino acid. (Three consecutive nucleotide bases in
a DNA sequence constitute a ‘codon’; for example, 'AGT' and 'ACG' are two consecutive
codons in the DNA fragment AGTACGT. Of the 64 possible different arrangements of
the four nucleotides (A, T, G, C) in sets of three, three (UAA, UAG, UGA) functionally
act as periods to translating ribosome in that they cause the translation to stop. These
three codons are therefore termed as 'stop codons'. Similarly, one codon of the genetic
code, namely ATG, is reserved as start codon, though GTG, CTG, TTG are also rarely
observed.)
Genes translate into proteins and these proteins perform most life functions and
even make up the majority of cellular structures. [reference]
2 Introduction
With the advent of whole-genome sequencing projects, there is considerable use
for programs that scan genomic DNA sequences to find genes. Even though there is no
substitute to experimentation in determining the exact locations of genes in the genome
sequence, a prior knowledge of the approximate location of genes will speed up the
process to a great extent thereby saving huge amount of laboratory time and resources.
The simplest method of finding DNA sequences that encode proteins is to search
for open reading frames, or ORFs. An ORF is a length of DNA sequence that contains a
contiguous set of codons, each of which specifies an amino acid[reference], and end-
marked by a start codon and stop codon which mark the beginning and the end of a gene
respectively. However, biologically, not every ORF is a coding region. Only few ORFs
with a certain minimum length and with a specific composition can translate into
proteins. And hence, this method fails when there are a large number of flanking stop
codons, resulting in many small ORFs.
DNA sequences that encode protein are not random chain of available codons for
an amino acid, but rather an ordered list of specific codons that reflect the evolutionary
origin of the gene and constraints associated with gene expression. This nonrandom
property of coding sequences can be used to advantage for finding regions in DNA
sequences that encode proteins (Ficket and tung 1992). Each species also has a
characteristic patten of use of synonymous codons (Wada et al 1992). Also, there is a
strong preference for certain codon pairs within a coding region.
The various methods for finding genes maybe classified into the following
categories :
Sequence similarity search: One of the oldest methods of gene identification, based on
sequence conservation due to functional constraint, is to search for regions of similarity
between the sequence under study (or its conceptual translation) and the sequences of
known genes (or their protein products). [Robinson et al. (1994)]. The obvious
disadvantage of this method is that when no homologues to the new gene are to be found
in the databases, similarity search will yield little or no useful information.
Based on statistical regularities: The following measures have been used to encapsulate
the features of genes - codon usage measure, hexamer-n measure, hexamer measure,
open reading frame measure, amino acid usage measure, Diamino acid usage measure
and many more. Combining several measures does improve accuracy.
Using signals: Any portion of the DNA whose binding by another biochemical plays a
key role in transcription is called a signal. The collection of all specific instances of some
particular kind of signal will normally tend to be recognizably similar.
Our aim in this project is to use machine learning techniques like Hidden Markov Models
and Neural Networks to capture the above mentioned properties for gene recognition and
prediction.
3 Issues in Gene Prediction
Three types of posttranscriptional events influence the translation of mRNA into
protein and the accuracy of gene prediction. First, the genetic code of a given genome
may vary from the universal code. [36, 37]Second, one tissue may splice a given mRNA
differently from another, thus creating two similar but also partially different mRNAs
encoding two related but different proteins. Third, mRNAs may be edited, changing the
sequence of the mRNA and, as a result, of the encoded protein. Such changes also depend
on interaction of RNA with RNA-binding proteins. Then there are issues of frame-shifts,
insertions and deletions of bases, overlapping genes, genes on the complementary strand
etc. Straight-forward solutions therefore do not work when we need to take all these
issues into considerations.
6 Training set
The FASTA files for the complete genome sequence, genes and their conceptual
translations into proteins are downloaded from Entrez (Genome) database at NCBI
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome). The fasta file for genes
contains around 4113 genes out of which 1000 are used for training the HMM. This is
done against the traditional division with one half as learning set and another half as test
set to avoid overfitting (considering the enoromous number of parameters that are
adjusted during the training phase). These genes are used without any preprocessing to
train the above models to distinguish between coding and non-coding regions. However,
in case of extended models, there is a need to train the model not only on coding regions
but also on intergenic regions. Therefore, the complete genome seqeuence and the gene
listing is used to extract the intergenic regions, which are in turn used to train the
extended models.
7 Results
In the case of the HMModels, a logarithmic probability is calculated on the basis
of the viterbi path and this is used to determine if a given sequence is a gene or not. As
mentioned earlier, a set of 1000 coding-sequences are used as learning set and the trained
models are then run over a test set containing 2000 genes and 500 random sequences, to
calculate the accuracy. The accuracy of both the models over the test set is as high as
97% and 98% respectively. However, the number of false positives increases when the
random sequences are long and flanked by a start and stop codon. The number of false
negetives are still 0.
With Extended HMModels, the genome is annotated by calculating the viterbi
path through the sequence and then using the state sequence to determine coding and
non-coding regions in the genome. For example, in case of Extended Model 1, given a
genome (or part of it) as input, the viterbi path through the sequence is determined and
the part of the genome in between state 0 (start state) and state 60 (stop state) is annotated
as a gene and the part of the genome sequence that loops in state 61 (intergenic region) is
annotated as intergenic region. Similar is the case for Extended Model 2. These models
locate the exact position of the genes more than 88% of the time and approximate
positions (four or five codons) around 5% of the time.
The ANN with only 64 parameters (consequently, 64 input nodes) gives an
accuracy of 83%, however when the training is done by calculating MSE (mean-square
error) on the entire set rather than on individual sequences and the additional 6
parameters are added, the accuracy increases considerably, as much as 90%.
8 Appendix
Generally, the learning problem is how to adjust the HMM parameters, so that the given
set of observations (called the training set) is represented by the model in the best way
for the intended application. Thus it would be clear that the ``quantity'' we wish to
optimize during the learning process can be different from application to application. In
other words there may be several optimization criteria for learning, out of which a
suitable one is selected depending on the application.
There are two main optimization criteria found in ASR literature; Maximum Likelihood
(ML) and Maximum Mutual Information (MMI). The solutions to the learning problem
under each of those criteria is described below.
belonging to a given class w, given the HMM of the class w, wrt the parameters of the
model . This probability is the total likelihood of the observations and can be
expressed mathematically as
However since we consider only one class w at a time we can drop the subscript and
superscript 'w's. Then the ML criterion can be given as,
which maximize the quantity . But we can choose model parameters such that it is
locally maximized, using an iterative procedure, like Baum-Welch method or a gradient
based method, which are described below.
8.2.2 Baum-Welch Algorithm
This method can be derived using simple ``occurrence counting'' arguments or using
calculus to maximize the auxiliary quantity
over [],[, p 344-346,]. A special feature of the algorithm is the guaranteed convergence .
To describe the Baum-Welch algorithm, ( also known as Forward-Backward algorithm),
we need to define two more auxiliary variables, in addition to the forward and backward
variables defined in a previous section. These variables can however be expressed in
terms of the forward and backward variables.
First one of those variables is defined as the probability of being in state i at t=t and in
state j at t=t+1. Formally,
that is the probability of being in state i at t=t, given the observation sequence and the
model. In forward and backward variables this can be expressed by,
One can see that the relationship between and is given by,
Now it is possible to describe the Baum-Welch learning process, where parameters of the
model ,we calculate the ' 's and ' 's using the recursions 1.5 and 1.2,
and then ' 's and ' 's using 1.12 and 1.15. Next step is to update the HMM parameters
according to eqns 1.16 to 1.18, known as re-estimation formulas.
These reestimation formulas can easily be modified to deal with the continuous density
case too.
8.2.3 The Decoding Problem and the Viterbi Algorithm
In this case We want to find the most likely state sequence for a given sequence of
The solution to this problem depends upon the way ``most likely state sequence'' is
defined. One approach is to find the most likely state at t=t and to concatenate all such '
's. But some times this method does not give a physically meaningful state sequence.
Therefore we would go for another method which has no such problems.
In this method, commonly known as Viterbi algorithm, the whole state sequence with the
maximum likelihood is found. In order to facilitate the computation we define an
auxiliary variable,
which gives the highest probability that partial observation sequence and state sequence
up to t=t can have, when the current state is i.
It is easy to observe that the following recursive relationship holds.
where,
So the procedure to find the most likely state sequence starts from calculation of
``winning state'' in the maximum finding operation. Finally the state , is found where
and starting from this state, the sequence of states is back-tracked as the pointer in each
state indicates.This gives the required set of states.
This whole algorithm can be interpreted as a search in a graph whose nodes are formed
9 References
David W. Mount (2000) Bioinformatics: Sequence and Genome Analysis, Cold Spring
Harbor Laboratory Press, 337-380
Baldi P. and Brunak S. (1998). Bioinformatics: The machine learning approach. MIT
Press, Cambridge, Massachusetts.
Andres Krogh, I. Saira Mian, David Haussler (1994) A model that finds genes in E.Coli,
USSC-CRL-93-33, revised May 1994