0% found this document useful (0 votes)

76 views15 pages

Gene Prediction

This document discusses gene prediction in prokaryotes and eukaryotes. It begins by introducing Srivani Narra and her supervisor Prof. Harish Karnick from IIT Kanpur. It then provides background on gene structure in prokaryotes and eukaryotes. Key differences are that prokaryotic genes are transcribed directly into proteins while eukaryotic genes contain introns that are spliced out. The document discusses using hidden Markov models and neural networks to capture properties of coding sequences to predict genes.

Uploaded by

Raghav Suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views15 pages

Gene Prediction

Uploaded by

Raghav Suresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Gene Prediction

Srivani Narra
Indian Institute of Technology Kanpur
Email: [email protected]

Supervisor: Prof. Harish Karnick

Indian Institute of Technology Kanpur
Email: [email protected]

Keywords: DNA, Hidden Markov Models, Neural Networks, Prokaryotes,

Eukaryotes, E.Coli.

Abstract:
The gene identification problem is the problem of interpreting nucleotide
sequences by computer, in order to provide tentative annotation on the location, structure,
and functional class of protein-coding genes. This problem is of self-evident importance,
and is far from being fully solved, particularly for higher Eukaryotes. With the advent of
whole-genome sequencing projects, there is considerable use for programs that scan
genomic DNA sequences to find genes, particularly those that encode proteins. Even
though there is no substitute to experimentation in determining the exact locations of
genes in the genome sequence, a prior knowledge of the approximate location of genes
will fasten the process to a great extent apart from saving a huge amount of laboratory
time and resources. Hidden Markov Models have been developed to find protein coding
genes in DNA. Three models are developed with increasing complexity. First model
includes states that model the codons and their frequencies in the genes. Second model
includes states for amino acids with corresponding codons as the observation events. The
third model, in addition, includes states that model Shine-Delgarno motif and Repetitive
Extragenic Palindromic sequences (REPs). In order to compare the performance, an ANN
is also developed to classify a set of nucleotide sequences into genes and non-coding
regions.
1 A primer on molecular biology
Cells are fundamental working units of every living system. All instructions
needed to direct their activities are contained within the chemical DNA
(Deoxyribonucleic Acid). DNA from all organisms is made up of the same chemical and
physical components. The DNA sequence is the particular side-by-side arrangement of
bases along the DNA strand. (E.g.: ATTCCGGA) This order spells out the extra
instructions required to create a particular organism with its own unique traits.
The genome is an organism's complete set of DNA. However only fragments of
genome are responsible for the functioning of the cell. These fragments, called genes, are
the basic physical and functional units of heredity. Genes are made up of a contiguous set
of codons, each of which specifies an amino acid. (Three consecutive nucleotide bases in
a DNA sequence constitute a ‘codon’; for example, 'AGT' and 'ACG' are two consecutive
codons in the DNA fragment AGTACGT. Of the 64 possible different arrangements of
the four nucleotides (A, T, G, C) in sets of three, three (UAA, UAG, UGA) functionally
act as periods to translating ribosome in that they cause the translation to stop. These
three codons are therefore termed as 'stop codons'. Similarly, one codon of the genetic
code, namely ATG, is reserved as start codon, though GTG, CTG, TTG are also rarely
observed.)
Genes translate into proteins and these proteins perform most life functions and
even make up the majority of cellular structures. [reference]

2 Introduction
With the advent of whole-genome sequencing projects, there is considerable use
for programs that scan genomic DNA sequences to find genes. Even though there is no
substitute to experimentation in determining the exact locations of genes in the genome
sequence, a prior knowledge of the approximate location of genes will speed up the
process to a great extent thereby saving huge amount of laboratory time and resources.
The simplest method of finding DNA sequences that encode proteins is to search
for open reading frames, or ORFs. An ORF is a length of DNA sequence that contains a
contiguous set of codons, each of which specifies an amino acid[reference], and end-
marked by a start codon and stop codon which mark the beginning and the end of a gene
respectively. However, biologically, not every ORF is a coding region. Only few ORFs
with a certain minimum length and with a specific composition can translate into
proteins. And hence, this method fails when there are a large number of flanking stop
codons, resulting in many small ORFs.
DNA sequences that encode protein are not random chain of available codons for
an amino acid, but rather an ordered list of specific codons that reflect the evolutionary
origin of the gene and constraints associated with gene expression. This nonrandom
property of coding sequences can be used to advantage for finding regions in DNA
sequences that encode proteins (Ficket and tung 1992). Each species also has a
characteristic patten of use of synonymous codons (Wada et al 1992). Also, there is a
strong preference for certain codon pairs within a coding region.
The various methods for finding genes maybe classified into the following
categories :
Sequence similarity search: One of the oldest methods of gene identification, based on
sequence conservation due to functional constraint, is to search for regions of similarity
between the sequence under study (or its conceptual translation) and the sequences of
known genes (or their protein products). [Robinson et al. (1994)]. The obvious
disadvantage of this method is that when no homologues to the new gene are to be found
in the databases, similarity search will yield little or no useful information.
Based on statistical regularities: The following measures have been used to encapsulate
the features of genes - codon usage measure, hexamer-n measure, hexamer measure,
open reading frame measure, amino acid usage measure, Diamino acid usage measure
and many more. Combining several measures does improve accuracy.
Using signals: Any portion of the DNA whose binding by another biochemical plays a
key role in transcription is called a signal. The collection of all specific instances of some
particular kind of signal will normally tend to be recognizably similar.

Our aim in this project is to use machine learning techniques like Hidden Markov Models
and Neural Networks to capture the above mentioned properties for gene recognition and
prediction.
3 Issues in Gene Prediction
Three types of posttranscriptional events influence the translation of mRNA into
protein and the accuracy of gene prediction. First, the genetic code of a given genome
may vary from the universal code. [36, 37]Second, one tissue may splice a given mRNA
differently from another, thus creating two similar but also partially different mRNAs
encoding two related but different proteins. Third, mRNAs may be edited, changing the
sequence of the mRNA and, as a result, of the encoded protein. Such changes also depend
on interaction of RNA with RNA-binding proteins. Then there are issues of frame-shifts,
insertions and deletions of bases, overlapping genes, genes on the complementary strand
etc. Straight-forward solutions therefore do not work when we need to take all these
issues into considerations.

4 Comparison between prokaryotic and eukaryotic genomes

In prokaryotic genomes, DNA sequences that encode proteins are transcribed into
mRNA, and then RNA is usually translated directly into proteins without significant
modification. The longest ORFs running from the first available Met codon on the
mRNA to the next stop codon in the same reading frame generally provide a good, but
not assured, prediction of the protein-encoding regions. A reading frame of a genomic
sequence that does not encode a protein will have short ORFs due to the presence of
many inframe stop codons.
In eukaryotic organisms, transcription of protein-encoding regions initiated at
specific promoter sequences is followed by removal of noncoding sequence (introns)
from pre-mRNA by a splicing mechanism, leaving the protein-coding exons. As a result
of the presence of intron sequences in the genomic DNA sequences of eukaryotes, the
ORF corresponding to an encoding gene will be interrupted by the presence of introns
that usually generate stop codons.
The transcription (the formation of mRNA from the DNA sequence) and
translation (coding-regions of mRNA into corresponding proteins) differ at a fundamental
level in prokaryotes and eukaryotes. Hence, the problem of Gene Prediction maybe
divided into two, namely, Gene Prediction in Prokaryotes and in Eukaryotes.
5 Gene Prediction in Prokaryotes
5.1 Understanding prokaryotic gene structure
The knowledge of gene structure is very important when we set out to solve the
problem of gene prediction. The gene structure of Prokaryotes can be captured in terms
of the following characteristics
Promoter Elements
The process of gene expression begins with transcription - the making of an
mRNA copy of a gene by an RNA polymerase. Prokaryotic RNA polymerases are
actually assemblies of several different proteins (alpha, beta and beta-prime) that each
play a distinct and important role in the functioning of the enzyme. The -35 and -10
sequences recognized by any particular sigma factor are usually described as a consensus
sequence - essentially the set of most commonly found nucleotides at the equivalent
positions of other genes that are transcribed by RNA polymerases containing the same
sigma factor.
Open Reading Frames
Since stop codons are found in uninformative nucleotide sequences,
approximately once every 21 codons (3 out of 64), a run of 30 or more triplet codons that
does not include a stop codon is in itself a good evidence that the region corresponds to
the coding sequence of a prokaryotic gene. One hallmark of prokaryotic genes that is
related to their translation is the presence of the set of sequences around which ribosomes
assenble at the 5' end of each open reading frame. Often found immediately downstream
of transcriptional sites and just upstream of the first start codon, ribosome loading sites
(sometimes called Shine-Delgarno) sequences almost invariably include the nucleotide
sequence 5'-AGGAGGU-3'.
Termination Sequences
Just as the RNA polymerases begin transcription at recognizable transcriptional
start sites immediately downstream from promoters, the vast majority of prokaryotic
operons also contain specific signals for the termination of transcription called intrinsic
terminators. Intrinsic terminators have two prominent structural features 1) a sequence of
nucleotides that include an inverted repeat and 2) a run of roughly six uracils immediately
following the inverted repeats. [give examples and references]
5.2 Our Solution
HMMs have been used to analyse DNA, to model certain protein-binding sites in
DNA and in protein analysis. The HMM models we use to find genes in E.Coli. range
from simple models based on one-to-one correspondence between the codons and the
HMM states to more complex HMMs with states corresponding to amino acids and
intergenic regions.
In addition to the above HMM models, we have developed a Neural Networks to
classify a set of nucleotide sequences into protein-encoding genes and non-coding
regions.
5.2.1 Hidden Markov Model 1
The model is characterized by the following:
The number of states in the model : 61 (N, say). Although the states are hidden,
for many practical applications there is often some physical significance attached to the
states or to sets of states of the model. In our case, the states may be divided in three
classes - one state for the start codon, one state for the stop codon, and a set of 59 states
for the rest of the intragenic codons.
The number of distinct observation symbols per state : The start state has two
observation symbols, each corresponding to a start codon, namely ATG and GTG. The
stop state has three observation symbols corresponding to three stop codons, namely
TAG, TAA and TGA. Finally, each state belonging to the set of 59 states has one
observation symbol, each corresponding to one codon out of the remaining of the 64
possible codons.
State transition probability distribution : Transitions are initialised to every state
from every other state, except for the stop state. In case of stop state, transitions only
happen into the state but not out of the state. The initialisation of state transition
probabilities is on a random basis with the only constraint that the sum of probabilities of
all the transitions going out of a state is 1.
Observation symbol probability distribution : The observation symbol
probabilities are also intialised randomly in case of start and stop states. In case of the
rest of 59 states, each state has only one observation symbol, so the observation symbol
probability is set to 1.
Initial state distribution : Only the start state is allowed an initial state probability
( = 1) and no other state is allowed to do so.
This model is then trained using a set of genes extracted from a part of the
annotated genome. The training is done using Baum-Welch algorithm for HMMs (see
Appendix). The trained HMM is then used to calculate the normalized logarithmic
probability for a given sequence to be a gene.
(Extended HMM 1)This same model is then extended to take a genome sequence
(or part of a genome sequence) as input and annotate it with gene information. This is
done by adding an extra state that corresponds to the intergenic region. The state
transition probability table is then extended so that there is transition into the IR state
(intergenic region state) from the stop state and there is transition from the IR state into
the start state and the IR state itself. These probabilities are randomly initialised. Also, all
64 possible codons are observation symbols for the IR state and their probabilities are
randomly initialised. The initial state distribution is modified to include IR region apart
from start state and their probabilities are initialised as 0.01 and 0.99 respectively. This
model is then trained using a part of the annotated genome sequence. The trained model
is then used to annotate the genome using Viterbi algorithm (See Appendix).
5.2.2 Hidden Markov Model 2
The model is characterized by the following:
The number of states in the model : 22. The states may be divided into three sets -
one state for the start codon, one state for the stop codon, and a set of 20 states each
standing for an amino acid.
The number of distinct observation symbols per state : The start state has two
observation symbols, each corresponding to a start codon, namely ATG and GTG. The
stop state has three observation symbols corresponding to three stop codons, namely
TAG, TAA and TGA. Finally, each state (remember each state corresponds to one amino
acid) belonging to the set of 20 states has as observation symbols all codons that translate
into that amino acid. For example, the state corresponding to Glycine in the model has
GGA, GGG, GGC and GGT as its observation symbols.
State transition probability distribution : As in the case of HM Model 1,
transitions are initialised to every state from every other state, except for the stop state. In
case of stop state, transitions only happen into the state but not out of the state. The
initialisation of state transition probabilities is on a random basis with the only constraint
that the sum of probabilities of all the transitions going out of a state is 1.
Observation symbol probability distribution : The observation symbol
probabilities are also intialised randomly.
Initial state distribution : Only the start state is allowed an initial state probability
( = 1) and no other state is allowed to do so.
This model is then trained using a set of genes extracted from a part of the
annotated genome. The training is done using Baum-Welch algorithm for HMMs (see
Appendix). The trained HMM is then used to calculate the normalized logarithmic
probability for a given sequence to be a gene.
(Extended HMM 2)This same model is then extended to take a genome sequence
(or part of a genome sequence) as input and annotate it with gene information. This is
done in a fashion similar to HMModel 1.
5.2.3 Neural Networks
In addition to the above HMM models, we have developed an ANN to classify a
set of nucleotide sequences into protein-encoding genes and non-coding regions. The
search involves two steps - a sequence encoding step to convert genes into neural
network input vectors and a neural network classification step to map input vectors to
appropriate classes (i.e. coding or non-coding).
Architecture: The ANN is a three layer network with one input layer, on hidden
layer and one output layer. The input layer has 70 input nodes, out of which 64
correspond to all possible codons, 4 correspond to the nucleotides themselves (i.e. A, T,
G and C), and two for chemically similar nulceotides - purines (A, T) and pyrimidines
(G, C). The number of nodes in the hidden layer is randomly set to 20, and the number of
output layer nodes is set to 2 (corresponding to two classes of sequences, coding and non-
coding).
Input vector encoding: The encoding of input sequence into neural network input
vector is done using ngrams. An ngram is a vector of nodes. Each node corresponds to a
parameter, and its value is the normalized frequency of occurance of that parameter in the
input sequence. For example, the node corresponding to the codon 'ACG' contains the
number of occurances of ACG divided by the total number of codons in the given
sequence, as its value. Similarly, the node corresponding to 'A' contains the total number
of occurances of A in the input sequence divided by the length of the input sequence. The
node for pyrines contains the total number of occurances of A and T together in the
sequence, and so on.
The ANN classification employs three-layered, feed-forward, back-propagation
network. This ANN is trained using a pre-classified set of sequences consisting of both
coding and non-coding sequences. The trained ANN is then used for classification.

6 Training set
The FASTA files for the complete genome sequence, genes and their conceptual
translations into proteins are downloaded from Entrez (Genome) database at NCBI
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome). The fasta file for genes
contains around 4113 genes out of which 1000 are used for training the HMM. This is
done against the traditional division with one half as learning set and another half as test
set to avoid overfitting (considering the enoromous number of parameters that are
adjusted during the training phase). These genes are used without any preprocessing to
train the above models to distinguish between coding and non-coding regions. However,
in case of extended models, there is a need to train the model not only on coding regions
but also on intergenic regions. Therefore, the complete genome seqeuence and the gene
listing is used to extract the intergenic regions, which are in turn used to train the
extended models.

7 Results
In the case of the HMModels, a logarithmic probability is calculated on the basis
of the viterbi path and this is used to determine if a given sequence is a gene or not. As
mentioned earlier, a set of 1000 coding-sequences are used as learning set and the trained
models are then run over a test set containing 2000 genes and 500 random sequences, to
calculate the accuracy. The accuracy of both the models over the test set is as high as
97% and 98% respectively. However, the number of false positives increases when the
random sequences are long and flanked by a start and stop codon. The number of false
negetives are still 0.
With Extended HMModels, the genome is annotated by calculating the viterbi
path through the sequence and then using the state sequence to determine coding and
non-coding regions in the genome. For example, in case of Extended Model 1, given a
genome (or part of it) as input, the viterbi path through the sequence is determined and
the part of the genome in between state 0 (start state) and state 60 (stop state) is annotated
as a gene and the part of the genome sequence that loops in state 61 (intergenic region) is
annotated as intergenic region. Similar is the case for Extended Model 2. These models
locate the exact position of the genes more than 88% of the time and approximate
positions (four or five codons) around 5% of the time.
The ANN with only 64 parameters (consequently, 64 input nodes) gives an
accuracy of 83%, however when the training is done by calculating MSE (mean-square
error) on the entire set rather than on individual sequences and the additional 6
parameters are added, the accuracy increases considerably, as much as 90%.

8 Appendix

8.1 Codon Usage Table

UUU Phe UCU Ser UAU Tyr UGU Cys

UUC Phe UCU Ser UAU Tyr UGU Cys
UUA Leu UCA Ser UAA TER UGA TER
UUG Leu UCG Ser UAG TER UGG Trp

CUU Leu CCU Pro CAU His CGU Arg

CUC Leu CCC Pro CAC His CGC Arg
CUA Leu CCA Pro CAA Gln CGA Arg
CUG Leu CCG Pro CAG Gln CGG Arg

AUU Ile ACU Thr AAU Asn AGU Ser

AUC Ile ACC Thr AAC Asn AGC Ser
AUA Ile ACA Thr AAA Lys AGA Arg
AUG MET ACG Thr AAG Lys AGG Arg

GUU Val GCU Ala GAU Asp GGU Gly

GUC Val GCC Ala GAC Asp GGC Gly
GUA Val GCA Ala GAA Glu GGA Gly
GUG Val GCG Ala GAG Glu GGG Gly
8.2 Baum-Welch Algorithm and Viterbi algorithm

Generally, the learning problem is how to adjust the HMM parameters, so that the given
set of observations (called the training set) is represented by the model in the best way
for the intended application. Thus it would be clear that the ``quantity'' we wish to
optimize during the learning process can be different from application to application. In
other words there may be several optimization criteria for learning, out of which a
suitable one is selected depending on the application.
There are two main optimization criteria found in ASR literature; Maximum Likelihood
(ML) and Maximum Mutual Information (MMI). The solutions to the learning problem
under each of those criteria is described below.

8.2.1 Maximum Likelihood (ML) criterion

In ML we try to maximize the probability of a given sequence of observations ,

belonging to a given class w, given the HMM of the class w, wrt the parameters of the

model . This probability is the total likelihood of the observations and can be
expressed mathematically as

However since we consider only one class w at a time we can drop the subscript and
superscript 'w's. Then the ML criterion can be given as,

However there is no known way to analytically solve for the model ,

which maximize the quantity . But we can choose model parameters such that it is
locally maximized, using an iterative procedure, like Baum-Welch method or a gradient
based method, which are described below.
8.2.2 Baum-Welch Algorithm
This method can be derived using simple ``occurrence counting'' arguments or using
calculus to maximize the auxiliary quantity

over [],[, p 344-346,]. A special feature of the algorithm is the guaranteed convergence .
To describe the Baum-Welch algorithm, ( also known as Forward-Backward algorithm),
we need to define two more auxiliary variables, in addition to the forward and backward
variables defined in a previous section. These variables can however be expressed in
terms of the forward and backward variables.

First one of those variables is defined as the probability of being in state i at t=t and in
state j at t=t+1. Formally,

This is the same as,

Using forward and backward variables this can be expressed as,

The second variable is the a posteriori probability,

that is the probability of being in state i at t=t, given the observation sequence and the
model. In forward and backward variables this can be expressed by,
One can see that the relationship between and is given by,

Now it is possible to describe the Baum-Welch learning process, where parameters of the

HMM is updated in such a way to maximize the quantity, . Assuming a starting

model ,we calculate the ' 's and ' 's using the recursions 1.5 and 1.2,

and then ' 's and ' 's using 1.12 and 1.15. Next step is to update the HMM parameters
according to eqns 1.16 to 1.18, known as re-estimation formulas.

These reestimation formulas can easily be modified to deal with the continuous density
case too.
8.2.3 The Decoding Problem and the Viterbi Algorithm

In this case We want to find the most likely state sequence for a given sequence of

observations, and a model,

The solution to this problem depends upon the way ``most likely state sequence'' is
defined. One approach is to find the most likely state at t=t and to concatenate all such '
's. But some times this method does not give a physically meaningful state sequence.
Therefore we would go for another method which has no such problems.
In this method, commonly known as Viterbi algorithm, the whole state sequence with the
maximum likelihood is found. In order to facilitate the computation we define an
auxiliary variable,

which gives the highest probability that partial observation sequence and state sequence
up to t=t can have, when the current state is i.
It is easy to observe that the following recursive relationship holds.

where,

So the procedure to find the most likely state sequence starts from calculation of

using recursion in 1.8, while always keeping a pointer to the

``winning state'' in the maximum finding operation. Finally the state , is found where
and starting from this state, the sequence of states is back-tracked as the pointer in each
state indicates.This gives the required set of states.
This whole algorithm can be interpreted as a search in a graph whose nodes are formed

by the states of the HMM in each of the time instant .

9 References

David W. Mount (2000) Bioinformatics: Sequence and Genome Analysis, Cold Spring
Harbor Laboratory Press, 337-380

Dan E. Krane, Michael L. Raymer (2003) Fundamental concepts of Bioinformatics,

Pearson Education, 117-155

Baldi P. and Brunak S. (1998). Bioinformatics: The machine learning approach. MIT
Press, Cambridge, Massachusetts.

Andres Krogh, I. Saira Mian, David Haussler (1994) A model that finds genes in E.Coli,
USSC-CRL-93-33, revised May 1994

James W. Fickett (1998) Gene Identification, Bioinformatics, 10: 563-578

James W. Fickett (1982) Recognition of protein coding regions in DNA sequences.

Nucleic Acid Research 10: 5303-5318

Reordering Life: Knowledge and Control in the Genomics Revolution
From Everand
Reordering Life: Knowledge and Control in the Genomics Revolution
Stephen Hilgartner
No ratings yet
Copy of Copy of Copy of Sure Shot Qns 3
No ratings yet
Copy of Copy of Copy of Sure Shot Qns 3
102 pages
Metabolism Chart Assignment
No ratings yet
Metabolism Chart Assignment
3 pages
D1.1:1.2:1.3:2.1 + A2.1 Written
No ratings yet
D1.1:1.2:1.3:2.1 + A2.1 Written
21 pages
SVMP Book Chapter
No ratings yet
SVMP Book Chapter
619 pages
CH03
No ratings yet
CH03
27 pages
Genomes 4 (C-5, Genome Annotation)
No ratings yet
Genomes 4 (C-5, Genome Annotation)
16 pages
Unit Vi
No ratings yet
Unit Vi
64 pages
Noc19 bt20 Assessment Id Week 7
No ratings yet
Noc19 bt20 Assessment Id Week 7
1 page
Biology Assignment Sem 1
No ratings yet
Biology Assignment Sem 1
8 pages
Codon Bias and Heterologous Protein Expression
No ratings yet
Codon Bias and Heterologous Protein Expression
10 pages
Lec (6) - Gene Prediction
No ratings yet
Lec (6) - Gene Prediction
19 pages
Functional Analysis of Genes
No ratings yet
Functional Analysis of Genes
16 pages
Gene Prediction
No ratings yet
Gene Prediction
17 pages
Eukaryotic Gene Structure: Done By: Laith Saeed Alamoudi
No ratings yet
Eukaryotic Gene Structure: Done By: Laith Saeed Alamoudi
9 pages
Bioinformatics Module Final Version-Word
No ratings yet
Bioinformatics Module Final Version-Word
18 pages
BioAlg10 9
No ratings yet
BioAlg10 9
69 pages
Urolitina A
No ratings yet
Urolitina A
8 pages
Anchal
No ratings yet
Anchal
18 pages
Structure of Genomes 2
No ratings yet
Structure of Genomes 2
8 pages
CUBT401 - 4 - Sequence and Genome Annotation
No ratings yet
CUBT401 - 4 - Sequence and Genome Annotation
66 pages
Chapter 3
No ratings yet
Chapter 3
14 pages
UNIT 4 - New
No ratings yet
UNIT 4 - New
21 pages
The Regulation of Enzyme Synthesis
No ratings yet
The Regulation of Enzyme Synthesis
4 pages
Win Molecular Answers
No ratings yet
Win Molecular Answers
15 pages
Gene L0cation and Structure
No ratings yet
Gene L0cation and Structure
20 pages
Module - 5 - Reference Course Content
No ratings yet
Module - 5 - Reference Course Content
25 pages
Genome Organization and Biosynthesis of Proteins
No ratings yet
Genome Organization and Biosynthesis of Proteins
48 pages
Gene Structure
No ratings yet
Gene Structure
14 pages
Bio.2 Modular Approach - Nanong LM
No ratings yet
Bio.2 Modular Approach - Nanong LM
8 pages
07042020144707gene Concept1
No ratings yet
07042020144707gene Concept1
4 pages
Decision Theory Vohra
No ratings yet
Decision Theory Vohra
65 pages
Gene Expression
No ratings yet
Gene Expression
4 pages
Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011
No ratings yet
Omputational ENE Rediction: Cse/Bimm/Beng 181 M 24, 2011
45 pages
One Gene One Enzyme Theory
No ratings yet
One Gene One Enzyme Theory
16 pages
Lec 2 Bioinformatics Glossary
No ratings yet
Lec 2 Bioinformatics Glossary
6 pages
Genetic Code
No ratings yet
Genetic Code
20 pages
LAB 5 - Gene Discovery
No ratings yet
LAB 5 - Gene Discovery
10 pages
The Genetic Code-2 3
No ratings yet
The Genetic Code-2 3
3 pages
Calvin Cycle MS Notes
No ratings yet
Calvin Cycle MS Notes
2 pages
The Effect of Calorie Restriction and Intermittent Fasting On Health and Disease
No ratings yet
The Effect of Calorie Restriction and Intermittent Fasting On Health and Disease
188 pages
DSP Lab Manual
No ratings yet
DSP Lab Manual
27 pages
Biochem 218 - Biomedical Informatics 231: The Human Genome Project
No ratings yet
Biochem 218 - Biomedical Informatics 231: The Human Genome Project
27 pages
Bio 206
No ratings yet
Bio 206
9 pages
Bioinformatics Manual
No ratings yet
Bioinformatics Manual
117 pages
Gene Pridiction and Orf
No ratings yet
Gene Pridiction and Orf
34 pages
Unit 2 BI
No ratings yet
Unit 2 BI
10 pages
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
No ratings yet
BBT3 - CASD - BIOCOMP - 2ndassignment' With You
7 pages
Protein Metabolism
No ratings yet
Protein Metabolism
18 pages
Wbhffkndnckdcbudfjlnvskjcjancjabvsw
No ratings yet
Wbhffkndnckdcbudfjlnvskjcjancjabvsw
4 pages
Gene Prediction
No ratings yet
Gene Prediction
24 pages
MCB 413 (PT 2)
No ratings yet
MCB 413 (PT 2)
14 pages
Solution Manual For Genetics From Genes To Genomes 5th Edition by Hartwell Goldberg Fischer ISBN 0073525316 9780073525310
100% (50)
Solution Manual For Genetics From Genes To Genomes 5th Edition by Hartwell Goldberg Fischer ISBN 0073525316 9780073525310
36 pages
Genetics Lec m1-m3
No ratings yet
Genetics Lec m1-m3
21 pages
Regulation of Gene Expression
No ratings yet
Regulation of Gene Expression
16 pages
CL662 PW 02 Gene Finding
No ratings yet
CL662 PW 02 Gene Finding
39 pages
Protein Synthesis II
No ratings yet
Protein Synthesis II
57 pages
Pac-Biosciences-Single Molecule Real Time DNA Sequencing
No ratings yet
Pac-Biosciences-Single Molecule Real Time DNA Sequencing
7 pages
MULTIPLE CHOICE. Choose The One Alternative That Best Completes The Statement or Answers The Question. Solve The Problem
No ratings yet
MULTIPLE CHOICE. Choose The One Alternative That Best Completes The Statement or Answers The Question. Solve The Problem
15 pages
03 The Genetic Code - 21-10-21
No ratings yet
03 The Genetic Code - 21-10-21
8 pages
Bioscience Catalogue 64p PDF
No ratings yet
Bioscience Catalogue 64p PDF
64 pages
Genome Annotation
No ratings yet
Genome Annotation
24 pages
RNA Binding Protein
No ratings yet
RNA Binding Protein
8 pages
Lva1 App6891 PDF
No ratings yet
Lva1 App6891 PDF
33 pages
Disorders and Diseases That Result From The Malfunction of The Cell During The Cell Cycle
100% (4)
Disorders and Diseases That Result From The Malfunction of The Cell During The Cell Cycle
2 pages
Introducing Epigenetics: A Graphic Guide
From Everand
Introducing Epigenetics: A Graphic Guide
Cath Ennis
3/5 (4)
PM703 Practical Biotechnology (2019) PM703 Practical Biotechnology (2019)
No ratings yet
PM703 Practical Biotechnology (2019) PM703 Practical Biotechnology (2019)
20 pages
Gene, Proteins, and Genetic Code
No ratings yet
Gene, Proteins, and Genetic Code
37 pages
Remedial Activity - Science-8 4TH Quarter
No ratings yet
Remedial Activity - Science-8 4TH Quarter
3 pages
BABS1201 Notes
No ratings yet
BABS1201 Notes
25 pages
Gene Finding
No ratings yet
Gene Finding
31 pages
Study Material: Downloaded From Vedantu
No ratings yet
Study Material: Downloaded From Vedantu
12 pages
Gene Prediction
No ratings yet
Gene Prediction
50 pages
Department of Education: Learning Activity Worksheets (LAW) General Biology 1 Grade 12
No ratings yet
Department of Education: Learning Activity Worksheets (LAW) General Biology 1 Grade 12
4 pages
Comparison of Liquid/Liquid and Solid-Phase Extraction For Alkaline Drugs
No ratings yet
Comparison of Liquid/Liquid and Solid-Phase Extraction For Alkaline Drugs
5 pages
13 Cellular Respiration-KEY
No ratings yet
13 Cellular Respiration-KEY
7 pages
Group # 13
No ratings yet
Group # 13
49 pages
The Genetic Code and Transcription: Flow of Genetic Information
No ratings yet
The Genetic Code and Transcription: Flow of Genetic Information
10 pages
Computational Approaches
No ratings yet
Computational Approaches
12 pages
Genetic Code, Translation, Gene Expression Regulation, HGP
100% (1)
Genetic Code, Translation, Gene Expression Regulation, HGP
26 pages
Alignments Lecture
No ratings yet
Alignments Lecture
15 pages
Ghosh and Mallik
No ratings yet
Ghosh and Mallik
68 pages
Overlapping Genes
No ratings yet
Overlapping Genes
10 pages
Gene Prediction
No ratings yet
Gene Prediction
5 pages
An Overview of Gene Identification
No ratings yet
An Overview of Gene Identification
9 pages
Nucleus
No ratings yet
Nucleus
10 pages
Bacterial Gene Annotation
100% (1)
Bacterial Gene Annotation
12 pages
Growth Rate and Yield Calculations - 17.11.16
100% (1)
Growth Rate and Yield Calculations - 17.11.16
10 pages
Compgenestruiden
No ratings yet
Compgenestruiden
19 pages
General Biology 1 SHS Second Quarter - Learning Activity Sheet
No ratings yet
General Biology 1 SHS Second Quarter - Learning Activity Sheet
9 pages
A Technical Paper ON Genomic Digital Signal Processing: Jyothishmathi Institute of Technology and Science Karimnagar
No ratings yet
A Technical Paper ON Genomic Digital Signal Processing: Jyothishmathi Institute of Technology and Science Karimnagar
12 pages
Protein Synthesis Worksheet
No ratings yet
Protein Synthesis Worksheet
4 pages
Experiment No: 1 Aim
No ratings yet
Experiment No: 1 Aim
13 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
2.2.2. Heat Conduction in Cylinders and Spheres: Onduction C Eat H Teady S Wo T
No ratings yet
2.2.2. Heat Conduction in Cylinders and Spheres: Onduction C Eat H Teady S Wo T
9 pages
Glycolysis: The 3 Stages of Glycolysis
No ratings yet
Glycolysis: The 3 Stages of Glycolysis
11 pages
Lipid Metabolism PDF
No ratings yet
Lipid Metabolism PDF
6 pages
Name: - Period
No ratings yet
Name: - Period
3 pages
LFSC Autumn Camp 2023 Term 1 Drilling
No ratings yet
LFSC Autumn Camp 2023 Term 1 Drilling
23 pages

Gene Prediction

Uploaded by

Gene Prediction

Uploaded by

Gene Prediction

Supervisor: Prof. Harish Karnick

Keywords: DNA, Hidden Markov Models, Neural Networks, Prokaryotes,

4 Comparison between prokaryotic and eukaryotic genomes

8.1 Codon Usage Table

UUU Phe UCU Ser UAU Tyr UGU Cys

CUU Leu CCU Pro CAU His CGU Arg

AUU Ile ACU Thr AAU Asn AGU Ser

GUU Val GCU Ala GAU Asp GGU Gly

8.2.1 Maximum Likelihood (ML) criterion

In ML we try to maximize the probability of a given sequence of observations ,

However there is no known way to analytically solve for the model ,

This is the same as,

Using forward and backward variables this can be expressed as,

The second variable is the a posteriori probability,

HMM is updated in such a way to maximize the quantity, . Assuming a starting

observations, and a model,

using recursion in 1.8, while always keeping a pointer to the

by the states of the HMM in each of the time instant .

Dan E. Krane, Michael L. Raymer (2003) Fundamental concepts of Bioinformatics,

James W. Fickett (1998) Gene Identification, Bioinformatics, 10: 563-578

James W. Fickett (1982) Recognition of protein coding regions in DNA sequences.

You might also like