0% found this document useful (0 votes)
12 views

Bioinformatics-Lesson 07 - Hidden Markov Model

This document provides an introduction to Markov chain models and hidden Markov models. It discusses using Markov chains to model nucleotide frequencies in DNA and discriminate between CpG islands and other genomic regions. It then introduces hidden Markov models and the key questions they aim to answer: the likelihood of a sequence, the most probable path that generated a sequence, and learning model parameters from example sequences. An example of an occasionally dishonest casino is used to illustrate hidden Markov models and the forward algorithm for calculating sequence likelihoods.

Uploaded by

mahedi hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Bioinformatics-Lesson 07 - Hidden Markov Model

This document provides an introduction to Markov chain models and hidden Markov models. It discusses using Markov chains to model nucleotide frequencies in DNA and discriminate between CpG islands and other genomic regions. It then introduces hidden Markov models and the key questions they aim to answer: the likelihood of a sequence, the most probable path that generated a sequence, and learning model parameters from example sequences. An example of an occasionally dishonest casino is used to illustrate hidden Markov models and the forward algorithm for calculating sequence likelihoods.

Uploaded by

mahedi hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduction to Bioinformatics

Lecture 9
A Markov Chain Model
• Nucleotide frequencies in the human genome
A C T G
29.5 20.4 20.5 29.6
A Markov Chain Model
• Traditionally end of a sequence is not modelled
• can also have an explicit end state; allows the model
to represent
– a distribution over sequences of different lengths
– preferences for ending sequences with certain
symbols
Markov Chain Model: Definition

• a Markov chain model is defined by


– a set of states
• some states emit symbols
• other states are silent
– (e.g. the begin and end states)
– a set of transitions with associated
probabilities
• the transitions emanating from a given state
define a distribution over the possible next
states
Markov Chain Model: Property
• given some sequence x of length L, we can ask how
probable the sequence is given our model
• for any probabilistic model of sequences, we can write
this probability as
Pr(x)  Pr(xL , xL1 ,K , x1 )
 Pr(xL | xL1 ,K , x1 ) Pr(xL1 | xL2 ,K , x1 )K Pr(x1 )

• key property of a (1st order) Markov chain: the


probability of each xi depends only on the value of xi-1
Pr(x)  Pr(xL | xL1 ) Pr(xL 1 | xL2 )K Pr(x2 | x1 ) Pr(x1 )
L
 Pr(x1 ) Pr(xi | xi 1 )
i2
The Probability of a Sequence for a Given
Markov Chain Model

Pr(cggt)  Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end |


t)
Markov Chain Model: Notation
• the transition parameters can be denoted by a
xi1 xi where

a xi1 xi  Pr(xi | x )
i1
• similarly we can denote the probability of a sequence x
as B x1 L  L

xi1 xi  Pr(x1 ) Pr(xi |


a i2
a i2
xi 1 )
where aB x i
represents the transition from the begin state
Written CpG to

CpG Islands distinguish from


a C≡G base pair

• CpG dinucleotides are rarer than would be expected


from the independent probabilities of C and G.
– Reason: When CpG occurs, C is typically chemically
modified by methylation and there is a relatively high
chance of methyl-C mutating into T
• A CpG island is a region where CpG dinucleotides
are much more abundant than elsewhere.
• High CpG frequency may be biologically significant;
e.g., may signal promoter region (“start” of a gene).
Markov Chain for Discrimination
• suppose we want to distinguish CpG islands from
other sequence regions
• given sequences from CpG islands, and sequences
from other regions, we can construct
– a model to represent CpG islands (model +)
– a null model to represent other regions (model -)
• can then score a test sequence by:

Pr( x | model)
score( x)  log
Pr(x | model - )
Markov Chain for Discrimination
• parameters estimated for + and - models
– human sequences containing 48 CpG islands
– 60,000 nucleotides

• Calculated Transition probabilities for both models


Markov Chain for Discrimination
• Calculated the log-odds ratio

L  L
a
score(x)  log
Pr( x | model)
Pr(x | model
  log xi1 x
i    xi1 xi
i1 i1

xi1 xi
a
-)
x x
• i1
i
are the log-likelihood ratios of corresponding
transition probabilities
 A C G T
A -0.740 0.419 0.580 -0.803

C -0.913 0.302 1.812 -0.685

T -0.624 0.461 0.331 -0.730

T -1.169 0.573 0.393 -0.679


Markov Chain for Discrimination

•Solid bars represent


non-CPG

•Dotted bars represent


CpG islands

•Error could be due to


inadequate modeling or
mislabelling
A simple Hidden Markov Model (HMM)

• given say a T in our input sequence, which state


emitted it?
Why Hidden ?

• we’ll distinguish between the observed parts


of a problem and the hidden parts
• in the Markov chain models it is clear which
state accounts for each part of the observed
sequence
• in the model above, there are multiple states
that could account for each part of the
observed sequence
– this is the hidden part of the problem
– the essential difference between Markov
chain and Hidden Markov model
Hidden Markov Models
• Components:
– Observed variables
• Emitted symbols
– Hidden variables
– Relationships between them
• Represented by a graph with transition
probabilities

• Goal: Find the most likely explanation for the


observed variables
Notations in HMM
• States are decoupled from symbols
• x is the sequence of symbols emitted by model
– xi is the symbol emitted at time i
• A path, , is a sequence of states
– The i-th state in  is i
• akr is the probability of making a transition from
state k to state r:
akr Pr(i  r |i 1
 k)
• ek(b) is the probability that symbol b is emitted
when in state k

e (b )  Pr(x b|
The occasionally dishonest casino
• A casino uses a fair die most of the time, but
occasionally switches to a loaded one
– Fair die: Prob(1) = Prob(2) = . . . = Prob(6) =
1/6
– Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) = ½
– These are the emission probabilities
• Transition probabilities
– Prob(Fair  Loaded) = 0.01
– Prob(Loaded  Fair) = 0.2
– Transitions between states obey a Markov process
An HMM for occasionally dishonest casino
1: 1/6
2: 1/6 akl
0.99
0.80
0.01
1: 1/10
2: 1/10
3: 1/6 3: 1/10
4: 1/6 4: 1/10
5: 1/6 5: 1/10
0.2
6: 1/6 6: 1/2

ek (b) Fair Loaded


The occasionally dishonest casino
• Known:
– The structure of the model
– The transition probabilities
• Hidden: What the casino did
– FFFFFLLLLLLLFFFF...
• Observable: The series of die
tosses
– 3415256664666153...
• What we must infer:
– When was a fair die used?
– When was a loaded one used?
• The answer is a sequence
FFFFFFFLLLLLLFFF...
Three Important Questions
• How likely is a given sequence?
– the Forward algorithm
• What is the most probable “path” for
generating a given sequence?
– the Viterbi algorithm
• How can we learn the HMM parameters
given a set of sequences?
– the Baum-Welch (Forward-Backward)
algorithm
How Likely is a Given Sequence?
The probability that the path 1, 2,…, L is taken
and the sequence x1,x2,…,xL is generated:

Pr(x1 ,K , xL | 1 ,K ,  L )  a0 1

 e (xi )a 
i i i1

i1

(assuming begin/end are the only silent


states on path)
The occasionally dishonest casino
x  x1 , x2 , 
x3 6,2,6
Pr(x,  )  a0 F eF (6)a FF eF (2)a FF eF (6)
(1)

1 1

1
0.5  0.99   0.99 
 FFF
 (1)
6 6
6
 0.00227
Pr(x, 
(2)
)  a0 L eL (6)aLL eL (2)aLL eL (6)
 LLL
 (2)  0.5 0.5 0.8 0.1 0.8
0.5
 0.008
Pr(x,  )  a0 L eL (6)aLF eF (2)aFL eL (6)aL 0
(3)

 LFL 1
 0.5 0.5 0.2   0.01
 (3)
0.5
6
 0.0000417
Making the inference
• Model assigns a probability to each explanation of the
observation:
P(326|FFL)

• Maximum Likelihood: Determine which


explanation is most likely
– Find the path most likely to have produced
the observed sequence

• Total probability: Determine probability


that observed sequence was produced by the
HMM
– Consider all paths that could have produced
the observed sequence
How Likely is a Given Sequence?
• for a single path  the probability that the
sequence x is generated L

Pr(x1 ,K , xL | 1 ,K ,  L )  a0 1

 e (xi )a 
i i i1

i1

• the probability over all such paths is:

Pr(x
• but,K
1
the number
sequence...

, xL ) ofpaths Pr(x
can
1 ,K ,
be exponential in the length of the

xL • the Forward algorithm enables us to compute this efficiently


The most probable path
The most likely path * satisfies

*  argmax
Pr(x,  )

To find *, consider all possible ways the last
symbol of x could have been emitted
vk (i )  Prob. of path  1 ,L ,  i most
Let
to emit x ,K , xlikely
1 i such
that  i k
Then
vk (i )  ek (xi ) max
r
rv rk

(i  1)a 
The Viterbi Algorithm
• Initialization: (i = 0)
v0 (0)  1, vk (0)  0 for k  0
• Recursion: (i = 1, . . . , L): For each state k
vk (i)  ek (xi ) max
r
vr (i 1)ark 
• Termination:
Pr(x,  * )  vk (L)ak
max 0
k

To find *, use trace-back, as in dynamic programming


Viterbi: Example
x
 6 2 6
1 0 0 0
B
0 (1/6)(1/2) (1/6)max{(1/12)0.99, (1/6)max{0.013750.99,
= 1/12 (1/4)0.2} 0.020.2}
F
 = 0.01375 = 0.00226875

0 (1/2)(1/2) (1/10)max{(1/12)0.01, (1/2)max{0.013750.01,


= 1/4 (1/4)0.8} 0.020.8}
L
= 0.02 = 0.08

0.80
0.99
0.01 1: 1/10
1: 1/6
2: 1/10
2: 1/6

rv
3: 1/10
vk (i )  ek (xi ) max
r
rk
3:
4:
1/6
1/6
4: 1/10
5: 1/10

5: 1/6 0.2
(i  1)a 6: 1/6
6: 1/2

Fair Loaded
Viterbi gets it right more often than not

The numbers in first rows show 300 rolls of a die. Second rows show which die
was actually used for the roll. Third lines show the Viterbi algorithm prediction

You might also like