Bioinformatics-Lesson 07 - Hidden Markov Model
Bioinformatics-Lesson 07 - Hidden Markov Model
Lecture 9
A Markov Chain Model
• Nucleotide frequencies in the human genome
A C T G
29.5 20.4 20.5 29.6
A Markov Chain Model
• Traditionally end of a sequence is not modelled
• can also have an explicit end state; allows the model
to represent
– a distribution over sequences of different lengths
– preferences for ending sequences with certain
symbols
Markov Chain Model: Definition
a xi1 xi Pr(xi | x )
i1
• similarly we can denote the probability of a sequence x
as B x1 L L
Pr( x | model)
score( x) log
Pr(x | model - )
Markov Chain for Discrimination
• parameters estimated for + and - models
– human sequences containing 48 CpG islands
– 60,000 nucleotides
L L
a
score(x) log
Pr( x | model)
Pr(x | model
log xi1 x
i xi1 xi
i1 i1
xi1 xi
a
-)
x x
• i1
i
are the log-likelihood ratios of corresponding
transition probabilities
A C G T
A -0.740 0.419 0.580 -0.803
e (b ) Pr(x b|
The occasionally dishonest casino
• A casino uses a fair die most of the time, but
occasionally switches to a loaded one
– Fair die: Prob(1) = Prob(2) = . . . = Prob(6) =
1/6
– Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) = ½
– These are the emission probabilities
• Transition probabilities
– Prob(Fair Loaded) = 0.01
– Prob(Loaded Fair) = 0.2
– Transitions between states obey a Markov process
An HMM for occasionally dishonest casino
1: 1/6
2: 1/6 akl
0.99
0.80
0.01
1: 1/10
2: 1/10
3: 1/6 3: 1/10
4: 1/6 4: 1/10
5: 1/6 5: 1/10
0.2
6: 1/6 6: 1/2
Pr(x1 ,K , xL | 1 ,K , L ) a0 1
e (xi )a
i i i1
i1
1 1
1
0.5 0.99 0.99
FFF
(1)
6 6
6
0.00227
Pr(x,
(2)
) a0 L eL (6)aLL eL (2)aLL eL (6)
LLL
(2) 0.5 0.5 0.8 0.1 0.8
0.5
0.008
Pr(x, ) a0 L eL (6)aLF eF (2)aFL eL (6)aL 0
(3)
LFL 1
0.5 0.5 0.2 0.01
(3)
0.5
6
0.0000417
Making the inference
• Model assigns a probability to each explanation of the
observation:
P(326|FFL)
Pr(x1 ,K , xL | 1 ,K , L ) a0 1
e (xi )a
i i i1
i1
Pr(x
• but,K
1
the number
sequence...
, xL ) ofpaths Pr(x
can
1 ,K ,
be exponential in the length of the
* argmax
Pr(x, )
To find *, consider all possible ways the last
symbol of x could have been emitted
vk (i ) Prob. of path 1 ,L , i most
Let
to emit x ,K , xlikely
1 i such
that i k
Then
vk (i ) ek (xi ) max
r
rv rk
(i 1)a
The Viterbi Algorithm
• Initialization: (i = 0)
v0 (0) 1, vk (0) 0 for k 0
• Recursion: (i = 1, . . . , L): For each state k
vk (i) ek (xi ) max
r
vr (i 1)ark
• Termination:
Pr(x, * ) vk (L)ak
max 0
k
0.80
0.99
0.01 1: 1/10
1: 1/6
2: 1/10
2: 1/6
rv
3: 1/10
vk (i ) ek (xi ) max
r
rk
3:
4:
1/6
1/6
4: 1/10
5: 1/10
5: 1/6 0.2
(i 1)a 6: 1/6
6: 1/2
Fair Loaded
Viterbi gets it right more often than not
The numbers in first rows show 300 rolls of a die. Second rows show which die
was actually used for the roll. Third lines show the Viterbi algorithm prediction