Hidden Markov Models: CH 3.2, 3.2 of DEKM
Hidden Markov Models: CH 3.2, 3.2 of DEKM
CpG islands
The dinucleotide CG (called CpG) is rare
C in a CG often gets methylated and the resulting C then mutates to T
Methylation is suppressed in some areas of genome, called CpG islands Such CpG islands often found around genes Problem: find CpG islands in whole genome
Markov Chains
Given a sequence, can calculate its probability under a CpG Markov chain and its probability under a normal genome Markov chain. Then calculate
Pr( x | CpGMarkovChain ) S ( x ) = log Pr( x | NormalMarkovChain )
Markov Chains
Slide a fixed length window, say of length 100 bp If any window x has a positive (or very positive) S(x), mark it as a (part of a) CpG island
Issues
Why fixed length windows? Why 100 bp ? A more satisfactory approach will be to build one model for the whole sequence, that incorporates both Markov chains.
high CG
rare CG
CG island
normal
Two states. Each state emits sequence. Sequence emitted by CG island state is high in CG frequency Concatenation of sequence emissions = genome
HMM vs MC
The main difference between HMM and Markov chains is that in an HMM there is not a one-to-one correspondence between states visited and symbols observed.
HMM: Formal
The path or parse is the sequence of states visited by the process. This is a simple Markov chain given by the following transition probabilities:
a
akl = P (i = l|i
= k)
HMM: Formal
For technical reasons, assume a begin state (0) from where the process transitions to any state k with probability a0k Similarly assume an end state (also 0) with transition probabilities into it being ak0
HMM: Formal
In each state other than 0, the model can emit a symbol according to some distribution. This is the emission probability distribution
e k ( b ) = P ( x i = b | i = k )
L P( x, ) = a0 1 e i ( xi )a i i+1 a L 0 i=1
We may ignore a0 and a simplicity.
1
L0
for
Decoding
In an HMM-generated sequence, we cant say for sure whether a particular symbol was generated from a particular state. Yet, it is often the underlying states that we are more interested in finding out. This is called decoding (from speech recognition).
= argmax P (x, )
Viterbi algorithm
Suppose vk(i) is the probability of the most probable path ending in state k with observation xi. Suppose vk(i) is known for all k. Then:
Viterbi algorithm
Would you implement as a recursion? No. Use dynamic programming
Probability of sequence
Calculate the probability of the sequence by summing over all paths:
P ( x) =
P (x, )
fk (i) = P (x1 xi , i = k ) X
k
The first term on the RHS is fk(i). Call the second term bk(i)
bk (i) = P (xi+1 xL |i = k )
Recurrence:
bk ( i ) =
X
l
akl
Akl =P l0 Akl0
Akl =
Akl ( )P ( |x)
Expectation step
Note that Akl is EP(|x)(Akl()) Note also that
Akl ( ) =
L X i=1
Akli ( )
Akl = E (Akl ( )) =
X
i
E (Akli ( ))
Expectation step
But:
P (i = k, i+1
Thus, having already run the forward and backward algorithms, we can calculate Akl