Hmms
Hmms
CpG islands - regions of more than 500 bp with CG content > 55%
Sequence: s = t !t !a!c!g!g!t N
0th-order: P0 (s ) = p(t ) ⋅ p(t ) ⋅ p(a ) ⋅ p(c ) ⋅ p(g ) = ∏ p(s ) i
i =1
N
1st-order: P1 (s ) = p(t ) ⋅ p(t | t ) ⋅ p(a | t ) ⋅ p(c | a ) = p(s1 ) ⋅ ∏ p(si | si −1 )
i =2
N
2nd-order: P (s ) = p(tt ) ⋅ p(a | tt ) ⋅ p(c | ta ) ⋅ p(g | ac ) = p(s s ) ⋅
2 1 2 ∏ p(s | s
i =3
i s
i − 2 i −1 )
Benos 02-710/MSCBIO2070 1-3.Apr.2013 9
Application of Markov chains:
CpG islands (cntd)
aAT Training Set:
A T l set of DNA sequences w/ known CpG islands
Derive two Markov chain models:
aAC aGT
l ‘+’ model: from the CpG islands
C G l ‘-’ model: from the remainder of sequence
aGC
Transition probabilities for each model:
+
• A state for each of the four letters A,C, G, and c + c st is the number of times
T in the DNA alphabet a st+ = st letter t followed letter s
• : probability of a residue following
another residue
∑
t'
+
cst' in the CpG islands
.161
.368
.339
.274
.375
.188
.125
S(x) = log
P(x|model − )
= ∑ i =1
log
a −x i −1 x i
G
T .079 .355 .384 .182 Benos 02-710/MSCBIO2070 1-3.Apr.2013 10
Application of Markov chains:
CpG islands (cntd)
P( t | s,+ )
log 2 (P(t | s, +) / P(t | s, !))
P( t | s,- )
!
P( x | +) L P(x i+1 | x i ,+)
log 2 ! = # log 2
P( x | ") i=1 P(x i+1 | x i ,")
other
CpG
Q1: Given a short sequence x, does it come from CpG island? (Yes-No question)
• Evaluate S(x)
Q2: Given a long sequence x, how do we find CpG islands in it (Where question)?
• Calculate the log-odds score for a window of, say, 100 nucleotides around
every nucleotide, plot it, and predict CpG islands as ones w/ positive values
• Drawbacks: Window size?
Benos 02-710/MSCBIO2070 1-3.Apr.2013 12
HMM: A parse of a sequence
Given a sequence x = x1……xL, and a HMM with K states,
A parse of x is a sequence of states π = π1, ……, πL
1 1 1 … 1
2 2 2 … 2
… … … …
K K K … K
x1 x2 x3 xL
13
Loaded
3: 1/6 3: 1/10
Fair
aFL=0.05, aLF=0.1
4: 1/6 4: 1/10 ü E : eF(b)=1/6 (∀ b∈Ω),
0.1
5: 1/6 5: 1/10 eL( 6 )=1/2
6: 1/6 6: 1/2 eL(b)=1/10 (if b≠6)
E.g. Given the following sequence is it more likely that it comes from
a Loaded or a Fair die?
123412316261636461623411221341
E.g. Given the following sequence is it more likely that the 3rd
observed 6 comes from a Loaded or a Fair die?
123412316261636461623411221341
!
The Forward Algorithm – derivation
(cntd)
l Then, we need to write the fk(i) as a function of the
previous state, fl(i-1).
& )
! = % (% P(x1,..., x i"1,#1,...,# i" 2 ,# i"1 = l) $ al,k + $ ek (x i )
l' #1 ,...,# i "2 *
!
= % P(x1,..., x i"1,# i"1 = l) $ al,k $ ek (x i )
l
!
= ek (x i ) " $ f l (i #1) " al,k Chain rule: P(A,B,C)=P(C|A,B) P(B|A) P(A)
l
!
The Forward Algorithm
We can compute fk(i) for all k, i, using dynamic programming
Initialization: f 0 (0) = 1
f k (0) = 0, "k > 0
! !
Termination: P( x ) = # f k (N) " ak,0
k
!
The Backward Algorithm
l Forward algorithm determines the most likely state k
at position i, using the previous observations.
123412316261636461623411221341
!
= # ek (x i+1 ) " ak,l "# P(x i+2 ,..., x N ,$ i+2 ,...,$ N | $ i+i = l)
l $ i +2 ,...,$ N
!
= # bl (i + 1) " ak,l "el (x i+1 )
l
Chain rule: P(A,B,C)=P(C|A,B) P(B|A) P(A)
!
Benos 02-710/MSCBIO2070 1-3.Apr.2013 25
!
The Backward Algorithm
We can compute bk(i) for all k, i, using dynamic programming
! !
Termination: P( x ) = # bk (1) " a0,k " ek (x1 )
k
State 1
l P(πi=k|x)
l Posterior decoding calculates the optimal path that explains the
data.
l For each emitted symbol, xi, it finds the most likely state that
could produce it, based on the forward and backward
probabilities.
Benos 02-710/MSCBIO2070 1-3.Apr.2013 28
The Viterbi Algorithm – derivation
l We define:
2
Vj(i)
State 1
Vj(i)
Termination:
P(x, π*) = maxk Vk(N) Termination: Termination:
P(x) = Σk fk(N) ak0 P(x) = Σk a ek(x1) bk(1)
0k
3. Learning
GIVEN HMM M, with unknown prob. parameters, sequence x
FIND parameters θ = (π, eij, akl) that maximize P(x | θ, M )
ALGOR. Maximum likelihood (ML), Baum-Welch (EM)
Two scenarios:
l Labeled data - Supervised learning
GIVEN: a newly sequenced genome; we don t know how frequent are the
CpG islands there, neither do we know their composition
ML Ak,l E k (b)
ak,l = ekML (b) =
" Ak,i
i
" E k (c)
c
l Problem: overfitting (when training set is small for the model)
! !
l Then:
aFF = 10/10 = 1.00; aFL = 0/10 = 0
eF(1) = eF(3) = 2/10 = 0.2;
eF(2) = 3/10 = 0.3; eF(4) = 0/10 = 0; eF(5) = eF(6) = 1/10 = 0.1
l Then:
aFF = 11/12 = 0.92; aFL = 1/12 = 0.08
eF(1) = eF(3) = 3/16 = 0.1875;
eF(2) = 4/16 = 0.25; eF(4) = 1/16 = 0.0625; eF(5) = eF(6) = 2/16 = 0.125
3. Repeat steps #1, #2 with new parameters ak,l and ek(b)
• Initialization:
• Set A and E to pseudocounts (or priors)
3. M-step: Estimate new model parameters ak,l and ek(b) using ML
across all training sequences
! 4. Estimate the new model’s (log)likelihood to assess convergence
Benos 02-710/MSCBIO2070 1-3.Apr.2013 42
The Baum-Welch algorithm (cntd)
• Initialization: pick arbitrary model parameters
• Set A and E to pseudocounts (or priors)
- guarantees convergence
- is a special case of EM
l Example guess: if initial I-I came from S1-S2 then the
probability is:
0.3 x 0.2 x 0.5 x 0.9 = 0.027
Day k+1
S1 S2 IN OUT
Day k
A: 1 A: 0 A: 0 A: 0 P(CGCG ) = a0,C+ ×1× aC+ ,G− ×1× aG− ,C− ×1× aC− ,G+ ×1× aG+ ,0
C: 0 C: 1 C: 0 C: 0
G: 0 G: 0 G: 1 G: 0
T: 0 T: 0 T: 0 T: 1 In general, we DO NOT know the path.
How to estimate the path?
Note: Each set (‘+’ or ‘-’) has an additional set
of transitions as in previous Markov chain
Benos 02-710/MSCBIO2070 1-3.Apr.2013 50
What we have..
A+ C+ G+ T+ A- C- G- T-
A+ .180 .274 .426 .120
Note: these transitions out
of each state add up to one—
C+ .171 .368 .274 .188 no room for transitions
between (+) and (-) states
G+ .161 .339 .375 .125
51
A model of CpG Islands –
Transitions
l What about transitions between (+) and (-) states?
l They affect
Length distribution of region +:
l Avg. length of CpG island
l Avg. separation between two CpG islands P(L=1) +- = 1-p++ :
1-p++ P(L=2) ++- = p++ (1-p++)
p++ p- - …
P[L= l ] = p++l-1(1-p++)
+ -
1
Geometric distribution, with mean =
1-p- - 1 − p++
Expected length of a state to continue in that state
Benos 02-710/MSCBIO2070 1-3.Apr.2013 52
What we have..
A+ C+ G+ T+ A- C- G- T- (1-λ+) * freq(bi)
A+ .180 .274 .426 .120
λ+
C- .233 .298 .078 .302
(1-λ-)*freq(bi) λ-
53
Another application: Profile HMMs
Begin M1 M2 M3 M4 End
LE--VK
LE--IR We know it should look like this in the end
LE--IK
LD--VE
LEKKVK
Begin M1 M2 M3 M4 End
Introducing “delete” states to the
previous HMM
We want to know whether (for instance) the sequence
LEK is a good match to the HMM
LEVK
LEIR We know it should look like this in the end
LEIK
LDVE
LE-K
Begin M1 M2 M3 M4 End
Three main applications for
profile HMMs
1. Find sequence homologs
l ie, we represent a sequence family by an HMM and use that
to identify (“evaluate”) other related sequences
KKKKKK
LEVK IKNGTTT
Convert Search LEAK
LEIR Profile
LEIK HMM ……
GGIAAEEIK
LDVE IIGGGAVVS
LEVK
LEVK LEIR
Convert Align
LEIR Profile LEIK
LEIK HMM
LDVE
LDVE LE-K
LEVK
Align LEIR
LEVK,LEK, LEIR, LEIK, LDVE LEIK
LDVE
LE-K
62
Acknowledgements
Some of the slides used in this lecture are adapted or modified
slides from lectures of:
l Serafim Batzoglou, Stanford University
l Bino John, Dow Agrosciences