HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
Hidden Markov Model & Viterbi
Lecturer: Dr. Bùi Thanh Hùng
Data Science Department
Faculty of Information Technology
Industrial University of Ho Chi Minh city
Email:
[email protected]Website: https://sites.google.com/site/hungthanhbui1980/
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Examples
Hidden Markov Model
Hidden Markov Model
Hidden Markov Model
Markoviance Properties of State Sequences
HMM Formalism
S S S S S
K K K K K
• {S, K,
• S : {s1…sN } are the values for the hidden states
• K : {k1…kM } are the values for the observations
HMM Formalism
A A A A
S S S S S
B B B
K K K K K
{S, K,
• S : {s1…sN } are the values for the hidden states
• K : {k1…kM } are the values for the observations
• = are the initial state probabilities
• A = {aij} are the state transition probabilities
• B = {bik} are the observation state probabilities
Inference in an HMM
• Compute the probability of a given observation
sequence
• Given an observation sequence, compute the most
likely hidden state sequence
• Given an observation sequence and set of possible
models, which model most closely fits the data?
Decoding
o1 ot-1 ot ot+1 oT
Given an observation sequence and a model, compute the
probability of the observation sequence
O = (o1...oT ), = ( A, B, )
Compute P(O | )
= are the initial state probabilities
A = {aij} are the state transition probabilities
B = {bik} are the observation state probabilities
Decoding
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
P(O | X , ) = bx1o1 bx2o2 ...bxT oT
Decoding
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
P(O | X , ) = bx1o1 bx2o2 ...bxT oT
P( X | ) = x1 ax1x2 ax2 x3 ...axT −1xT
Decoding
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
P(O | X , ) = bx1o1 bx2o2 ...bxT oT
P( X | ) = x1 ax1x2 ax2 x3 ...axT −1xT
P(O, X | ) = P(O | X , ) P( X | )
Decoding
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
P(O | X , ) = bx1o1 bx2o2 ...bxT oT
P( X | ) = x1 ax1x2 ax2 x3 ...axT −1xT
P(O, X | ) = P(O | X , ) P( X | )
P(O | ) = P(O | X , ) P( X | )
X
Decoding
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
T −1
P(O | ) =
{ x1 ... xT }
b
x1 x1o1 a
t =1
b
xt xt +1 xt +1ot +1
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
• Special structure gives us an efficient solution using
dynamic programming.
• Intuition: Probability of the first t observations is the
same for all possible t+1 length state sequences.
• Define:
i (t ) = P(o1...ot , xt = i | )
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
j (t + 1)
= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
j (t + 1)
= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
j (t + 1)
= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
j (t + 1)
= P(o1...ot +1 , xt +1 = j )
= P(o1...ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot | xt +1 = j ) P(ot +1 | xt +1 = j ) P( xt +1 = j )
= P(o1...ot , xt +1 = j ) P(ot +1 | xt +1 = j )
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
= P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )
= (t )a b
i =1... N
i ij jot +1
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
= P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )
= (t )a b
i =1... N
i ij jot +1
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
= P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )
= (t )a b
i =1... N
i ij jot +1
Forward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
= P(o ...o , x
i =1... N
1 t t = i, xt +1 = j )P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t +1 = j | xt = i )P( xt = i ) P(ot +1 | xt +1 = j )
= P(o ...o , x
i =1... N
1 t t = i )P( xt +1 = j | xt = i ) P(ot +1 | xt +1 = j )
= (t )a b
i =1... N
i ij jot +1
Backward Procedure
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
i (T + 1) = 1
i (t ) = P(ot ...oT | xt = i) Probability of the rest
of the states given the
i (t ) = a b
j =1... N
ij iot j (t + 1) first state
Decoding Solution
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
N
P(O | ) = i (T ) Forward Procedure
i =1
N
P(O | ) = i i (1) Backward Procedure
i =1
N
P(O | ) = i (t ) i (t ) Combination
i =1
Best State Sequence
o1 ot-1 ot ot+1 oT
• Find the state sequence that best explains the
observations
• Viterbi algorithm
arg max P( X | O)
X
Viterbi
The Viterbi algorithm is named after Andrew Viterbi,
who proposed it in 1967 as a decoding algorithm
for convolutional codes over noisy digital
communication links. It has, however, a history
of multiple invention, with at least seven
independent discoveries, including those by
Viterbi, Needleman and Wunsch, and Wagner and
Fischer.
Viterbi
The Viterbi algorithm is a dynamic
programming algorithm for finding the
most likely sequence of hidden states—called
the Viterbi path—that results in a sequence of
observed events, especially in the context
of Markov information sources and hidden Markov
models (HMM).
Viterbi
The algorithm has found universal application in decoding
the convolutional codes used in both CDMA and GSM digital
cellular, dial-up modems, satellite, deep-space communications,
and 802.11 wireless LANs. It is now also commonly used in speech
recognition, speech synthesis, diarization, keyword
spotting, computational linguistics, and bioinformatics.
For example, in speech-to-text (speech recognition), the acoustic
signal is treated as the observed sequence of events, and a string of
text is considered to be the "hidden cause" of the acoustic signal. The
Viterbi algorithm finds the most likely string of text given the acoustic
signal.
Viterbi Algorithm
x1 xt-1 j
o1 ot-1 ot ot+1 oT
j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j, ot )
x1 ... xt −1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
Viterbi Algorithm
x1 xt-1 xt xt+1
o1 ot-1 ot ot+1 oT
j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j, ot )
x1 ... xt −1
j (t + 1) = max i (t )aij b jo t +1
i Recursive
Computation
j (t + 1) = arg max i (t )aij b jo t +1
i
Viterbi Algorithm
x1 xt-1 xt xt+1 xT
o1 ot-1 ot ot+1 oT
Xˆ T = arg max i (T ) Compute the most
i
likely state sequence
Xˆ t = ^ (t + 1) by working
X t +1
backwards
P( Xˆ ) = arg max i (T )
i
Parameter Estimation
A A A A
B B B B B
o1 ot-1 ot ot+1 oT
• Given an observation sequence, find the model that is most
likely to produce that sequence.
• No analytic method
• Given a model and observation sequence, update the model
parameters to better fit the observations.
Parameter Estimation
A A A A
B B B B B
o1 ot-1 ot ot+1 oT
i (t )aij b jo j (t + 1)
pt (i, j ) = t +1
Probability of
m (t ) m (t )
m =1... N
traversing an arc
i (t ) = p (i, j)
j =1... N
t
Probability of
being in state i
Parameter Estimation
A A A A
B B B B B
o1 ot-1 ot ot+1 oT
ˆi = i (1)
T
p (i, j ) Now we can
= t =1 t
aˆij
(t )
T compute the new
t =1 i estimates of the
bˆik =
(i )
{t:ot = k } t
model parameters.
(t )
T
t =1 i
Viterbi
Viterbi
Viterbi
Viterbi
obs = ('normal', 'cold', 'dizzy')
states = ('Healthy', 'Fever')
start_p = {'Healthy': 0.6, 'Fever': 0.4}
trans_p = {
'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
'Fever' : {'Healthy': 0.4, 'Fever': 0.6}
}
emit_p = {
'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}
}
Viterbi
day 1: normal
day 2: cold
day 3: dizzy
day 1: normal Viterbi
day 2: cold
day 3: dizzy
What is a step of states which has the highest
probability?
Describe in details
Healthy
Start
Fever
day 1: normal
day 2: cold
day 3: dizzy
Describe in details
Healthy Healthy
Start
Fever Fever
day 1: normal
day 2: cold
day 3: dizzy
Describe in details
Healthy Healthy
Start
Fever Fever
day 1: normal
day 1: normal
day 2: cold
day 3: dizzy
Describe in details
Healthy Healthy
Start
Fever Fever
day 1: normal day 2: cold
day 1: normal
day 2: cold
day 3: dizzy
Describe in details
Healthy Healthy Healthy
Start
Fever Fever Fever
day 1: normal day 2: cold day 3: dizzy
day 1: normal
day 2: cold
day 3: dizzy
Describe in details
Healthy Healthy Healthy
Start
Fever Fever Fever
day 1: normal day 2: cold day 3: dizzy
day 1: normal
day 2: cold
day 3: dizzy
Viterbi
day 1: normal
day 2: cold
day 3: dizzy
Viterbi
day 1: normal
day 2: cold
day 3: dizzy
Viterbi
day 1: normal
day 2: cold
day 3: dizzy
Viterbi
day 1: normal
day 2: cold
day 3: dizzy
Viterbi
day 1: normal
day 2: cold
day 3: dizzy
Viterbi
obs = ('normal', 'cold', 'dizzy')
states = ('Healthy', 'Fever')
start_p = {'Healthy': 0.6, 'Fever': 0.4}
trans_p = {
'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
'Fever' : {'Healthy': 0.4, 'Fever': 0.6}
}
emit_p = {
'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}
}
Hidden Markov Model Toolkit (HTK)
http://htk.eng.cam.ac.uk/