Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
2
What We Talk About When We
Talk About“Learning”
Learning general models from a data of particular examples
Data is cheap and abundant (data warehouses, data marts);
knowledge is expensive and scarce.
Example in retail: Customer transactions to consumer behavior:
People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
Build a model that is a good and useful approximation to the
data.
3
What is Machine Learning?
Optimize a performance criterion using example data or
past experience.
Role of Statistics: Inference from a sample
Role of Computer science: Efficient algorithms to
Solve the optimization problem
4
Applications
Association
Supervised Learning
Unsupervised Learning
Reinforcement Learning
5
Part-of-Speech Tagging
Input:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Mulally announced first quarter results.
Output:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.
N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
………….
Face Recognition
Training examples of a person
Test images
7
Introduction
Modeling dependencies in input
Sequences:
Temporal: In speech; phonemes in a word
(dictionary), words in a sentence (syntax, semantics
of the language).
In handwriting, pen movements
Spatial: In a DNA sequence; base pairs
8
Discrete Markov Process
N states: S1, S2, ..., SN State at “time” t, qt = Si
First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)
Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1
Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1
9
Stochastic Automaton
T
P O Q | A , P q 1 P q t | q t 1 q 1 a q 1q 2 a q T 1q T
t 2
10
Example: Balls and Urns
Three urns each full of balls of one color
S1: red, S2: blue, S3: green
0 .4 0 .3 0 .3
0 .5 ,0 .2 ,0 .3 A 0 .2 0 .6 0 .2
T
0 .1 0 .1 0 .8
O S 1 , S 1 , S 3 , S 3
P O | A , P S 1 P S 1 | S 1 P S 3 | S 1 P S 3 | S 3
1 a 11 a 13 a 33
0 .5 0 .4 0 .3 0 .8 0 .048
11
Balls and Urns: Learning
Given K example sequences of length T
k t 1 t i
T- 1
1 q k
S and q t 1 S j
k
k t 1 t S i
1
T- 1
q k
12
Hidden Markov Models
States are not observable
Discrete observations {v1,v2,...,vM} are recorded; a
probabilistic function of the state
Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
Example: In each urn, there are balls of different
colors, but with different probabilities.
For each observation sequence, there are multiple
state sequences
13
Hidden Markov Model (HMM)
HMMs allow you to estimate probabilities
of unobserved events
Given plain text, which underlying
parameters generated the surface
E.g., in speech recognition, the observed
data is the acoustic signal and the words
are the hidden parameters
HMMs and their Usage
HMMs are very common in Computational
Linguistics:
Speech recognition (observed: acoustic signal,
hidden: words)
Handwriting recognition (observed: image, hidden:
words)
Part-of-speech tagging (observed: words, hidden:
part-of-speech tags)
Machine translation (observed: foreign words,
hidden: words in target language)
Noisy Channel Model
In speech recognition you observe an
acoustic signal (A=a1,…,an) and you want
to determine the most likely sequence of
words (W=w1,…,wn): P(W | A)
Problem: A and W are too specific for
reliable counts on observed data, and are
very unlikely to occur in unseen data
Noisy Channel Model
Assume that the acoustic signal (A) is already
segmented wrt word boundaries
P(W | A) could be computed as
19
Decoding
The decoder combines evidence from
The likelihood: P(A | W)
This can be approximated as:
P(A |W ) P(ai | w i )
n
i1
Markov assumption: n
P(w1,...,w n ) P(w i | w i1 )
i 2
The Trellis
Parameters of an HMM
States: A set of states S=s1,…,sn
Transition probabilities: A= a1,1,a1,2,…,an,n Each
ai,j represents the probability of transitioning
from state si to sj.
Emission probabilities: A set B of functions of
the form bi(ot) which is the probability of
observation ot being emitted by si
Initial state distribution: i is the probability that
si is a start state
The Three Basic HMM Problems
Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
(A,B, ) , how do we compute the
probability of O given the model?
Problem 2 (Decoding): Given the observation
sequence O=o1,…,oT and an HMM model
(A,B, ), how do we find the state
sequence that best explains the observations?
The Three Basic HMM Problems
Problem 3 (Learning): How do we adjust
the model parameters (A,B, ), to
maximize P(O | ) ?
Problem 1: Probability of an Observation
Sequence
What is P(O | ?)
The probability of a observation sequence is the sum
of the probabilities of all possible state sequences in
the HMM.
Naïve computation is very expensive. Given T
observations and N states, there are NT possible state
sequences.
Even small HMMs, e.g. T=10 and N=10, contain 10
billion different paths
Solution to this and problem 2 is to use dynamic
programming
Forward Probabilities
What is the probability that, given an
HMM , at time t the state is i and the
partial observation o1 … ot has been
generated?
t (i) P(o1 ... ot , qt si | )
Forward Probabilities
t (i) P(o1 ...ot , qt si | )
N
t ( j) t1 (i) aij b j (ot )
i1
Forward Algorithm
Initialization: 1(i) ibi (o1) 1 i N
Induction:
N
t ( j) t1 (i) aij b j (ot ) 2 t T,1 j N
i1
N
t (i) aij b j (ot 1 ) t 1 ( j)
j1
Backward Algorithm
Initialization: T (i) 1, 1 i N
Induction:
N
t (i) aij b j (ot 1 ) t 1 ( j) t T 1...1,1 i N
j1
Termination: N
P(O | ) i 1 (i)
i1
Problem 2: Decoding
The solution to Problem 1 (Evaluation) gives us
the sum of all paths through an HMM efficiently.
For Problem 2, we wan to find the path with the
highest probability.
We want to find the state sequence Q=q1…qT,
such that
Q argmax P(Q'| O, )
Q'
Viterbi Algorithm
Similar to computing the forward
probabilities, but instead of summing over
transitions from incoming states, compute
the maximum
Forward: N
( j) (i) a b (o )
t t1 ij j t
i1
Viterbi Recursion:
t ( j) max t1 (i) aij b j (ot )
1iN
Viterbi Algorithm
Initialization: 1 (i) ib j (o1) 1 i N
Induction:
t ( j) max t1 (i) aij b j (ot )
1iN
t ( j) argmaxt1 (i) aij 2 t T,1 j N
1iN
Termination:
p max T (i)
*
q argmax T (i)
*
T
1i N 1iN
Read out path:
q t 1 (q ) t T 1,...,1
*
t
*
t 1
Problem 3: Learning
Up to now we’ve assumed that we know the
underlying model (A,B, )
Often these parameters are estimated on
annotated training data, which has two
drawbacks:
Annotation is difficult and/or expensive
Training data is different from the current data
We want to maximize the parameters with
respect to the current data, i.e., we’re looking
for a model ', such that ' argmax P(O | )
Problem 3: Learning
Unfortunately, there is no known way to
analytically find a global maximum, i.e., a model
, such'that ' argmax P(O | )
But it is possible to find a local maximum
Given an initial model , we can always find a
model ', such that P(O | ') P(O | )
Parameter Re-estimation
Use the forward-backward (or Baum-
Welch) algorithm, which is a hill-climbing
algorithm
Using an initial parameter instantiation,
the forward-backward algorithm iteratively
re-estimates the parameters and
improves the probability that given
observation are generated by the new
parameters
Parameter Re-estimation
Three parameters need to be re-
estimated:
Initial state distribution: i
Transition probabilities: ai,j
Re-estimating Transition Probabilities
What’s the probability of being in state si
at time t and going to state sj, given the
current model and parameters?
t (i, j) P(qt si , qt 1 s j | O, )
Re-estimating Transition Probabilities
t (i, j) P(qt si , qt 1 s j | O, )
(i) a t i, j b j (ot 1 ) t 1 ( j)
i1 j1
Re-estimating Transition Probabilities
The intuition behind the re-estimation
equation for transition probabilities is
expected number of transitions from state si to state sj
aˆ i, j
expected number of transitions from state si
Formally:
T 1
(i, j)t
aˆ i, j t1
T 1 N
(i, j') t
t1 j'1
Re-estimating Transition Probabilities
N
Defining t (i) t (i, j)
j1
(i, j)
t
(i) t
t1
Review of Probabilities
Forward probability: (i)
t
The probability of being in state si, given the partial observation
o1,…,ot
Backward probability:
t (i)
The probability of being in state si, given the partial observation
ot+1,…,oT
Transition probability:
The probability of going from tstate
(i, j)si, to state sj, given the
complete observation o1,…,oT
State probability:
observation o1,…,oT
(i)
The probability of being in state si, given the complete
t
Re-estimating Initial State Probabilities
Initial state distribution: i is the
probability that si is a start state
Re-estimation is easy:
ˆ i expected number
of times in state si at time 1
Formally:
ˆ i 1(i)
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
expected number of times in state si and observe symbol vk
bˆi (k)
expected number of times in state si
Formally: T
(o ,v ) (i)
t k t
(i) t
t1
Where
(o ,v ) 1, if o v , and 0 otherwise
Note that heret is kthe Kronecker
t k
delta function and is not related
discussion of the Viterbi algorithm!!
to the in the
The Updated Model
Coming from (A,B, ) we get to
' ( Aˆ , Bˆ ,
ˆ)
by the following update rules:
T 1
(i, j)
T
t (o ,v ) (i)
t k t
aˆ i, j t1
T 1
bˆi (k) t1
T
ˆ i 1(i)
(i) t
(i) t
t1
t1
Expectation Maximization
The forward-backward algorithm is an
instance of the more general EM
algorithm
The E Step: Compute the forward and
backward probabilities for a give model
The M Step: Re-estimate the model
parameters
Exercise
Programming with Viterbi Algorithm
Apply HMM for Part-of-Speech Tagging