0% found this document useful (0 votes)
36 views51 pages

Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội

Hidden Markov models are statistical models used to model systems where an underlying process with unknown ("hidden") parameters generates a set of observable outputs. The three basic problems addressed by hidden Markov models are: 1) Evaluation - Computing the probability of an observed sequence given the model. 2) Decoding - Finding the most likely sequence of hidden states that produced the observed outputs. 3) Learning - Adjusting the model parameters to maximize the probability of the observed data. Dynamic programming techniques such as the forward algorithm are typically used to efficiently solve these problems for hidden Markov models.

Uploaded by

dungocluongvu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views51 pages

Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội

Hidden Markov models are statistical models used to model systems where an underlying process with unknown ("hidden") parameters generates a set of observable outputs. The three basic problems addressed by hidden Markov models are: 1) Evaluation - Computing the probability of an observed sequence given the model. 2) Decoding - Finding the most likely sequence of hidden states that produced the observed outputs. 3) Learning - Adjusting the model parameters to maximize the probability of the observed data. Dynamic programming techniques such as the forward algorithm are typically used to efficiently solve these problems for hidden Markov models.

Uploaded by

dungocluongvu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Hidden Markov Models

Ts. Nguyễn Văn Vinh


Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
Why “Learn” ?
 Machine learning is programming computers to optimize a
performance criterion using example data or past
experience.
 There is no need to “learn” to calculate payroll
 Learning is used when:
 Human expertise does not exist (navigating on Mars),
 Humans are unable to explain their expertise (speech
recognition)
 Solution changes in time (routing on a computer network)
 Solution needs to be adapted to particular cases (user
biometrics)

2
What We Talk About When We
Talk About“Learning”
 Learning general models from a data of particular examples
 Data is cheap and abundant (data warehouses, data marts);
knowledge is expensive and scarce.
 Example in retail: Customer transactions to consumer behavior:
People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
 Build a model that is a good and useful approximation to the
data.

3
What is Machine Learning?
 Optimize a performance criterion using example data or
past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient algorithms to
 Solve the optimization problem

 Representing and evaluating the model for inference

4
Applications
 Association
 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning

5
Part-of-Speech Tagging
 Input:
Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Mulally announced first quarter results.
 Output:
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V
forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N
Alan/N Mulally/N announced/V first/ADJ quarter/N results/N ./.

N = Noun
V = Verb
P = Preposition
Adv = Adverb
Adj = Adjective
………….
Face Recognition
Training examples of a person

Test images

AT&T Laboratories, Cambridge UK


http://www.uk.research.att.com/facedatabase.html

7
Introduction
 Modeling dependencies in input
 Sequences:
 Temporal: In speech; phonemes in a word
(dictionary), words in a sentence (syntax, semantics
of the language).
In handwriting, pen movements
 Spatial: In a DNA sequence; base pairs

8
Discrete Markov Process
 N states: S1, S2, ..., SN State at “time” t, qt = Si
 First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)

 Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1

 Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1

9
Stochastic Automaton

T
P O  Q | A ,    P  q 1   P  q t | q t 1   q 1 a q 1q 2 a q T 1q T
t 2

10
Example: Balls and Urns
 Three urns each full of balls of one color
S1: red, S2: blue, S3: green

0 .4 0 .3 0 .3 
   0 .5 ,0 .2 ,0 .3  A  0 .2 0 .6 0 .2 
T

0 .1 0 .1 0 .8 
O  S 1 , S 1 , S 3 , S 3 
P O | A ,    P  S 1   P  S 1 | S 1   P  S 3 | S 1   P  S 3 | S 3 
 1  a 11  a 13  a 33
 0 .5  0 .4  0 .3  0 .8  0 .048
11
Balls and Urns: Learning
 Given K example sequences of length T

# sequences starting with S i  1


k 1  S i
q k

ˆi  
# sequences  K
# transition s from S i to S j 
â ij 
#  transition s from S i 


k t 1 t i
T- 1
1 q k
 S and q t 1  S j
k

k t 1 t  S i
1
T- 1
q k
 
12
Hidden Markov Models
 States are not observable
 Discrete observations {v1,v2,...,vM} are recorded; a
probabilistic function of the state
 Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
 Example: In each urn, there are balls of different
colors, but with different probabilities.
 For each observation sequence, there are multiple
state sequences

13
Hidden Markov Model (HMM)
 HMMs allow you to estimate probabilities
of unobserved events
 Given plain text, which underlying
parameters generated the surface
 E.g., in speech recognition, the observed
data is the acoustic signal and the words
are the hidden parameters
HMMs and their Usage
 HMMs are very common in Computational
Linguistics:
 Speech recognition (observed: acoustic signal,
hidden: words)
 Handwriting recognition (observed: image, hidden:
words)
 Part-of-speech tagging (observed: words, hidden:
part-of-speech tags)
 Machine translation (observed: foreign words,
hidden: words in target language)
Noisy Channel Model
 In speech recognition you observe an
acoustic signal (A=a1,…,an) and you want
to determine the most likely sequence of
words (W=w1,…,wn): P(W | A)
 Problem: A and W are too specific for
reliable counts on observed data, and are
very unlikely to occur in unseen data
Noisy Channel Model
 Assume that the acoustic signal (A) is already
segmented wrt word boundaries
 P(W | A) could be computed as

P(W | A)   max P(w i | ai )


ai wi
 Problem: Finding the most likely word
corresponding to a acoustic representation
depends on the context
 E.g., /'pre-z&ns / could mean “presents” or

“presence” depending on the context
Noisy Channel Model
 Given a candidate sequence W we need
to compute P(W) and combine it with P(W
| A)
 Applying Bayes’ rule:
P(A |W )P(W )
argmax P(W | A)  arg max
W W P(A)
 The denominator P(A) can be dropped,
because it is constant for all W

Noisy Channel in a Picture

19
Decoding
The decoder combines evidence from
 The likelihood: P(A | W)
This can be approximated as:
P(A |W )   P(ai | w i )
n

i1

 The prior: P(W)


This can be approximated as:

P(W )  P(w1 )
n
P(w i | w i1 )
i 2
Search Space
 Given a word-segmented acoustic sequence list all
candidates

'bot ik-'spen-siv 'pre-z&ns


boat P('bot | bald) excessive presidents
bald
P(inactive | bald)
expensive presence

bold expressive presents
 bought inactive press
 Compute the most likely path
Markov Assumption
 The Markov assumption states that
probability of the occurrence of word wi at
time t depends only on occurrence of
word wi-1 at time t-1
 Chain rule: n
P(w1,...,w n )   P(w i | w1,...,w i1 )
i 2

 Markov assumption: n
P(w1,...,w n )   P(w i | w i1 )
 i 2
The Trellis
Parameters of an HMM
 States: A set of states S=s1,…,sn
 Transition probabilities: A= a1,1,a1,2,…,an,n Each
ai,j represents the probability of transitioning
from state si to sj.
 Emission probabilities: A set B of functions of
the form bi(ot) which is the probability of
observation ot being emitted by si


Initial state distribution: i is the probability that
si is a start state


The Three Basic HMM Problems
 Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
  (A,B,  ) , how do we compute the
probability of O given the model?
 Problem 2 (Decoding): Given the observation
sequence O=o1,…,oT and an HMM model
   (A,B,  ), how do we find the state
sequence that best explains the observations?


The Three Basic HMM Problems
 Problem 3 (Learning): How do we adjust
the model parameters   (A,B,  ), to
maximize P(O | ) ?



Problem 1: Probability of an Observation
Sequence
 What is P(O | ?)
 The probability of a observation sequence is the sum
of the probabilities of all possible state sequences in
the HMM.
 
Naïve computation is very expensive. Given T
observations and N states, there are NT possible state
sequences.
 Even small HMMs, e.g. T=10 and N=10, contain 10
billion different paths
 Solution to this and problem 2 is to use dynamic
programming
Forward Probabilities
 What is the probability that, given an
HMM  , at time t the state is i and the
partial observation o1 … ot has been
generated?
  t (i)  P(o1 ... ot , qt  si |  )
Forward Probabilities
 t (i)  P(o1 ...ot , qt  si | )



N 
 t ( j)   t1 (i) aij b j (ot )
i1 
Forward Algorithm
 Initialization: 1(i)   ibi (o1) 1  i  N

 Induction:
 N 
 t ( j)   t1 (i) aij b j (ot ) 2  t  T,1  j  N
i1 

 Termination: P(O | )   T (i)


i1
Forward Algorithm Complexity
 In the naïve approach to solving problem
1 it takes on the order of 2T*NT
computations
 The forward algorithm takes on the order
of N2T computations
Backward Probabilities
 Analogous to the forward probability, just
in the other direction
 What is the probability that given an HMM
and given the state at time t is i, the
partial observation ot+1 … oT is generated?

  t (i)  P(ot 1 ...oT | qt  si , )


Backward Probabilities
 t (i)  P(ot 1 ...oT | qt  si , )



N 
 t (i)   aij b j (ot 1 ) t 1 ( j) 

j1 

Backward Algorithm
 Initialization: T (i)  1, 1  i  N

 Induction:
N 
 t (i)   aij b j (ot 1 ) t 1 ( j)  t  T 1...1,1  i  N
 j1 


 Termination: N
 P(O | )    i 1 (i)
i1
Problem 2: Decoding
 The solution to Problem 1 (Evaluation) gives us
the sum of all paths through an HMM efficiently.
 For Problem 2, we wan to find the path with the
highest probability.
 We want to find the state sequence Q=q1…qT,
such that
Q  argmax P(Q'| O, )
Q'
Viterbi Algorithm
 Similar to computing the forward
probabilities, but instead of summing over
transitions from incoming states, compute
the maximum
 Forward: N 
 ( j)   (i) a b (o )
t t1 ij j t
i1 
 Viterbi Recursion:
 
t ( j)  max t1 (i) aij b j (ot )
1iN

Viterbi Algorithm
 Initialization: 1 (i)   ib j (o1) 1  i  N
 Induction:

 
t ( j)  max t1 (i) aij b j (ot )
1iN

 
 t ( j)  argmaxt1 (i) aij  2  t  T,1  j  N
 1iN 
  Termination:
p  max T (i)
*
q  argmax T (i)
*
T
1i N 1iN
 Read out path:
 q   t 1 (q ) t  T 1,...,1
*
t
*
t 1
Problem 3: Learning
 Up to now we’ve assumed that we know the
underlying model   (A,B,  )
 Often these parameters are estimated on
annotated training data, which has two
drawbacks:
Annotation is difficult and/or expensive


 Training data is different from the current data
 We want to maximize the parameters with
respect to the current data, i.e., we’re looking
for a model ', such that ' argmax P(O |  )

Problem 3: Learning
 Unfortunately, there is no known way to
analytically find a global maximum, i.e., a model
, such'that ' argmax P(O |  )

 But it is possible to find a local maximum
 Given an initial model  , we can always find a
model ', such that P(O | ')  P(O |  )




 
Parameter Re-estimation
 Use the forward-backward (or Baum-
Welch) algorithm, which is a hill-climbing
algorithm
 Using an initial parameter instantiation,
the forward-backward algorithm iteratively
re-estimates the parameters and
improves the probability that given
observation are generated by the new
parameters
Parameter Re-estimation
 Three parameters need to be re-
estimated:
 Initial state distribution:  i
 Transition probabilities: ai,j

 Emission probabilities: bi(ot)


Re-estimating Transition Probabilities
 What’s the probability of being in state si
at time t and going to state sj, given the
current model and parameters?
 t (i, j)  P(qt  si , qt 1  s j | O,  )


Re-estimating Transition Probabilities
 t (i, j)  P(qt  si , qt 1  s j | O,  )



 t (i) ai, j b j (ot 1 )  t 1 ( j)


 t (i, j)  N N

  (i) a t i, j b j (ot 1 )  t 1 ( j)
i1 j1
Re-estimating Transition Probabilities
 The intuition behind the re-estimation
equation for transition probabilities is
expected number of transitions from state si to state sj
aˆ i, j 
expected number of transitions from state si

 Formally:
T 1

  (i, j)t

aˆ i, j  t1
T 1 N

  (i, j') t
t1 j'1
Re-estimating Transition Probabilities
N
 Defining  t (i)   t (i, j)
j1

As the probability of being in state si,


given
 the complete observation O
T 1

 (i, j)
t

 We can say: aˆ i, j  t1


T 1

  (i) t
t1
Review of Probabilities
 Forward probability:  (i)
t
The probability of being in state si, given the partial observation
o1,…,ot
 Backward probability:
 t (i)
The probability of being in state si, given the partial observation
ot+1,…,oT


Transition probability:
The probability of going from tstate
(i, j)si, to state sj, given the
complete observation o1,…,oT


State probability:

observation o1,…,oT
 (i)
The probability of being in state si, given the complete
t

Re-estimating Initial State Probabilities
 Initial state distribution:  i is the
probability that si is a start state
 Re-estimation is easy:

ˆ i  expected number
 of times in state si at time 1
 Formally: 
ˆ i  1(i)



Re-estimation of Emission Probabilities
 Emission probabilities are re-estimated as
expected number of times in state si and observe symbol vk
bˆi (k) 
expected number of times in state si
 Formally: T

(o ,v )  (i)
t k t

 bˆi (k)  t1


T

  (i) t
t1
Where
 (o ,v )  1, if o  v , and 0 otherwise
Note that heret is kthe Kronecker
t k
delta function and is not related
 discussion of the Viterbi algorithm!!
to the in the
 

The Updated Model
 Coming from   (A,B,  ) we get to
' ( Aˆ , Bˆ , 
ˆ)
by the following update rules:

T 1

 (i, j)
T

 t (o ,v )  (i)
t k t

aˆ i, j  t1
T 1
bˆi (k)  t1
T

ˆ i  1(i)
  (i) t
  (i) t
t1
t1



Expectation Maximization
 The forward-backward algorithm is an
instance of the more general EM
algorithm
 The E Step: Compute the forward and
backward probabilities for a give model
 The M Step: Re-estimate the model
parameters
Exercise
 Programming with Viterbi Algorithm
 Apply HMM for Part-of-Speech Tagging

You might also like