Chapter 5 - Graphical Models
Chapter 5 - Graphical Models
Graphical Models
A B
B depend on A
• Conditional independence:
p(D|C, E, A, B) = p(D|C)
• Conditional independence:
p(D|C, E, A, B) = p(D|C)
• Factorization:
small prob
-> easier to
compute
independent
• Introduction
• Example
• Independence assumptions
• Forward algorithm
• Viterbi algorithm
• Training
• Application to NER
Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.
Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.
A model:
Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.
A model:
Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study) hidden state, to guess
• Given the sequence of observations of your looking, guess what you did in previous nights.
A model:
p(y|x)
state
= p(x|y)p(y) / p(x)
observation
• Joint distributions:
• It can generate any distribution on Y and X top layer for speech recognization
Forward algorithm:
• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:
Forward algorithm:
• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:
Forward algorithm:
• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:
• The highest αt (yt ) is the most likely yt would be given the same {x1 , x2 , ..., xt }.
Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
P(yt, x1:t) = p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1
Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
= p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1
Forward algorithm:
X
αt (yt ) = p(xt |yt ) p(yt |yt−1 ).αt−1 (yt−1 )
yt−1
Viterbi algorithm:
find best sequence of y
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y
Viterbi algorithm:
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y
• Viterbi algorithm:
max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T yT y1:T −1
• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T
• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T
• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T
• Could the results from the forward algorithm be used for Viterbi algorithm?
Could the results from the forward algorithm be used for Viterbi algorithm?
Training HMMs:
Supervised learning:
• Training data: paired sequences of states and observations (y1 , y2 , ..., yT , x1 , x2 , ..., xt )
• p(yi ) = number of sequences starting with yi /number of all sequences.
• p(yj |yi ) = number of (yi , yj )’s / number of all (yi , y)’s
• p(xj |yi ) = number of (yi , xj )’s / number of all (yi , x)’s
Unsupervised learning:
Unsupervised learning:
Unsupervised learning:
Unsupervised learning:
• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
Unsupervised learning:
• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.
Unsupervised learning:
• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.
• Repeat it until convergence.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 35
Hidden Markov Models
Application to NER:
Application to NER:
Application to NER:
Application to NER:
Application to NER:
• Readings
• Marsland, S. (2009) Machine learning:An algorithmic perspective. Chapter 15 (graphical
models).
• Bikel, D. M. etal. (1997) Nymble: a high performance learning name-finder.
• HW
• Apply Viterbi algorithm to find the most probable 3-state sequence in the looking-activity
example in the lecture.
• Write a program to carry out the unsupervised learning example for HMM in the lecture.
Discuss on the result, in particular the convergence of the process.