0% found this document useful (0 votes)
34 views

Chapter 5 - Graphical Models

This document discusses machine learning graphical models including Bayesian networks, naive Bayes classifiers, and hidden Markov models. It provides examples and explanations of key concepts for each model. Bayesian networks represent conditional independence relationships between variables using a directed graph. Naive Bayes classifiers make a strong independence assumption to classify instances based on attribute values. Hidden Markov models add a hidden state layer to model sequential data, where observations depend on the hidden state.

Uploaded by

Gia Khang Tạ
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Chapter 5 - Graphical Models

This document discusses machine learning graphical models including Bayesian networks, naive Bayes classifiers, and hidden Markov models. It provides examples and explanations of key concepts for each model. Bayesian networks represent conditional independence relationships between variables using a directed graph. Naive Bayes classifiers make a strong independence assumption to classify instances based on attribute values. Hidden Markov models add a hidden state layer to model sequential data, where observations depend on the hidden state.

Uploaded by

Gia Khang Tạ
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning

Graphical Models

Lecturer: Duc Dung Nguyen, PhD.


Contact: [email protected]

Faculty of Computer Science and Engineering


Hochiminh city University of Technology
Contents

1. Bayesian Networks (revisited)

2. Naive Bayes Classifier (revisited)

3. Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 1 / 35


Bayesian Networks (revisited)
Bayesian Networks

relationship between events

A B

B depend on A

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 2 / 35


Bayesian Networks

Advantages of graphical modeling:

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 3 / 35


Bayesian Networks

Advantages of graphical modeling:

• Conditional independence:
p(D|C, E, A, B) = p(D|C)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 3 / 35


Bayesian Networks

Advantages of graphical modeling:

• Conditional independence:
p(D|C, E, A, B) = p(D|C)
• Factorization:

p(A, B, C, D, E) = p(D|C)p(E|C)p(C|A, B)p(A)p(B)

small prob
-> easier to
compute

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 3 / 35


Naive Bayes Classifier
(revisited)
Naive Bayes Classifier

• Each instance x is described by a conjunction of attribute values < a1 , a2 , ..., an >


• It is to assign the most probable class c to an instance

CN B = arg max(a1 , a2 , ..., an |c)p(c)


c∈C
Y
= arg max p(ai |c).p(c)
c∈C i=1,n

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 4 / 35


Naive Bayes Classifier

independent

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 5 / 35


Naive Bayes Classifier

Joint distribution: p(C, A1 , A2 , ..., An )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 6 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

In contrast to a discriminative model (e.g., CRF)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

In contrast to a discriminative model (e.g., CRF)

• Conditional distribution: P (C|A)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 35


Naive Bayes Classifier

if have enough distribution of data


-> we can generate data

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

In contrast to a discriminative model (e.g., CRF)

• Conditional distribution: P (C|A)


• It discriminates C given A

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 35


Hidden Markov Models
Hidden Markov Models

• Introduction
• Example
• Independence assumptions
• Forward algorithm
• Viterbi algorithm
• Training
• Application to NER

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 8 / 35


Hidden Markov Models

have some relationship between data

• One of the most popular graphical models.


• Dynamic extension of Bayesian networks.
• Sequential extension of Naive Bayes classifier.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 9 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.

A model:

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.

A model:

• Your looking depends on what you did in the night before.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study) hidden state, to guess
• Given the sequence of observations of your looking, guess what you did in previous nights.

A model:

• Your looking depends on what you did in the night before.


• Your activity in a night depends on what you did in some previous nights.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 35


Hidden Markov Models

• A finite set of possible observations.


• A finite set of possible hidden states.
• To predict the most probable sequence of underlying stats {y1 , y2 , ..., yT } for a given
sequence of observations {x1 , x2 , ..., xT }

p(y|x)
state

= p(x|y)p(y) / p(x)

observation

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 11 / 35


Hidden Markov Models

Marsland, S. (2009) Machine Learning:Machine


Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] An Algorithmic
Learning Perspective. 12 / 35
Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 13 / 35


Hidden Markov Models

HMM conditional independence assumptions:


• State at time t depends only on state at time t − 1.
p(yt |yt−1 , Z) = p(yt |yt−1 )
• Observation at time t depends only on state at time t.
P (xt |yt , Z) = p(xt |yt )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 14 / 35


Hidden Markov Models

HMM is a generative model:


• Joint distributions:
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 15 / 35


Hidden Markov Models

HMM is a generative model:


• Joint distributions:
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )


• It can generate any distribution on Y and X

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 15 / 35


Hidden Markov Models

HMM is a generative model:

• Joint distributions:

p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )


Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )

• It can generate any distribution on Y and X top layer for speech recognization

In contrast to a discriminative model (e.g., CRF):

• Conditional distributions: p(Y |X)


• It discriminates Y given X.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 16 / 35
Hidden Markov Model

Forward algorithm:

• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:

αt (yt ) = p(yt , x1 , x2 , ..., xt )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 17 / 35


Hidden Markov Model

Forward algorithm:

• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:

yt = argmax alphat(yt) αt (yt ) = p(yt , x1 , x2 , ..., xt )

• Bayes’ theorem gives:

p(yt |x1 , x2 , ..., xt ) = p(yt , x1 , x2 , ..., xt )/p(x1 , x2 , ..., xt )


= αt (yt )/p(x1 , x2 , ..., xt )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 17 / 35


Hidden Markov Model

Forward algorithm:

• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:

αt (yt ) = p(yt , x1 , x2 , ..., xt )

• Bayes’ theorem gives:

p(yt |x1 , x2 , ..., xt ) = p(yt , x1 , x2 , ..., xt )/p(x1 , x2 , ..., xt )


= αt (yt )/p(x1 , x2 , ..., xt )

• The highest αt (yt ) is the most likely yt would be given the same {x1 , x2 , ..., xt }.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 17 / 35


Hidden Markov Models

Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
P(yt, x1:t) = p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 18 / 35


Hidden Markov Models

Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
= p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1

α1 (y1 ) = p(y1 , x1 ) = p(x1 |y1 )p(y1 )

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 18 / 35


Hidden Markov Models

Forward algorithm:

X
αt (yt ) = p(xt |yt ) p(yt |yt−1 ).αt−1 (yt−1 )
yt−1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 19 / 35


Hidden Markov Models

Viterbi algorithm:
find best sequence of y
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 20 / 35


Hidden Markov Models

Viterbi algorithm:
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 20 / 35


Hidden Markov Models

• Viterbi algorithm:
max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T yT y1:T −1

= max max {p(xT |yT ).p(yT |yT −1 )p(y1 , ..., yT −1 , x1 , x2 , ..., xT −1 )}


yT y1:T −1
 
= max max p(xT |yT ).p(yT |yT −1 ) max p(y1 , ..., yT −1 , x2 , ..., xT −1 )
yT yT −1 y1:T −2

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 21 / 35


• Viterbi algorithm:

max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )


y1:T yT y1:T −1

= max max {p(xT |yT ).p(yT |yT −1 )p(y1 , ..., yT −1 , x1 , x2 , ..., xT −1 )}


yT y1:T −1
 
= max max p(xT |yT ).p(yT |yT −1 ) max p(y1 , ..., yT −1 , x2 , ..., xT −1 )
yT yT −1 y1:T −2

• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1

• For each t from 2 to T, and for each state yt , compute:

arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )


y1:t−1

• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 22 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 23 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )
y1:t−1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 23 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )
y1:t−1

• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 23 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )
y1:t−1

• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 23 / 35


Hidden Markov Models

• Could the results from the forward algorithm be used for Viterbi algorithm?

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 24 / 35


Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 25 / 35


Hidden Markov Models

Could the results from the forward algorithm be used for Viterbi algorithm?

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 26 / 35


Hidden Markov Models

Training HMMs:

• Topology is designed beforehand.


• Parameters to be learned: emission and transition probabilities.
• Supervised or unsupervised training.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 27 / 35


Hidden Markov Models

Supervised learning:

• Training data: paired sequences of states and observations (y1 , y2 , ..., yT , x1 , x2 , ..., xt )
• p(yi ) = number of sequences starting with yi /number of all sequences.
• p(yj |yi ) = number of (yi , yj )’s / number of all (yi , y)’s
• p(xj |yi ) = number of (yi , xj )’s / number of all (yi , x)’s

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 28 / 35


Hidden Markov Models

Supervised learning example:

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 35


Hidden Markov Models

Unsupervised learning:

• Only observation sequences are available

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 30 / 35


Hidden Markov Models

Unsupervised learning:

• Only observation sequences are available

• Iterative improvement of model parameters.


• How?

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 30 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.
• Repeat it until convergence.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 35
Hidden Markov Models

Application to NER:

• Example: "Facebook CEO Zuckerberg visited Vietnam".


ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 35


Hidden Markov Models

Application to NER:

• Example: "Facebook CEO Zuckerberg visited Vietnam".


ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"
• States = Class labels

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 35


Hidden Markov Models

Application to NER:

• Example: "Facebook CEO Zuckerberg visited Vietnam".


ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"
• States = Class labels
• Observations = Words + Features

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 35


Hidden Markov Models

Application to NER:

Bikel, D.M., (1997) A high-performance learning name-finder


Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 33 / 35
Hidden Markov Models

Application to NER:

• What if a name is a multi-word phrase?


• Example: "...John von Neumann is ..."
B-PER = "John"
I-PER = "von","Neumann"
O = "is"
• BIO notation: {B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC, O}

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 34 / 35


Homework

• Readings
• Marsland, S. (2009) Machine learning:An algorithmic perspective. Chapter 15 (graphical
models).
• Bikel, D. M. etal. (1997) Nymble: a high performance learning name-finder.
• HW
• Apply Viterbi algorithm to find the most probable 3-state sequence in the looking-activity
example in the lecture.
• Write a program to carry out the unsupervised learning example for HMM in the lecture.
Discuss on the result, in particular the convergence of the process.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 35 / 35

You might also like