0% found this document useful (0 votes)
14 views79 pages

24f 09 Hidden Markov Models

Uploaded by

tanyeriakif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views79 pages

24f 09 Hidden Markov Models

Uploaded by

tanyeriakif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

CS 461: Artificial Intelligence

Hidden Markov Models

1
Reasoning over Time or Space
– Often, we want to reason about a sequence of observations where the
state of the underlying system is changing
● Speech recognition
● Robot localization
● User attention
● Medical monitoring

– Need to introduce time into our models

2
Markov Models (aka Markov chain/process).

X0 X1 X2 X3

3
Quiz: are Markov models a special case of Bayes nets?
– Yes and no!

– Yes:
● Directed acyclic graph, joint = product of conditionals

– No:
● Infinitely many variables (unless we truncate)

● Repetition of transition model not part of standard Bayes net syntax

4
Example: Random walk in one dimension

-4 -3 -2 -1 0 1 2 3 4

5
Example: n-gram models
We call ourselves Homo sapiens—man the wise—because our intelligence is so important to us.
For thousands of years, we have tried to understand how we think; that is, how a mere handful of matter can
perceive, understand, predict, and manipulate a world far larger and more complicated than itself. ….

– State: word at position t in text (can also build letter n-grams)


– Transition model (probabilities come from empirical frequencies):
● Unigram (zero-order): P(Wordt = i)
■ “logical are as are confusion a may right tries agent goal the was . . .”

● Bigram (first-order): P(Wordt = i | Wordt-1= j)


■ “systems are very similar computational approach would be represented . . .”

● Trigram (second-order): P(Wordt = i | Wordt-1= j, Wordt-2= k)


■ “planning and scheduling are integrated the success of naive bayes model is . . .”

– Applications: text classification, spam detection, author identification, language


classification, speech recognition 6
Example: Web browsing
– State: URL visited at step t
– Transition model:
● With probability p, choose an outgoing link at random
● With probability (1-p), choose an arbitrary new page

– Question: What is the stationary distribution over pages?


● i.e., if the process runs forever, what fraction of time does it spend in any given page?

– Application: Google page rank


● Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all
search engines use link analysis along with many other factors (rank actually getting less
important over time)

7
Example: Weather
– States {rain, sun}
Two new ways of representing the same CPT
0.9
0.3
▪ Initial distribution P(X0)
rain sun
P(X0)
sun rain
0.7
0.5 0.5 0.1

▪ Transition model P(Xt |


0.9
Xt-1X) P(Xt|Xt-1)
sun
0.1
sun
t-1
0.3
sun rain rain rain
0.7
sun 0.9 0.1
rain 0.3 0.7 8
Weather prediction
– Time 0: <0.5,0.5>

– What is the weather like at time 1? Xt-1 P(Xt|Xt-1)


sun rain
● P(X1) = ∑x0 P(X1,X0=x0)
sun 0.9 0.1
● = ∑x0 P(X0=x0) P(X1| X0=x0) rain 0.3 0.7
● = 0.5<0.9,0.1> + 0.5<0.3,0.7> = <0.6,0.4>

9
Weather prediction
– Time 1: <0.6,0.4>

– What is the weather like at time 1? Xt-1 P(Xt|Xt-1)


sun rain
● P(X2) = ∑x1 P(X2,X1=x1)
sun 0.9 0.1
● = ∑x1 P(X1=x1) P(X2| X1=x1) rain 0.3 0.7
● = 0.6<0.9,0.1> + 0.4<0.3,0.7> = <0.66,0.34>

10
Weather prediction
– Time 2: <0.66,0.34>

– What is the weather like at time 1? Xt-1 P(Xt|Xt-1)


sun rain
● P(X3) = ∑x2 P(X3,X2=x2)
sun 0.9 0.1
● = ∑x2 P(X2=x2) P(X3| X2=x2) rain 0.3 0.7
● = 0.66<0.9,0.1> + 0.34<0.3,0.7> = <0.696,0.304>

11
Forward algorithm (simple form)
Probability from
previous iteration
Transition model

12
And the same thing in linear algebra form
● What is the weather like at time 2?
○ P(X2) = 0.6<0.9,0.1> + 0.4<0.3,0.7> = <0.66,0.34>
● In matrix-vector form:
Xt-1 P(Xt|Xt-1)

○ P(X2) = ( )( )=( )
0.9 0.3 0.6 0.66 sun rain
0.1 0.7 0.4 0.34
sun 0.9 0.1
rain 0.3 0.7

● i.e., multiply by TT, transpose of transition matrix

13
Stationary Distributions
– The limiting distribution is called the stationary distribution P

of the chain
– It satisfies P = P +1 = TT P∞
∞ ∞
– Solving for P in the example:

( 0.9 0.3
0.1 0.7 )( ) ( )
p p
1-p = 1-p

0.9p + 0.3(1-p) = p
p = 0.75

Stationary distribution is <0.75,0.25> regardless of starting distribution

14
Example Run of Mini-Forward Algorithm
▪ From initial observation of sun

P(X1) P(X2) P(X3) P(X4) P(X∞)


▪ From initial observation of rain

P(X1) P(X2) P(X3) P(X4) P(X∞)

▪ From yet another initial distribution P(X1):


P(X1) P(X∞)
Video of Demo Ghostbusters Basic Dynamics
Video of Demo Ghostbusters Circular Dynamics
Video of Demo Ghostbusters Whirlpool Dynamics
Application of Stationary Distributions: Gibbs Sampling*
– Each joint instantiation over all hidden and query variables is a state:
{X1, …, Xn} = H U Q

– Transitions:
● With probability 1/n resample variable Xj according to

P(Xj | x1, x2, …, xj-1, xj+1, …, xn, e1, …, em)

– Stationary distribution:
● Conditional distribution P(X1, X2 , … , Xn|e1, …, em)
● Means that when running Gibbs sampling long enough we get a sample from the
desired distribution
● Requires some proof to show this is true!

19
MC Example
MC Example: by forward sim (Monte Carlo)

100K steps
MC Example: by Linear Algebra

Pizza day!
MC Example: by Linear Algebra

Pizza day!
MC Example: by Linear Algebra

Pizza day!
MC Example: by Linear Algebra

Pizza day!

&

Monte Carlo estimate:


Hidden Markov Models
Hidden Markov Models

X0 X1 X2 X3
X
5

E1 E2 E3
E
5

27
Example: Weather HMM
– An HMM is defined by:
Wt-1 P(Wt|Wt-1) Wt P(Ut|Wt)
● Initial distribution: P(X0)
sun rain true false
sun 0.9 0.1 sun 0.2 0.8
● Transition model: P(Xt| Xt-1)

rain 0.3 0.7 rain 0.9 0.1 ● Sensor model: P(Et| Xt)

Weathert-1 Weathert Weathert+1

Umbrellat-1 Umbrellat Umbrellat+1


HMM as probability model

X0 X1 X2 X3
X
Useful notation:
5
Xa:b = Xa , Xa+1, …, Xb
E1 E2 E3
E
5
HMMs: Some Relevant Problems

30
Real HMM Examples
– Speech recognition HMMs:
● Observations are acoustic signals (continuous valued)
● States are specific positions in specific words (so, tens of thousands)

– Machine translation HMMs:


● Observations are words (tens of thousands)
● States are translation options

– Robot tracking:
● Observations are range readings (continuous)
● States are positions on a map (continuous)

– Molecular biology:
● Observations are nucleotides ACGT
● States are coding/non-coding/start/stop/splice-site etc.

31
Inference tasks

32
Inference tasks
– Filtering: P(X |e )
t 1:t
● belief state—input to the decision process of a rational agent

– Prediction: P(X
t+k
|e
1:t
) for k > 0
● evaluation of possible action sequences; like filtering without the evidence

– Smoothing: P(X |e ) for 0 ≤ k < t


k 1:t
● better estimate of past states, essential for learning

– Most likely explanation: arg max


x1:t
P(x
1:t
|e
1:t
)
● speech recognition, decoding with a noisy channel

33
Inference tasks

Filtering: P(Xt|e1:t) Prediction: P(Xt+k|e1:t)


X1 X2 X3 X4 X1 X2 X3 X4

e1 e2 e3 e4 e1 e2 e3

Smoothing: P(Xk|e1:t), k<t Explanation: P(X1:t|e1:t)


X1 X2 X3 X4 X1 X2 X3 X4

e1 e2 e3 e4 e1 e2 e3 e4
Example: Ghostbusters HMM
– P(X1) = uniform 1/9 1/9 1/9

1/9 1/9 1/9


– P(X|X’) = usually move clockwise, but
sometimes move in a random direction 1/9 1/9 1/9

or stay in place P(X1)

– P(Rij|X) = same sensor model as before: 1/6 1/6 1/2


red means close, green means far away.
0 1/6 0

X1 X2 X3 X4 0 0 0

X P(X|X’=<1,2>)
5
Ri,j Ri,j Ri,j Ri,j
35
[Demo: Ghostbusters – Circular Dynamics – HMM (L14D2)]
Video of Demo Ghostbusters – Circular Dynamics --
HMM
Example 1: Weather-Mood (states observed)
Example 1: Weather-Mood (states observed)

)
Example 1: Weather-Mood (states observed)

Using left eigenvectors


Example 2: Best Explanation (HMM)

What is the most likely weather sequence for


the observed mood sequence?
Example 2: Best Explanation (HMM)
Example 2: Best Explanation (HMM)
Example 3: Likelihood of Evidence (HMM)

What is the probability of this observed mood


sequence given an HMM model (as on the right)?
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)
Example 3: Likelihood of Evidence (HMM)

(one way to construct)


Forward algorithm
Example 3: Likelihood of Evidence (HMM)
Filtering / Monitoring
– Filtering, or monitoring, or state estimation is the task of
tracking/maintaining the distribution Bt(X) = f1:t = P(Xt|e1:t)
(the belief state) over time

– We start with f0 in an initial setting, usually uniform

– As time passes, or we get observations, we update f

– The Kalman filter was invented in the 60’s and first implemented as a
method of trajectory estimation for the Apollo program; 1.120.000 papers
on Google Scholar

55
Example: Robot Localization
Example from
Michael Pfeiffer

Prob 0 1
t=0
Sensor model: four bits for wall/no-wall in each direction,
never more than 1 mistake
Transition model: action may fail with small prob.
Example: Robot Localization

Prob 0 1
t=1
Lighter grey: was possible to get the reading,
but less likely (required 1 mistake)
Example: Robot Localization

Prob 0 1

t=2
Example: Robot Localization

Prob 0 1

t=3
Example: Robot Localization

Prob 0 1

t=4
Example: Robot Localization

Prob 0 1

t=5
Inference: Base Cases

X
1
X X
E 1 2
1
Passage of Time

– Aim: devise a recursive filtering algorithm of the form: P(Xt+1|e1:t+1) = g(et+1, P(Xt|e1:t) )

– Assume we have current belief P(X | evidence to date)


X1 X2
– Then, after one time step passes:

be careful about what time step t the belief


is about, and what evidence it includes

▪ Or compactly:

– Basic idea: beliefs get “pushed” through the transitions


Filtering
– Filtering allows us to essentially update our belief (a probability
distribution for the true state) with observations
● We need a prior distribution for the belief as well

– Let our belief probability for state s at time t be denoted as

– We would like to have a recursive, Markov filtering algorithm


● Otherwise computations would become more difficult as time goes on

64
Passage of Time

– As time passes, uncertainty “accumulates” (Transition model: ghosts usually go clockwise)

T=1 T=2 T=5


Bayesian Filtering (Forward Algorithm)
– We can directly derive a belief update (the so called forward algorithm):

66
Observation
– Assume we have current belief P(X | previous evidence): X
1

– Then, after evidence comes in: E


1

– Or, compactly:

▪ Basic idea: beliefs “reweighted”


by likelihood of evidence
▪ Unlike passage of time, we must
renormalize
Example: Observation

– As we get observations, beliefs get reweighted, uncertainty “decreases”

Before observation After observation


Example: Weather HMM

B’(+r) = 0.5 B’(+r) = 0.627


B’(-r) = 0.5 B’(-r) = 0.373

B(+r) = 0.5 B(+r) = 0.818 B(+r) = 0.883


B(-r) = 0.5 B(-r) = 0.182 B(-r) = 0.117

Rain0 Rain1 Rain2 Rt Rt+1 P(Rt+1|Rt) Rt Ut P(Ut|Rt)


+r +r 0.7 +r +u 0.9
+r -r 0.3 +r -u 0.1
Umbrella1 Umbrella2 -r +r 0.3 -r +u 0.2
-r -r 0.7 -r -u 0.8
The Forward Algorithm
– We are given evidence at each time and want to know

– We can derive the following updates


We can normalize as we go if we
want to have P(x|e) at each time
step, or just once at the end…
The Forward Algorithm
– We are given evidence at each time and want to know

– We can derive the following updates


We can normalize as we go if we
want to have P(x|e) at each time
step, or just once at the end…

Base case

known known
Online Belief Updates
– Every time step, we start with current P(X | evidence)
– We update for time:

X1 X2

– We update for evidence:


X2

– The forward algorithm does both at once (and doesn’t normalize)


E2
Pacman – Sonar (P4)

[Demo: Pacman – Sonar – No Beliefs(L14D1)]


Video of Demo Pacman – Sonar (with beliefs)
Most Likely Explanation
Inference tasks

– Filtering: P(X |e )
t 1:t
● belief state—input to the decision process of a rational agent

– Prediction: P(X
t+k
|e
1:t
) for k > 0
● evaluation of possible action sequences; like filtering without the evidence

– Smoothing: P(X |e ) for 0 ≤ k < t


k 1:t
● better estimate of past states, essential for learning

– Most likely explanation: arg max


x1:t
P(x
1:t
|e
1:t
)
● speech recognition, decoding with a noisy channel

76
Most likely explanation = most probable path
– State trellis: graph of states and transitions over time
• arg maxx1:t P(x1:t | e1:t)
sun sun sun sun = arg maxx1:t α P(x1:t , e1:t)
= arg maxx1:t P(x1:t , e1:t)
rain rain rain rain = arg maxx1:t P(x0) ∏t P(xt | xt-1) P(et | xt)

X0 X1 … XT

– Each arc represents some transition xt-1 → xt


– Each arc has weight P(xt | xt-1) P(et | xt) (arcs to initial states have weight P(x0) )
– The product of weights on a path is proportional to that state sequence’s probability
– Forward algorithm computes sums of paths, Viterbi algorithm computes best paths

77
Forward / Viterbi algorithms

sun sun sun sun

rain rain rain rain

X0 X1 … XT

Forward Algorithm (sum) Viterbi Algorithm (max)


For each state at time t, keep track For each state at time t, keep track of
of the total probability of all the maximum probability of any path
paths to it to it
f1:t+1 = FORWARD(f1:t , et+1) m1:t+1 = VITERBI(m1:t , et+1)
= α P(et+1|Xt+1) ∑xt P(Xt+1| xt) f1:t = P(et+1|Xt+1) maxxt P(Xt+1| xt) m1:t
Next Time: Particle Filtering and Applications of HMMs

You might also like