0% found this document useful (0 votes)
4 views8 pages

cs188 Fa22 Note17

The document discusses Markov models and their application in modeling time-dependent systems, using weather patterns as an example. It explains the structure of Markov models, the mini-forward algorithm for calculating distributions, and the concept of stationary distributions. Additionally, it introduces Hidden Markov Models, which incorporate evidence at each timestep to update beliefs about the system's state.

Uploaded by

sh_mpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views8 pages

cs188 Fa22 Note17

The document discusses Markov models and their application in modeling time-dependent systems, using weather patterns as an example. It explains the structure of Markov models, the mini-forward algorithm for calculating distributions, and the concept of stationary distributions. Additionally, it introduces Hidden Markov Models, which incorporate evidence at each timestep to update beliefs about the system's state.

Uploaded by

sh_mpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CS 188 Introduction to Artificial Intelligence

Fall 2022 Note 17


These lecture notes are heavily based on notes originally written by Nikhil Sharma.
Last updated: October 25, 2022

Markov Models
In previous notes, we talked about Bayes’ nets and how they are a wonderful structure used for compactly
representing relationships between random variables. We’ll now cover a very intrinsically related structure
called a Markov model, which for the purposes of this course can be thought of as analogous to a chain-
like, infinite-length Bayes’ net. The running example we’ll be working with in this section is the day-to-day
fluctuations in weather patterns. Our weather model will be time-dependent (as are Markov models in
general), meaning we’ll have a separate random variable for the weather on each day. If we define Wi as the
random variable representing the weather on day i, the Markov model for our weather example would look
like this:

What information should we store about the random variables involved in our Markov model? To track
how our quantity under consideration (in this case, the weather) changes over time, we need to know both
it’s initial distribution at time t = 0 and some sort of transition model that characterizes the probability
of moving from one state to another between timesteps. The initial distribution of a Markov model is
enumerated by the probability table given by P(W0 ) and the transition model of transitioning from state i to
i + 1 is given by P(Wi+1 |Wi ). Note that this transition model implies that the value of Wi+1 is conditionally
dependent only on the value of Wi . In other words, the weather at time t = i + 1 satisfies the Markov
property or memoryless property, and is independent of the weather at all other timesteps besides t = i.
Using our Markov model for weather, if we wanted to reconstruct the joint between W0 , W1 , and W2 using
the chain rule, we would want:

P(W0 ,W1 ,W2 ) = P(W0 )P(W1 |W0 )P(W2 |W1 ,W0 )

However, with our assumption that the Markov property holds true and W0 ⊥⊥ W2 |W1 , the joint simplifies to:

P(W0 ,W1 ,W2 ) = P(W0 )P(W1 |W0 )P(W2 |W1 )

And we have everything we need to calculate this from the Markov model. More generally, Markov models
make the following independence assumption at each timestep: Wi+1 ⊥⊥ {W0 , ...,Wi−1 }|Wi . This allows us
to reconstruct the joint distribution for the first n + 1 variables via the chain rule as follows:
n−1
P(W0 ,W1 , ...,Wn ) = P(W0 )P(W1 |W0 )P(W2 |W1 )...P(Wn |Wn−1 ) = P(W0 ) ∏ P(Wi+1 |Wi )
i=0

CS 188, Fall 2022, Note 17 1


A final assumption that’s typically made in Markov models is that the transition model is stationary. In
other words, for all values of i (all timesteps), P(Wi+1 |Wi ) is identical. This allows us to represent a Markov
model with only two tables: one for P(W0 ) and one for P(Wi+1 |Wi ).

The Mini-Forward Algorithm


We now know how to compute the joint distribution across timesteps of a Markov model. However, this
doesn’t explicitly help us answer the question of the distribution of the weather on some given day t. Nat-
urally, we can compute the joint then marginalize (sum out) over all other variables, but this is typically
extremely inefficient, since if we have j variables each of which can take on d values, the size of the joint
distribution is O(d j ). Instead, we’ll present a more efficient technique called the mini-forward algorithm.
Here’s how it works. By properties of marginalization, we know that

P(Wi+1 ) = ∑ P(wi ,Wi+1 )


wi

By the chain rule we can re-express this as follows:

P(Wi+1 ) = ∑ P(Wi+1 |wi )P(wi )


wi

This equation should make some intuitive sense — to compute the distribution of the weather at timestep
i + 1, we look at the probability distribution at timestep i given by P(Wi ) and "advance" this model a timestep
with our transition model P(Wi+1 |Wi ). With this equation, we can iteratively compute the distribution of the
weather at any timestep of our choice by starting with our initial distribution P(W0 ) and using it to compute
P(W1 ), then in turn using P(W1 ) to compute P(W2 ), and so on. Let’s walk through an example, using the
following initial distribution and transition model:

Wi+1 Wi P(Wi+1 |Wi )


W0 P(W0 ) sun sun 0.6
sun 0.8 rain sun 0.4
rain 0.2 sun rain 0.1
rain rain 0.9

Using the mini-forward algorithm we can compute P(W1 ) as follows:

P(W1 = sun) = ∑ P(W1 = sun|w0 )P(w0 )


w0
= P(W1 = sun|W0 = sun)P(W0 = sun) + P(W1 = sun|W0 = rain)P(W0 = rain)
= 0.6 · 0.8 + 0.1 · 0.2 = 0.5
P(W1 = rain) = P(W1 = rain|w0 )P(w0 )
= P(W1 = rain|W0 = sun)P(W0 = sun) + P(W1 = rain|W0 = rain)P(W0 = rain)
= 0.4 · 0.8 + 0.9 · 0.2 = 0.5

Hence our distribution for P(W1 ) is

W1 P(W1 )
sun 0.5
rain 0.5

CS 188, Fall 2022, Note 17 2


Notably, the probability that it will be sunny has decreased from 80% at time t = 0 to only 50% at time t = 1.
This is a direct result of our transition model, which favors transitioning to rainy days over sunny days. This
gives rise to a natural follow-up question: does the probability of being in a state at a given timestep ever
converge? We’ll address the answer to this problem in the following section.

Stationary Distribution
To solve the problem stated above, we must compute the stationary distribution of the weather. As the
name suggests, the stationary distribution is one that remains the same after the passage of time, i.e.
P(Wt+1 ) = P(Wt )
We can compute these converged probabilities of being in a given state by combining the above equivalence
with the same equation used by the mini-forward algorithm:
P(Wt+1 ) = P(Wt ) = ∑ P(Wt+1 |wt )P(wt )
wt

For our weather example, this gives us the following two equations:

P(Wt = sun) = P(Wt+1 = sun|Wt = sun)P(Wt = sun) + P(Wt+1 = sun|Wt = rain)P(Wt = rain)
= 0.6 · P(Wt = sun) + 0.1 · P(Wt = rain)
P(Wt = rain) = P(Wt+1 = rain|Wt = sun)P(Wt = sun) + P(Wt+1 = rain|Wt = rain)P(Wt = rain)
= 0.4 · P(Wt = sun) + 0.9 · P(Wt = rain)

We now have exactly what we need to solve for the stationary distribution, a system of 2 equations in 2
unknowns! We can get a third equation by using the fact that P(Wt ) is a probability distribution and so must
sum to 1:

P(Wt = sun) = 0.6 · P(Wt = sun) + 0.1 · P(Wt = rain)


P(Wt = rain) = 0.4 · P(Wt = sun) + 0.9 · P(Wt = rain)
1 = P(Wt = sun) + P(Wt = rain)

Solving this system of equations yields P(Wt = sun) = 0.2 and P(Wt = rain) = 0.8. Hence the table for our
stationary distribution, which we’ll henceforth denote as P(W∞ ), is the following:

W∞ P(W∞ )
sun 0.2
rain 0.8

To verify this result, let’s apply the transition model to the stationary distribution:

P(W∞+1 = sun) = P(W∞+1 = sun|W∞ = sun)P(W∞ = sun) + P(W∞+1 = sun|W∞ = rain)P(W∞ = rain)
= 0.6 · 0.2 + 0.1 · 0.8 = 0.2
P(W∞+1 = rain) = P(W∞+1 = rain|W∞ = sun)P(W∞ = sun) + P(W∞+1 = rain|W∞ = rain)P(W∞ = rain)
= 0.4 · 0.2 + 0.9 · 0.8 = 0.8

As expected, P(W∞+1 ) = P(W∞ ). In general, if Wt had a domain of size k, the equivalence


P(Wt ) = ∑ P(Wt+1 |wt )P(wt )
wt

yields a system of k equations, which we can use to solve for the stationary distribution.

CS 188, Fall 2022, Note 17 3


Hidden Markov Models
With Markov models, we saw how we could incorporate change over time through a chain of random vari-
ables. For example, if we want to know the weather on day 10 with our standard Markov model from above,
we can begin with the initial distribution P(W0 ) and use the mini-forward algorithm with our transition
model to compute P(W10 ). However, between time t = 0 and time t = 10, we may collect new meteoro-
logical evidence that might affect our belief of the probability distribution over the weather at any given
timestep. In simpler terms, if the weather forecasts an 80% chance of rain on day 10, but there are clear
skies on the night of day 9, that 80% probability might drop drastically. This is exactly what the Hidden
Markov Model helps us with - it allows us to observe some evidence at each timestep, which can potentially
affect the belief distribution at each of the states. The Hidden Markov Model for our weather model can be
described using a Bayes’ net structure that looks like the following:

Unlike vanilla Markov models, we now have two different types of nodes. To make this distinction, we’ll
call each Wi a state variable and each weather forecast Fi an evidence variable. Since Wi encodes our belief
of the probability distribution for the weather on day i, it should be a natural result that the weather forecast
for day i is conditionally dependent upon this belief. The model implies similar conditional indepencence
relationships as standard Markov models, with an additional set of relationships for the evidence variables:

F1 ⊥⊥ W0 |W1
∀i = 2, . . . , n; Wi ⊥⊥ {W0 , . . . ,Wi−2 , F1 , . . . , Fi−1 }|Wi−1
∀i = 2, . . . , n; Fi ⊥⊥ {W0 , . . . ,Wi−1 , F1 , . . . , Fi−1 }|Wi

Just like Markov models, Hidden Markov Models make the assumption that the transition model P(Wi+1 |Wi )
is stationary. Hidden Markov Models make the additional simplifying assumption that the sensor model
P(Fi |Wi ) is stationary as well. Hence any Hidden Markov Model can be represented compactly with just
three probability tables: the initial distribution, the transition model, and the sensor model.
As a final point on notation, we’ll define the belief distribution at time i with all evidence F1 , . . . , Fi observed
up to date:
B(Wi ) = P(Wi | f1 , . . . , fi )
Similarly, we’ll define B′ (Wi ) as the belief distribution at time i with evidence f1 , . . . , fi−1 observed:
B′ (Wi ) = P(Wi | f1 , . . . , fi−1 )
Defining ei as evidence observed at timestep i, you might sometimes see the aggregated evidence from
timesteps 1 ≤ i ≤ t reexpressed in the following form:
e1:t = e1 , . . . , et
Under this notation, P(Wi | f1 , . . . , fi−1 ) can be written as P(Wi | f1:(i−1) ). This notation will become relevant
in the upcoming sections, where we’ll discuss time elapse updates that iteratively incorporate new evidence
into our weather model.

CS 188, Fall 2022, Note 17 4


The Forward Algorithm
Using the conditional probability assumptions stated above and marginalization properties of conditional
probability tables, we can derive a relationship between B(Wi ) and B′ (Wi+1 ) that’s of the same form as the
update rule for the mini-forward algorithm. We begin by using marginalization:
B′ (Wi+1 ) = P(Wi+1 | f1 , . . . , fi ) = ∑ P(Wi+1 , wi | f1 , . . . , fi )
wi

This can be reexpressed then with the chain rule as follows:


B′ (Wi+1 ) = P(Wi+1 | f1 , . . . , fi ) = ∑ P(Wi+1 |wi , f1 , . . . , fi )P(wi | f1 , . . . , fi )
wi

Noting that P(wi | f1 , . . . , fi ) is simply B(wi ) and that Wi+1 ⊥⊥ { f1 , . . . fi }|Wi , this simplies to our final rela-
tionship between B(Wi ) to B′ (Wi+1 ):

B′ (Wi+1 ) = ∑ P(Wi+1 |wi )B(wi )


wi

Now let’s consider how we can derive a relationship between B′ (Wi+1 ) and B(Wi+1 ). By application of the
definition of conditional probability (with extra conditioning), we can see that
P(Wi+1 , fi+1 | f1 , . . . , fi )
B(Wi+1 ) = P(Wi+1 | f1 , . . . , fi+1 ) =
P( fi+1 | f1 , . . . , fi )
When dealing with conditional probabilities a commonly used trick is to delay normalization until we require
the normalized probabilities, a trick we’ll now employ. More specifically, since the denominator in the
above expansion of B(Wi+1 ) is common to every term in the probability table represented by B(Wi+1 ), we
can omit actually dividing by P( fi+1 | f1 , . . . , fi ). Instead, we can simply note that B(Wi+1 ) is proportional to
P(Wi+1 , fi+1 | f1 , . . . , fi ):
B(Wi+1 ) ∝ P(Wi+1 , fi+1 | f1 , . . . , fi )
with a constant of proportionality equal to P( fi+1 | f1 , . . . , fi ). Whenever we decide we want to recover the
belief distribution B(Wi+1 ), we can divide each computed value by this constant of proportionality. Now,
using the chain rule we can observe the following:
B(Wi+1 ) ∝ P(Wi+1 , fi+1 | f1 , . . . , fi ) = P( fi+1 |Wi+1 , f1 , . . . , fi )P(Wi+1 | f1 , . . . , fi )
By the conditional independence assumptions associated with Hidden Markov Models stated previously,
P( fi+1 |Wi+1 , f1 , . . . , fi ) is equivalent to simply P( fi+1 |Wi+1 ) and by definition P(Wi+1 | f1 , . . . , fi ) = B′ (Wi+1 ).
This allows us to express the relationship between B′ (Wi+1 ) and B(Wi+1 ) in it’s final form:

B(Wi+1 ) ∝ P( fi+1 |Wi+1 )B′ (Wi+1 )


Combining the two relationships we’ve just derived yields an iterative algorithm known as the forward
algorithm, the Hidden Markov Model analog of the mini-forward algorithm from earlier:

B(Wi+1 ) ∝ P( fi+1 |Wi+1 ) ∑ P(Wi+1 |wi )B(wi )


wi

The forward algorithm can be thought of as consisting of two distinctive steps: the time elapse update
which corresponds to determining B′ (Wi+1 ) from B(Wi ) and the observation update which corresponds to
determining B(Wi+1 ) from B′ (Wi+1 ). Hence, in order to advance our belief distribution by one timestep
(i.e. compute B(Wi+1 ) from B(Wi )), we must first advance our model’s state by one timestep with the time
elapse update, then incorporate new evidence from that timestep with the observation update. Consider the
following initial distribution, transition model, and sensor model:

CS 188, Fall 2022, Note 17 5


Wi+1 Wi P(Wi+1 |Wi ) Fi Wi P(Fi |Wi )
W0 B(W0 ) sun sun 0.6 good sun 0.8
sun 0.8 rain sun 0.4 bad sun 0.2
rain 0.2 sun rain 0.1 good rain 0.3
rain rain 0.9 bad rain 0.7

To compute B(W1 ), we begin by performing a time update to get B′ (W1 ):

B′ (W1 = sun) = ∑ P(W1 = sun|w0 )B(w0 )


w0
= P(W1 = sun|W0 = sun)B(W0 = sun) + P(W1 = sun|W0 = rain)B(W0 = rain)
= 0.6 · 0.8 + 0.1 · 0.2 = 0.5

B (W1 = rain) = ∑ P(W1 = rain|w0 )B(w0 )
w0
= P(W1 = rain|W0 = sun)B(W0 = sun) + P(W1 = rain|W0 = rain)B(W0 = rain)
= 0.4 · 0.8 + 0.9 · 0.2 = 0.5

Hence:
W1 B′ (W1 )
sun 0.5
rain 0.5

Next, we’ll assume that the weather forecast for day 1 was good (i.e. F1 = good), and perform an observation
update to get B(W1 ):

B(W1 = sun) ∝ P(F1 = good|W1 = sun)B′ (W1 = sun) = 0.8 · 0.5 = 0.4
B(W1 = rain) ∝ P(F1 = good|W1 = rain)B′ (W1 = rain) = 0.3 · 0.5 = 0.15

The last step is to normalize B(W1 ), noting that the entries in table for B(W1 ) sum to 0.4 + 0.15 = 0.55:
8
B(W1 = sun) = 0.4/0.55 =
11
3
B(W1 = rain) = 0.15/0.55 =
11
Our final table for B(W1 ) is thus the following:

W1 B′ (W1 )
sun 8/11
rain 3/11

Note the result of observing the weather forecast. Because the weatherman predicted good weather, our
belief that it would be sunny increased from 12 after the time update to 11
8
after the observation update.

As a parting note, the normalization trick discussed above can actually simplify computation significantly
when working with Hidden Markov Models. If we began with some initial distribution and were interested
in computing the belief distribution at time t, we could use the forward algorithm to iteratively compute
B(W1 ), . . . , B(Wt ) and normalize only once at the end by dividing each entry in the table for B(Wt ) by the
sum of it’s entries.

CS 188, Fall 2022, Note 17 6


Viterbi Algorithm
In the Forward Algorithm, we used recursion to solve for P(XN |e1:N ), the probability distribution over states
the system could inhabit given the evidence variables observed so far. Another important question related to
Hidden Markov Models is: What is the most likely sequence of hidden states the system followed given the
observed evidence variables so far? In other words, we would like to solve for arg maxx1:N P(x1:N |e1:N ) =
arg maxx1:N P(x1:N , e1:N ). This trajectory can also be solved for using dynamic programming with the Viterbi
algorithm.
The algorithm consists of two passes: the first runs forward in time and computes the probability of the best
path to each (state, time) tuple given the evidence observed so far. The second pass runs backwards in time:
first it finds the terminal state that lies on the path with the highest probability, and then traverses backward
through time along the path that leads into this state (which must be the best path).
To visualize the algorithm, consider the following state trellis, a graph of states and transitions over time:

In this HMM with two possible hidden states, sun or rain, we would like to compute the highest probability
path (assignment of a state for every timestep) from X1 to XN . The weights on an edge from Xt−1 to Xt is
equal to P(Xt |Xt−1 )P(Et |Xt ), and the probability of a path is computed by taking the product of its edge
weights. The first term in the weight formula represents how likely a particular transition is and the second
term represents how well the observed evidence fits the resulting state.
Recall that:
N
P(X1:N , e1:N ) = P(X1 )P(e1 |X1 ) ∏ P(Xt |Xt−1 )P(et |Xt )
t=2
The Forward Algorithm computes (up to normalization)

P(XN , e1:N ) = ∑ P(XN , x1:N−1 , e1:N )


x1 ,..,xN−1

In the Viberbi Algorithm, we want to compute

arg max P(x1:N , e1:N )


x1 ,..,xN

to find the maximum likelihood estimate of the sequence of hidden states. Notice that each term in the
product is exactly the expression for the edge weight between layer t − 1 to layer t. So, the product of
weights along the path on the trellis gives us the probability of the path given the evidence.
We could solve for a joint probability table over all of the possible hidden states, but this results in an
exponential space cost. Given such a table, we could use dynamic programming to compute the best path in

CS 188, Fall 2022, Note 17 7


polynomial time. However, because we can use dynamic programming to compute the best path, we don’t
necessarily need the whole table at any given time.
Define mt [xt ] = maxx1:t−1 P(x1:t , e1:t ), or the maximum probability of a path starting at any x0 and the evidence
seen so far to a given xt at time t. This is the same as the highest weight path through the trellis from step 1
to t. Also note that

mt [xt ] = max P(et |xt )P(xt |xt−1 )P(x1:t−1 , e1:t−1 ) (1)


x1:t−1

= P(et |xt ) max P(xt |xt−1 ) max P(x1:t−1 , e1:t−1 ) (2)


xt−1 x1:t−2

= P(et |xt ) max P(xt |xt−1 )mt−1 [xt−1 ]. (3)


xt−1

This suggests that we can compute mt for all t recursively via dynamic programming. This makes it possible
to determine the last state xN for the most likely path, but we still need a way to backtrack to reconstruct the
entire path. Let’s define at [xt ] = P(et |xt ) arg maxxt−1 P(xt |xt−1 )mt−1 [xt−1 ] = arg maxxt−1 P(xt |xt−1 )mt−1 [xt−1 ]
to keep track of the last transition along the best path to xt . We can now outline the algorithm.
Result: Most likely sequence of hidden states x1:N ∗

/* Forward pass */
for t = 1 to N do
for xt ∈ X do
if t = 1 then
mt [xt ] = P(xt )P(e0 |xt )
else
at [xt ] = arg maxxt−1 P(xt |xt−1 )mt−1 [xt−1 ];
mt [xt ] = P(et |xt )P(xt |at [xt ])mt−1 [at [xt ]];
end
end
end
/* Find the most likely path’s ending point */

xN = arg maxxN mN [xN ];
/* Work backwards through our most likely path and find the hidden
states */
for t = N to 2 do
∗ = a [x∗ ];
xt−1 t t
end
Notice that our a arrays define a set of N sequences, each of which is the most likely sequence to a particular
end state xN . Once we finish the forward pass, we look at the likelihood of the N sequences, pick the best
one, and reconstruct it in the backwards pass. We have thus computed the most likely explanation for our
evidence in polynomial space and time.

CS 188, Fall 2022, Note 17 8

You might also like