cs188 Fa22 Note17
cs188 Fa22 Note17
Markov Models
In previous notes, we talked about Bayes’ nets and how they are a wonderful structure used for compactly
representing relationships between random variables. We’ll now cover a very intrinsically related structure
called a Markov model, which for the purposes of this course can be thought of as analogous to a chain-
like, infinite-length Bayes’ net. The running example we’ll be working with in this section is the day-to-day
fluctuations in weather patterns. Our weather model will be time-dependent (as are Markov models in
general), meaning we’ll have a separate random variable for the weather on each day. If we define Wi as the
random variable representing the weather on day i, the Markov model for our weather example would look
like this:
What information should we store about the random variables involved in our Markov model? To track
how our quantity under consideration (in this case, the weather) changes over time, we need to know both
it’s initial distribution at time t = 0 and some sort of transition model that characterizes the probability
of moving from one state to another between timesteps. The initial distribution of a Markov model is
enumerated by the probability table given by P(W0 ) and the transition model of transitioning from state i to
i + 1 is given by P(Wi+1 |Wi ). Note that this transition model implies that the value of Wi+1 is conditionally
dependent only on the value of Wi . In other words, the weather at time t = i + 1 satisfies the Markov
property or memoryless property, and is independent of the weather at all other timesteps besides t = i.
Using our Markov model for weather, if we wanted to reconstruct the joint between W0 , W1 , and W2 using
the chain rule, we would want:
However, with our assumption that the Markov property holds true and W0 ⊥⊥ W2 |W1 , the joint simplifies to:
And we have everything we need to calculate this from the Markov model. More generally, Markov models
make the following independence assumption at each timestep: Wi+1 ⊥⊥ {W0 , ...,Wi−1 }|Wi . This allows us
to reconstruct the joint distribution for the first n + 1 variables via the chain rule as follows:
n−1
P(W0 ,W1 , ...,Wn ) = P(W0 )P(W1 |W0 )P(W2 |W1 )...P(Wn |Wn−1 ) = P(W0 ) ∏ P(Wi+1 |Wi )
i=0
This equation should make some intuitive sense — to compute the distribution of the weather at timestep
i + 1, we look at the probability distribution at timestep i given by P(Wi ) and "advance" this model a timestep
with our transition model P(Wi+1 |Wi ). With this equation, we can iteratively compute the distribution of the
weather at any timestep of our choice by starting with our initial distribution P(W0 ) and using it to compute
P(W1 ), then in turn using P(W1 ) to compute P(W2 ), and so on. Let’s walk through an example, using the
following initial distribution and transition model:
W1 P(W1 )
sun 0.5
rain 0.5
Stationary Distribution
To solve the problem stated above, we must compute the stationary distribution of the weather. As the
name suggests, the stationary distribution is one that remains the same after the passage of time, i.e.
P(Wt+1 ) = P(Wt )
We can compute these converged probabilities of being in a given state by combining the above equivalence
with the same equation used by the mini-forward algorithm:
P(Wt+1 ) = P(Wt ) = ∑ P(Wt+1 |wt )P(wt )
wt
For our weather example, this gives us the following two equations:
P(Wt = sun) = P(Wt+1 = sun|Wt = sun)P(Wt = sun) + P(Wt+1 = sun|Wt = rain)P(Wt = rain)
= 0.6 · P(Wt = sun) + 0.1 · P(Wt = rain)
P(Wt = rain) = P(Wt+1 = rain|Wt = sun)P(Wt = sun) + P(Wt+1 = rain|Wt = rain)P(Wt = rain)
= 0.4 · P(Wt = sun) + 0.9 · P(Wt = rain)
We now have exactly what we need to solve for the stationary distribution, a system of 2 equations in 2
unknowns! We can get a third equation by using the fact that P(Wt ) is a probability distribution and so must
sum to 1:
Solving this system of equations yields P(Wt = sun) = 0.2 and P(Wt = rain) = 0.8. Hence the table for our
stationary distribution, which we’ll henceforth denote as P(W∞ ), is the following:
W∞ P(W∞ )
sun 0.2
rain 0.8
To verify this result, let’s apply the transition model to the stationary distribution:
P(W∞+1 = sun) = P(W∞+1 = sun|W∞ = sun)P(W∞ = sun) + P(W∞+1 = sun|W∞ = rain)P(W∞ = rain)
= 0.6 · 0.2 + 0.1 · 0.8 = 0.2
P(W∞+1 = rain) = P(W∞+1 = rain|W∞ = sun)P(W∞ = sun) + P(W∞+1 = rain|W∞ = rain)P(W∞ = rain)
= 0.4 · 0.2 + 0.9 · 0.8 = 0.8
yields a system of k equations, which we can use to solve for the stationary distribution.
Unlike vanilla Markov models, we now have two different types of nodes. To make this distinction, we’ll
call each Wi a state variable and each weather forecast Fi an evidence variable. Since Wi encodes our belief
of the probability distribution for the weather on day i, it should be a natural result that the weather forecast
for day i is conditionally dependent upon this belief. The model implies similar conditional indepencence
relationships as standard Markov models, with an additional set of relationships for the evidence variables:
F1 ⊥⊥ W0 |W1
∀i = 2, . . . , n; Wi ⊥⊥ {W0 , . . . ,Wi−2 , F1 , . . . , Fi−1 }|Wi−1
∀i = 2, . . . , n; Fi ⊥⊥ {W0 , . . . ,Wi−1 , F1 , . . . , Fi−1 }|Wi
Just like Markov models, Hidden Markov Models make the assumption that the transition model P(Wi+1 |Wi )
is stationary. Hidden Markov Models make the additional simplifying assumption that the sensor model
P(Fi |Wi ) is stationary as well. Hence any Hidden Markov Model can be represented compactly with just
three probability tables: the initial distribution, the transition model, and the sensor model.
As a final point on notation, we’ll define the belief distribution at time i with all evidence F1 , . . . , Fi observed
up to date:
B(Wi ) = P(Wi | f1 , . . . , fi )
Similarly, we’ll define B′ (Wi ) as the belief distribution at time i with evidence f1 , . . . , fi−1 observed:
B′ (Wi ) = P(Wi | f1 , . . . , fi−1 )
Defining ei as evidence observed at timestep i, you might sometimes see the aggregated evidence from
timesteps 1 ≤ i ≤ t reexpressed in the following form:
e1:t = e1 , . . . , et
Under this notation, P(Wi | f1 , . . . , fi−1 ) can be written as P(Wi | f1:(i−1) ). This notation will become relevant
in the upcoming sections, where we’ll discuss time elapse updates that iteratively incorporate new evidence
into our weather model.
Noting that P(wi | f1 , . . . , fi ) is simply B(wi ) and that Wi+1 ⊥⊥ { f1 , . . . fi }|Wi , this simplies to our final rela-
tionship between B(Wi ) to B′ (Wi+1 ):
Now let’s consider how we can derive a relationship between B′ (Wi+1 ) and B(Wi+1 ). By application of the
definition of conditional probability (with extra conditioning), we can see that
P(Wi+1 , fi+1 | f1 , . . . , fi )
B(Wi+1 ) = P(Wi+1 | f1 , . . . , fi+1 ) =
P( fi+1 | f1 , . . . , fi )
When dealing with conditional probabilities a commonly used trick is to delay normalization until we require
the normalized probabilities, a trick we’ll now employ. More specifically, since the denominator in the
above expansion of B(Wi+1 ) is common to every term in the probability table represented by B(Wi+1 ), we
can omit actually dividing by P( fi+1 | f1 , . . . , fi ). Instead, we can simply note that B(Wi+1 ) is proportional to
P(Wi+1 , fi+1 | f1 , . . . , fi ):
B(Wi+1 ) ∝ P(Wi+1 , fi+1 | f1 , . . . , fi )
with a constant of proportionality equal to P( fi+1 | f1 , . . . , fi ). Whenever we decide we want to recover the
belief distribution B(Wi+1 ), we can divide each computed value by this constant of proportionality. Now,
using the chain rule we can observe the following:
B(Wi+1 ) ∝ P(Wi+1 , fi+1 | f1 , . . . , fi ) = P( fi+1 |Wi+1 , f1 , . . . , fi )P(Wi+1 | f1 , . . . , fi )
By the conditional independence assumptions associated with Hidden Markov Models stated previously,
P( fi+1 |Wi+1 , f1 , . . . , fi ) is equivalent to simply P( fi+1 |Wi+1 ) and by definition P(Wi+1 | f1 , . . . , fi ) = B′ (Wi+1 ).
This allows us to express the relationship between B′ (Wi+1 ) and B(Wi+1 ) in it’s final form:
The forward algorithm can be thought of as consisting of two distinctive steps: the time elapse update
which corresponds to determining B′ (Wi+1 ) from B(Wi ) and the observation update which corresponds to
determining B(Wi+1 ) from B′ (Wi+1 ). Hence, in order to advance our belief distribution by one timestep
(i.e. compute B(Wi+1 ) from B(Wi )), we must first advance our model’s state by one timestep with the time
elapse update, then incorporate new evidence from that timestep with the observation update. Consider the
following initial distribution, transition model, and sensor model:
Hence:
W1 B′ (W1 )
sun 0.5
rain 0.5
Next, we’ll assume that the weather forecast for day 1 was good (i.e. F1 = good), and perform an observation
update to get B(W1 ):
B(W1 = sun) ∝ P(F1 = good|W1 = sun)B′ (W1 = sun) = 0.8 · 0.5 = 0.4
B(W1 = rain) ∝ P(F1 = good|W1 = rain)B′ (W1 = rain) = 0.3 · 0.5 = 0.15
The last step is to normalize B(W1 ), noting that the entries in table for B(W1 ) sum to 0.4 + 0.15 = 0.55:
8
B(W1 = sun) = 0.4/0.55 =
11
3
B(W1 = rain) = 0.15/0.55 =
11
Our final table for B(W1 ) is thus the following:
W1 B′ (W1 )
sun 8/11
rain 3/11
Note the result of observing the weather forecast. Because the weatherman predicted good weather, our
belief that it would be sunny increased from 12 after the time update to 11
8
after the observation update.
As a parting note, the normalization trick discussed above can actually simplify computation significantly
when working with Hidden Markov Models. If we began with some initial distribution and were interested
in computing the belief distribution at time t, we could use the forward algorithm to iteratively compute
B(W1 ), . . . , B(Wt ) and normalize only once at the end by dividing each entry in the table for B(Wt ) by the
sum of it’s entries.
In this HMM with two possible hidden states, sun or rain, we would like to compute the highest probability
path (assignment of a state for every timestep) from X1 to XN . The weights on an edge from Xt−1 to Xt is
equal to P(Xt |Xt−1 )P(Et |Xt ), and the probability of a path is computed by taking the product of its edge
weights. The first term in the weight formula represents how likely a particular transition is and the second
term represents how well the observed evidence fits the resulting state.
Recall that:
N
P(X1:N , e1:N ) = P(X1 )P(e1 |X1 ) ∏ P(Xt |Xt−1 )P(et |Xt )
t=2
The Forward Algorithm computes (up to normalization)
to find the maximum likelihood estimate of the sequence of hidden states. Notice that each term in the
product is exactly the expression for the edge weight between layer t − 1 to layer t. So, the product of
weights along the path on the trellis gives us the probability of the path given the evidence.
We could solve for a joint probability table over all of the possible hidden states, but this results in an
exponential space cost. Given such a table, we could use dynamic programming to compute the best path in
This suggests that we can compute mt for all t recursively via dynamic programming. This makes it possible
to determine the last state xN for the most likely path, but we still need a way to backtrack to reconstruct the
entire path. Let’s define at [xt ] = P(et |xt ) arg maxxt−1 P(xt |xt−1 )mt−1 [xt−1 ] = arg maxxt−1 P(xt |xt−1 )mt−1 [xt−1 ]
to keep track of the last transition along the best path to xt . We can now outline the algorithm.
Result: Most likely sequence of hidden states x1:N ∗
/* Forward pass */
for t = 1 to N do
for xt ∈ X do
if t = 1 then
mt [xt ] = P(xt )P(e0 |xt )
else
at [xt ] = arg maxxt−1 P(xt |xt−1 )mt−1 [xt−1 ];
mt [xt ] = P(et |xt )P(xt |at [xt ])mt−1 [at [xt ]];
end
end
end
/* Find the most likely path’s ending point */
∗
xN = arg maxxN mN [xN ];
/* Work backwards through our most likely path and find the hidden
states */
for t = N to 2 do
∗ = a [x∗ ];
xt−1 t t
end
Notice that our a arrays define a set of N sequences, each of which is the most likely sequence to a particular
end state xN . Once we finish the forward pass, we look at the likelihood of the N sequences, pick the best
one, and reconstruct it in the backwards pass. We have thus computed the most likely explanation for our
evidence in polynomial space and time.