11 Probabilistic Temporal Models
11 Probabilistic Temporal Models
AIMA Chapter 15
X0 X1 X2 X3
P(X0) P(Xt | Xt-1)
▪ The transition model P(Xt | Xt-1) specifies how the state evolves
over time
▪ Stationarity assumption: same transition probabilities at all time
steps
▪ Joint distribution P(X0,…, XT) = P(X0) t P(Xt | Xt-1)
Quiz: are Markov models a special case of Bayes nets?
X0 X1 X2 X3
Probability from
Transition model
previous iteration
…
P(X0) P(X)
Stationary Distributions
Xt-1 Xt P(Xt|Xt-1)
sun sun 0.9
sun rain 0.1
rain sun 0.3
rain rain 0.7
Also:
Application of Stationary Distribution: Web Link Analysis
▪ Web browsing
▪ Each web page is a state
▪ Initial distribution: uniform over pages
▪ Transitions:
▪ With prob. c, uniform jump to a random page
▪ With prob. 1-c, follow a random outlink
▪ Stationary distribution: PageRank
▪ Will spend more time on highly reachable pages
▪ Google 1.0 returned the set of pages containing all
your keywords in decreasing rank
▪ Now: use link analysis along with many other factors
(rank actually getting less important)
Application of Stationary Distributions: Gibbs Sampling
X0 X1 X2 X3 X5
E1 E2 E3 E5
Example: Weather HMM
▪ An HMM is defined by:
Wt-1 P(Wt|Wt-1)
▪ Initial distribution: P(X0)
sun rain
sun 0.9 0.1 ▪ Transition model: P(Xt| Xt-1)
rain 0.3 0.7 ▪ Emission model: P(Et| Xt)
Weathert-1 Weathert Weathert+1
Wt P(Ut|Wt)
true false
sun 0.2 0.8
rain 0.9 0.1
X0 X1 X2 X3 X5
E1 E2 E3 E5
Real HMM Examples
▪ Speech recognition HMMs:
▪ Observations are acoustic signals (continuous valued)
▪ States are specific positions in specific words (so, tens of thousands)
▪ Machine translation HMMs:
▪ Observations are words (tens of thousands)
▪ States are translation options
▪ Robot tracking:
▪ Observations are range readings (continuous)
▪ States are positions on a map (continuous)
▪ Molecular biology:
▪ Observations are nucleotides ACGT
▪ States are coding/non-coding/start/stop/splice-site etc.
Inference tasks
α = 1 / P(et+1|e1:t)
Filtering
▪ Filtering: infer current state given all evidence
▪ Aim: a recursive filtering algorithm of the form
▪ P(Xt+1|e1:t+1) = g(et+1, P(Xt|e1:t) )
Xt+1 Xt Xt+1
Et+1
Forward algorithm
▪ P(Xt+1|e1:t+1) = α P(et+1|Xt+1) x P(xt | e1:t) P(Xt+1| xt)
t
Weather0 Weather1
Wt P(Ut|Wt)
true false
P(W0)
sun 0.2 0.8
sun rain
Umbrella1 rain 0.9 0.1
0.5 0.5
Example: Weather HMM
0.6 0.45
predict 0.4 predict 0.55
update & update &
normalize normalize Wt-1 P(Wt|Wt-1)
f(sun) = 0.5 f(sun) = 0.25 f(sun) = 0.154 sun rain
f(rain) = 0.5 f(rain) = 0.75 f(rain) = 0.846 sun 0.9 0.1
rain 0.3 0.7
X0 X1 … XT
X0 X1 … Xt+1
• Forward algorithm computes sum over all possible paths
P(xt+1|e1:t+1) = x0:t P(x0:t+1 | e1:t+1)
• It uses dynamic programming to sum over all paths
• For each state at time t, keep track of the total probability of all paths to it
f1:t+1 = FORWARD(f1:t , et+1)
= α P(et+1|Xt+1) x P(Xt+1| xt) f1:t [xt]
t
Most Likely Explanation
Inference tasks
▪ Filtering: P(Xt|e1:t)
▪ belief state—input to the decision process of a rational agent
▪ Prediction: P(Xt+k|e1:t) for k > 0
▪ evaluation of possible action sequences; like filtering without the evidence
▪ Smoothing: P(Xk|e1:t) for 0 ≤ k < t
▪ better estimate of past states, essential for learning
▪ Most likely explanation: arg maxx0:t P(x0:t | e1:t)
▪ speech recognition, decoding with a noisy channel
Most likely explanation = most probable path
▪ State trellis: graph of states and transitions over time
sun sun sun sun
X0 X1 … XT
X0 X1 … XT
Viterbi Algorithm (max) Forward Algorithm (sum)
For each state at time t, keep track of For each state at time t, keep track of the
the (unnormalized) maximum total probability of all paths to it:
probability of any path to it: f1:t+1 (xt+1) = P(xt+1 |e1:t+1)
m1:t+1 (xt+1) = maxx P(x1:t+1 |e1:t+1)
1:t
= x P(x1:t+1 |e1:t+1)
1:t
X0 X1 X2 X3 true false
sun 0.2 0.8
U1=true U2=false U3=true rain 0.9 0.1
X0 X1 X2 X3 true false
sun 0.2 0.8
U1=true U2=false U3=true rain 0.9 0.1
G 1a G 2a G 3a
G 1b G 2b G 3b
G 1a G 2a G 3a
G 1b G 2b G 3b
▪ Can we do better?
▪ Do we need to unroll for many steps? What is the best variable order of elimination?
▪ Online: unroll as we go, eliminate all variables from the previous time step
▪ A generalization of the Forward algorithm
Particle Filtering
Large state space
▪ When |X| is huge (e.g., position in a building), exact inference becomes
infeasible
▪ Can we use approximate inference, e.g., likelihood weighting?
▪ Evidences are “downstream”
▪ By ignoring the evidence: with more states sampled over time, the weight drops
quickly (going into low-probability region)
▪ Hence: too few “reasonable” samples
X0 X1 X2 X3
E1 E2 E3
Particle Filtering
▪ Initialization
▪ sample N particles from the initial distribution P(X0)
▪ All particles have a weight of 1
Particles:
(3,3)
(2,3)
(3,3)
(3,2)
(3,3)
(3,2)
(1,2)
(3,3)
(3,3)
(2,3)
Particle Filtering: Propagate forward
P(Xt|e1:t)