0% found this document useful (0 votes)
17 views60 pages

11 Probabilistic Temporal Models

Uploaded by

韩二建
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views60 pages

11 Probabilistic Temporal Models

Uploaded by

韩二建
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Midterm

▪ Request regrading on Gradescope by this Saturday


Announcement
▪ Homework 4
▪ Due: Nov. 18, 11:59pm
▪ Programming Assignment 4
▪ Due: Nov. 25, 11:59pm
Probabilistic Reasoning over Time

AIMA Chapter 15

[Adapted from slides by Dan Klein and Pieter Abbeel at UC Berkeley]


Uncertainty and Time

▪ Often, we want to reason about a sequence of observations


▪ Speech recognition
▪ Robot localization
▪ Medical monitoring
▪ User attention

▪ Need to introduce time into our models


Markov Models
Markov Models (aka Markov chain/process)
▪ Assume discrete variables that share the same finite domain
▪ Values in the domain is called the states

X0 X1 X2 X3
P(X0) P(Xt | Xt-1)

▪ The transition model P(Xt | Xt-1) specifies how the state evolves
over time
▪ Stationarity assumption: same transition probabilities at all time
steps
▪ Joint distribution P(X0,…, XT) = P(X0) t P(Xt | Xt-1)
Quiz: are Markov models a special case of Bayes nets?

X0 X1 X2 X3

▪ Yes and no!


▪ Yes:
▪ Directed acyclic graph, joint = product of conditionals
▪ No:
▪ Infinitely many variables (unless we truncate)
▪ Repetition of transition model not part of standard Bayes net syntax
Markov Assumption: Conditional Independence

▪ Markov assumption: Xt+1, … are independent of X0,…, Xt-1 given


Xt
▪ Past and future independent given the present
▪ Each time step only depends on the previous
▪ This is a first-order Markov model
▪ A kth-order model allows dependencies on k earlier steps
Example: Weather
▪ States {rain, sun}

▪ Initial distribution P(X0)


P(X0)
sun rain
0.5 0.5
Two new ways of representing the same CPT
▪ Transition model P(Xt | Xt-1) 0.9
0.3
0.9
Xt-1 P(Xt|Xt-1) sun sun
rain sun 0.1
sun rain 0.3
rain rain
sun 0.9 0.1 0.7
0.7
rain 0.3 0.7 0.1
Weather prediction
▪ Time 0: <0.5,0.5> Xt-1 P(Xt|Xt-1)
sun rain
sun 0.9 0.1
rain 0.3 0.7

▪ What is the weather like at time 1?


▪ P(X1) = x0 P(X1,X0=x0)
▪ = x0 P(X0=x0) P(X1| X0=x0)
▪ = 0.5<0.9,0.1> + 0.5<0.3,0.7> = <0.6,0.4>
Weather prediction, contd.
▪ Time 1: <0.6,0.4> Xt-1 P(Xt|Xt-1)
sun rain
sun 0.9 0.1
rain 0.3 0.7

▪ What is the weather like at time 2?


▪ P(X2) = x1 P(X2,X1=x1)
▪ = x1 P(X1=x1) P(X2| X1=x1)
▪ = 0.6<0.9,0.1> + 0.4<0.3,0.7> = <0.66,0.34>
Weather prediction, contd.
▪ Time 2: <0.66,0.34> Xt-1 P(Xt|Xt-1)
sun rain
sun 0.9 0.1
rain 0.3 0.7

▪ What is the weather like at time 3?


▪ P(X3) = x2 P(X3,X2=x2)
▪ = x2 P(X2=x2) P(X3| X2=x2)
▪ = 0.66<0.9,0.1> + 0.34<0.3,0.7> = <0.696,0.304>
Forward algorithm (simple form)
▪ What is the state at time t (given an initial distribution P(X0))?
▪ P(Xt) = xt-1 P(Xt,Xt-1=xt-1)
▪ = xt-1 P(Xt-1=xt-1) P(Xt| Xt-1=xt-1)

Probability from
Transition model
previous iteration

▪ Iterate this update starting at t=0


Example Run of Mini-Forward Algorithm
▪ From initial observation of sun Xt-1 Xt P(Xt|Xt-1)
sun sun 0.9
sun rain 0.1
rain sun 0.3
P(X0) P(X1) P(X2) P(X3) P(X)
rain rain 0.7
▪ From initial observation of rain

P(X0) P(X1) P(X2) P(X3) P(X)


▪ From yet another initial distribution P(X0):


P(X0) P(X)
Stationary Distributions

▪ For most chains: ▪ Stationary distribution:


▪ Influence of the initial distribution ▪ The distribution we end up with is called
gets less and less over time. the stationary distribution of the
▪ The distribution we end up in is chain
independent of the initial distribution ▪ It satisfies
Example: Stationary Distributions
▪ Computing the stationary distribution
X0 X1 X2 X3

Xt-1 Xt P(Xt|Xt-1)
sun sun 0.9
sun rain 0.1
rain sun 0.3
rain rain 0.7

Also:
Application of Stationary Distribution: Web Link Analysis

▪ Web browsing
▪ Each web page is a state
▪ Initial distribution: uniform over pages
▪ Transitions:
▪ With prob. c, uniform jump to a random page
▪ With prob. 1-c, follow a random outlink
▪ Stationary distribution: PageRank
▪ Will spend more time on highly reachable pages
▪ Google 1.0 returned the set of pages containing all
your keywords in decreasing rank
▪ Now: use link analysis along with many other factors
(rank actually getting less important)
Application of Stationary Distributions: Gibbs Sampling

▪ Each joint instantiation over all hidden and


query variables is a state: {X1, …, Xn} = H ∪ Q
▪ Transitions:
▪ Pick a variable and resample its value conditioned
on its Markov blanket
▪ Stationary distribution:
▪ Conditional distribution P(X1, X2 , … , Xn|e1, …, em)
▪ When running Gibbs sampling long enough, we
get a sample from the desired distribution
Hidden Markov Models
Hidden Markov Models
▪ Usually the true state is not observed directly
▪ E.g., you stay indoor and cannot see the weather,
but you can see if people come in with umbrella
or not.
▪ Hidden Markov models (HMMs)
▪ Underlying Markov chain over states X
▪ You observe evidence E at each time step

X0 X1 X2 X3 X5

E1 E2 E3 E5
Example: Weather HMM
▪ An HMM is defined by:
Wt-1 P(Wt|Wt-1)
▪ Initial distribution: P(X0)
sun rain
sun 0.9 0.1 ▪ Transition model: P(Xt| Xt-1)
rain 0.3 0.7 ▪ Emission model: P(Et| Xt)
Weathert-1 Weathert Weathert+1

Wt P(Ut|Wt)
true false
sun 0.2 0.8
rain 0.9 0.1

Umbrellat-1 Umbrellat Umbrellat+1


HMM as probability model
▪ Joint distribution for Markov model: P(X0,…, XT) = P(X0) t=1:T P(Xt | Xt-1)
▪ Joint distribution for hidden Markov model:
P(X0,X1, E1,…, XT,ET) = P(X0) t=1:T P(Xt | Xt-1) P(Et | Xt)
▪ Independence in HMM
▪ Future states are independent of the past given the present
▪ Current evidence is independent of everything else given the current state

X0 X1 X2 X3 X5

E1 E2 E3 E5
Real HMM Examples
▪ Speech recognition HMMs:
▪ Observations are acoustic signals (continuous valued)
▪ States are specific positions in specific words (so, tens of thousands)
▪ Machine translation HMMs:
▪ Observations are words (tens of thousands)
▪ States are translation options
▪ Robot tracking:
▪ Observations are range readings (continuous)
▪ States are positions on a map (continuous)
▪ Molecular biology:
▪ Observations are nucleotides ACGT
▪ States are coding/non-coding/start/stop/splice-site etc.
Inference tasks

▪ Useful notation: Xa:b = Xa , Xa+1, …, Xb


▪ Filtering: P(Xt|e1:t)
▪ belief state — posterior distribution over the most recent state given all evidence
▪ Ex: robot localization
▪ Prediction: P(Xt+k|e1:t) for k > 0
▪ posterior distribution over a future state given all evidence
▪ Smoothing: P(Xk|e1:t) for 0 ≤ k < t
▪ posterior distribution over a past state given all evidence
▪ Most likely explanation: arg maxx0:t P(x0:t | e1:t)
▪ Ex: speech recognition, decoding with a noisy channel
Filtering
▪ Filtering: infer current state given all evidence
▪ Aim: a recursive filtering algorithm of the form
▪ P(Xt+1|e1:t+1) = g(et+1, P(Xt|e1:t) )
Apply Bayes’ rule

▪ P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1)


▪ = α P(et+1|Xt+1, e1:t) P(Xt+1| e1:t)

α = 1 / P(et+1|e1:t)
Filtering
▪ Filtering: infer current state given all evidence
▪ Aim: a recursive filtering algorithm of the form
▪ P(Xt+1|e1:t+1) = g(et+1, P(Xt|e1:t) )

▪ P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1) Apply conditional independence

▪ = α P(et+1|Xt+1, e1:t) P(Xt+1| e1:t)


▪ = α P(et+1|Xt+1) P(Xt+1| e1:t)

Normalize Update Predict


Filtering
▪ Filtering: infer current state given all evidence
▪ Aim: a recursive filtering algorithm of the form
▪ P(Xt+1|e1:t+1) = g(et+1, P(Xt|e1:t) )

▪ P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1)


▪ = α P(et+1|Xt+1, e1:t) P(Xt+1| e1:t) Condition on X
t

▪ = α P(et+1|Xt+1) P(Xt+1| e1:t)


▪ = α P(et+1|Xt+1) xt P(xt | e1:t) P(Xt+1| xt, e1:t)
Filtering
▪ Filtering: infer current state given all evidence
▪ Aim: a recursive filtering algorithm of the form
▪ P(Xt+1|e1:t+1) = g(et+1, P(Xt|e1:t) )

▪ P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1)


▪ = α P(et+1|Xt+1, e1:t) P(Xt+1| e1:t)
▪ = α P(et+1|Xt+1) P(Xt+1| e1:t) Apply conditional
independence

▪ = α P(et+1|Xt+1) xt P(xt | e1:t) P(Xt+1| xt, e1:t)


▪ = α P(et+1|Xt+1) x P(xt | e1:t) P(Xt+1| xt)
t
Filtering
▪ P(Xt+1|e1:t+1) = α P(et+1|Xt+1) x P(xt | e1:t) P(Xt+1| xt)
t

Normalize Update Predict

Xt+1 Xt Xt+1

Et+1
Forward algorithm
▪ P(Xt+1|e1:t+1) = α P(et+1|Xt+1) x P(xt | e1:t) P(Xt+1| xt)
t

Normalize Update Predict

▪ f1:t+1 = FORWARD(f1:t , et+1)


▪ We start with f1:0 = P(X0) and then iterate
▪ Cost per time step: O(|X|2) where |X| is the number of states
Example: Weather HMM
𝑃 𝑠 = 0.5 × 0.9 + 0.5 × 0.3 = 0.6
0.6
0.4 𝑃 𝑟 = 0.5 × 0.1 + 0.5 × 0.7 = 0.4
predict
update &
normalize Wt-1 P(Wt|Wt-1)
f(sun) = 0.5 f(sun) = 0.25 sun rain
𝑃 𝑠|𝑢 ∝ 0.6 × 0.2 = 0.12
f(rain) = 0.5 f(rain) = 0.75 sun 0.9 0.1
𝑃 𝑠|𝑢 ∝ 0.4 × 0.9 = 0.36
rain 0.3 0.7

Weather0 Weather1
Wt P(Ut|Wt)
true false
P(W0)
sun 0.2 0.8
sun rain
Umbrella1 rain 0.9 0.1
0.5 0.5
Example: Weather HMM

0.6 0.45
predict 0.4 predict 0.55
update & update &
normalize normalize Wt-1 P(Wt|Wt-1)
f(sun) = 0.5 f(sun) = 0.25 f(sun) = 0.154 sun rain
f(rain) = 0.5 f(rain) = 0.75 f(rain) = 0.846 sun 0.9 0.1
rain 0.3 0.7

Weather0 Weather1 Weather2 …


Wt P(Ut|Wt)
true false
P(W0)
sun 0.2 0.8
sun rain
Umbrella1 Umbrella2 rain 0.9 0.1
0.5 0.5
Forward algorithm
▪ P(Xt+1|e1:t+1) = α P(et+1|Xt+1) x P(xt | e1:t) P(Xt+1| xt)
t

Normalize Update Predict

α is a constant. So if we only want to compute P(xt | e1:t), then we can skip


normalization when computing P(x1 | e1), P(x2 | e1:2), …, P(xt-1 | e1:t-1)

Q: How is the algorithm related to variable elimination?


Another view of the algorithm
▪ State trellis: graph of states and transitions over time
sun sun sun sun

rain rain rain rain

X0 X1 … XT

▪ Each arc represents some transition xt-1 → xt


▪ Each arc has weight P(xt | xt-1) P(et | xt) (arcs to initial states have weight P(x0) )
▪ Each path is a sequence of states
▪ The product of weights on a path is proportional to that state sequence’s probability
P(x0) t P(xt | xt-1) P(et | xt) = P(x1:t , e1:t)  P(x1:t | e1:t)
Another view of the algorithm

sun sun sun sun

rain rain rain rain

X0 X1 … Xt+1
• Forward algorithm computes sum over all possible paths
P(xt+1|e1:t+1) = x0:t P(x0:t+1 | e1:t+1)
• It uses dynamic programming to sum over all paths
• For each state at time t, keep track of the total probability of all paths to it
f1:t+1 = FORWARD(f1:t , et+1)
= α P(et+1|Xt+1) x P(Xt+1| xt) f1:t [xt]
t
Most Likely Explanation
Inference tasks
▪ Filtering: P(Xt|e1:t)
▪ belief state—input to the decision process of a rational agent
▪ Prediction: P(Xt+k|e1:t) for k > 0
▪ evaluation of possible action sequences; like filtering without the evidence
▪ Smoothing: P(Xk|e1:t) for 0 ≤ k < t
▪ better estimate of past states, essential for learning
▪ Most likely explanation: arg maxx0:t P(x0:t | e1:t)
▪ speech recognition, decoding with a noisy channel
Most likely explanation = most probable path
▪ State trellis: graph of states and transitions over time
sun sun sun sun

rain rain rain rain

X0 X1 … XT

▪ The product of weights on a path is proportional to that state sequence’s probability


P(x0) t P(xt | xt-1) P(et | xt) = P(x0:t , e1:t)  P(x0:t | e1:t)
▪ Viterbi algorithm computes best paths
arg maxx0:t P(x0:t | e1:t)
Forward / Viterbi algorithms
sun sun sun sun

rain rain rain rain

X0 X1 … XT
Viterbi Algorithm (max) Forward Algorithm (sum)
For each state at time t, keep track of For each state at time t, keep track of the
the (unnormalized) maximum total probability of all paths to it:
probability of any path to it: f1:t+1 (xt+1) = P(xt+1 |e1:t+1)
m1:t+1 (xt+1) = maxx P(x1:t+1 |e1:t+1)
1:t
= x P(x1:t+1 |e1:t+1)
1:t

m1:t+1 = VITERBI(m1:t , et+1) f1:t+1 = FORWARD(f1:t , et+1)


= P(et+1|Xt+1) maxx P(Xt+1| xt) m1:t [xt] = α P(et+1|Xt+1) x P(Xt+1| xt) f1:t [xt]
t t
Viterbi algorithm contd.
Wt-1 P(Wt|Wt-1)
P(W0)
sun rain
sun rain sun
0.5 sun
0.09 sun sun
sun 0.9 0.1
0.5 0.5
rain 0.3 0.7
rain
0.5 rain rain rain
Wt P(Ut|Wt)

X0 X1 X2 X3 true false
sun 0.2 0.8
U1=true U2=false U3=true rain 0.9 0.1

▪ m1:t+1 = P(et+1|Xt+1) maxx P(Xt+1| xt) m1:t [xt]


t

𝑚1:1 sun = 0.2 × max(0.9 × 0.5, 0.3 × 0.5) = 0.09


Viterbi algorithm contd.
Wt-1 P(Wt|Wt-1)
P(W0)
sun rain
sun rain sun
0.5 sun
0.09 sun
0.076 sun
0.0136080
sun 0.9 0.1
0.5 0.5
rain 0.3 0.7
rain
0.5 rain
0.315 rain
0.022 rain
0.0138495
Wt P(Ut|Wt)

X0 X1 X2 X3 true false
sun 0.2 0.8
U1=true U2=false U3=true rain 0.9 0.1

▪ m1:t+1 = P(et+1|Xt+1) maxx P(Xt+1| xt) m1:t [xt]


t

▪ Time complexity: O(|X|2 T)


▪ Space complexity: O(|X| T)
Dynamic Bayes Nets
Dynamic Bayes Nets (DBNs)
▪ We want to track multiple variables over time, using
multiple sources of evidence
▪ Idea: Repeat a fixed Bayes net structure at each time
▪ Variables from time t can condition on those from t-1
t =1 t =2 t =3

G 1a G 2a G 3a

G 1b G 2b G 3b

E1a E1b E2a E2b E3a E3b


DBNs and HMMs
▪ Every HMM is a DBN
▪ Every discrete DBN can be represented by a HMM
▪ Each HMM state is Cartesian product of DBN state variables
▪ E.g., 3 binary state variables => one state variable with 23 possible values
▪ Advantage of DBN vs. HMM?
▪ Sparse dependencies => exponentially fewer parameters
▪ E.g., 20 binary state variables, 2 parents each;
DBN has 20 x 22+1 = 160 parameters, HMM has 220 x 220 =~ 1012 parameters
Exact Inference in DBNs
▪ Variable elimination applies to dynamic Bayes nets
▪ Offline: “unroll” the network for T time steps, then eliminate variables to find P(XT|e1:T)
▪ Problem: results in very large BN
t =1 t =2 t =3

G 1a G 2a G 3a
G 1b G 2b G 3b

E1a E1b E2a E2b E3a E3b

▪ Can we do better?
▪ Do we need to unroll for many steps? What is the best variable order of elimination?
▪ Online: unroll as we go, eliminate all variables from the previous time step
▪ A generalization of the Forward algorithm
Particle Filtering
Large state space
▪ When |X| is huge (e.g., position in a building), exact inference becomes
infeasible
▪ Can we use approximate inference, e.g., likelihood weighting?
▪ Evidences are “downstream”
▪ By ignoring the evidence: with more states sampled over time, the weight drops
quickly (going into low-probability region)
▪ Hence: too few “reasonable” samples

X0 X1 X2 X3

E1 E2 E3
Particle Filtering

▪ Represent belief state at each step by a set of 0.0 0.1 0.0


samples
0.0 0.0 0.2
▪ Samples are called particles
▪ Our representation of P(X) is now a list of N 0.0 0.2 0.5
particles (samples)
▪ P(x) approximated by number of particles with value x
▪ So, many x may have P(x) = 0
▪ Generally, N << |X|
▪ More particles, more accuracy; but a large N would
defeat the point.
Representation: Particles

▪ Initialization
▪ sample N particles from the initial distribution P(X0)
▪ All particles have a weight of 1

Particles:
(3,3)
(2,3)
(3,3)
(3,2)
(3,3)
(3,2)
(1,2)
(3,3)
(3,3)
(2,3)
Particle Filtering: Propagate forward

▪ Each particle is moved by sampling its Particles:


(3,3)
next position from the transition model: (2,3)
(3,3)
▪ xt+1 ~ P(Xt+1 | xt) (3,2)
(3,3)
(3,2)
(1,2)
(3,3)

▪ This captures the passage of time


(3,3)
(2,3)

▪ If enough samples, close to exact probabilities


(consistent) Particles:
(3,2)
(2,3)
(3,2)
(3,1)
(3,3)
(3,2)
(1,3)
(2,3)
(3,2)
(2,2)
Particle Filtering: Observe
Particles:
▪ Similar to likelihood weighting, weight (3,2)
(2,3)
samples based on the evidence (3,2)
(3,1)
▪ W = P(et| xt) (3,3)
(3,2)

▪ Particles that fit the evidence better get (1,3)


(2,3)
higher weights, others get lower weights (3,2)
(2,2)

▪ What happens if we repeat the


Propagate-Observe procedure over Particles:
(3,2) w=.9

time? (2,3) w=.2


(3,2) w=.9
(3,1) w=.4
▪ It is exactly likelihood weighting (if we (3,3) w=.4
(3,2) w=.9
multiply the weights) (1,3) w=.1
(2,3) w=.2
▪ Weights drop quickly… (3,2) w=.9
(2,2) w=.4
Particle Filtering: Resample

▪ Rather than tracking weighted samples, Particles:


(3,2) w=.9
we resample (2,3) w=.2
(3,2) w=.9

▪ Generate N new samples from our weighted (3,1) w=.4


(3,3) w=.4
samples (3,2) w=.9
(1,3) w=.1

▪ Each new sample is selected from the current (2,3) w=.2


(3,2) w=.9
population of samples; the probability is (2,2) w=.4

proportional to its weight.


▪ The new samples have weight of 1
(New) Particles:
(3,2)
(2,2)
(3,2)
▪ Now the update is complete for this time (2,3)
(3,3)
step, continue with the next one (3,2)
(1,3)
(2,3)
(3,2)
(3,2)
Summary: Particle Filtering
▪ Particles: track samples of states rather than an explicit distribution
Propagate forward Weight Resample

 P(Xt|e1:t)

Particles: Particles: Particles: (New) Particles:


(3,3) (3,2) (3,2) w=.9 (3,2)
(2,3) (2,3) (2,3) w=.2 (2,2)
(3,3) (3,2) (3,2) w=.9 (3,2)
(3,2) (3,1) (3,1) w=.4 (2,3)
(3,3) (3,3) (3,3) w=.4 (3,3)
(3,2) (3,2) (3,2) w=.9 (3,2)
(1,2) (1,3) (1,3) w=.1 (1,3)
(3,3) (2,3) (2,3) w=.2 (2,3)
(3,3) (3,2) (3,2) w=.9 (3,2)
(2,3) (2,2) (2,2) w=.4 (3,2)

Consistency: see proof in AIMA Ch. 15


Robot Localization
▪ In robot localization:
▪ We know the map, but not the robot’s position
▪ Observations may be vectors of range finder readings
▪ State space and readings are typically continuous so we
cannot usually represent or compute an exact posterior
▪ Particle filtering is a main technique
Particle Filter Localization (Sonar)

[Dieter Fox, et al.]


Particle Filter Localization (Laser)

[Dieter Fox, et al.]


Robot Mapping
▪ SLAM: Simultaneous Localization And Mapping
▪ We do not know the map or our location
▪ State consists of position AND map!
▪ Main techniques: Kalman filtering (Gaussian HMMs)
and particle methods

DP-SLAM, Ron Parr


Particle Filter SLAM – Video 1

[Sebastian Thrun, et al.]


Particle Filter SLAM – Video 2

[Dirk Haehnel, et al.]


Summary
▪ Probabilistic temporal models
▪ Markov model
▪ Hidden Markov model
▪ Filtering: forward algorithm
▪ MLE: Viterbi algorithm
▪ Dynamic Bayesian network
▪ Approximate inference by particle filtering

You might also like