Lecture 11
Lecture 11
Russ
Salakhutdinov
Department of Statistics!
[email protected]!
http://www.utstat.utoronto.ca/~rsalakhu/
Sidney Smith Hall, Room 6002
Lecture 11
Project Deadline
• Project Deadline is Dec 15, 2011. This is a hard deadline!
• For many applications, e.g. financial forecasting, we want to predict the next
value in a time series, given past values.
• Markov models: future predictions are independent of all but the most recent
observations.
Example of a Spectrogram
• Example of a spectrogram of a spoken
word ‘Bayes theorem’:
• The joint distribution for a sequence of N observations under this model is:
• For many applications, these conditional distributions that define the model will
be constrained to be equal.
• This corresponds to the assumption of a stationary time series.
• The model is known as homogenous Markov chain.
Second-Order Markov Models
• We can also consider a second-order Markov chain:
• The joint distribution for a sequence of N observations under this model is:
• Example: for discrete outputs (symbols) and a 2nd-order Markov model we can
use the multinomial model:
• For each observation xn, we have a latent variable zn. Assume that latent
variables form a Markov chain.
• If the latent variables are discrete ! Hidden Markov Models (HMMs).
Observed variables can be discrete or continuous.
• There is always a path connecting two observed variables xn, xm via latent
variables.
• The predictive distribution:
• A set of output probability distributions (one per state) converts state path into
sequence of observable symbols/vectors (known as emission probabilities):
Gaussian, if x is continuous.
Conditional probability table if x is discrete.
• Standard mixture model for i.i.d. data: special case in which all parameters Ajk
are the same for all j.
• For the discrete, multinomial observed variable x, using 1-of-K encoding, the
conditional distribution takes form:
HMM Model Equations
• The joint distribution over the observed and latent variables is given by:
• It looks hard: N variables, each of which has K states. Hence NK total paths.
• If we knew the true state path, then ML parameter estimation would be trivial.
• We will first look at the E-step: Computing the true posterior distribution over
the state paths.
Inference of Hidden States
• We want to estimate the hidden states given observations. To start with, let
us estimate a single hidden state:
• Each ®(zn) and ¯(zn) represent a set of K numbers, one for each of the
possible settings of the 1-of-K binary vector zn.
Computational cost
scales like O(K2).
• Observe:
states
time ! time
Exponentially many is
• This paths.
exactly dynamic programming.
At each node, sum up the values
L
of all incoming paths.
• Initial condition:
• Hence:
The Backward (¯)Algorithm
Forward-Backward Recursion
gives
αi(t)• ®(z totaltotalinflow
nk) gives ofprobability
inflow of prob. toto node (t, i)
node (n,k).
βi(t)• gives totaltotal
¯(znk) gives outflow
outflow ofofprobability.
prob.
states
! time "
Bugs• again:
In fact, wewecanjust letforward
do one the bugs run
pass to forward
compute from
all the ®(zn) time 0 to t and
and one
backward
backward passtime
from to compute
τ to all
t. the ¯(zn) and then compute any °(zn) we want.
Total cost is O(K2N).
n fact, we can just do one forward pass to compute all the αi(t)
Computing Likelihood
• Note that
Because ¯(zn)=1.
• It can be co
parameter space
Complete Data Log-likelihood
• Complete data log-likelihood takes form:
• Note that any elements of ¼ or A that initially are set to zero will remain zero in
subsequent EM updates.
Parameter Estimation: Emission Model
• For the case of discrete multinomial observed variables, the observation
model takes form: Same as fitting Bernoulli
mixture model.
• By choosing the state °*(zn) with the largest probability at each time, we
can make an “average” state path. This is the path with the maximum
expected number of correct states.
• Viterbi decoding efficiently determines the most probable path from the
exponentially many possibilities.
• The probability of each path is given by the product of the elements of the
transition matrix Ajk, along with the emission probabilities associated with
each node in the path.
Using HMMs for Recognition
Using HMMs for Recognition
Use many HMMs for recognition by:
. training
• We canone
useHMM forrecognition
HMMs for each classby: (requires labelled training data)
- training
. evaluating one HMM for
probability ofeach
an class (requires
unknown labeled training
sequence underdata)
each HMM
- evaluating probability of an unknown sequence under each HMM
. classifying unknown
- classifying sequence:
unknown sequence byHMM with
choosing an highest
HMM withlikelihood
highest likelihood
L1 L2 Lk
• Model parameters can be efficiently fit using EM, in which the E-step
involves forward-backward recursion.
Factorial HMMs
• Example of Factorial HMM comprising of two Markov chains of latent
variables:
e can
• Weuse mixtures
can also of base
tie parameters acrossrates.
states.
Regularizing Transition Matrices
Regularizing Transition Matrices
• One way to regularize large transition matrices is to constrain them to be
ay tosparse:
regularize
insteadlarge transition
of being matrices
allowed to transitionisto to
anyconstrain
other state,them
each state
has only
elatively a few possible
sparse: instead successor
of being states.
allowed to transition to any
tate,• Aeach state has
very effective only
way a few possible
to constrain successor
the transitions states.
is to order the states in the
ampleHMM and allow
if each transitions
state has atonlymost to states that come
p possible nextlaterstates
in the ordering.
then
st of• Such
inference
models isare
O(pKT
known as ) and the
“linear number
HMMs”, ofHMMs”
“chain parameters
or “left- is
to-right
+ KMHMMs”.) which are both
Transition matrix linear
is upper-in diagonal
the number
(usuallyofonly
states.
has a few bands).
s(t)
ow transitions only to states that come
the ordering. Such models are known
ear HMMs”, “chain HMMs” or “left-
t HMMs”. Transition matrix is upper- s(t+1)
al (usually only has a few bands).
Linear Dynamical Systems
• In HMMs, latent variables are discrete but with arbitrary emission
probability distributions.
• Because the LDS is a linear-Gaussian model, the joint distribution over all
variables, as well as marginals and conditionals, will be Gaussian.
Predicted
Prediction of the observation for xn.
mean over zn.
• The new observation has shifted and narrowed the distribution compared to
(see red curve)
Tracking Example
• LDS that is being used to track a moving object in 2-D space:
• Suppose we observed Xn =
{x1,…,xn}, and we wish to
approximate:
Particle Filters
• Hence
• Hence the posterior p(zn | Xn) is represented by the set of L samples together
with the corresponding importance weights.