0% found this document useful (0 votes)
11 views55 pages

Lecture 11

- The document discusses a statistical machine learning course, including details about projects, sequential data, Markov models, hidden Markov models, and the forward algorithm. - The key deadline for projects is December 15th, with no extensions possible as marks are due on December 21st. - Sequential data violates the independence assumption of traditional models; Markov models address this by making future states dependent only on current states. - Hidden Markov models introduce latent variables to remove the Markov assumption, allowing modeling of complex sequential patterns through efficient algorithms like the forward pass.

Uploaded by

roboganowo24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views55 pages

Lecture 11

- The document discusses a statistical machine learning course, including details about projects, sequential data, Markov models, hidden Markov models, and the forward algorithm. - The key deadline for projects is December 15th, with no extensions possible as marks are due on December 21st. - Sequential data violates the independence assumption of traditional models; Markov models address this by making future states dependent only on current states. - Hidden Markov models introduce latent variables to remove the Markov assumption, allowing modeling of complex sequential patterns through efficient algorithms like the forward pass.

Uploaded by

roboganowo24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

STA 4273H:

Statistical Machine Learning

Russ  Salakhutdinov  
Department of Statistics!
[email protected]!
http://www.utstat.utoronto.ca/~rsalakhu/
Sidney Smith Hall, Room 6002

Lecture 11
Project Deadline
• Project Deadline is Dec 15, 2011. This is a hard deadline!

• I will not be able to give any extensions, as I have to turn in the


marks by Dec 21.

• Send the projects to me via e-mail to:


[email protected]
Sequential Data
• So far we focused on problems that assumed that the data points were
independent and identically distributed (i.i.d. assumption).
• Express the likelihood function as a product over all data points of the
probability distribution evaluated at each data point.
• Poor assumption when working with sequential data.

• For many applications, e.g. financial forecasting, we want to predict the next
value in a time series, given past values.

• Intuitively, the recent observations are likely to be more informative in predicting


the future.

• Markov models: future predictions are independent of all but the most recent
observations.
Example of a Spectrogram
• Example of a spectrogram of a spoken
word ‘Bayes theorem’:

• Successive observations are


highly correlated.
Markov Models
• The simplest model is the first-order Markov chain:

• The joint distribution for a sequence of N observations under this model is:

• From the d-separation property, the conditionals are given by:

• For many applications, these conditional distributions that define the model will
be constrained to be equal.
• This corresponds to the assumption of a stationary time series.
• The model is known as homogenous Markov chain.
Second-Order Markov Models
• We can also consider a second-order Markov chain:

• The joint distribution for a sequence of N observations under this model is:

• We can similarly consider extensions to an Mth order Markov chain.

• Increased flexibility ! Exponential growth in the number of parameters.

• Markov models need big orders to remember past “events”.


Learning Markov Models
• The ML parameter estimates for a simple Markov model are easy.
Consider a Kth order model:

• Each window of k + 1 outputs is a training case for the model.

• Example: for discrete outputs (symbols) and a 2nd-order Markov model we can
use the multinomial model:

• The maximum likelihood values for ® are:


State Space Models
• How about the model that is not limited by the Markov assumption to any
order.
• Solution: Introduce additional latent variables!

• Graphical structure known


as the State Space Model.

• For each observation xn, we have a latent variable zn. Assume that latent
variables form a Markov chain.
• If the latent variables are discrete ! Hidden Markov Models (HMMs).
Observed variables can be discrete or continuous.

• If the latent and observed variables are Gaussian ! Linear Dynamical


System.
State Space Models
• The joint distribution is given by:

• Graphical structure known


as the State Space Model.

• There is always a path connecting two observed variables xn, xm via latent
variables.
• The predictive distribution:

does not exhibit any conditional independence properties! And so prediction


depends on all previous observations.
• Even though hidden state sequence is first-order Markov, the output process is
not Markov of any order!
Hidden Markov Model
• First order Markov chain generates hidden state sequence (known as
transition probabilities):

• A set of output probability distributions (one per state) converts state path into
sequence of observable symbols/vectors (known as emission probabilities):

Gaussian, if x is continuous.
Conditional probability table if x is discrete.

State transition Observation model


Links to Other Models
• You can view HMM as: A Markov chain with stochastic measurements.

We will adopt this view,


• Or a mixture model with states coupled across time. as we worked with
mixture model before.
Transition Probabilities
• It will be convenient to use 1-of-K encoding for the latent variables.
• The matrix of transition probabilities takes form:

• The conditionals can be written as:

• We will focus on homogenous models: all of the conditional distributions over


latent variables share the same parameters A.

• Standard mixture model for i.i.d. data: special case in which all parameters Ajk
are the same for all j.

• Or the conditional distribution p(zn|zn-1) is independent of zn-1.


Emission Probabilities
• The emission probabilities take form:

• For example, for a continuous x, we have

• For the discrete, multinomial observed variable x, using 1-of-K encoding, the
conditional distribution takes form:
HMM Model Equations
• The joint distribution over the observed and latent variables is given by:

where are the model parameters.

• Data are not i.i.d. Everything is coupled across time.


• Three problems: computing probabilities of observed sequences, inference
of hidden state sequences, learning of parameters.
HMM as a Mixture Through Time
• Sampling from a 3-state HMM with a 2-d Gaussian emission model.

• The transition matrix is fixed: Akk=0.9 and Ajk = 0.05.


Applications of HMMs
• Speech recognition.
• Language modeling
• Motion video analysis/tracking.
• Protein sequence and genetic sequence alignment and analysis.
• Financial time series prediction.
Maximum Likelihood for the HMM
• We observe a dataset X = {x1,…,xN}.
• The goal is to determine model parameters
• The probability of observed sequence takes form:

• In contrast to mixture models, the joint distribution p(X,Z | µ) does not


factorize over n.

• It looks hard: N variables, each of which has K states. Hence NK total paths.

• Remember inference problem on a simple chain.


Probability of an Observed Sequence
• The joint distribution factorizes:

• Dynamic Programming: By moving the summations inside, we can save a


lot of work.
EM algorithm
• We cannot perform direct maximization (no closed form solution):

• EM algorithm: we will derive efficient algorithm for maximizing the likelihood


function in HMMs (and later for linear state-space models).

• E-step: Compute the posterior distribution over latent variables:

• M-step: Maximize the expected complete data log-likelihood:

• If we knew the true state path, then ML parameter estimation would be trivial.

• We will first look at the E-step: Computing the true posterior distribution over
the state paths.
Inference of Hidden States
• We want to estimate the hidden states given observations. To start with, let
us estimate a single hidden state:

• Using conditional independence property, we obtain:


Inference of Hidden States
• Hence:

The joint probability of observing all


of the data up to time n and zn.

The conditional probability of all


future data from time n+1 to N.

• Each ®(zn) and ¯(zn) represent a set of K numbers, one for each of the
possible settings of the 1-of-K binary vector zn.

• We will derive efficient recursive algorithm, known as the alpha-beta


recursion, or forward-backward algorithm.

• Remember the sum-product message passing algorithm for tree-structured


graphical models.
The Forward (®) Recursion
• The forward recursion:

Computational cost
scales like O(K2).

• Observe:

• This enables us to easily (cheaply) compute the desired likelihood.


ug forward in time by making copies of it and
Bugs on a Grid - Trick
The Forward (®) Recursion
g the value of each copy by the probability of the
d output emission• Clever recursion:
all bugs
• The have reached
forward addstime
recursion: τ between 2 and 3 above which says: at each node
a step
es on all bugs all the bugs with a single bug carrying the sum of their valu
m
states

states
time ! time
Exponentially many is
• This paths.
exactly dynamic programming.
At each node, sum up the values
L
of all incoming paths.

• This is exactly dynamic programming.


Bugs on a Grid - Trick
The Forward (®) Recursion
• Illustration of the forward recursion

Here ®(zn,1) is obtained by


• Taking the elements ®(zn-1,j)
• Summing the up with weights Aj1,
corresponding to p(zn | zn-1)
• Multiplying by the data contribution
p(xn | zn1).

• The initial condition is given by:


The Backward (¯) Recursion
• There is also a simple recursion for ¯(zn):
The Backward (¯) Recursion
• Illustration of the backward recursion

• Initial condition:

• Hence:
The Backward (¯)Algorithm
Forward-Backward Recursion
gives
αi(t)• ®(z totaltotalinflow
nk) gives ofprobability
inflow of prob. toto node (t, i)
node (n,k).
βi(t)• gives totaltotal
¯(znk) gives outflow
outflow ofofprobability.
prob.
states

! time "
Bugs• again:
In fact, wewecanjust letforward
do one the bugs run
pass to forward
compute from
all the ®(zn) time 0 to t and
and one
backward
backward passtime
from to compute
τ to all
t. the ¯(zn) and then compute any °(zn) we want.
Total cost is O(K2N).
n fact, we can just do one forward pass to compute all the αi(t)
Computing Likelihood
• Note that

• We can compute the likelihood at any time using ® - ¯ recursion:

• In the forward calculation we proposed originally, we did this at the final


time step n = N.

Because ¯(zn)=1.

• This is a good way to check your code!


Two-Frame Inference
• We will also need the cross-time statistics for adjacent time steps:

• This is a K £ K matrix with elements »(i,j) representing the expected number of


transitions from state i to state j that begin at time n-1, given all the observations.

• It can be computed with the


same ® and ¯ recursions.
EM algorithm
• Intuition: if only we knew the true state path then ML parameter estimation
would be trivial.
Baum-Welch (EM) Training
• E-step: Compute the posterior distribution over the state path using ® - ¯
on: if recursion (dynamic
only we knew programming):
the true state path then ML parameter • Need the cr
tion would be trivial.
an estimate state path using the DP trick.
• M-step: Maximize the expected complete data log-likelihood (parameter• re-
This can be
Welchestimation):
algorithm (special case of EM): estimate the states,
p(xt, xt+1|{
ompute params, then re-estimate states, and so on . . .
orks and we can prove that it always improves likelihood.
• We then
er: finding iterate.
the ML This is also
parameters known
is NP hard,assoainitial
Baum-Welch algorithm (special case
of EM).a lot and convergence is hard to tell.
ons matter
• In general, finding the ML parameters is
NP hard, so initial conditions matter a lot
• This is the e
likelihood

and convergence is hard to tell. that begin a

• It can be co
parameter space
Complete Data Log-likelihood
• Complete data log-likelihood takes form:

transition model observation model

• Statistics we need from the E-step are:


Expected Complete Data Log-likelihood
• The complete data log-likelihood takes form:

• Hence in the E-step we evaluate:

• In the M-step we optimize Q with respect to parameters:


Parameter Estimation
• Initial state distribution: expected number of times in state k at time 1:

• Expected # of transitions from state j to k which begin at time n-1:

so the estimated transition probabilities are:

• The EM algorithm must be initialized by choosing starting values for ¼ and A.

• Note that any elements of ¼ or A that initially are set to zero will remain zero in
subsequent EM updates.
Parameter Estimation: Emission Model
• For the case of discrete multinomial observed variables, the observation
model takes form: Same as fitting Bernoulli
mixture model.

• And the corresponding M-step update:

• For the case of the Gaussian emission model: Remember:

• And the corresponding M-step updates:


Same as fitting a Gaussian
mixture model.
Viterbi Decoding
• The numbers °(zn) above gave the probability distribution over all states at
any time.

• By choosing the state °*(zn) with the largest probability at each time, we
can make an “average” state path. This is the path with the maximum
expected number of correct states.

• To find the single best path, we do Viterbi decoding which is Bellman’s


dynamic programming algorithm applied to this problem.

• The recursions look the same, except with max instead of ∑.

• Same dynamic programming trick: instead of summing, we keep the term


with the highest value at each node.

• There is also a modified EM (Baum-Welch) training based on the Viterbi


decoding. Like K-means instead of mixtures of Gaussians.

• Remember the max-sum algorithm for tree structured graphical models.


Viterbi Decoding
• A fragment of the HMM lattice showing two possible paths:

• Viterbi decoding efficiently determines the most probable path from the
exponentially many possibilities.

• The probability of each path is given by the product of the elements of the
transition matrix Ajk, along with the emission probabilities associated with
each node in the path.
Using HMMs for Recognition
Using HMMs for Recognition
Use many HMMs for recognition by:
. training
• We canone
useHMM forrecognition
HMMs for each classby: (requires labelled training data)
- training
. evaluating one HMM for
probability ofeach
an class (requires
unknown labeled training
sequence underdata)
each HMM
- evaluating probability of an unknown sequence under each HMM
. classifying unknown
- classifying sequence:
unknown sequence byHMM with
choosing an highest
HMM withlikelihood
highest likelihood

L1 L2 Lk

This •requires the solution of two problems:


This requires the solution of two problems:
. Given- model, evaluate
Given model, prob.
evaluate of a ofsequence.
probability a sequence. (We can do this exactly
(We canand do
efficiently.)
this exactly & efficiently.)
- Giventraining
. Give some some training sequences,
sequences, estimate model
estimate modelparameters. (We can
parameters.
find the local maximum using EM.)
(We can find the local maximum of parameter space nearest our
starting point.)
Autoregressive HMMs
• One limitation of the standard HMM is that it is poor at capturing long-
range correlations between observations, as these have to be mediated via
the first order Markov chain of hidden states.

• Autoregressive HMM: The distribution over xn depends depends on a


subset of previous observations.
• The number of additional links must be limited to avoid an excessive
number of free parameters.

• The graphical model framework motivates a number of different models


based on HMMs.
Input-Output HMMs
• Both the emission probabilities and the transition probabilities depend on
the values of a sequence of observations u1,…,uN.

• Model parameters can be efficiently fit using EM, in which the E-step
involves forward-backward recursion.
Factorial HMMs
• Example of Factorial HMM comprising of two Markov chains of latent
variables:

• Motivation: In order to represent 10


bits of information at a given time step,
a standard HMM would need
K=210=1024 states.

• Factorial HMMs would use 10 binary


chains.
• Much more powerful model.

• The key disadvantage: Exact inference is intractable.


• Observing the x variables introduces dependencies between latent chains.
• Hence E-step for this model does not correspond to running forward-
backward along the M latent chain independently.
Factorial HMMs
• The conditional independence property: zn+1 ? zn-1 | zn does not hold for the
individual latent chains.

• There is no efficient exact E-step for


this model.

• One solution would be to use MCMC


techniques to obtain approximate
sample from the posterior.

• Another alternative is to resort to variational inference.


• The variational distribution can be described by M separate Markov chains
corresponding to the latent chains in the original model (structured mean-
field approximation).
nal outputs, lots Regularizing
of parameters in each HMMs Aj (y)
transition matrix has many 2 elements
• There are two problems:
- for high
ovariance dimensional
matrices in outputs,
high lots of parametersor
dimensions in the emission model
- with many states, transition matrix has many (squared) elements
dels •with many symbols have lots of parameters.
First problem: full covariance matrices in high dimensions or discrete symbol
ccurately
models withrequires a lothave
many symbols of lots
training data.To estimate these
of parameters.
accurately requires a lot of training data.
e mixtures of diagonal covariance Gaussians.
• We can use mixtures of diagonal
covariance Gaussians.

• For discrete data, we can use mixtures of


base rates.

e can
• Weuse mixtures
can also of base
tie parameters acrossrates.
states.
Regularizing Transition Matrices
Regularizing Transition Matrices
• One way to regularize large transition matrices is to constrain them to be
ay tosparse:
regularize
insteadlarge transition
of being matrices
allowed to transitionisto to
anyconstrain
other state,them
each state
has only
elatively a few possible
sparse: instead successor
of being states.
allowed to transition to any
tate,• Aeach state has
very effective only
way a few possible
to constrain successor
the transitions states.
is to order the states in the
ampleHMM and allow
if each transitions
state has atonlymost to states that come
p possible nextlaterstates
in the ordering.
then
st of• Such
inference
models isare
O(pKT
known as ) and the
“linear number
HMMs”, ofHMMs”
“chain parameters
or “left- is
to-right
+ KMHMMs”.) which are both
Transition matrix linear
is upper-in diagonal
the number
(usuallyofonly
states.
has a few bands).

remely effective way to constrain the


ions is to order the states in the HMM

s(t)
ow transitions only to states that come
the ordering. Such models are known
ear HMMs”, “chain HMMs” or “left-
t HMMs”. Transition matrix is upper- s(t+1)
al (usually only has a few bands).
Linear Dynamical Systems
• In HMMs, latent variables are discrete but with arbitrary emission
probability distributions.

• We now consider a linear-Gaussian state-space model, so that latent


variables and observed variables are multivariate Gaussian distributions.

• An HMM can be viewed as an extension of the mixture models to allow for


sequential correlations in the data

• Similarly, the linear dynamical system (LDS) can be viewed as a


generalization of the continuous latent variable models, such as probabilistic
PCA.
Linear Dynamical Systems
• The model is represented by a tree-structured directed graph, so inference
can be solved efficiently using the sum-product algorithm.

• The forward recursions, analogous to the ®-messages of HMMs are known


as the Kalman filter equations.

• The backward recursions, analogous to the ¯-messages, are known as the


Kalman smoother equations.

• The Kalman filter is used in many real-time tracking applications.

• Because the LDS is a linear-Gaussian model, the joint distribution over all
variables, as well as marginals and conditionals, will be Gaussian.

• This leads to tractable inference and learning.


The Model
• We can write the transition and emission distributions in the general form:

• These can be expressed in terms of noisy linear equations:

• Model parameters can be learned using EM


algorithm (similar to standard HMM case).
Inference in LDS
• Consider forward equations. The initial message is Gaussian, and since each
of the factors is Gaussian, all subsequent messages will also be Gaussians.

• Similar to HMMs, let us define the normalized version of ®(zn):

Remember: for HMMs


• Using forward recursion, we get:
Inference in LDS
• Hence we obtain:

in which case ®(zn) is Gaussian:

and we have also defined the Kalman gain matrix:


Kalman Filter
• Let us examine the evolution of the mean:

Predicted
Prediction of the observation for xn.
mean over zn.

Predicted mean plus the Error between the predicted


correction term controlled by observation xn and the actual
the Kalman gain matrix. observation xn.

• We can view the Kalman filter as a process of making subsequent predictions


and then correcting these predictions in the light of the new observations.
Kalman Filter
• Example:

blue curve red curve blue curve


incorporate incorporate new observation
transition model (density of the new point is
given by the green curve)

• The new observation has shifted and narrowed the distribution compared to
(see red curve)
Tracking Example
• LDS that is being used to track a moving object in 2-D space:

• Blue points indicate the true position of the object.


• Green points denote the noisy measurements.
• Red crosses indicate the means of the inferred posterior distribution of the
positions inferred by the Kalman filter.
Particle Filters
• For dynamical systems that are non-Gaussian (e.g. emission densities are non-
Gaussian), we can use sampling methods to find a tractable solution to the
inference problem.
• Consider a class of distributions represented by the graphical model:

• Suppose we observed Xn =
{x1,…,xn}, and we wish to
approximate:
Particle Filters
• Hence

with importance weights:

• Hence the posterior p(zn | Xn) is represented by the set of L samples together
with the corresponding importance weights.

• We would like to define a sequential algorithm.


• Suppose that a set of samples and weights have been obtained at time step n.
• We wish to find the set of new samples and weights at time step n+1.
Particle Filters
• From our previous result, let

• Summary of the particle filter algorithm:


- At time n, we have a sample representation of the posterior distribution
p(zn | Xn) expressed as L samples with corresponding weights.
- We next draw L samples from the mixture distribution (above).
- For each sample, we then use the new observation to re-evaluate the
weights:
Example

• At time n, the posterior p(zn | Xn) is represented as a mixture distribution.


• We draw a set of L samples from this distribution (incorporating the transition
model).
• The new weights evaluated by incorporating the new observation xn+1.

You might also like