0% found this document useful (0 votes)
63 views

Learning With Hidden Variables - EM Algorithm

The document discusses the Expectation-Maximization (EM) algorithm and its applications. EM is an iterative method for finding maximum likelihood estimates of parameters in probabilistic models, when the model depends on unobserved latent variables. It involves alternating between performing an expectation (E) step, to compute the expected value of the log-likelihood, and a maximization (M) step to compute the parameter values that maximize the expected log-likelihood found in the E step. The document describes how EM can be applied to problems like training Gaussian mixture models and learning stochastic string edit distances.

Uploaded by

Mary Jansi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Learning With Hidden Variables - EM Algorithm

The document discusses the Expectation-Maximization (EM) algorithm and its applications. EM is an iterative method for finding maximum likelihood estimates of parameters in probabilistic models, when the model depends on unobserved latent variables. It involves alternating between performing an expectation (E) step, to compute the expected value of the log-likelihood, and a maximization (M) step to compute the parameter values that maximize the expected log-likelihood found in the E step. The document describes how EM can be applied to problems like training Gaussian mixture models and learning stochastic string edit distances.

Uploaded by

Mary Jansi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

ExpectationMaximization Algorithm and Applications

Eugene Weinstein Courant Institute of Mathematical Sciences Nov 14th, 2006

List of Concepts
Maximum-Likelihood Estimation (MLE) Expectation-Maximization (EM) Conditional Probability Mixture Modeling Gaussian Mixture Models (GMMs) String edit-distance Forward-backward algorithms
2/31

Overview
Expectation-Maximization Mixture Model Training Learning String Edit-Distance

3/31

One-Slide MLE Review


Say I give you a coin with But I dont tell you the value of Now say I let you flip the coin n times
You get h heads and n-h tails

What is the natural estimate of ?


This is

More formally, the likelihood of is governed by a binomial distribution:


Can prove is the maximum-likelihood estimate of Differentiate with respect to , set equal to 0
4/31

EM Motivation
So, to solve any ML-type problem, we analytically maximize the likelihood function?
Seems to work for 1D Bernoulli (coin toss) Also works for 1D Gaussian (find , 2 )

Not quite
Distribution may not be well-behaved, or have too many parameters Say your likelihood function is a mixture of 1000 1000dimensional Gaussians (1M parameters) Direct maximization is not feasible

Solution: introduce hidden variables to


Simplify the likelihood function (more common) Account for actual missing data
5/31

Hidden and Observed Variables


Observed variables: directly measurable from the data, e.g.
The waveform values of a speech recording Is it raining today? Did the smoke alarm go off?

Hidden variables: influence the data, but not trivial to measure


The phonemes that produce a given speech recording P (rain today | rain yesterday) Is the smoke alarm malfunctioning?

6/31

Expectation-Maximization
Model dependent random variables:
Observed variable x Unobserved (hidden) variable y that generates x

Assume probability distributions:


represents set of all parameters of distribution

Repeat until convergence


E-step: Compute expectation of

( , : old, new distribution parameters) M-step: Find that maximizes Q


7/31

Conditional Expectation Review


Let X, Y be r.v.s drawn from the distributions P(x) and P(y) Conditional distribution given by: Then For function h(Y ): Given a particular value of X (X=x):
8/31

Maximum Likelihood Problem


Want to pick that maximizes the loglikelihood of the observed (x) and unobserved (y) variables given
Observed variable x Previous parameters

Conditional expectation of given x and is

9/31

EM Derivation
Lemma (Special case of Jensens Inequality): Let p(x), q(x) be probability distributions. Then

Proof: rewrite as:

Interpretation: relative entropy non-negative


10/31

EM Derivation
EM Theorem:
If then

Proof:

By some algebra and lemma, So, if this quantity is positive, so is

11/31

EM Summary
Repeat until convergence
E-step: Compute expectation of

( , : old, new distribution parameters) M-step: Find that maximizes Q

EM Theorem:
If then

Interpretation
As long as we can improve the expectation of the log-likelihood, EM improves our model of observed variable x Actually, its not necessary to maximize the expectation, just need to make sure that it increases this is called Generalized EM
12/31

EM Comments
In practice, the x is series of data points
To calculate expectation, can assume i.i.d and sum over all points:

Problems with EM?


Local maxima Need to bootstrap training process (pick a )

When is EM most useful?


When model distributions easy to maximize (e.g., Gaussian mixture models)

EM is a meta-algorithm, needs to be adapted to particular application

13/31

Overview
Expectation-Maximization Mixture Model Training Learning String Distance

14/31

EM Applications: Mixture Models


Gaussian/normal distribution
Parameters: mean and variance 2 In the multi-dimensional case, assume isotropic Gaussian: same variance in all dimensions We can model arbitrary distributions with density mixtures

15/31

Density Mixtures
Combine m elementary densities to model a complex data distribution

kth Gaussian parametrized by

16/31

Density Mixtures
Combine m elementary densities to model a complex data distribution

17/31

Density Mixtures
Combine m elementary densities to model a complex data distribution

Log-likelihood function of the data x given

Log of sum hard to optimize analytically! Instead, introduce hidden variable y


: x generated by Gaussian k

EM formulation: maximize
18/31

Gaussian Mixture Model EM


Goal: maximize n (observed) data points: n (hidden) labels:
: xi generated by Gaussian k

Several pages of math later, we get: E step: compute likelihood of

M step: update k, k, k for each Gaussian k=1..m

19/31

GMM-EM Discussion
Summary: EM naturally applicable to training probabilistic models EM is a generic formulation, need to do some hairy math to get to implementation Problems with GMM-EM?
Local maxima Need to bootstrap training process (pick a )

GMM-EM applicable to enormous number of pattern recognition tasks: speech, vision, etc. Hours of fun with GMM-EM
20/31

Overview
Expectation-Maximization Mixture Model Training Learning String Distance

21/31

String Edit-Distance
Notation: Operate on two strings: Edit-distance: transform one string into another using
Substitution: kitten bitten, cost Insertion: cop crop, cost Deletion: learn earn, cost

Can compute efficiently recursively


22/31

Stochastic String Edit-Distance


Instead of setting costs, model edit operation sequence as a random process Edit operations selected according to a probability distribution For edit operation sequence View string edit-distance as
memoryless (Markov): stochastic: random process according to () is governed by a true probability distribution transducer:
23/31

Edit-Distance Transducer
Arc label a:b/0 means input a, output b and weight 0 Assume

24/31

Two Distances
Define yield of an edit sequence (zn#) as the set of all strings hx,yi such that zn# turns x into y Viterbi edit-distance: negative loglikelihood of most likely edit sequence Stochastic edit-distance: negative loglikelihood of all edit sequences from x to y
25/31

Evaluating Likelihood
Viterbi: Stochastic: Both require calculation of possible edit sequences

over all

possibilities (three edit operations)

However, memoryless assumption allows us to compute likelihood efficiently Use the forward-backward method!
26/31

Forward
Evaluation of forward probabilities : likelihood of picking an edit sequence that generates the prefix pair Memoryless assumption allows efficient recursive computation:

27/31

Backward
Evaluation of backward probabilities : likelihood of picking an edit sequence that generates the suffix pair Memoryless assumption allows efficient recursive computation:

28/31

EM Formulation
Edit operations selected according to a probability distribution So, EM has to update based on occurrence counts of each operation (similar to coin-tossing example) Idea: accumulate expected counts from forward, backward variables (z): expected count of edit operation z
29/31

EM Details

(z): expected count of edit operation z e.g,

30/31

References
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp. 1-38. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11(1), Mar 1983, pp. 95-103. F. Jelinek, Statistical Methods for Speech Recognition, 1997 M. Collins, The EM Algorithm, 1997 J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, University of Berkeley, TR-97-021, 1998 E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532. L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, In Proc. IEEE, 77(2), 1989, pp. 257-286. A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture Of Gaussians M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation and Application of Automata, (CIAA) 2002, pp. 1-23 J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003 Carlo Tomasi, Estimating Gaussian Mixture Densities with EM A Tutorial, 2004 Wikipedia
31/31

You might also like