0% found this document useful (0 votes)
43 views23 pages

EM Algorithm: Jur Van Den Berg

The document describes the Expectation-Maximization (EM) algorithm. It begins by explaining Kalman filtering and smoothing for state space models. It then introduces the EM algorithm, which simultaneously optimizes state estimates and model parameters given observed data. The document proceeds to derive the log-likelihood function for state space models and shows how to maximize it with respect to the model parameters to perform the M-step of the EM algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views23 pages

EM Algorithm: Jur Van Den Berg

The document describes the Expectation-Maximization (EM) algorithm. It begins by explaining Kalman filtering and smoothing for state space models. It then introduces the EM algorithm, which simultaneously optimizes state estimates and model parameters given observed data. The document proceeds to derive the log-likelihood function for state space models and shows how to maximize it with respect to the model parameters to perform the M-step of the EM algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

EM Algorithm

Jur van den Berg

Kalman Filtering vs.


Smoothing
Dynamics and Observation model
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Kalman Filter:

Compute X t | Y0 y 0 , , Yt y t
Real-time, given data so far

Kalman Smoother:

X t | Y0 y 0 , , YT y T , 0 t T
Compute
Post-processing, given all data

EM Algorithm
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Kalman smoother:
Compute distributions X0, , Xt
given parameters A, C, Q, R, and data y0, , yt.

EM Algorithm:
Simultaneously optimize X0, , Xt and A, C, Q,
R
given data y0, , yt.

Probability vs. Likelihood


Probability: predict unknown outcomes
based on known parameters:
p(x | )

Likelihood: estimate unknown


parameters based on known outcomes:
L(| x) = p(x | )

Coin-flip example:
is probability of heads (parameter)
x = HHHTTH is outcome

Likelihood for Coin-flip


Example
Probability of outcome given parameter:
p(x = HHHTTH | = 0.5) = 0.56 = 0.016

Likelihood of parameter given outcome:


L(= 0.5 | x = HHHTTH) = p(x | ) = 0.016

Likelihood maximal when = 0.6666


Likelihood function not a probability density

Likelihood for Cont.


Distributions
Six samples {-3, -2, -1, 1, 2, 3}
believed to be drawn from some
Gaussian N(0, 2)

LLikelihood
( | {3,2,1,1,2of
,3}):
p ( x 3 | ) p ( x 2 | ) p ( x 3 | )
Maximum (likelihood:
3) 2 (2) 2 (1) 2 12 2 2 32

2.16

Likelihood for Stochastic


Model
Dynamics model
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Suppose xt and yt are given for 0 t


T, what is likelihood of A,
C, Q and
T
L( A, C , Q, R | x, y ) p (x, y | A, C , Q, R ) p (x t | x t 1 ) p (y t | x t )
R?
t 0

log p (x, y | A, C , Q, R )
Compute log-likelihood:

Log-likelihood
T

log p(x, y | A, C , Q, R ) log p(x t | x t 1 ) p(y t | x t )


t 0

T 1

log p(x
t 0

t 1

| x t ) log p(y t | x t ) ...


t 0

Multivariate normal
distribution N(,
1
/
2
k / 2
1
T 1
1
p
(
x
)

(
2

exp(

(
x

)
(x ))
2
) has pdf:
From model:xt 1 ~ N ( Axt , Q) y t ~ N (Cxt , R)

1
1

1
T
1
log Q (x t 1 Ax t ) Q (x t 1 Ax t )
2
t 0 2

1
T 1

1
T
1
log R (y t Cx t ) R (y t Cx t ) const
2
t 0 2

T 1

Log-likelihood #2

1
1

1
T
1
log Q (x t 1 Ax t ) Q (x t 1 Ax t )
2
t 0 2

1
T 1

1
T
1
log R (y t Cx t ) R (y t Cx t ) const ...
2
t 0 2

T 1

a = Tr(a) if a is scalar
Bring summation inward
T 1
T
1

1
T
1
log Q Tr(( x t 1 Ax t ) Q (x t 1 Ax t ))
2
2 t 0

T 1
1 T

1
T
1
log R Tr(( y t Cx t ) R (y t Cx t )) const
2
2 t 0

Log-likelihood #3
T
1 T 1

1
T
1
log Q Tr(( x t 1 Ax t ) Q (x t 1 Ax t ))
2
2 t 0

T
T 1
1

1
T
1
log R Tr(( y t Cx t ) R (y t Cx t )) const ...
2
2 t 0

Tr(AB) = Tr(BA)
Tr(A) + Tr(B) = Tr(A+B)
T
1 1
1
log Q Tr Q
2
2

T 1
1 1
1
log R Tr R
2
2

T 1

(x
t 0

t 1

(y
t 0

Cx t ) (y t Cx t ) const

T

Ax t )(x t 1 Ax t )

T

Log-likelihood #4
T
1 1
1
log Q Tr Q
2
2

T 1

(x
t 0

T 1
1 1
1
log R Tr R
2
2

t 1

(y
t 0

Ax t )(x t 1 Ax t )

T

Cx t ) (y t Cx t ) const ...

T

Expand
l ( A, C , Q, R | x, y )
T
1 1
1
log Q Tr Q
2
2

T 1

x
t 0

T 1
1 1
1
log R Tr R
2
2

T
t 1 t 1

y y
t 0

T
t

x x A Ax t x
T
t 1 t

y t x C Cx t y
T
t

T
t 1

T
t 1

Ax t x A

T
t

Cx t x C const

T
t

Maximize likelihood
log is monotone function
max log(f(x)) max f(x)

Maximize l(A, C, Q, R | x, y) in turn


for A, C, Q and R.

l ( A, C , Q, R | x, y )
Solve
A
l ( A, C , Q, R | x, y )
Solve
C
Solvel ( A, C , Q, R | x, y )
Q
Solvel ( A, C , Q, R | x, y )
R

0
0

0
0

for
for
for
for

A
C
Q
R

Matrix derivatives
Defined for scalar functions f : Rn*m ->
R

Key identities

xT Ax
xT ( AT A)
x
B T AB
B T ( AT A)
B
Tr ( AB) Tr ( BA) Tr ( B T AT )

BT
A
A
A
log A
A T
A

Optimizing A
Derivative
l ( A, C , Q, R | x, y ) 1 1
Q
A
2

Maximizer

T 1

x
t 0

T
t 1 t

T 1

x x
t 0

T
t

T 1

2x
t 0

x 2 Ax t x

T
t 1 t

T
t

Optimizing C
Derivative
l ( A, C , Q, R | x, y ) 1 1
R
C
2

Maximizer

y x x x
t 0

T
t

t 0

T
t

2y x
t 0

T
t

2Cx t x
T
t

Optimizing Q
Derivative with respect to inverse
l ( A, C , Q, R | x, y ) T
1
Q
1
Q
2
2

t 0

Maximizer
1
Q
T

T 1

x
t 0

T
t 1 t 1

T
T T
T
T T
xt 1xt 1 xt 1xt A Axt xt 1 Axt xt A
T 1

x x A Ax t x
T
t 1 t

T
t 1

Ax t x A
T
t

Optimizing R
Derivative with respect to inverse
l ( A, C , Q, R | x, y ) T 1
1

R
1
R
2
2

T
T T
T
T
y
y

y
x
C

C
x
y

C
x
x
t t t t
t t
t t C

t 0

Maximizer
1
R

T 1

y y
t 0

T
t

y t x C Cx t y Cx t x C
T
t

T
t

T
t

EM-algorithm
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Initial guesses of A, C, Q, R
Kalman smoother (E-step):
Compute distributions X0, , XT
given data y0, , yT and A, C, Q, R.

Update parameters (M-step):


Update A, C, Q, R such that
expected log-likelihood is maximized

Repeat until convergence (local


optimum)

Kalman Smoother
for (t = 0; t < T; ++t)
filter x t 1|t Ax t|t

// Kalman

Pt 1|t APt|t AT Q

K t 1
x t 1|t 1

Pt 1|t C CPt 1|t C R


x t 1|t K t 1 y t 1 Cx t 1|t

Pt 1|t 1

Pt 1|t K t 1CPt 1|t

for (t = T 1; t 0;T --t)1


Lt
Pt |t A Pt 1|t
pass
x t|T
Pt|T

// Backward

x t |t Lt x t 1|T x t 1|t
Pt |t Lt ( Pt 1|T Pt 1|t ) LTt

Update Parameters
Likelihood in terms of x, but only X
l ( A, C , Q, R | x, y )
available
T
1 1
1
log Q Tr Q
2
2

T 1

x
t 0

T 1
1

log R 1 Tr R 1
2
2

T
t 1 t 1

x x A Ax t x
T
t 1 t

T
t 1

Ax t x A

T
t

T
T T
T
T T

y
y

y
x
C

C
x
y

C
x
x

t t
t t
t t 1
t t C const
t 0

T

x t , x t xTt , x t xTt1

Likelihood-function linear in
Expected likelihood: replace them with:
E ( X t | y ) x t|T

E ( X t X tT | y ) Pt|T x t|T x Tt|T

E ( X t X tT1 | y ) x t|t x Tt 1|T Lt Pt 1|T (x t 1|T x t 1|t )x Tt 1|T

Use maximizers to update A, C, Q and R.

Convergence
Convergence is guaranteed to local
optimum
Similar to coordinate ascent

Conclusion
EM-algorithm to simultaneously
optimize
state estimates and model
parameters
Given ``training data, EM-algorithm
can be used (off-line) to learn the
model for subsequent use in (realtime) Kalman filters

Next time
Learning from demonstrations
Dynamic Time Warping

You might also like