Motion Segmentation Using EM - A Short Tutorial: 1 The Expectation (E) Step
Motion Segmentation Using EM - A Short Tutorial: 1 The Expectation (E) Step
Yair Weiss
MIT E10-120, Cambridge, MA 02139, USA
[email protected]
1 Throughout this tutorial we assume the number of models is known and is equal to two. A method
for automatically estimating the number of models is presented in (Weiss and Adelson, 1996)
and similarly for r (i). For example, suppose we have two line models: (1) y = x + 3
and (2) y = 2x 1. Suppose the ith datapoint is x = 1; y = 1:1. Then the residual for
line 1 is r (i) = (1 + 3 1:1) = 2:9 and for line 2 we get r (i) = (2 1 1:1 ) = 0:1 .
Intuitively we would expect this datapoint to be assigned to model 2 because the residual
is smaller. Indeed the formula for the weights is consistent with the intuition:
2
2
1
w1 (i)
2
2
r12 (i)=2
r22 (i)=2
r12 (i)=2
+e
(2)
r22 (i)=2
(3)
+ e r22 i =2
There are a few things to note about this formula. First, note that w (i) and w (i) sum
to one. This is because these weights are actually probabilities - the formula is simply
derived from Bayes rule. Second, note that there is a "free parameter" here . Roughly
speaking, this parameter corresponds to the amount of residual expected in your data
(e.g. the noise level). Finally, note that if r (i) is much smaller than r (i) then w (i) = 1
and w (i) = 0. For this reason equations 2-3 are sometimes known as the "softmin"
equations. They are a way of generalizing the concept of a minimum of two numbers
into a smooth function. More details on this can be found in the references.
So to summarize, the E step calculates two weights for every datapoint by rst calculating residuals (using the parameters of each model) and then running the residuals
through a softmin function as dened above.
w2 (i)
r12 (i)=2
( )
2
1
2
2
P x2 P x ! " a # " P x y #
Pi xi Pi 1i
= Pi yi i
b
i i
i i
Now in weighted least squares we are also given wi for each point and the equation simply
become:
"P
#
P w x2 P w x ! " a #
w
x
y
i
i
i
i
i
i
i
i
i
i
P
P
P
=
b
i wi xi
i wi 1
i wi yi
(4)
So in the M step we solve the above equation twice. First with wi = w (i) for the
parameters of line 1 and then with wi = w (i) for the parameters of line 2. In general,
in the M step we solve two weighted least squares - one for each model, with the weights
given by the results of the E step.
1
2.5
2.5
2.5
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2
0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.1
0.2
0.3
0.4
0.5
t=1
0.6
0.7
0.8
0.9
0
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.1
0.8
0
0
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.1
0.2
0.3
0.4
0.5
t=2
0.6
0.7
0.8
0.9
0
0
t=3
Figure 1: An example of the EM algorithm for tting two lines to data. The algorithm
starts at random conditions and converges in three iterations. The top panels show the
line ts at every iteration and the bottom panels show the weights. (data was generated
by setting y = x +1 for jx 0:5j < 0:25 and y = x otherwise. The parameter = 0:1).
2
What is the class of possible motion models we are considering? What are the
parameters that need to be estimated for each model? (e.g. translational models
or ane models)
What is the nature of the data that the models need to t? (e.g. raw pixel data or
results of optical
ow).
Here we will describe the simplest version of an EM based motion segmentation algorithm:
the models will be assumed to be translational and the image data will be results of optical
ow. More complex versions are described in the references.
We are given the results of optical
ow vx(i; j ) and vy (i; j ). We assume this
ow was
generated by two global translation models. Each translation model is characterized by
two numbers: (u; v) which give the horizontal and vertical translation. The problem is
to estimate the parameters (u ; v ); (u ; v ) as well as assigning each pixel to the model
that generated it.
The E step is entirely analogous to the one described for line tting. We estimate two
weights for every pixel w (i; j ) and w (i; j ). This is done by assuming (u ; v ); (u ; v )
are known and calculating residuals:
r (i; j ) = (u vx (i; j )) + (v vy (i; j ))
(5)
r (i; j ) = (u vx (i; j )) + (v vy (i; j ))
(6)
The residuals are converted into weights by passing them through the softmin function:
1
2
1
2
2
w1 (i; j )
w2 (i; j )
+e
(7)
(8)
+ e r22 i;j =2
The M step is also analogous to the one used for line tting. In this case, the
parameters (u ; v ) satisfy:
!" #
"P
#
P w (i; j )
0
u
w
(
i;
j
)
v
(
i;
j
)
x
i;j
i;j
P w (i; j )
= P w (i; j )v (i; j )
(9)
0
v
y
i;j
i;j
And the equations for (u ; v ) are the same with w (i; j ) replaced everywhere by w (i; j ).
To summarize, the motion segmentation starts by choosing an initial random value
for the translation parameters, and iterating the E and M steps detailed above until the
translation parameters converge.
In the simple case described here, since the EM algorithm is using the results of an
optical
ow algorithm, it makes sense to get the nal segmentation by going back to the
raw pixel data. That is, after the algorithm has converged and two translation models
have been found, we let the models \compete" to explain the pixel data. This is done
by:
warping frame 2 to frame 1 with the translation of model 1. Subtracting the warped
image from the true frame 1. This gives the prediction error of model 1.
repeating the previous step with the translation of model 2 and obtaining the
prediction error of model 2.
assigning each pixel to the model whose prediction error is lowest at that pixel.
1
4 Conclusion
This tutorial has discussed how to use the EM algorithm for motion segmentation. Although we have given an informal exposition, it should be noted that the actual algorithm
is derived from a rigorous statistical estimation framework and proofs about convergence
and rate of convergence can be found in the literature. The combination of the algorithm's principled derivation and intuitively decoupled steps is probably responsible for
its continued success in many estimation problems.
References
Ayer, S. and Sawhney, H. S. (1995). Layered representation of motion video using robust
maximum likelihood estimation of mixture models and mdl encoding. In Proc. Int'l
Conf. Comput. Vision, pages 777{784.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39:1{38.
Jepson, A. and Black, M. J. (1993). Mixture models for optical
ow computation. In
Proc. IEEE Conf. Comput. Vision Pattern Recog., pages 760{761, New York.
Weiss, Y. and Adelson, E. H. (1994). Perceptually organized EM: a framework for motion
segmentation that combines information about form and motion. Technical Report
315, MIT Media Lab, Perceptual Computing Section.
Weiss, Y. and Adelson, E. H. (1996). A unied mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models. In Proc.
IEEE Conf. Comput. Vision Pattern Recog., pages 321{326.