0% found this document useful (0 votes)
18 views9 pages

Lie Optimization

Uploaded by

dio din
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

Lie Optimization

Uploaded by

dio din
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Gauss-Newton / Levenberg-Marquardt

Optimization

Ethan Eade

Updated March 20, 2013

1 Definitions
Let x ∈ X be the state parameters to be optimized, with n degrees of freedom.
The goal of the optimization is to maximize the likelihood of a set of observa-
tions given the parameters, under a specified observation model.

1.1 Observations

For some problems, the observations are represented directly, either as vectors
in Rm or as elements on a manifold with m degrees of freedom. For example,
observations of points in an image are vectors in R2 (m = 2). Observations of
coordinate transformations in 3D are elements of SE(3) (m = 6). In these cases,
we refer to the collective observation as z ∈ Z. Often the collective observation
is built by stacking up M independent observations {zi ∈ Z }:

Z ≡ ZM (1)
 
z1
z ≡ 
 ..  (2)
. 
zM

The observation model h(x) predicts the value of z given the state parameters:

h:X→Z (3)

The error vector v is then the difference (in a vector space) between the obser-
vations and the predictions, as a function of the parameters x:

v(x) ≡ z h(x) (4)

1
When z is composed of independent observations {zi }, we can also refer to the
corresponding pieces of v:
vi ( x ) ≡ zi hi ( x ) (5)
 
v1
 .. 
v(x) ≡  .  (6)
vM

The operator yields a vector difference between two elements in Z:


: Z × Z → Rm (7)

Its definition depends on that space. For observations in a vector space (e.g.
image points), is just the plain vector difference. For a Lie group G, and two
elements a, b ∈ G, we can define
 
a b ≡ ln a · b−1 ∈ g (8)

where g is the Lie algebra vector space corresponding to G.


For some problems, the error function is easier to express directly, rather than
as a difference between observation and model prediction. For instance, when
estimating epipolar geometry, the errors are the distances between points in an
image and their epipolar lines. Each distance vi is a scalar (m = 1), though the
points themselves are 2-vectors and the predictions are lines. In such scenarios,
we refer to the error vector v(x) without explicitly defining it in terms of z and
h(x). Nonetheless, we still refer to the pieces vi as observations.
We describe the uncertainty of v with a covariance matrix R. In the case of
Eq. 4, R is typically just the covariance over z itself. Otherwise, R is computed
by projecting uncertainties of the measured quantities into the space of the
error v. Again using the epipolar geometry example, the covariance of each
point measurement is propagated through the distance-to-line function to yield
variance over the epipolar error.
When the errors for each observation are independent, as is common in many
optimizations, the matrix R is block diagonal with M blocks, and we refer to
the ith block as Ri :  
R1
R=
 .. 
(9)
. 
RM

1.2 Jacobians
We define J to be the negative Jacobian (differential) of the error v as a function
of x:
∂v(x)
J≡− (10)
∂x

2
We use the negative Jacobian because when v ≡ z h(x), it is more natural to
compute the Jacobian of h(x), and then
∂h(x)
J= (11)
∂x

As with v and R, for independent errors the whole Jacobian is just the stacked
matrix of individual Jacobians:

∂v (x)
Ji ≡ − i
 ∂x 
J1
 .. 
J =  . 
JM

1.3 Parameter Perturbations

Above, the operator was defined for computing differences in the observa-
tion space, in such a way that zi hi (x) is a vector in Rm even when zi and
hi (x) are not represented as vectors. Similarly, we define the operator ⊕ for
“adding” a perturbation to our parameters. The parameter space X could be
a vector space like Rn , or instead some other manifold with n degrees of free-
dom. For pose estimation, X = SE(3) and n = 6.
Consider a parameter perturbation vector δ ∈ Rn . Then for x ∈ X, we have
⊕ : X × Rn → X (12)

When X = Rn , this is just standard vector addition:


x⊕δ ≡ x+δ (13)

When X = G for a Lie group G, the perturbation is expressed as left multipli-


cation in the group:
x ⊕ δ ≡ exp (δ) · x (14)

Thus in Lie groups, using Eqs. 8 and 14, our intuition for ⊕ and holds:
 
(x ⊕ δ) x = ln exp (δ) · x · x−1 (15)
= ln (exp (δ)) (16)
= δ (17)

In manifolds without the exponential map, the perturbation can be computed


as an update that might violate the manifold constraints, followed by a projec-
tion back onto the manifold.

3
1.4 Objective Function

The goal is to adjust x so that the likelihood of the observations is maximized:


 
1
p (z|x) ∝ exp − v T · R−1 · v (18)
2
As the logarithm is monotonic, this is equivalent to minimizing the negative
log-likelihood objective function:

L = v T · R −1 · v (19)

When the individual observations are independent, the covariance matrix R is


block diagonal. Then the objective function reduces to a sum over the observa-
tions (again as a function of x):

Li ≡ viT · Ri−1 · vi (20)


L = ∑ Li (21)
i

The value of this objective function for some specific parameters x is often
called the residual.

2 Gauss-Newton Method
We approximate vi as a function of x by a first-order Taylor expansion:

∂vi (x)
vi ( x ⊕ δ ) ≈ vi ( x ) + ·δ
∂x
= vi ( x ) − Ji · δ

This approximation then extends trivially to the whole error vector:

v (x ⊕ δ) ≈ v(x) − J · δ (22)

Substituting this approximation into 19 yields

L = ( v − J · δ ) T · R −1 · ( v − J · δ ) (23)

4
To minimize this residual, we differentiate with respect to δ, set equal to zero,
and solve for δ:
∂L
= − 2 ( v − J · δ ) T · R −1 · J (24)
∂δ
0 = v T · R −1 · J − δ T · J T · R −1 · J (25)
v T · R −1 · J = δ T · J T · R −1 · J (26)
J · R −1 · J · δ = J T · R −1 · v
T
(27)
  −1
δ = J T · R −1 · J · J T · R −1 · v (28)

The Fisher information matrix J T · R−1 · J is symmetric and positive definite,


 

so the linear system can be efficiently solved with a Cholesky or LDLT decom-
position. Further, if the observations are independent, the information matrix
and information vector are simply accumulated over the observations:

J T · R −1 · J = ∑ JiT · Ri−1 · Ji (29)


i
J T · R −1 · v = ∑ JiT · Ri−1 · vi (30)
i

The update from Eq. 31 is then applied by pertubring x by δ:


x ← x⊕δ (31)

The whole process is iterated by evaluating J and v at the new parameters,


recomputing δ (Eq. 28), and applying the update (Eq. 31). The iteration con-
tinues until some convergence criterion is met, or the iteration count reaches a
bound.
 −1
Note that upon convergence to a minimum of the residual, J T · R−1 · J (the
inverse of the information matrix) is the Cramer-Rao lower bound for the co-
variance of the parameters.

3 Assuring Convergence
Convergence of the Gauss-Newton method is not guaranteed, and it converges
only to a local optimum that depends on the starting parameters. In practice,
if the objective function L(x) is locally well-approximated by a quadratic form,
then convergence to a local minimum is quadratic. However, the curvature of
the error surface of a nonlinear observation model can vary significantly over
the parameter space. The Levenberg-Marquardt method is a refinement to the
Gauss-Newton procedure that increases the chance of local convergence and
prohibits divergence. Note that the results still depend on the starting point.

5
3.1 Levenberg Method

Define a modified information matrix, with a damping factor λ:

A ≡ J T · R −1 · J + λ · I (32)

As λ → 0, A approaches the unmodified information matrix. For λ → ∞, A


is dominated by the identity matrix. As λ increases, the computed update δ
tends to the scaled gradient descent direction:

1 T 
δ→ J · R −1 · v (33)
λ
with decreasing step size. Thus if the residual L is not currently at a minimum,
increasing λ and computing the update δ will eventually lead to a decrease in
L.
To control convergence behavior, we modify λ according to a simple schedule,
controlled by two factors 1 < a < b. Typical values are a = 2 and b = 10.
Starting with parameters x, residual L(x), and damping value λ, an update δλ
is computed and applied:
x0 = x ⊕ δλ (34)

Then the residual L(x0 ) is computed under the new parameters. If the resid-
ual has decreased, such that L(x0 ) < L(x), then the update is valid, and the
damping factor is decreased by factor a:

x ← x0 (35)
1
λ ← ·λ (36)
a

If the residual has increased, or has not decreased by some threshold amount,
the parameters are left unchanged and λ is increased:

λ ← b·λ (37)

Thus only parameter updates that decrease the residual are kept. The process is
iterated similarly to the Gauss-Newton method, and can be terminated when
λ reaches a large threshold value (which corresponds to a vanishingly small
update). Note that in the case where the parameter update is rejected and λ
increases, the information matrix and vector need not be recomputed. Instead
only the matrix A needs to be updated using the new λ value, and the linear
system solved to find a new δλ .

6
3.2 Levenberg-Marquardt Method

A refinement due to Marquardt changes how A is defined in terms of λ. In-


stead of damping all parameter dimensions equally (by adding a multiple of
the identity matrix), a scaled version of of the diagonal of the information ma-
trix itself can be added:
 
A ≡ J T · R−1 · J + λ · diag J T · R−1 · J (38)

As λ grows, δλ again tends towards a gradient descent update, but with each
dimension scaled according to the diagonal of the information matrix. This
can lead to faster convergence than the Levenberg damping term when some
dimensions of the error surface have much different curvature than others.

4 Robust Cost Functions


When dealing with non-Gaussian distributed errors, such as arise when some
observations are outliers, a cost function other than simple quadratic error is
appropriate. An alternative cost function ρ : R+ → R+ can be injected into the
objective function inside the sum of per-observation negative log-likelihoods
by generalizing Eq. 20:

Ci ≡ ρ ( Li ) (39)
 
= ρ viT · Ri−1 · vi (40)
C = ∑ Ci (41)
i

The standard Gaussian least-squares objective function is thus the special case
ρ ( Li ) = Li .
The optimization method presented here assumes the function ρ is continu-
ously differentiable. M-estimators have nontrivial ρ, though√often M-estimator
cost functions are specified in the literature as functions of Li .
For example, the Huber cost function ρ H [k] with scale k can be defined in terms
of Li :
(
Li if Li < k2
ρ H [k ] ( Li ) ≡ √ (42)
2k · Li − k2 if Li ≥ k2

We can differentiate the more general objective function of Eq. 41 around our
parameters with the aid of the chain rule:

7
 
∂C ∂ρ ( Li ) ∂Li
∂x
= ∑ ∂Li
·
∂x
(43)
i
 
∂ρ ( L ) ∂vi (x)
= ∑ ∂Li i · 2viT · Ri−1 · ∂x (44)
i
 
∂ρ ( Li )
= ∑ ∂Li · 2vi · Ri · −Ji
T −1
(45)
i

Linearizing the model hi (x + δ) and setting the differential to zero yields a


linear system for the update vector δ similar to the standard case:

 
∂ρ ( Li )
0 = ∑ ∂Li · [ z i − h i ( x + δ )] T
· R −1
i · (46)
J i
i
 
∂ρ ( L )
≈ ∑ ∂Li i · (vi − Ji δ)T · Ri−1 · Ji (47)
i
   
∂ρ ( Li ) T T ∂ρ ( Li ) T
∑ ∂Li
· δ · Ji · Ri−1 · Ji = ∑ ∂Li · v i · R −1
i · J i (48)
i i

The differential of the cost function is typically called the weight function:

∂ρ ( Li )
Wi ≡ (49)
∂Li

For the example of the Huber cost function, the weight function is then:
(
1 if Li < k2
WH [ k ] ( L i ) ≡ √ (50)
k/ Li if Li ≥ k2

Incorporating this abbreviation yields an update equation similar to Eq. 28:

! −1 !
h i h i
δ= ∑ Wi · JiT · Ri−1 · Ji · ∑ Wi · JiT · Ri−1 · vi (51)
i i

Note that the scalar weights Wi are evaluated at each iteration from the squared
errors Li computed from the current parameters x. This method of optimizing
robust cost functions is called iterated re-weighted least-squares.
Depending on the properties of ρ, when the parameter vector x is far from the
optimum, the weights Wi might all tend to zero, in which case the iteration will
not converge or will converge very slowly (i.e. not quadratically). Such failure

8
modes can sometimes be avoided by first optimizing a convex cost function
(e.g. Huber) to find a reasonable estimate of the parameters before switching
to a non-convex cost function (e.g. Tukey or Cauchy). The latter cost functions
are able to more strongly reject outlier observations by assigning them very
low or zero weights.

5 References
• K. Madsen, H.B. Nielsen, O. Tingleff. Methods for Non-linear Least Squares
problems. Informatics and Mathematical Modelling, Technical Univer-
sity of Denmark. April 2004.

You might also like