Lie Optimization
Lie Optimization
Optimization
Ethan Eade
1 Definitions
Let x ∈ X be the state parameters to be optimized, with n degrees of freedom.
The goal of the optimization is to maximize the likelihood of a set of observa-
tions given the parameters, under a specified observation model.
1.1 Observations
For some problems, the observations are represented directly, either as vectors
in Rm or as elements on a manifold with m degrees of freedom. For example,
observations of points in an image are vectors in R2 (m = 2). Observations of
coordinate transformations in 3D are elements of SE(3) (m = 6). In these cases,
we refer to the collective observation as z ∈ Z. Often the collective observation
is built by stacking up M independent observations {zi ∈ Z }:
Z ≡ ZM (1)
z1
z ≡
.. (2)
.
zM
The observation model h(x) predicts the value of z given the state parameters:
h:X→Z (3)
The error vector v is then the difference (in a vector space) between the obser-
vations and the predictions, as a function of the parameters x:
1
When z is composed of independent observations {zi }, we can also refer to the
corresponding pieces of v:
vi ( x ) ≡ zi hi ( x ) (5)
v1
..
v(x) ≡ . (6)
vM
Its definition depends on that space. For observations in a vector space (e.g.
image points), is just the plain vector difference. For a Lie group G, and two
elements a, b ∈ G, we can define
a b ≡ ln a · b−1 ∈ g (8)
1.2 Jacobians
We define J to be the negative Jacobian (differential) of the error v as a function
of x:
∂v(x)
J≡− (10)
∂x
2
We use the negative Jacobian because when v ≡ z h(x), it is more natural to
compute the Jacobian of h(x), and then
∂h(x)
J= (11)
∂x
As with v and R, for independent errors the whole Jacobian is just the stacked
matrix of individual Jacobians:
∂v (x)
Ji ≡ − i
∂x
J1
..
J = .
JM
Above, the operator was defined for computing differences in the observa-
tion space, in such a way that zi hi (x) is a vector in Rm even when zi and
hi (x) are not represented as vectors. Similarly, we define the operator ⊕ for
“adding” a perturbation to our parameters. The parameter space X could be
a vector space like Rn , or instead some other manifold with n degrees of free-
dom. For pose estimation, X = SE(3) and n = 6.
Consider a parameter perturbation vector δ ∈ Rn . Then for x ∈ X, we have
⊕ : X × Rn → X (12)
Thus in Lie groups, using Eqs. 8 and 14, our intuition for ⊕ and holds:
(x ⊕ δ) x = ln exp (δ) · x · x−1 (15)
= ln (exp (δ)) (16)
= δ (17)
3
1.4 Objective Function
L = v T · R −1 · v (19)
The value of this objective function for some specific parameters x is often
called the residual.
2 Gauss-Newton Method
We approximate vi as a function of x by a first-order Taylor expansion:
∂vi (x)
vi ( x ⊕ δ ) ≈ vi ( x ) + ·δ
∂x
= vi ( x ) − Ji · δ
v (x ⊕ δ) ≈ v(x) − J · δ (22)
L = ( v − J · δ ) T · R −1 · ( v − J · δ ) (23)
4
To minimize this residual, we differentiate with respect to δ, set equal to zero,
and solve for δ:
∂L
= − 2 ( v − J · δ ) T · R −1 · J (24)
∂δ
0 = v T · R −1 · J − δ T · J T · R −1 · J (25)
v T · R −1 · J = δ T · J T · R −1 · J (26)
J · R −1 · J · δ = J T · R −1 · v
T
(27)
−1
δ = J T · R −1 · J · J T · R −1 · v (28)
so the linear system can be efficiently solved with a Cholesky or LDLT decom-
position. Further, if the observations are independent, the information matrix
and information vector are simply accumulated over the observations:
3 Assuring Convergence
Convergence of the Gauss-Newton method is not guaranteed, and it converges
only to a local optimum that depends on the starting parameters. In practice,
if the objective function L(x) is locally well-approximated by a quadratic form,
then convergence to a local minimum is quadratic. However, the curvature of
the error surface of a nonlinear observation model can vary significantly over
the parameter space. The Levenberg-Marquardt method is a refinement to the
Gauss-Newton procedure that increases the chance of local convergence and
prohibits divergence. Note that the results still depend on the starting point.
5
3.1 Levenberg Method
A ≡ J T · R −1 · J + λ · I (32)
1 T
δ→ J · R −1 · v (33)
λ
with decreasing step size. Thus if the residual L is not currently at a minimum,
increasing λ and computing the update δ will eventually lead to a decrease in
L.
To control convergence behavior, we modify λ according to a simple schedule,
controlled by two factors 1 < a < b. Typical values are a = 2 and b = 10.
Starting with parameters x, residual L(x), and damping value λ, an update δλ
is computed and applied:
x0 = x ⊕ δλ (34)
Then the residual L(x0 ) is computed under the new parameters. If the resid-
ual has decreased, such that L(x0 ) < L(x), then the update is valid, and the
damping factor is decreased by factor a:
x ← x0 (35)
1
λ ← ·λ (36)
a
If the residual has increased, or has not decreased by some threshold amount,
the parameters are left unchanged and λ is increased:
λ ← b·λ (37)
Thus only parameter updates that decrease the residual are kept. The process is
iterated similarly to the Gauss-Newton method, and can be terminated when
λ reaches a large threshold value (which corresponds to a vanishingly small
update). Note that in the case where the parameter update is rejected and λ
increases, the information matrix and vector need not be recomputed. Instead
only the matrix A needs to be updated using the new λ value, and the linear
system solved to find a new δλ .
6
3.2 Levenberg-Marquardt Method
As λ grows, δλ again tends towards a gradient descent update, but with each
dimension scaled according to the diagonal of the information matrix. This
can lead to faster convergence than the Levenberg damping term when some
dimensions of the error surface have much different curvature than others.
Ci ≡ ρ ( Li ) (39)
= ρ viT · Ri−1 · vi (40)
C = ∑ Ci (41)
i
The standard Gaussian least-squares objective function is thus the special case
ρ ( Li ) = Li .
The optimization method presented here assumes the function ρ is continu-
ously differentiable. M-estimators have nontrivial ρ, though√often M-estimator
cost functions are specified in the literature as functions of Li .
For example, the Huber cost function ρ H [k] with scale k can be defined in terms
of Li :
(
Li if Li < k2
ρ H [k ] ( Li ) ≡ √ (42)
2k · Li − k2 if Li ≥ k2
We can differentiate the more general objective function of Eq. 41 around our
parameters with the aid of the chain rule:
7
∂C ∂ρ ( Li ) ∂Li
∂x
= ∑ ∂Li
·
∂x
(43)
i
∂ρ ( L ) ∂vi (x)
= ∑ ∂Li i · 2viT · Ri−1 · ∂x (44)
i
∂ρ ( Li )
= ∑ ∂Li · 2vi · Ri · −Ji
T −1
(45)
i
∂ρ ( Li )
0 = ∑ ∂Li · [ z i − h i ( x + δ )] T
· R −1
i · (46)
J i
i
∂ρ ( L )
≈ ∑ ∂Li i · (vi − Ji δ)T · Ri−1 · Ji (47)
i
∂ρ ( Li ) T T ∂ρ ( Li ) T
∑ ∂Li
· δ · Ji · Ri−1 · Ji = ∑ ∂Li · v i · R −1
i · J i (48)
i i
The differential of the cost function is typically called the weight function:
∂ρ ( Li )
Wi ≡ (49)
∂Li
For the example of the Huber cost function, the weight function is then:
(
1 if Li < k2
WH [ k ] ( L i ) ≡ √ (50)
k/ Li if Li ≥ k2
! −1 !
h i h i
δ= ∑ Wi · JiT · Ri−1 · Ji · ∑ Wi · JiT · Ri−1 · vi (51)
i i
Note that the scalar weights Wi are evaluated at each iteration from the squared
errors Li computed from the current parameters x. This method of optimizing
robust cost functions is called iterated re-weighted least-squares.
Depending on the properties of ρ, when the parameter vector x is far from the
optimum, the weights Wi might all tend to zero, in which case the iteration will
not converge or will converge very slowly (i.e. not quadratically). Such failure
8
modes can sometimes be avoided by first optimizing a convex cost function
(e.g. Huber) to find a reasonable estimate of the parameters before switching
to a non-convex cost function (e.g. Tukey or Cauchy). The latter cost functions
are able to more strongly reject outlier observations by assigning them very
low or zero weights.
5 References
• K. Madsen, H.B. Nielsen, O. Tingleff. Methods for Non-linear Least Squares
problems. Informatics and Mathematical Modelling, Technical Univer-
sity of Denmark. April 2004.