Introduction To Non-Linear Optimization: Ross A. Lippert
Introduction To Non-Linear Optimization: Ross A. Lippert
Introduction To Non-Linear Optimization: Ross A. Lippert
Ross A. Lippert
D. E. Shaw Research
February 25, 2008
R. A. Lippert Non-linear optimization
Optimization problems
problem: Let f : R
n
(, ],
nd min
xR
n
{f (x)}
nd x
s.t. f (x
) = min
xR
n
{f (x)}
Quite general, but some cases, like f convex, are fairly solvable.
Todays problem: How about f : R
n
R, smooth?
nd x
s.t. f (x
) = 0
We have a reasonable shot at this if f is twice differentiable.
R. A. Lippert Non-linear optimization
Two pillars of smooth multivariate optimization
nD optimization
linear solve/quadratic opt. 1D optimization
R. A. Lippert Non-linear optimization
The simplest example we can get
Quadratic optimization: f (x) = c x
t
b +
1
2
x
t
Ax.
very common (actually universal, more later)
Finding f (x) = 0
f (x) = b Ax = 0
x
= A
1
b
A has to be invertible (really, b in range of A).
Is this all we need?
R. A. Lippert Non-linear optimization
Max, min, saddle, or what?
Require A be positive denite, why?
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
0.5
1
1.5
2
2.5
3
1
0.5
0
0.5
1
1
0.5
0
0.5
1
3
2.5
2
1.5
1
0.5
0
1
0.5
0
0.5
1
1
0.5
0
0.5
1
2
1.5
1
0.5
0
0.5
1
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
0.2
0.4
0.6
0.8
1
R. A. Lippert Non-linear optimization
Universality of linear algebra in optimization
f (x) = c x
t
b +
1
2
x
t
Ax
Linear solve: x
= A
1
b.
Even for non-linear problems: if optimal x
near our x
f (x
) f (x) + (x
x)
t
f (x) +
1
2
(x
x)
t
f (x) (x
x) +
x = x
x (f (x))
1
f (x)
Optimization Linear solve
R. A. Lippert Non-linear optimization
Linear solve
x = A
1
b
But really we just want to solve
Ax = b
Dont form A
1
if you can avoid it.
(Dont form A if you can avoid that!)
For a general A, there are three important special cases,
diagonal: A =
_
_
a
1
0 0
0 a
2
0
0 0 a
3
_
_
thus x
i
=
1
a
i
b
i
orthogonal A
t
A = I, thus A
1
= A
t
and x = A
t
b
triangular: A =
_
_
a
11
0 0
a
21
a
22
0
a
31
a
32
a
33
_
_
, x
i
=
1
a
ii
_
b
i
j <i
a
ij
x
j
_
R. A. Lippert Non-linear optimization
Direct methods
A is symmetric positive denite.
Cholesky factorization:
A = LL
t
,
where L lower triangular. So x = L
t
_
L
1
b
_
by
Lz = b, z
i
=
1
L
ii
_
_
b
i
j <i
L
ij
z
j
_
_
L
t
x = z, x
i
=
1
L
ii
_
_
z
i
j >i
L
ij
x
j
_
_
R. A. Lippert Non-linear optimization
Direct methods
A is symmetric positive denite.
Eigenvalue factorization:
A = QDQ
t
,
where Q is orthogonal and D is diagonal. Then
x = Q
_
D
1
_
Q
t
b
_
_
.
More expensive than Choesky
Direct methods are usually quite expensive (O(n
3
) work).
R. A. Lippert Non-linear optimization
Iterative method basics
Whats an iterative method?
Denition (Informal denition)
An iterative method is an algorithm A which takes what you
have, x
i
, and gives you a new x
i +1
which is less bad such that
x
1
, x
2
, x
3
, . . . converges to some x
with badness= 0.
A notion of badness could come from
1
distance from x
i
to our problem solution
2
value of some objective function above its minimum
3
size of the gradient at x
i
e.g. If x is supposed to satisfy Ax = b, we could take ||b Ax||
to be the measure of badness.
R. A. Lippert Non-linear optimization
Iterative method considerations
How expensive is one x
i
x
i +1
step?
How quickly does the badness decrease per step?
A thousand and one years of experience yields two cases
1
B
i
i
for some (0, 1) (linear)
2
B
i
(
i
)
for (0, 1), > 1 (superlinear)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
Can you tell the difference?
R. A. Lippert Non-linear optimization
Convergence
1e04
0.001
0.01
0.1
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e04
0.001
0.01
0.1
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
Now can you tell the difference?
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
When evaluating an iterative method against manufacturers
claims, be sure to do semilog plots.
R. A. Lippert Non-linear optimization
Iterative methods
Motivation: directly optimize f (x) = c x
t
b +
1
2
x
t
Ax.
gradient descent:
1
Search direction: r
i
= f = b Ax
i
2
Search step: x
i +1
= x
i
+
i
r
i
3
Pick alpha:
i
=
r
t
i
r
i
r
t
i
Ar
i
minimizes f (x +r
i
)
f (x
i
+r
i
) = c x
t
i
b +
1
2
x
t
i
Ax
i
+r
t
i
(Ax
i
b) +
1
2
2
r
t
i
Ar
i
= f (x
i
) r
t
i
r
i
+
1
2
2
r
t
i
Ar
i
(Cost of a step = 1 A-multiply.)
R. A. Lippert Non-linear optimization
Iterative methods
Optimize f (x) = c x
t
b +
1
2
x
t
Ax.
conjugate gradient descent:
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= b Ax
i
.
2
Pick
i
=
d
t
i 1
Ar
i
d
t
i 1
Ad
i 1
, ensures d
t
i 1
Ad
i
= 0.
3
Search step: x
i +1
= x
i
+
i
d
i
4
Pick
i
=
d
t
i
r
i
d
t
i
Ad
i
: minimizes f (x
i
+d
i
)
f (x
i
+d
i
) = c x
t
i
b +
1
2
x
t
i
Ax
i
d
t
i
r
i
+
1
2
2
d
t
i
Ad
i
(also means that r
t
i +1
d
i
= 0)
Avoid extra A-multiply: using Ad
i 1
r
i 1
r
i
i
=
(r
i 1
r
i
)
t
r
i
(r
i 1
r
i
)
t
d
i 1
=
(r
i 1
r
i
)
t
r
i
r
t
i 1
d
i 1
=
(r
i
r
i 1
)
t
r
i
r
t
i 1
r
i 1
R. A. Lippert Non-linear optimization
A cute result
conjugate gradient descent:
1
r
i
= b Ax
i
2
Search direction: d
i
= r
i
+
i
d
i 1
( s.t. d
i
Ad
i 1
= 0)
3
Search step: x
i +1
= x
i
+
i
d
i
( minimizes).
Cute result (not that useful in practice)
Theorem (sub-optimality of CG)
(Assuming x
0
= 0) at the end of step k, the solution x
k
is the
optimal linear combination of b, Ab, A
2
b, . . . A
k
b for minimizing
c b
t
x +
1
2
x
t
Ax.
(computer arithmetic errors make this less than perfect)
Very little extra effort. Much better convergence.
R. A. Lippert Non-linear optimization
Slow convergence: Conditioning
The eccentricity of the quadratic is a big factor in convergence
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
R. A. Lippert Non-linear optimization
Convergence and eccentricity
=
max eig(A)
min eig(A)
For gradient descent,
||r
i
||
1
+ 1
i
For CG,
||r
i
||
+ 1
i
useless CG fact: in exact arithmetic r
i
= 0 when i > n (A is
n n).
R. A. Lippert Non-linear optimization
The truth about descent methods
Very slow unless can be controlled.
How do we control ?
Ax = b (PAP
t
)y = Pb, x = P
t
y
where P is a pre-conditioner you pick.
How to make (PAP
t
) small?
perfect answer, P = L
1
where L
t
L = A (Cholesky
factorization).
imperfect answer, P L
1
Variations on the theme of incomplete factorization:
P
1
= D
1
2
where D = diag(a
11
, . . . , a
nn
)
more generally, incomplete Cholesky decomposition
some easy nearby solution or simple approximate A
(requiring domain knowledge)
R. A. Lippert Non-linear optimization
Class project?
One idea for a preconditioner is by a block diagonal matrix
P
1
=
_
_
L
11
0 0
0 L
22
0
0 0 L
33
_
_
where L
t
ii
L
ii
= A
ii
a diagonal block of A.
In what sense does good clustering give good preconditioners?
End of solvers: there are a few other iterative solvers out there
I havent discussed.
R. A. Lippert Non-linear optimization
Second pillar: 1D optimization
1D optimization gives important insights into non-linearity.
min
sR
f (s), f continuous.
A derivative-free option:
A bracket is (a, b, c) s.t. a < b < c and f (a) > f (b) < f (c) then
f (x) has a local min for a < x < b
a b c
Golden search based on picking a < b
< b) or (b
(a), f
f (x
i
) = f (x
)e
i
+ O(||e
i
||
2
)
f (x
i
) = f (x
) + O(||e
i
||)
e
i +1
= e
i
(f
i
)
1
f
i
= O(||e
i
||
2
)
squares the error at every step (exactly eliminates the linear
error).
R. A. Lippert Non-linear optimization
Nave Newtons method
Sources of trouble
1
if f (x
i
) not posdef, x
i
= x
i +1
x
i
might be in an
increasing direction.
2
if f (x
i
) posdef, (f (x
i
))
t
x
i
< 0 so x
i
is a direction of
decrease (could overshoot)
3
even if f is convex, f (x
i +1
) f (x
i
) not assured.
(f (x) = 1 + e
x
+ log(1 + e
x
) starting from x = 2).
4
if all goes well, superlinear convergence!
R. A. Lippert Non-linear optimization
1D example of Newton trouble
1D example of trouble: f (x) = x
4
2x
2
+ 12x
20
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
20
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
Has one local minimum
Is not convex (note the concavity near x=0)
R. A. Lippert Non-linear optimization
1D example of Newton trouble
derivative of trouble: f
(x) = 4x
3
4x + 12
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
the negative f
i
(0,]
f (x
i
+
i
x
i
)
Armijo-search use this rule:
i
=
n
some n
f (x
i
+ sx
i
) f (x
i
) s (x
i
)
t
f (x
i
)
with , , xed (e.g. = 2, = =
1
2
).
R. A. Lippert Non-linear optimization
Line searching
1D-minimization looks like less of a hack than Armijo. For
Newton, asymptotic convergence is not strongly affected, and
function evaluations can be expensive.
far from x
))
1
Without pre-conditioner
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= P
t
f (x
i
).
2
Pick
i
=
(r
i
r
i 1
)
t
r
i 1
(r
i
r
i 1
)
t
r
i
(Polak-Ribiere)
3
Search step: x
i +1
= x
i
+
i
d
i
4
zero-nding d
t
i
f (x
i
+d
i
) = 0
with B = PP
t
change of metric
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= f (x
i
).
2
Pick
i
=
(r
i
r
i 1
)
t
Br
i 1
(r
i
r
i 1
)
t
r
i
3
Search step: x
i +1
= x
i
+
i
d
i
4
zero-nding d
t
i
Bf (x
i
+d
i
) = 0
R. A. Lippert Non-linear optimization
What else?
Remember this cute property?
Theorem (sub-optimality of CG)
(Assuming x
0
= 0) at the end of step k, the solution x
k
is the
optimal linear combination of b, Ab, A
2
b, . . . A
k
b for minimizing
c b
t
x +
1
2
x
t
Ax.
In a sense, CG learns about A from the history of b Ax
i
.
Noting,
1
computer arithmetic errors ruin this nice property quickly
2
non-linearity ruins this property quickly
R. A. Lippert Non-linear optimization
Quasi-Newton
Quasi-Newton has much popularity/hype. What if we
approximate (f (x
))
1
from the data we have
(f (x
))(x
i
x
k1
) f (x
i
) f (x
k1
)
x
i
x
k1
(f (x
))
1
(f (x
i
) f (x
k1
))
over some xed-nite history.
Data: y
i
= f (x
i
) f (x
k1
), s
i
= x
i
x
k1
with 1 i k
Problem: Find symmetric positive def H
k
s.t.
H
k
y
i
= s
i
Multiple solutions, but BFGS works best in most situations.
R. A. Lippert Non-linear optimization
BFGS update
H
k
=
_
I
s
k
y
t
k
y
t
k
s
k
_
H
k1
_
I
y
k
s
t
k
y
t
k
s
k
_
+
s
k
s
t
k
y
t
k
s
k
Lemma
The BFGS update minimizes min
H
||H
1
H
1
k1
||
2
F
such that
Hy
k
= s
k
.
Forming H
k
not necessary, e.g. H
k
v can be recursively
computed.
R. A. Lippert Non-linear optimization
Quasi-Newton
Typically keep about 5 data points in the history.
initialize Set H
0
= I, r
0
= f (x
0
), d
0
= r
0
goto 3
1
Compute r
k
= f (x
k
), y
k
= r
k1
r
k
2
Compute d
k
= H
k
r
k
3
Search step: x
k+1
= x
k
+
k
d
k
(line-search)
Asymptotically identical to CG (with
i
=
d
t
i
(f )d
i
r
t
i
d
i
)
Armijo line searching has good theoretical properties. Typically
used.
Quasi-Newton ideas generalize beyond optimization (e.g.
xed-point iterations)
R. A. Lippert Non-linear optimization
Summary
All multi-variate optimizations relate to posdef linear solves
Simple iterative methods require pre-conditioning to be
effective in high dimensions.
Line searching strategies are highly variable
Timing and storage of f , f , f are all critical in selecting
your method.
f f concerns method
fast fast 2,5 quasi-N (zero-search)
fast fast 5 CG (zero-search)
fast slow 1,2,3 derivative-free methods
fast slow 2,5 quasi-N (min-search)
fast slow 3,5 CG (min-search)
fast/slow slow 2,4,5 quasi-N with Armijo
fast/slow slow 4,5 CG (linearized )
1=time 2=space 3=accuracy
4=robust vs. nonlinearity 5=precondition
Dont take this table too seriously. . .
R. A. Lippert Non-linear optimization