Introduction To Non-Linear Optimization: Ross A. Lippert

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Introduction to non-linear optimization

Ross A. Lippert
D. E. Shaw Research
February 25, 2008
R. A. Lippert Non-linear optimization
Optimization problems
problem: Let f : R
n
(, ],
nd min
xR
n
{f (x)}
nd x

s.t. f (x

) = min
xR
n
{f (x)}
Quite general, but some cases, like f convex, are fairly solvable.
Todays problem: How about f : R
n
R, smooth?
nd x

s.t. f (x

) = 0
We have a reasonable shot at this if f is twice differentiable.
R. A. Lippert Non-linear optimization
Two pillars of smooth multivariate optimization
nD optimization
linear solve/quadratic opt. 1D optimization
R. A. Lippert Non-linear optimization
The simplest example we can get
Quadratic optimization: f (x) = c x
t
b +
1
2
x
t
Ax.
very common (actually universal, more later)
Finding f (x) = 0
f (x) = b Ax = 0
x

= A
1
b
A has to be invertible (really, b in range of A).
Is this all we need?
R. A. Lippert Non-linear optimization
Max, min, saddle, or what?
Require A be positive denite, why?
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
0.5
1
1.5
2
2.5
3
1
0.5
0
0.5
1
1
0.5
0
0.5
1
3
2.5
2
1.5
1
0.5
0
1
0.5
0
0.5
1
1
0.5
0
0.5
1
2
1.5
1
0.5
0
0.5
1
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
0.2
0.4
0.6
0.8
1
R. A. Lippert Non-linear optimization
Universality of linear algebra in optimization
f (x) = c x
t
b +
1
2
x
t
Ax
Linear solve: x

= A
1
b.
Even for non-linear problems: if optimal x

near our x
f (x

) f (x) + (x

x)
t
f (x) +
1
2
(x

x)
t
f (x) (x

x) +
x = x

x (f (x))
1
f (x)
Optimization Linear solve
R. A. Lippert Non-linear optimization
Linear solve
x = A
1
b
But really we just want to solve
Ax = b
Dont form A
1
if you can avoid it.
(Dont form A if you can avoid that!)
For a general A, there are three important special cases,
diagonal: A =
_
_
a
1
0 0
0 a
2
0
0 0 a
3
_
_
thus x
i
=
1
a
i
b
i
orthogonal A
t
A = I, thus A
1
= A
t
and x = A
t
b
triangular: A =
_
_
a
11
0 0
a
21
a
22
0
a
31
a
32
a
33
_
_
, x
i
=
1
a
ii
_
b
i

j <i
a
ij
x
j
_
R. A. Lippert Non-linear optimization
Direct methods
A is symmetric positive denite.
Cholesky factorization:
A = LL
t
,
where L lower triangular. So x = L
t
_
L
1
b
_
by
Lz = b, z
i
=
1
L
ii
_
_
b
i

j <i
L
ij
z
j
_
_
L
t
x = z, x
i
=
1
L
ii
_
_
z
i

j >i
L
ij
x
j
_
_
R. A. Lippert Non-linear optimization
Direct methods
A is symmetric positive denite.
Eigenvalue factorization:
A = QDQ
t
,
where Q is orthogonal and D is diagonal. Then
x = Q
_
D
1
_
Q
t
b
_
_
.
More expensive than Choesky
Direct methods are usually quite expensive (O(n
3
) work).
R. A. Lippert Non-linear optimization
Iterative method basics
Whats an iterative method?
Denition (Informal denition)
An iterative method is an algorithm A which takes what you
have, x
i
, and gives you a new x
i +1
which is less bad such that
x
1
, x
2
, x
3
, . . . converges to some x

with badness= 0.
A notion of badness could come from
1
distance from x
i
to our problem solution
2
value of some objective function above its minimum
3
size of the gradient at x
i
e.g. If x is supposed to satisfy Ax = b, we could take ||b Ax||
to be the measure of badness.
R. A. Lippert Non-linear optimization
Iterative method considerations
How expensive is one x
i
x
i +1
step?
How quickly does the badness decrease per step?
A thousand and one years of experience yields two cases
1
B
i

i
for some (0, 1) (linear)
2
B
i

(
i
)
for (0, 1), > 1 (superlinear)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
Can you tell the difference?
R. A. Lippert Non-linear optimization
Convergence
1e04
0.001
0.01
0.1
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e04
0.001
0.01
0.1
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
Now can you tell the difference?
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
When evaluating an iterative method against manufacturers
claims, be sure to do semilog plots.
R. A. Lippert Non-linear optimization
Iterative methods
Motivation: directly optimize f (x) = c x
t
b +
1
2
x
t
Ax.
gradient descent:
1
Search direction: r
i
= f = b Ax
i
2
Search step: x
i +1
= x
i
+
i
r
i
3
Pick alpha:
i
=
r
t
i
r
i
r
t
i
Ar
i
minimizes f (x +r
i
)
f (x
i
+r
i
) = c x
t
i
b +
1
2
x
t
i
Ax
i
+r
t
i
(Ax
i
b) +
1
2

2
r
t
i
Ar
i
= f (x
i
) r
t
i
r
i
+
1
2

2
r
t
i
Ar
i
(Cost of a step = 1 A-multiply.)
R. A. Lippert Non-linear optimization
Iterative methods
Optimize f (x) = c x
t
b +
1
2
x
t
Ax.
conjugate gradient descent:
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= b Ax
i
.
2
Pick
i
=
d
t
i 1
Ar
i
d
t
i 1
Ad
i 1
, ensures d
t
i 1
Ad
i
= 0.
3
Search step: x
i +1
= x
i
+
i
d
i
4
Pick
i
=
d
t
i
r
i
d
t
i
Ad
i
: minimizes f (x
i
+d
i
)
f (x
i
+d
i
) = c x
t
i
b +
1
2
x
t
i
Ax
i
d
t
i
r
i
+
1
2

2
d
t
i
Ad
i
(also means that r
t
i +1
d
i
= 0)
Avoid extra A-multiply: using Ad
i 1
r
i 1
r
i

i
=
(r
i 1
r
i
)
t
r
i
(r
i 1
r
i
)
t
d
i 1
=
(r
i 1
r
i
)
t
r
i
r
t
i 1
d
i 1
=
(r
i
r
i 1
)
t
r
i
r
t
i 1
r
i 1
R. A. Lippert Non-linear optimization
A cute result
conjugate gradient descent:
1
r
i
= b Ax
i
2
Search direction: d
i
= r
i
+
i
d
i 1
( s.t. d
i
Ad
i 1
= 0)
3
Search step: x
i +1
= x
i
+
i
d
i
( minimizes).
Cute result (not that useful in practice)
Theorem (sub-optimality of CG)
(Assuming x
0
= 0) at the end of step k, the solution x
k
is the
optimal linear combination of b, Ab, A
2
b, . . . A
k
b for minimizing
c b
t
x +
1
2
x
t
Ax.
(computer arithmetic errors make this less than perfect)
Very little extra effort. Much better convergence.
R. A. Lippert Non-linear optimization
Slow convergence: Conditioning
The eccentricity of the quadratic is a big factor in convergence
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
R. A. Lippert Non-linear optimization
Convergence and eccentricity
=
max eig(A)
min eig(A)
For gradient descent,
||r
i
||

1
+ 1

i
For CG,
||r
i
||

+ 1

i
useless CG fact: in exact arithmetic r
i
= 0 when i > n (A is
n n).
R. A. Lippert Non-linear optimization
The truth about descent methods
Very slow unless can be controlled.
How do we control ?
Ax = b (PAP
t
)y = Pb, x = P
t
y
where P is a pre-conditioner you pick.
How to make (PAP
t
) small?
perfect answer, P = L
1
where L
t
L = A (Cholesky
factorization).
imperfect answer, P L
1
Variations on the theme of incomplete factorization:
P
1
= D
1
2
where D = diag(a
11
, . . . , a
nn
)
more generally, incomplete Cholesky decomposition
some easy nearby solution or simple approximate A
(requiring domain knowledge)
R. A. Lippert Non-linear optimization
Class project?
One idea for a preconditioner is by a block diagonal matrix
P
1
=
_
_
L
11
0 0
0 L
22
0
0 0 L
33
_
_
where L
t
ii
L
ii
= A
ii
a diagonal block of A.
In what sense does good clustering give good preconditioners?
End of solvers: there are a few other iterative solvers out there
I havent discussed.
R. A. Lippert Non-linear optimization
Second pillar: 1D optimization
1D optimization gives important insights into non-linearity.
min
sR
f (s), f continuous.
A derivative-free option:
A bracket is (a, b, c) s.t. a < b < c and f (a) > f (b) < f (c) then
f (x) has a local min for a < x < b
a b c
Golden search based on picking a < b

< b < c and either


(a < b

< b) or (b

< b < c) is a new bracket. . . continue


Linearly convergent, e
i
G
i
, golden ratio G.
R. A. Lippert Non-linear optimization
1D optimization
Fundamentally limited accuracy of derivative-free argmin:
a b c
Derivative-based methods, f

(s) = 0, for accurate argmin


bracketed: (a, b) s.t. f

(a), f

(b) opposite sign


1
bisection (linearly convergent)
2
modied regula falsi & Brents method (superlinear)
unbracketed:
1
secant method (superlinear)
2
Newtons method (superlinear; requires another derivative)
R. A. Lippert Non-linear optimization
From quadratic to non-linear optimizations
What can happen when far from the optimum?
f (x) always points in a direction of decrease
f (x) may not be positive denite
For convex problems f is always positive semi-denite and
for strictly convex it is positive denite.
What do we want?
nd a convex neighborhood of x

(be robust against


mistakes)
apply a quadratic approximation (do linear solve)
Fact: non-linear optimization algorithms, f which fools it.
R. A. Lippert Non-linear optimization
Nave Newtons method
Newtons method nding x s.t. f (x) = 0
x
i
= (f (x
i
))
1
f (x
i
)
x
i +1
= x
i
+ x
i
Asymptotic convergence, e
i
= x
i
x

f (x
i
) = f (x

)e
i
+ O(||e
i
||
2
)
f (x
i
) = f (x

) + O(||e
i
||)
e
i +1
= e
i
(f
i
)
1
f
i
= O(||e
i
||
2
)
squares the error at every step (exactly eliminates the linear
error).
R. A. Lippert Non-linear optimization
Nave Newtons method
Sources of trouble
1
if f (x
i
) not posdef, x
i
= x
i +1
x
i
might be in an
increasing direction.
2
if f (x
i
) posdef, (f (x
i
))
t
x
i
< 0 so x
i
is a direction of
decrease (could overshoot)
3
even if f is convex, f (x
i +1
) f (x
i
) not assured.
(f (x) = 1 + e
x
+ log(1 + e
x
) starting from x = 2).
4
if all goes well, superlinear convergence!
R. A. Lippert Non-linear optimization
1D example of Newton trouble
1D example of trouble: f (x) = x
4
2x
2
+ 12x
20
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
20
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
Has one local minimum
Is not convex (note the concavity near x=0)
R. A. Lippert Non-linear optimization
1D example of Newton trouble
derivative of trouble: f

(x) = 4x
3
4x + 12
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
the negative f

region around x = 0 repels the iterates:


0 3 1.96154 1.14718 0.00658 3.00039 1.96182
1.14743 0.00726 3.00047 1.96188 1.14749
R. A. Lippert Non-linear optimization
Non-linear Newton
Try to enforce f (x
i +1
) f (x
i
)
x
i
= (I +f (x
i
))
1
f (x
i
)
x
i +1
= x
i
+
i
x
i
Set > 0 to keep x
i
in a direction of decrease (many
heuristics).
Pick
i
> 0 such that f (x
i
+
i
x
i
) f (x
i
). If x
i
is a direction
of decrease, some
i
exists.
1D-minimization do 1D optimization problem,
min

i
(0,]
f (x
i
+
i
x
i
)
Armijo-search use this rule:
i
=
n
some n
f (x
i
+ sx
i
) f (x
i
) s (x
i
)
t
f (x
i
)
with , , xed (e.g. = 2, = =
1
2
).
R. A. Lippert Non-linear optimization
Line searching
1D-minimization looks like less of a hack than Armijo. For
Newton, asymptotic convergence is not strongly affected, and
function evaluations can be expensive.
far from x

their only value is ensuring decrease


near x

the methods will return


i
1.
If you have a Newton step, accurate line-searching adds little
value.
R. A. Lippert Non-linear optimization
Practicality
Direct (non-iterative, non-structured) solves are expensive!
f information is often expensive!
R. A. Lippert Non-linear optimization
Iterative methods
gradient descent:
1
Search direction: r
i
= f (x
i
)
2
Search step: x
i +1
= x
i
+
i
r
i
3
Pick alpha: (depends on whats cheap)
1
linearized
i
=
r
t
i
(f )r
i
r
t
i
r
i
2
minimization f (x
i
+r
i
) (danger: low quality)
3
zero-nding r
t
i
f (x
i
+r
i
) = 0
R. A. Lippert Non-linear optimization
Iterative methods
conjugate gradient descent:
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= f (x
i
).
2
Pick
i
without f
1

i
=
(r
i
r
i 1
)
t
r
i 1
(r
i
r
i 1
)
t
r
i
(Polak-Ribiere)
2
can also use
i
=
r
t
i
r
i
r
t
i 1
r
i 1
(Fletcher-Reeves)
3
Search step: x
i +1
= x
i
+
i
d
i
1
linearized
i
=
d
t
i
(f )d
i
r
t
i
d
i
2
1D minimization f (x
i
+d
i
) (danger: low quality)
3
zero-nding d
t
i
f (x
i
+d
i
) = 0
R. A. Lippert Non-linear optimization
Dont forget the truth about iterative methods
To get good convergence you must precondition!
B (f (x

))
1
Without pre-conditioner
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= P
t
f (x
i
).
2
Pick
i
=
(r
i
r
i 1
)
t
r
i 1
(r
i
r
i 1
)
t
r
i
(Polak-Ribiere)
3
Search step: x
i +1
= x
i
+
i
d
i
4
zero-nding d
t
i
f (x
i
+d
i
) = 0
with B = PP
t
change of metric
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= f (x
i
).
2
Pick
i
=
(r
i
r
i 1
)
t
Br
i 1
(r
i
r
i 1
)
t
r
i
3
Search step: x
i +1
= x
i
+
i
d
i
4
zero-nding d
t
i
Bf (x
i
+d
i
) = 0
R. A. Lippert Non-linear optimization
What else?
Remember this cute property?
Theorem (sub-optimality of CG)
(Assuming x
0
= 0) at the end of step k, the solution x
k
is the
optimal linear combination of b, Ab, A
2
b, . . . A
k
b for minimizing
c b
t
x +
1
2
x
t
Ax.
In a sense, CG learns about A from the history of b Ax
i
.
Noting,
1
computer arithmetic errors ruin this nice property quickly
2
non-linearity ruins this property quickly
R. A. Lippert Non-linear optimization
Quasi-Newton
Quasi-Newton has much popularity/hype. What if we
approximate (f (x

))
1
from the data we have
(f (x

))(x
i
x
k1
) f (x
i
) f (x
k1
)
x
i
x
k1
(f (x

))
1
(f (x
i
) f (x
k1
))
over some xed-nite history.
Data: y
i
= f (x
i
) f (x
k1
), s
i
= x
i
x
k1
with 1 i k
Problem: Find symmetric positive def H
k
s.t.
H
k
y
i
= s
i
Multiple solutions, but BFGS works best in most situations.
R. A. Lippert Non-linear optimization
BFGS update
H
k
=
_
I
s
k
y
t
k
y
t
k
s
k
_
H
k1
_
I
y
k
s
t
k
y
t
k
s
k
_
+
s
k
s
t
k
y
t
k
s
k
Lemma
The BFGS update minimizes min
H
||H
1
H
1
k1
||
2
F
such that
Hy
k
= s
k
.
Forming H
k
not necessary, e.g. H
k
v can be recursively
computed.
R. A. Lippert Non-linear optimization
Quasi-Newton
Typically keep about 5 data points in the history.
initialize Set H
0
= I, r
0
= f (x
0
), d
0
= r
0
goto 3
1
Compute r
k
= f (x
k
), y
k
= r
k1
r
k
2
Compute d
k
= H
k
r
k
3
Search step: x
k+1
= x
k
+
k
d
k
(line-search)
Asymptotically identical to CG (with
i
=
d
t
i
(f )d
i
r
t
i
d
i
)
Armijo line searching has good theoretical properties. Typically
used.
Quasi-Newton ideas generalize beyond optimization (e.g.
xed-point iterations)
R. A. Lippert Non-linear optimization
Summary
All multi-variate optimizations relate to posdef linear solves
Simple iterative methods require pre-conditioning to be
effective in high dimensions.
Line searching strategies are highly variable
Timing and storage of f , f , f are all critical in selecting
your method.
f f concerns method
fast fast 2,5 quasi-N (zero-search)
fast fast 5 CG (zero-search)
fast slow 1,2,3 derivative-free methods
fast slow 2,5 quasi-N (min-search)
fast slow 3,5 CG (min-search)
fast/slow slow 2,4,5 quasi-N with Armijo
fast/slow slow 4,5 CG (linearized )
1=time 2=space 3=accuracy
4=robust vs. nonlinearity 5=precondition
Dont take this table too seriously. . .
R. A. Lippert Non-linear optimization

You might also like