0% found this document useful (0 votes)
53 views

Introduction To Non-Linear Optimization: Ross A. Lippert

This document provides an introduction to non-linear optimization. It discusses optimization problems where the goal is to find the minimum of a function. For smooth functions, the goal is to find the point where the gradient is equal to zero. Quadratic optimization problems are presented as the simplest example, where the solution can be found by setting the gradient equal to zero and solving a linear system. Iterative methods like gradient descent and conjugate gradient descent are introduced as ways to solve non-linear optimization problems by iteratively improving an estimate of the optimal value. Conditioning of the problem and preconditioning are discussed as important factors for the convergence of iterative methods.

Uploaded by

Abdul Maroof
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Introduction To Non-Linear Optimization: Ross A. Lippert

This document provides an introduction to non-linear optimization. It discusses optimization problems where the goal is to find the minimum of a function. For smooth functions, the goal is to find the point where the gradient is equal to zero. Quadratic optimization problems are presented as the simplest example, where the solution can be found by setting the gradient equal to zero and solving a linear system. Iterative methods like gradient descent and conjugate gradient descent are introduced as ways to solve non-linear optimization problems by iteratively improving an estimate of the optimal value. Conditioning of the problem and preconditioning are discussed as important factors for the convergence of iterative methods.

Uploaded by

Abdul Maroof
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to non-linear optimization

Ross A. Lippert
D. E. Shaw Research
February 25, 2008
R. A. Lippert Non-linear optimization
Optimization problems
problem: Let f : R
n
(, ],
nd min
xR
n
{f (x)}
nd x

s.t. f (x

) = min
xR
n
{f (x)}
Quite general, but some cases, like f convex, are fairly solvable.
Todays problem: How about f : R
n
R, smooth?
nd x

s.t. f (x

) = 0
We have a reasonable shot at this if f is twice differentiable.
R. A. Lippert Non-linear optimization
Two pillars of smooth multivariate optimization
nD optimization
linear solve/quadratic opt. 1D optimization
R. A. Lippert Non-linear optimization
The simplest example we can get
Quadratic optimization: f (x) = c x
t
b +
1
2
x
t
Ax.
very common (actually universal, more later)
Finding f (x) = 0
f (x) = b Ax = 0
x

= A
1
b
A has to be invertible (really, b in range of A).
Is this all we need?
R. A. Lippert Non-linear optimization
Max, min, saddle, or what?
Require A be positive denite, why?
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
0.5
1
1.5
2
2.5
3
1
0.5
0
0.5
1
1
0.5
0
0.5
1
3
2.5
2
1.5
1
0.5
0
1
0.5
0
0.5
1
1
0.5
0
0.5
1
2
1.5
1
0.5
0
0.5
1
1
0.5
0
0.5
1
1
0.5
0
0.5
1
0
0.2
0.4
0.6
0.8
1
R. A. Lippert Non-linear optimization
Universality of linear algebra in optimization
f (x) = c x
t
b +
1
2
x
t
Ax
Linear solve: x

= A
1
b.
Even for non-linear problems: if optimal x

near our x
f (x

) f (x) + (x

x)
t
f (x) +
1
2
(x

x)
t
f (x) (x

x) +
x = x

x (f (x))
1
f (x)
Optimization Linear solve
R. A. Lippert Non-linear optimization
Linear solve
x = A
1
b
But really we just want to solve
Ax = b
Dont form A
1
if you can avoid it.
(Dont form A if you can avoid that!)
For a general A, there are three important special cases,
diagonal: A =
_
_
a
1
0 0
0 a
2
0
0 0 a
3
_
_
thus x
i
=
1
a
i
b
i
orthogonal A
t
A = I, thus A
1
= A
t
and x = A
t
b
triangular: A =
_
_
a
11
0 0
a
21
a
22
0
a
31
a
32
a
33
_
_
, x
i
=
1
a
ii
_
b
i

j <i
a
ij
x
j
_
R. A. Lippert Non-linear optimization
Direct methods
A is symmetric positive denite.
Cholesky factorization:
A = LL
t
,
where L lower triangular. So x = L
t
_
L
1
b
_
by
Lz = b, z
i
=
1
L
ii
_
_
b
i

j <i
L
ij
z
j
_
_
L
t
x = z, x
i
=
1
L
ii
_
_
z
i

j >i
L
ij
x
j
_
_
R. A. Lippert Non-linear optimization
Direct methods
A is symmetric positive denite.
Eigenvalue factorization:
A = QDQ
t
,
where Q is orthogonal and D is diagonal. Then
x = Q
_
D
1
_
Q
t
b
_
_
.
More expensive than Choesky
Direct methods are usually quite expensive (O(n
3
) work).
R. A. Lippert Non-linear optimization
Iterative method basics
Whats an iterative method?
Denition (Informal denition)
An iterative method is an algorithm A which takes what you
have, x
i
, and gives you a new x
i +1
which is less bad such that
x
1
, x
2
, x
3
, . . . converges to some x

with badness= 0.
A notion of badness could come from
1
distance from x
i
to our problem solution
2
value of some objective function above its minimum
3
size of the gradient at x
i
e.g. If x is supposed to satisfy Ax = b, we could take ||b Ax||
to be the measure of badness.
R. A. Lippert Non-linear optimization
Iterative method considerations
How expensive is one x
i
x
i +1
step?
How quickly does the badness decrease per step?
A thousand and one years of experience yields two cases
1
B
i

i
for some (0, 1) (linear)
2
B
i

(
i
)
for (0, 1), > 1 (superlinear)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
Can you tell the difference?
R. A. Lippert Non-linear optimization
Convergence
1e04
0.001
0.01
0.1
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e04
0.001
0.01
0.1
1
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
Now can you tell the difference?
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
1e30
1e25
1e20
1e15
1e10
1e05
1
100000
0 2 4 6 8 10
m
a
g
n
i
t
u
d
e
iteration
When evaluating an iterative method against manufacturers
claims, be sure to do semilog plots.
R. A. Lippert Non-linear optimization
Iterative methods
Motivation: directly optimize f (x) = c x
t
b +
1
2
x
t
Ax.
gradient descent:
1
Search direction: r
i
= f = b Ax
i
2
Search step: x
i +1
= x
i
+
i
r
i
3
Pick alpha:
i
=
r
t
i
r
i
r
t
i
Ar
i
minimizes f (x +r
i
)
f (x
i
+r
i
) = c x
t
i
b +
1
2
x
t
i
Ax
i
+r
t
i
(Ax
i
b) +
1
2

2
r
t
i
Ar
i
= f (x
i
) r
t
i
r
i
+
1
2

2
r
t
i
Ar
i
(Cost of a step = 1 A-multiply.)
R. A. Lippert Non-linear optimization
Iterative methods
Optimize f (x) = c x
t
b +
1
2
x
t
Ax.
conjugate gradient descent:
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= b Ax
i
.
2
Pick
i
=
d
t
i 1
Ar
i
d
t
i 1
Ad
i 1
, ensures d
t
i 1
Ad
i
= 0.
3
Search step: x
i +1
= x
i
+
i
d
i
4
Pick
i
=
d
t
i
r
i
d
t
i
Ad
i
: minimizes f (x
i
+d
i
)
f (x
i
+d
i
) = c x
t
i
b +
1
2
x
t
i
Ax
i
d
t
i
r
i
+
1
2

2
d
t
i
Ad
i
(also means that r
t
i +1
d
i
= 0)
Avoid extra A-multiply: using Ad
i 1
r
i 1
r
i

i
=
(r
i 1
r
i
)
t
r
i
(r
i 1
r
i
)
t
d
i 1
=
(r
i 1
r
i
)
t
r
i
r
t
i 1
d
i 1
=
(r
i
r
i 1
)
t
r
i
r
t
i 1
r
i 1
R. A. Lippert Non-linear optimization
A cute result
conjugate gradient descent:
1
r
i
= b Ax
i
2
Search direction: d
i
= r
i
+
i
d
i 1
( s.t. d
i
Ad
i 1
= 0)
3
Search step: x
i +1
= x
i
+
i
d
i
( minimizes).
Cute result (not that useful in practice)
Theorem (sub-optimality of CG)
(Assuming x
0
= 0) at the end of step k, the solution x
k
is the
optimal linear combination of b, Ab, A
2
b, . . . A
k
b for minimizing
c b
t
x +
1
2
x
t
Ax.
(computer arithmetic errors make this less than perfect)
Very little extra effort. Much better convergence.
R. A. Lippert Non-linear optimization
Slow convergence: Conditioning
The eccentricity of the quadratic is a big factor in convergence
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
R. A. Lippert Non-linear optimization
Convergence and eccentricity
=
max eig(A)
min eig(A)
For gradient descent,
||r
i
||

1
+ 1

i
For CG,
||r
i
||

+ 1

i
useless CG fact: in exact arithmetic r
i
= 0 when i > n (A is
n n).
R. A. Lippert Non-linear optimization
The truth about descent methods
Very slow unless can be controlled.
How do we control ?
Ax = b (PAP
t
)y = Pb, x = P
t
y
where P is a pre-conditioner you pick.
How to make (PAP
t
) small?
perfect answer, P = L
1
where L
t
L = A (Cholesky
factorization).
imperfect answer, P L
1
Variations on the theme of incomplete factorization:
P
1
= D
1
2
where D = diag(a
11
, . . . , a
nn
)
more generally, incomplete Cholesky decomposition
some easy nearby solution or simple approximate A
(requiring domain knowledge)
R. A. Lippert Non-linear optimization
Class project?
One idea for a preconditioner is by a block diagonal matrix
P
1
=
_
_
L
11
0 0
0 L
22
0
0 0 L
33
_
_
where L
t
ii
L
ii
= A
ii
a diagonal block of A.
In what sense does good clustering give good preconditioners?
End of solvers: there are a few other iterative solvers out there
I havent discussed.
R. A. Lippert Non-linear optimization
Second pillar: 1D optimization
1D optimization gives important insights into non-linearity.
min
sR
f (s), f continuous.
A derivative-free option:
A bracket is (a, b, c) s.t. a < b < c and f (a) > f (b) < f (c) then
f (x) has a local min for a < x < b
a b c
Golden search based on picking a < b

< b < c and either


(a < b

< b) or (b

< b < c) is a new bracket. . . continue


Linearly convergent, e
i
G
i
, golden ratio G.
R. A. Lippert Non-linear optimization
1D optimization
Fundamentally limited accuracy of derivative-free argmin:
a b c
Derivative-based methods, f

(s) = 0, for accurate argmin


bracketed: (a, b) s.t. f

(a), f

(b) opposite sign


1
bisection (linearly convergent)
2
modied regula falsi & Brents method (superlinear)
unbracketed:
1
secant method (superlinear)
2
Newtons method (superlinear; requires another derivative)
R. A. Lippert Non-linear optimization
From quadratic to non-linear optimizations
What can happen when far from the optimum?
f (x) always points in a direction of decrease
f (x) may not be positive denite
For convex problems f is always positive semi-denite and
for strictly convex it is positive denite.
What do we want?
nd a convex neighborhood of x

(be robust against


mistakes)
apply a quadratic approximation (do linear solve)
Fact: non-linear optimization algorithms, f which fools it.
R. A. Lippert Non-linear optimization
Nave Newtons method
Newtons method nding x s.t. f (x) = 0
x
i
= (f (x
i
))
1
f (x
i
)
x
i +1
= x
i
+ x
i
Asymptotic convergence, e
i
= x
i
x

f (x
i
) = f (x

)e
i
+ O(||e
i
||
2
)
f (x
i
) = f (x

) + O(||e
i
||)
e
i +1
= e
i
(f
i
)
1
f
i
= O(||e
i
||
2
)
squares the error at every step (exactly eliminates the linear
error).
R. A. Lippert Non-linear optimization
Nave Newtons method
Sources of trouble
1
if f (x
i
) not posdef, x
i
= x
i +1
x
i
might be in an
increasing direction.
2
if f (x
i
) posdef, (f (x
i
))
t
x
i
< 0 so x
i
is a direction of
decrease (could overshoot)
3
even if f is convex, f (x
i +1
) f (x
i
) not assured.
(f (x) = 1 + e
x
+ log(1 + e
x
) starting from x = 2).
4
if all goes well, superlinear convergence!
R. A. Lippert Non-linear optimization
1D example of Newton trouble
1D example of trouble: f (x) = x
4
2x
2
+ 12x
20
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
20
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
Has one local minimum
Is not convex (note the concavity near x=0)
R. A. Lippert Non-linear optimization
1D example of Newton trouble
derivative of trouble: f

(x) = 4x
3
4x + 12
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
15
10
5
0
5
10
15
20
2 1.5 1 0.5 0 0.5 1 1.5
the negative f

region around x = 0 repels the iterates:


0 3 1.96154 1.14718 0.00658 3.00039 1.96182
1.14743 0.00726 3.00047 1.96188 1.14749
R. A. Lippert Non-linear optimization
Non-linear Newton
Try to enforce f (x
i +1
) f (x
i
)
x
i
= (I +f (x
i
))
1
f (x
i
)
x
i +1
= x
i
+
i
x
i
Set > 0 to keep x
i
in a direction of decrease (many
heuristics).
Pick
i
> 0 such that f (x
i
+
i
x
i
) f (x
i
). If x
i
is a direction
of decrease, some
i
exists.
1D-minimization do 1D optimization problem,
min

i
(0,]
f (x
i
+
i
x
i
)
Armijo-search use this rule:
i
=
n
some n
f (x
i
+ sx
i
) f (x
i
) s (x
i
)
t
f (x
i
)
with , , xed (e.g. = 2, = =
1
2
).
R. A. Lippert Non-linear optimization
Line searching
1D-minimization looks like less of a hack than Armijo. For
Newton, asymptotic convergence is not strongly affected, and
function evaluations can be expensive.
far from x

their only value is ensuring decrease


near x

the methods will return


i
1.
If you have a Newton step, accurate line-searching adds little
value.
R. A. Lippert Non-linear optimization
Practicality
Direct (non-iterative, non-structured) solves are expensive!
f information is often expensive!
R. A. Lippert Non-linear optimization
Iterative methods
gradient descent:
1
Search direction: r
i
= f (x
i
)
2
Search step: x
i +1
= x
i
+
i
r
i
3
Pick alpha: (depends on whats cheap)
1
linearized
i
=
r
t
i
(f )r
i
r
t
i
r
i
2
minimization f (x
i
+r
i
) (danger: low quality)
3
zero-nding r
t
i
f (x
i
+r
i
) = 0
R. A. Lippert Non-linear optimization
Iterative methods
conjugate gradient descent:
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= f (x
i
).
2
Pick
i
without f
1

i
=
(r
i
r
i 1
)
t
r
i 1
(r
i
r
i 1
)
t
r
i
(Polak-Ribiere)
2
can also use
i
=
r
t
i
r
i
r
t
i 1
r
i 1
(Fletcher-Reeves)
3
Search step: x
i +1
= x
i
+
i
d
i
1
linearized
i
=
d
t
i
(f )d
i
r
t
i
d
i
2
1D minimization f (x
i
+d
i
) (danger: low quality)
3
zero-nding d
t
i
f (x
i
+d
i
) = 0
R. A. Lippert Non-linear optimization
Dont forget the truth about iterative methods
To get good convergence you must precondition!
B (f (x

))
1
Without pre-conditioner
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= P
t
f (x
i
).
2
Pick
i
=
(r
i
r
i 1
)
t
r
i 1
(r
i
r
i 1
)
t
r
i
(Polak-Ribiere)
3
Search step: x
i +1
= x
i
+
i
d
i
4
zero-nding d
t
i
f (x
i
+d
i
) = 0
with B = PP
t
change of metric
1
Search direction: d
i
= r
i
+
i
d
i 1
, with r
i
= f (x
i
).
2
Pick
i
=
(r
i
r
i 1
)
t
Br
i 1
(r
i
r
i 1
)
t
r
i
3
Search step: x
i +1
= x
i
+
i
d
i
4
zero-nding d
t
i
Bf (x
i
+d
i
) = 0
R. A. Lippert Non-linear optimization
What else?
Remember this cute property?
Theorem (sub-optimality of CG)
(Assuming x
0
= 0) at the end of step k, the solution x
k
is the
optimal linear combination of b, Ab, A
2
b, . . . A
k
b for minimizing
c b
t
x +
1
2
x
t
Ax.
In a sense, CG learns about A from the history of b Ax
i
.
Noting,
1
computer arithmetic errors ruin this nice property quickly
2
non-linearity ruins this property quickly
R. A. Lippert Non-linear optimization
Quasi-Newton
Quasi-Newton has much popularity/hype. What if we
approximate (f (x

))
1
from the data we have
(f (x

))(x
i
x
k1
) f (x
i
) f (x
k1
)
x
i
x
k1
(f (x

))
1
(f (x
i
) f (x
k1
))
over some xed-nite history.
Data: y
i
= f (x
i
) f (x
k1
), s
i
= x
i
x
k1
with 1 i k
Problem: Find symmetric positive def H
k
s.t.
H
k
y
i
= s
i
Multiple solutions, but BFGS works best in most situations.
R. A. Lippert Non-linear optimization
BFGS update
H
k
=
_
I
s
k
y
t
k
y
t
k
s
k
_
H
k1
_
I
y
k
s
t
k
y
t
k
s
k
_
+
s
k
s
t
k
y
t
k
s
k
Lemma
The BFGS update minimizes min
H
||H
1
H
1
k1
||
2
F
such that
Hy
k
= s
k
.
Forming H
k
not necessary, e.g. H
k
v can be recursively
computed.
R. A. Lippert Non-linear optimization
Quasi-Newton
Typically keep about 5 data points in the history.
initialize Set H
0
= I, r
0
= f (x
0
), d
0
= r
0
goto 3
1
Compute r
k
= f (x
k
), y
k
= r
k1
r
k
2
Compute d
k
= H
k
r
k
3
Search step: x
k+1
= x
k
+
k
d
k
(line-search)
Asymptotically identical to CG (with
i
=
d
t
i
(f )d
i
r
t
i
d
i
)
Armijo line searching has good theoretical properties. Typically
used.
Quasi-Newton ideas generalize beyond optimization (e.g.
xed-point iterations)
R. A. Lippert Non-linear optimization
Summary
All multi-variate optimizations relate to posdef linear solves
Simple iterative methods require pre-conditioning to be
effective in high dimensions.
Line searching strategies are highly variable
Timing and storage of f , f , f are all critical in selecting
your method.
f f concerns method
fast fast 2,5 quasi-N (zero-search)
fast fast 5 CG (zero-search)
fast slow 1,2,3 derivative-free methods
fast slow 2,5 quasi-N (min-search)
fast slow 3,5 CG (min-search)
fast/slow slow 2,4,5 quasi-N with Armijo
fast/slow slow 4,5 CG (linearized )
1=time 2=space 3=accuracy
4=robust vs. nonlinearity 5=precondition
Dont take this table too seriously. . .
R. A. Lippert Non-linear optimization

You might also like