Gradient Methods
May 2005
Preview
Background
Steepest Descent
Conjugate Gradient
Preview
Background
Steepest Descent
Conjugate Gradient
Background
Motivation
The gradient notion
The Wolfe Theorems
Motivation
The min(max) problem:
min f ( x)
x
But we learned in calculus how to solve that
kind of question!
Motivation
Not exactly,
Functions: f :R R n
High order polynomials:
1 3 1 5 1 7
x- x + x - x
6 120 5040
What about function that don’t have an analytic
presentation: “Black Box”
Motivation- “real world” problem
Connectivity shapes (isenburg,gumhold,gotsman)
mesh = {C = (V , E ), geometry}
What do we get only from C without geometry?
Motivation- “real world” problem
First we introduce error functionals and then try
to minimize them:
�( )
2
� )=
Es ( x � n�3
xi - x j - 1
( i , j )�E
n
�n�3 ) = �L( xi ) 2
Er ( x �
i =1
1
L( xi ) =
di
�x
( i , j )�E
j - xi
Motivation- “real world” problem
Then we minimize:
( 1 - l ) E s ( x ) + l Er ( x ) �
E (C , l ) = arg min �
� �
x��n�3
High dimension non-linear problem.
The authors use conjugate gradient method
which is maybe the most popular optimization
technique based on what we’ll see here.
Motivation- “real world” problem
Changing the parameter:
( 1 - l ) E s ( x ) + l Er ( x ) �
E (C , l ) = arg min �
� �
x��n�3
Motivation
General problem: find global min(max)
This lecture will concentrate on finding local
minimum.
Background
Motivation
The gradient notion
The Wolfe Theorems
1 � �
� 1 �
f := ( x , y ) � cos�
�x�
�cos�
� y�
�x
2 � �
� 2 �
Directional Derivatives:
first, the one dimension
derivative:
Directional Derivatives :
Along the Axes…
f ( x, y )
y
f ( x, y )
x
Directional Derivatives :
In general direction…
vR 2
v =1
f ( x, y )
v
Directional
Derivatives
f ( x, y ) f ( x, y )
y x
2
The Gradient: Definition in R
f f
f :R R
2
f ( x, y ) :=
x y
In the plane
f ( x , y )
The Gradient: Definition
f :R R
n
f f
f ( x1 ,..., xn ) := ,...,
x1 xn
The Gradient Properties
The gradient defines (hyper) plane
approximating the function infinitesimally
f f
z = x + y
x y
The Gradient properties
By the chain rule: (important for later use)
v =1
f
( p ) = f p , v
v
f p
The Gradient properties
Proposition 1:
is maximal choosing 1
v= f p
f f p
v
is minimal choosing -1
v= f p
f p
(intuitive: the gradient points at the greatest change direction)
The Gradient properties
Proof: (only for minimum case)
-1
Assign: v= f p by chain rule:
f p
f ( x, y ) -1
( p ) = (f ) p , (f ) p =
v (f ) p
2
-1 - f p
f p , f p = = - f p
f p f p
The Gradient properties
On the other hand for general v:
f ( x, y )
( p ) = f p , v f p v =
v
= f p
f ( x, y )
( p ) - f p
v
The Gradient Properties
Proposition 2: let f : R R be a
n
1
smooth C function around P,
if f has local minimum (maximum) at p
then,
f p = 0
(Intuitive: necessary for local min(max))
The Gradient Properties
Proof:
Intuitive:
The Gradient Properties
Formally: for any v R \ {0}
n
We get:
df ( p + t v)
0= (0) = (f ) p , v
dt
(f ) p = 0
The Gradient Properties
We found the best INFINITESIMAL DIRECTION
at each point,
Looking for minimum: “blind man” procedure
How can we derive the way to the minimum
using this knowledge?
Background
Motivation
The gradient notion
The Wolfe Theorems
The Wolfe Theorem
This is the link from the previous gradient
properties to the constructive algorithm.
The problem:
min f ( x)
x
The Wolfe Theorem
We introduce a model for algorithm:
Data: x 0 R
n
Step 0: set i=0
Step 1: if fstop,
( xi ) = 0
else, compute search direction hi R n
Step 2: compute the step-size
li arg min f ( xi + l hi )
l 0
Step 3: set x = go
i +1 i x +tolstep
i ih 1
The Wolfe Theorem
The Theorem: suppose f : R R C1
n
smooth, and exist continuous function:
k : R [0,1]
n
And,
x : f ( x) 0 k ( x) 0
And, the search vectors constructed by the
model algorithm satisfy:
f ( xi ), hi - k ( xi ) f ( xi ) hi
The Wolfe Theorem
And f ( y ) 0 hi 0
Then {ifxi }i=0 is the sequence constructed by
the algorithm model,
then any accumulation point y of this sequence
satisfy:
f ( y ) = 0
The Wolfe Theorem
The theorem has very intuitive interpretation :
Always go in decent direction.
hi
f ( xi )
Steepest Descent
What it mean?
We now use what we have learned to
implement the most basic minimization
technique.
First we introduce the algorithm, which is a
version of the model algorithm.
The problem:
min f ( x)
x
Steepest Descent
Steepest descent algorithm:
Data: x 0 R
n
Step 0: set i=0
Step 1: if fstop,
( xi ) = 0
else, compute search direction hi = -f ( xi )
Step 2: compute the step-size
li arg min f ( xi + l hi )
l 0
Step 3: set x = go
i +1 i ix +tolstep
i h 1
Steepest Descent
From the chain rule:
d
f ( xi + l hi ) = f ( xi + l hi ), hi = 0
dl
Therefore the method of steepest descent
looks like this:
Steepest Descent
Steepest Descent
The steepest descent find critical point and
local minimum.
Implicit step-size rule
Actually we reduced the problem to finding
minimum:
f :RR
There are extensions that gives the step size
rule in discrete sense. (Armijo)
Steepest Descent
Back with our connectivity shapes: the authors
solve the 1-dimension problem analytically.
li arg min f ( xi + l hi )
l 0
They change the spring energy and get a
quartic polynomial in x
�( )
2 2
Es ( x �� ) =
n�3
xi - x j -1
( i , j )�E
Preview
Background
Steepest Descent
Conjugate Gradient
Conjugate Gradient
We from now on assume we want to minimize
the quadratic function:
1 T
f ( x) = x Ax - bT x + c
2
This is equivalent to solve linear problem:
0 = f ( x) = Ax - b
There are generalizations to general functions.
Conjugate Gradient
What is the problem with steepest descent?
We can repeat the same directions over and
over…
Conjugate gradient takes at most n steps.
Conjugate Gradient
d 0 ,d 1,...,d j ,... Search directions – should span n
xi +1 = xi + i d i
A~ x =b x1
~ e1 ~
e =x -x
i i x
0
e0
f ( x) = Ax - b = Ax - A~ x d0
~
f ( xi ) = A( xi - x ) = Aei x0
Conjugate Gradient
Given dj , how do we calculate j ? (as before)
d f ( xi +1 ) = 0
T
i
d iT Aei +1 = 0 x1 f ( x i +1 )
~
x0
d iT A(ei + i d i ) = 0 d0
d Aei T
d f ( xi ) T
i = - =- T
i i
d Adi T
i d i Adi x0
Conjugate Gradient
How do we find d j ?
We want that after n step the error will be 0 : x1 e1 ~
x0
n -1
d0 e0
e0 = i d i
i =0
j -1
x0
e0 = e1 - 0 d 0 = e2 - 0 d 0 - 1d1 = ... = e j - i d i
i =0
n -1 j -1
e j = i di + i di
i =0 i =0
Conjugate Gradient
Here an idea: if j= - j then:
n -1 j -1 n -1 j -1 n -1
e j = i di - i di = i di - i di = i di
i =0 i =0 i =0 i =0 i= j
So if j =n,
en = 0
Conjugate Gradient
So we look for d j such that j= - j :
Simple calculation shows that if we take
d Tj Adi = 0 i j A - conjugate (- orthogonal)
Conjugate Gradient
We have to find an A conjugate basis
d j , j = 0...n - 1
We can do “gram-schmidt” process, but we
should be careful since it is an O(n³) process:
i -1
u1 , u 2 ,..., un d i = ui + i , k d k
k =0
Some series of vectors
Conjugate Gradient
So for a arbitrary choice of ui we don’t earn
nothing.
Luckily, we can choose ui so that the
conjugate direction calculation is O(m) where
m is the number of non-zero entries in A .
The correct choice of ui is:
ui = -f ( xi )
Conjugate Gradient
So the conjugate gradient algorithm for minimizing f:
Data: x0 n
Step 0: d 0 = r0 := -f ( x0 )
riT ri
Step 1: i = T ri := -f ( xi )
d i Adi
Step 2: xi +1 = xi + i d i
riT+1ri +1
Step 3: i +1 = T
ri ri
Step 4: d i +1 = ri +1 + i +and
1d i
repeat n times.