0% found this document useful (0 votes)
13 views43 pages

Lec 02

Lecture 2

Uploaded by

Murtaza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views43 pages

Lec 02

Lecture 2

Uploaded by

Murtaza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

MATH 4211/6211 – Optimization

Gradient Method

Xiaojing Ye
Department of Mathematics & Statistics
Georgia State University

Xiaojing Ye, Math & Stat, Georgia State University 0


Consider x(k) and compute g (k) := ∇f (x(k)). Set descent direction to
d(k) = −g (k).

Now we want to find α ≥ 0 such that x(k) − αg (k) improves x(k).

Define φ(α) := f (x(k) − αg (k)), then φ has Taylor expansion:

f (x(k) − αg (k)) = f (x(k)) − αkg (k)k2 + o(α)

For α sufficiently small, we have

f (x(k) − αg (k)) ≤ f (x(k))

Xiaojing Ye, Math & Stat, Georgia State University 1


Gradient Descent Method (or Gradient Method):

x(k+1) = x(k) − αk g (k)


Set an initial guess x(0), and iterate the scheme above to obtain {x(k) : k =
0, 1, . . . }.

• x(k): current estimate;

• g (k) := ∇f (x(k)): gradient at x(k);

• αk ≥ 0: step size.

Xiaojing Ye, Math & Stat, Georgia State University 2


Steepest Descent Method: choose αk such that

αk = arg min f (x(k) − αg (k))


α≥0
Steepest descent method is an exact line search method.

We will first discuss some properties of steepest descent method, and con-
sider other (inexact) line search methods.

Xiaojing Ye, Math & Stat, Georgia State University 3


Proposition. Let {x(k)} be obtained by steepest descent method, then

(x(k+2) − x(k+1))>(x(k+1) − x(k)) = 0

Proof. Define φ(α) := f (x(k) − αg (k)). Since αk = arg min φ(α), we have
>
0 = φ0(αk ) = ∇f (x(k) − αk g (k))>g (k) = g (k+1) g (k)
On the other hand, we have

x(k+2) = x(k+1) − αk+1g (k+1)


x(k+1) = x(k) − αk g (k)
Therefore, we have
>
(x(k+2) − x(k+1))>(x(k+1) − x(k)) = αk+1αk g (k+1) g (k) = 0.

Xiaojing Ye, Math & Stat, Georgia State University 4


Proposition. Let {x(k)} be obtained by steepest descent method and g (k) 6=
0, then f (x(k+1)) < f (x(k))

Proof. Define φ(α) := f (x(k) − αg (k)). Then

φ0(0) = −∇f (x(k) − 0g (k))>g (k) = −kg (k)k2 < 0.


Since αk is a minimizer, there is

f (x(k+1)) = φ(αk ) < φ(0) = f (x(k)).

Xiaojing Ye, Math & Stat, Georgia State University 5


Stopping Criterion.

For a prescribed  > 0, terminate the iteration if one of the followings is met:

• kg (k)k < ;

• |f (x(k+1)) − f (x(k))| < ;

• kx(k+1) − x(k)k < .

More preferable choices using “relative change”:

• |f (x(k+1)) − f (x(k))|/|f (x(k))| < ;

• kx(k+1) − x(k)k/kx(k)k < .

Xiaojing Ye, Math & Stat, Georgia State University 6


Example. Use steepest descent method for 3 iterations on

f (x1, x2, x3) = (x1 − 4)4 + (x2 − 3)2 + 4(x3 + 5)4


with initial point x(0) = [4, 2, −1]>.

Solution. We will repeatedly use the gradient, so let’s compute it first:


 
3
 4(x1 − 4) 
∇f (x) = 
 2(x2 − 3) 

16(x3 + 5)3

We keep in mind that x∗ = [4, 3, −5]>.

Xiaojing Ye, Math & Stat, Georgia State University 7


In the 1st iteration:

• Current iterate: x(0) = [4, 2, −1]>;

• Current gradient: g (0) = ∇f (x(0)) = [0, −2, 1024]>;

• Find step size:

α0 = arg min f (x(0) − αg (0))


α≥0
 
= arg min 0 + (2 + 2α − 3)2 + 4(−1 − 1024α + 5)4
α≥0

and use secant method to get α0 = 3.967 × 10−3.

• Next iterate: x(1) = x(0) − α0g (0) = · · · = [4.000, 2.008, −5.062]>.

Xiaojing Ye, Math & Stat, Georgia State University 8


1e3
6
5
4
)

3
0(

2
1
0
0.000 0.002 0.004 0.006 0.008 0.010
Xiaojing Ye, Math & Stat, Georgia State University 9
In the 2nd iteration:

• Current iterate: x(1) = [4.000, 2.008, −5.062]>;

• Current gradient: g (1) = ∇f (x(1)) = [0.001, −1.984, −0.003875]>;

• Find step size:


α1 = arg min f (x(1) − αg (1) )
α≥0
 
2 4
= arg min 0 + (2.008 + 1.984α − 3) + 4(−5.062 + 0.003875α + 5)
α≥0

and use secant method to get α1 = 0.500.

• Next iterate: x(2) = x(1) − α1g (1) = · · · = [4.000, 3.000, −5.060]>.

Xiaojing Ye, Math & Stat, Georgia State University 10


1.0
0.8
0.6
)
2(

0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0

Xiaojing Ye, Math & Stat, Georgia State University 11


In the 3rd iteration:

• Current iterate: x(2) = [4.000, 3.000, −5.060]>;

• Current gradient: g (2) = ∇f (x(2)) = [0.000, 0.000, −0.003525]>;

• Find step size:


α2 = arg min f (x(2) − αg (2) )
α≥0
 
4
= arg min 0 + 0 + 4(−5.060 + 0.003525α + 5)
α≥0

and use secant method to get α2 = 16.29.

• Next iterate: x(3) = x(2) − α2g (2) = · · · = [4.000, 3.000, −5.002]>.

Xiaojing Ye, Math & Stat, Georgia State University 12


1e 6
1.50
1.25
1.00
)

0.75
2(

0.50
0.25
0.00
10 12 14 16 18 20

Xiaojing Ye, Math & Stat, Georgia State University 13


A quadratic function f of x can be written as

f (x) = x>Ax − b>x


where A is not necessarily symmetric.

Note that x>Ax = x>A>x and hence x>Ax = 1 > >


2 x (A + A )x where
A + A> is symmetric.

Therefore, a quadratic function can always be rewritten as


1 >
f (x) = x Qx − b>x
2
where Q is symmetric. In this case, the gradient and Hessian are:

∇f (x) = Qx − b and ∇2f (x) = Q.

Xiaojing Ye, Math & Stat, Georgia State University 14


Now let’s see what happens when we apply the steepest descent method to a
quadratic function f :
1 >
f (x) = x Qx − b>x
2
where Q  0.

At k-th iteration, we have x(k) and g (k) = ∇f (x(k)) = Qx(k) − b.

Then we need to find the step size αk = arg minα φ(α) where
1 (k)
(x − αg (k) )> Q(x(k) − αg (k) ) − b> (x(k) − αg (k) )
φ(α) := f (x(k) − αg (k) ) =
2
Solving φ0(α) = −(x(k) − αg (k))>Qg (k) + b>g (k) = 0, we obtain
>
(Qx(k) − b)>g (k) g (k) g (k)
αk = =
g (k) > Qg (k)
>
g (k) Qg (k)

Xiaojing Ye, Math & Stat, Georgia State University 15


Therefore, the steepest descent method applied to f (x) = 1 > >
2 x Qx − b x
with Q  0 yields
(k) > (k) !
g g
x(k+1) = x(k) − >
g (k)
g (k) Qg (k)

Xiaojing Ye, Math & Stat, Georgia State University 16


Several concepts about algorithms and convergence:

• Iterative algorithm: an algorithm that generates sequence x(0), x(1),


x(2),. . . , each based on the points preceding it.

• Descent method: a method/algorithm such that f (x(k+1)) ≤ f (x(k)).

• Globally convergent: an algorithm that generates sequence x(k) → x∗


starting from ANY x(0).

• Locally convergent: an algorithm that generates sequence x(k) → x∗ if


x(0) is sufficiently close to x∗.

• Rate of convergence: how fast is the convergence (more later).

Xiaojing Ye, Math & Stat, Georgia State University 17


Now we come back to the convergence of the steepest descent applied to
quadratic function f (x) = 1
2 x > Qx − b> x where Q  0.

Since ∇2f (x) = Q  0, f is strictly convex and only has a unique minimizer,
denoted by x∗.

By FONC, there is ∇f (x∗) = Qx∗ − b = 0, i.e., Qx∗ = b.

Xiaojing Ye, Math & Stat, Georgia State University 18


To examine the convergence, we consider
1 ∗>
V (x) := f (x) + x Qx∗
2
= ···
1
= (x − x∗)>Q(x − x∗)
2
(show this as an exercise).

Since Q  0, there is V (x) = 0 iff x = x∗.

Xiaojing Ye, Math & Stat, Georgia State University 19


Lemma. Let {x(k)} be generated by the steepest descent method. Then

V (x(k+1)) = (1 − γk )V (x(k))
where

0

 if kg (k)k = 0
γk = >  > 
g (k) Qg (k) g (k) g (k)
αk
 2 − αk if kg (k)k 6= 0
(k) > (k) >

g Q−1g (k) g Qg (k)

Xiaojing Ye, Math & Stat, Georgia State University 20


Proof. If kg (k)k = 0, then x(k+1) = x(k) and V (x(k+1)) = V (x(k)).
Hence γk = 0.

If kg (k)k 6= 0, then
1 (k+1)
V (x(k+1)) = (x − x∗)>Q(x(k+1) − x∗)
2
1
= (x(k) − x∗ + αk g (k))>Q(x(k) − x∗ + αk g (k))
2
(k) (k) > (k) ∗ 1 2 (k)>
= V ( x ) − αk g Q(x − x ) + αk g Qg (k)
2
Therefore
(k) > (k) ∗ 1 2 (k) >
V (x(k)) − V (x(k+1)) αk g Q ( x − x ) − 2 αk g Qg (k)
=
V (x(k)) V (x(k))

Xiaojing Ye, Math & Stat, Georgia State University 21


Note that:

Q(x(k) − x∗) = Qx(k) − b = ∇f (x(k)) = g (k)


x(k) − x∗ = Q−1g (k)
(k) 1 (k) ∗ > (k) ∗ 1 (k)> −1 (k)
V (x ) = (x − x ) Q(x −x )= g Q g
2 2
Then we obtain
V (x(k)) − V (x(k+1)) αk a − 1 α 2b
2 k =α b

a

= 1 k 2 − αk
(k)
V (x ) c b
2c
where
> > >
a := g (k) g (k), b := g (k) Qg (k), c := g (k) Q−1g (k)

Xiaojing Ye, Math & Stat, Georgia State University 22


Now we have obtained V (x(k+1)) = (1 − γk )V (x(k)), from which we have
"k−1 #
V (x(k)) = (1 − γi) V (x(0))
Y

i=0

Since x(0) is given and fixed, we can see


k−1
V (x(k)) → 0
Y
⇐⇒ (1 − γi) → 0
i=0
k−1
X
⇐⇒ − log(1 − γi) → +∞
i=0
k−1
X
⇐⇒ γi → +∞
i=0

Xiaojing Ye, Math & Stat, Georgia State University 23


We summarize the result below:
n o
Theorem. Let x(k) be generated by the gradient algorithm for a quadratic
function f (x) = (1/2)x>Qx − b>x (where Q  0) with step sizes αk
converges, i.e., x(k) → x∗ iff ∞
P
k=0 γk = +∞.

Proof. (Sketch) Use the inequalities

γ ≤ − log(1 − γ) ≤ 2γ
which hold for γ ≥ 0 close to 0. Then use the squeeze theorem.

Xiaojing Ye, Math & Stat, Georgia State University 24


Rayleigh’s inequality: given a symmetric Q  0, there is

λmin(Q)kxk2 ≤ x>Qx =: kxk2


Q ≤ λmax ( Q )kx k2

for any x.

Here λmin(Q) (λmax(Q)) are the minimum (maximum) eigenvalue of Q.

In addition, we can get the min/max eigenvalues of Q−1:


1 1
λmin(Q−1) = and λmax(Q−1) =
λmax(Q) λmin(Q)

Xiaojing Ye, Math & Stat, Georgia State University 25


Lemma. If Q  0, then for any x, there is
λmin(Q) kxk4 λmax(Q)
≤ 2 2

λmax(Q) kxkQkxkQ−1 λmin(Q)

Proof. By Rayleigh’s inequality, we have


kxk2 kxk2
λmin(Q)kxk2 ≤ kxk2
Q ≤ λmax ( Q)kx k2 and 2
≤ kxkQ−1 ≤
λmax(Q) λmin(Q)
These imply
1 kxk2 1 kxk2
≤ 2
≤ and λmin(Q) ≤ 2
≤ λmax(Q)
λmax(Q) kxkQ λmin(Q) kxkQ−1
Multiplying the two yields the claim.

Xiaojing Ye, Math & Stat, Georgia State University 26


P
We can show the steepest descent method has αk set to satisfy k γk =
+∞:

>
g (k) g (k)
First recall that αk = > .
g (k) Qg (k)

Then there is
(k) > > (k)
g Qg (k) g (k) g
 
γk = αk 2 − αk
g (k) > Q−1g (k) g (k) > Qg (k)
(k) > (k) 2
(g g )
= > >
g (k) Qg (k)g (k) Q−1g (k)
kg (k)k4 λmin(Q)
= 2 2
≥ >0
(k)
kg kQkg kQ−1 (k) λmax(Q)
P
Therefore k γk = +∞.

Xiaojing Ye, Math & Stat, Georgia State University 27


Now let’s consider the gradient method with fixed step size α > 0:

Theorem. If the step size α > 0 is fixed, then the gradient method converges
if and only if
2
0<α<
λmax(Q)

Proof. “⇐” Suppose 0 < α < λ 2(Q) , then


max

kg (k)k2
Q

kg (k) k2 
γk = α 2
2 2
−α
(k)
kg kQ−1 (k)
kg kQ
λmin(Q)kg (k)k2 2
 
≥α −α
−1 (k) 2
λmax(Q )kg k λmax(Q)
2
 
= αλ2
min (Q) −α >0
λmax(Q)
Therefore k γk = ∞ and hence GM converges.
P

Xiaojing Ye, Math & Stat, Georgia State University 28


“⇒” Suppose GM converges but α ≤ 0 or α ≥ λ 2(Q) . Then if x(0) is
max
chosen such that x(0) − x∗ is the eigenvector corresponding to the eigenvalue
λmax(Q) of Q, we have

x(k+1) − x∗ = x(k) − αg (k) − x∗


= x(k) − α(Qx(k) − b) − x∗
= x(k) − α(Qx(k) − Qx∗) − x∗
= (I − αQ)(x(k) − x∗)
= (I − αQ)k+1(x(0) − x∗)
= (1 − αλmax(Q))k+1(x(0) − x∗)
Taking norm on both sides yields

kx(k+1) − x∗k = |1 − αλmax(Q)|k+1kx(0) − x∗k


where |1 − αλmax(Q)| ≥ 1 if α ≤ 0 or α ≥ λ 2(Q) . Contradiction.
max

Xiaojing Ye, Math & Stat, Georgia State University 29


Example. Find an appropriate α for the GM with fixed step size α for
" √ # " #
4 2 2 3
f (x) = x> x + x> + 24
0 5 6

Solution. First rewrite f into the standard quadratic form with symmetric Q:
" √ # " #
1 8 2 2 3
f (x) = x > √ x + x> + 24
2 2 2 10 6
" √ #
8 2 2
Then we compute the eigenvalues of Q = √ :
2 2 10

λ −√8 −2 2
|λI − Q| = = (λ − 8)(λ − 10) − 8 = (λ − 6)(λ − 12)
−2 2 λ − 10
2 ).
Hence λmax(Q) = 12, and the range of α should be (0, 12

Xiaojing Ye, Math & Stat, Georgia State University 30


Convergence rate of steepest descent method:

1 x> Qx + b> x with Q  0 yields


Recall that applying SD to f (x) = 2

V (x(k+1)) ≤ (1 − κ)V (x(k))


1 (x − x∗ )> Q(x − x∗ ), and κ = λmin (Q) .
where V (x) := 2 λ (Q) max

Q)
Remark. λλmax((Q )
= kQ kkQ −1 k is called the condition number of Q.
min

Xiaojing Ye, Math & Stat, Georgia State University 31


Order of convergence

We say x(k) → x∗ with order p if

kx(k+1) − x∗k
0 < lim <∞
k→∞ kx(k) − x∗ kp

It can be shown that p ≥ 1, and the larger p is, the faster the convergence is.

Xiaojing Ye, Math & Stat, Georgia State University 32


Example.

• x(k) = 1
k → 0, then

|x(k+1)| kp
= <∞
(k)
|x | p k+1
if p ≤ 1. Therefore x(k) → 0 with order 1.

• x(k) = q k → 0 for some q ∈ (0, 1), then

|x(k+1)| q k+1
= kp = q k(1−p)+1 < ∞
|x(k)|p q
if p ≤ 1. Therefore x(k) → 0 with order 1.

Xiaojing Ye, Math & Stat, Georgia State University 33


Example.

k
• x(k) = q 2 → 0, then
k+1
|x(k+1)| q2 2k (2−p) < ∞
= k
= q
|x(k)|p q p2
if p ≤ 2. Therefore x(k) → 0 with order 2.

Xiaojing Ye, Math & Stat, Georgia State University 34


In general, we have the following result:

Theorem. If kx(k+1) − x∗k = O(kx(k) − x∗kp), then the convergence is of


order at least p.

Remark. Note that p ≥ 1.

Xiaojing Ye, Math & Stat, Georgia State University 35


Descent method and line search

Given a descent direction d(k) of f : Rn → R at x(k) (e.g., d(k) = −g (k)),


we need to decide the step size αk in order to compute

x(k+1) = x(k) + αk d(k).


Exact line search computes αk by solving for

αk = arg min φk (α), where φk (α) := f (x(k) + αd(k)).


α

Notice that φ : R+ → R and φ0(α) = ∇f (x(k) + αd(k))d(k). Hence we can


use the secant method:
α(l) − α(l−1) 0 (α(l) ).
α(l+1) = α(l) − φ
φ0k (α(l)) − φ0k (α(l−1)) k
with some initial guess α(0), α(1), and set αk to liml→∞ α(l).

Xiaojing Ye, Math & Stat, Georgia State University 36


In practice, it is not computationally economical to use exact line search.

Instead, we prefer inexact line search. That is, we do not exactly solve

αk = arg min φk (α), where φk (α) := f (x(k) + αd(k)),


α
but only require αk to satisfy certain conditions such that:

• easy to compute in practice.

• guarantees convergence.

• performs well in practice.

Xiaojing Ye, Math & Stat, Georgia State University 37


There are several commonly used conditions for αk :

• Armijo condition: let ε ∈ (0, 1), γ > 1 and

φk (αk ) ≤ φk (0) + εαk φ0k (0) (so αk not too large)


φk (γαk ) ≥ φk (0) + εγαk φ0k (0) (so αk not too small)

• Armijo-Goldstein condition: let 0 < ε < η < 1 and

φk (αk ) ≤ φk (0) + εαk φ0k (0) (so αk not too large)


φk (αk ) ≥ φk (0) + ηαk φ0k (0) (so φ0k (αk ) not too small)

• Wolfe condition: let 0 < ε < η < 1 and

φk (αk ) ≤ φk (0) + εαk φ0k (0) (so αk not too large)


φ0k (αk ) ≥ ηφ0k (0) (so φk not too steep at αk )
Strong-Wolfe condition: replaces the second condition with |φ0k (αk )| ≤
η|φ0k (0)|.

Xiaojing Ye, Math & Stat, Georgia State University 38


Backtracking line search

In practice, we often use the following backtracking line search:

Backtracking: choose initial guess α(0) and τ ∈ (0, 1) (e.g., τ = 0.5), then
set α = α(0) and repeat:

1. Check whether φk (α) ≤ φk (0) + εαφ0k (0) (first Armijo condition). If yes,
then terminate.

2. Shrink α to τ α.

In other words, we find the smallest integer m ∈ N0 such that αk = τ mα(0)


satisfies the first Armijo condition φk (αk ) ≤ φk (0) + εαk φ0k (0).

Xiaojing Ye, Math & Stat, Georgia State University 39


Why line search guarantees convergence?

First, note that here by convergence we mean k∇f (x(k))k → 0.

We take Wolfe condition and d(k) = −g (k) for simplicity. Assume ∇f is


L-Lipschitz continuous. Now

x(k+1) = x(k) − αk g (k)


φk (αk ) = f (x(k+1))
φ0k (αk ) = −∇f (x(k+1))g (k)
φk (0) = f (x(k))
φ0k (0) = −∇f (x(k))g (k)
Moreover, L-Lipschitz continuity of ∇f implies

±h∇f (x) − ∇f (y ), x − y i ≤ k∇f (x) − ∇f (y )kkx − y k ≤ Lkx − y k2


for any x, y .

Xiaojing Ye, Math & Stat, Georgia State University 40


Claim. αk ≥ 1−η
L .

Proof of Claim. The second Wolfe condition φ0k (αk ) ≥ ηφ0k (0) implies
φ0k (αk ) − φ0k (0) ≥ (η − 1)φ0k (0), which is

−h∇f (x(k+1)) − ∇f (x(k)), g (k)i ≥ (1 − η)kg (k)k2.


(k+1) −x(k)
Note that g (k) = x α , we know
k

L
−h∇f (x(k+1)) − ∇f (x(k)), g (k)i ≤ kx(k+1) − x(k)k2 = Lαk kg (k)k2
αk
Combining the two inequalities above yields the claim.

Xiaojing Ye, Math & Stat, Georgia State University 41


The first Wolfe condition (Armijo condition) implies

(k+1) (k) (k) 2 (k) ε(1 − η) (k) 2


f (x ) ≤ f (x ) − εαk kg k ≤ f (x ) − kg k .
L
Taking telescope sum yields

(K) (0) ε(1 − η) K−1


kg (k)k2.
X
f (x ) ≤ f (x ) −
L k=0
which implies

ε(1 − η) K−1
kg (k)k2 ≤ f (x(0)) − f (x(K)) < ∞
X
L k=0

for any K (we assume f is bounded below). Notice that ε(1−η)


L > 0.

Therefore kg (k)k = k∇f (x(k))k → 0.

Xiaojing Ye, Math & Stat, Georgia State University 42

You might also like