0% found this document useful (0 votes)

25 views

Unconstrained Optimization

This document discusses descent methods for unconstrained optimization. It begins with an introduction that defines the optimization problem and assumptions. It then covers optimality conditions, stating that a first order necessary condition for a solution is for the gradient of the objective function to be equal to zero. The document outlines the plan to cover formulation of the problem, optimality conditions, descent algorithms and an illustration of descent methods.

Uploaded by

Adil Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Unconstrained Optimization

Uploaded by

Adil Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Descent methods

for unconstrained optimization

Gilles Gasso

INSA Rouen - ASI Departement

Laboratory LITIS

September 19, 2019

Gilles Gasso Descent methods 1 / 27

Plan

1 Formulation

2 Optimality conditions

3 Descent algorithms
Main methods of descent
Research of the step
Summary

4 Illustration of descent methods

Gilles Gasso Descent methods 2 / 27

Formulation

Unconstrained optimization

Elements of the problem

θ ∈ Rd : vector of unknown real parameters
J : Rd → R : the function to be minimized.
Assumption:
J is differentiable
all over its domain
domJ = θ ∈ Rd | J(θ) < ∞

Problem formulation
(P) min J(θ)
θ∈Rd

Gilles Gasso Descent methods 3 / 27

Formulation

Unconstrained optimization

Examples 6

1 >
θ Pθ + q > θ + r
5.5
J(θ) =
2

J(θ)
5

4.5
with P a positive definite matrix 4
2
2
0 1
0
−1
θ2 −2 −2 θ1

1 θ1
J(θ) = cos(θ1 −θ2 )+sin(θ1 +θ2 )+
0
4
−1

−2
6 6
4 4
2 2
0 0

Gilles Gasso Descent methods 4 / 27

Optimality conditions

Different solutions
Global solution
θ ∗ is said to be the global minimum solution of the problem if
J(θ ∗ ) ≤ J(θ), ∀θ ∈ domJ

Local solution
θ̂ is a local minimum solution of problem (P) if it holds
J(θ̂) ≤ J(θ), ∀θ ∈ domJ such that kθ̂ − θk ≤ , > 0
2.5
Minimum global
5
Minimum local 2

Illustration 4 θ* 1.5

1
3
θ1
J(θ) = cos(θ1 − θ2 ) + sin(θ1 + θ2 ) + 4 0.5
2
0

1
θ −0.5

0 −1

0 1 2 3 4 5

Gilles Gasso Descent methods 5 / 27

Optimality conditions

How do we assess a solution to the problem?

Gilles Gasso Descent methods 6 / 27

Optimality conditions

First order necessary condition

Theorem [First order condition]

Let J : Rd → R be a differential function on its domain. A vector θ 0 is a
(local or global) solution of the problem (P), if it necessarily satisfies the
condition ∇J(θ 0 ) = 0.

Vocabulary
Any vector θ 0 that verifies ∇J(θ 0 ) = 0 is called a stationary point or critical
point
∇J(θ) ∈ Rd is the gradient vector of J at θ.
The gradient is the unique vector such that the directional derivative can be
written as:
J(θ + th) − J(θ)
lim = ∇J(θ)> h, h ∈ Rd , t∈R
t→0 t
Gilles Gasso Descent methods 7 / 27
Optimality conditions

Example of a first order optimality condition

2 14 9
19 4
J(θ) = θ14 + θ24 − 4θ1 θ2

9
9 0
14 4

10
1

24
3
4θ1 − 4θ2 1

−1

9
Gradient ∇J(θ) =

0
19
−4θ1 + 4θ23

4
9
4
1

14
0 0 0 1

14
Stationary points that verify ∇J(θ) = 0.
−1

19
9
4

0 1

10
4
9
Three solutions θ (1) = , θ (2) = et −1
−1
0 1
0 1

24
14

0
1 4 9
−1

9
4 19
14
θ (3) = −2
−2 −1
9
0 1 2
−1
Remarks
θ (2) and θ (3) are local minimal but not θ (1)
every stationary point can be deemed a local extremum

We need another optimality condition

How to ensure that a stationary point is a minimum?
Gilles Gasso Descent methods 8 / 27
Optimality conditions

Hessian matrix
Twice differential function
J : Rd → R is said to be a twice differentiable function on its domain
domJ if, at every point θ ∈, there exists an unique symmetrical matrix
H(θ) ∈ Rd×d called Hessian matrix such that
J(θ + h) = J(θ) + ∇J(θ)> + h> H(θ)h + khk2 ε(h).
ε(h) is a continuous function at 0 with limh→0 ε(h) = 0

H(θ) is the second derivative matrix

 ∂2J ∂2J ∂2J

∂θ1 ∂θ1 ∂θ1 ∂θ2 ··· ∂θ 1 ∂θ d

H(θ) =  .. .. .. 
 . . ··· . 

∂2J ∂2J ∂2J
∂θd ∂θ1 ∂θ d ∂θ 2 ··· ∂θd ∂θd

H(θ) = ∇θ> (∇θ J(θ)) is the Jacobian of the gradient function

Gilles Gasso Descent methods 9 / 27
Optimality conditions

Examples

Example 1 Exemple 2
Objective function Quadratic objective function
J(θ) = θ14 + θ24 − 4θ1 θ2 J(θ) = 21 θ > Pθ + q> θ + r

Directional derivative
Gradient
4θ13 − 4θ2 D(h, θ) = limt→0 J(θ+th)−J(θ)

t
∇J(θ) =
−4θ1 + 4θ23 D(h, θ) = (Pθ + q)> h

Hessian matrix
2 Gradient ∇J(θ) = Pθ + q
12θ1 −4
H(θ) =
−4 12θ22
Hessian matrix H(θ) = P

Gilles Gasso Descent methods 10 / 27

Optimality conditions

Second order optimality condition

Theorem [Second order optimality condition]

Let J : Rd → R be a twice differentiable function on its domain. If θ 0 is a
minimum of J, then ∇J(θ 0 ) = 0 and H(θ 0 ) is a positive definite matrix.

Remarks
H is positive definite if and only if all its eigenvalues are positive
H is negative definite if and only if all its eigenvalues are negative
For θ ∈ R, this condition means that the gradient of J at the minimum is
null, J 0 (θ) = 0 and its second derivative is positive i.e. J 00 (θ) > 0
If at a stationary point θ 0 H(θ 0 )) is negative definite, θ 0 is a local
maximum of J

Gilles Gasso Descent methods 11 / 27

Optimality conditions

Illustration of the second order optimality condition

2 14 9
19 4
J(θ) = θ14 + θ24 − 4θ1 θ2

9
9 0
14 4

10
1

24
3
4θ1 − 4θ2 1

−1

9
Gradient : ∇J(θ) =

0
19
−4θ1 + 4θ23

4
9
4
1

14
0 0 0 1

14

0 1
Stationary points : θ (1) = , θ (2) = and
0 1 −1

19
9
4

10
4
9
−1 0 1
−1 −1
θ (3) =

24
14

0
4
−1 1 9

9
4 19
−2 9 14
2 −2 −1 0 1 2
12θ1 −4
Hessian matrix H(θ) =
−4 12θ22

(1) (2) (3)

θ θ θ
0 −4 12 −4 12 −4
Hessian
−4 12 −4 12 −4 12
Eigenvalues 4, −4 8, 16 8, 16
Type of solution Saddle point Minimum Minimum
Gilles Gasso Descent methods 12 / 27
Optimality conditions

Necessary and sufficient optimality condition

Theorem [2nd order sufficient condition ]

Assume the hessian matrix H(θ 0 ) of J(θ) at θ 0 exists and is positive
definite. Assume also the gradient ∇J(θ 0 ) = 0. Then θ 0 is a (local or
global) minimum of problem (P).

Theorem [Sufficient and necessary optimality condition]

Let J be a convex function. Every local solution θ̂ is a global solution θ ∗ .

Recall
A function J : Rd → R is convex if it verifies
J(αθ + (1 − α)z) ≤ αJ(θ) + (1 − α)J(z), ∀θ, z ∈ domJ, 0≤α≤1

Gilles Gasso Descent methods 13 / 27

Optimality conditions

How to find the solution(s)?

We have seen how to assess a solution to the problem

A question to be addressed is: how to compute a solution?

Gilles Gasso Descent methods 14 / 27

Descent algorithms

Principle of descent algorithms

Direction of descent
Let the function J : Rd → R. The vector h ∈ Rd is called a direction of
descent in θ if there exists α > 0 such that J(θ + αh) < J(θ)

Principle of descent methods

Start from an initial point θ 0
Design a sequence of points {θ k } with θ k+1 = θ k + αk hk
Ensure that the sequence {θ k } converges to a stationary point θ̂

hk : direction of descent
αk is the step size

Gilles Gasso Descent methods 15 / 27

Descent algorithms

General approach

General algorithm
1: Let k = 0, initialize θ k
2: repeat
3: Find a descent direction hk ∈ Rd
4: Line search: find a step size αk > 0 in the direction hk such that
J(θ k + αk hk ) decreases "enough"
5: Update: θ k+1 ← θ k + αk hk and k ← k + 1
6: until k∇J(θ k )k <

The methods of descent differ by the choice of:

h: gradient algorithm, Newton, Quasi-Newton algorithm
α: Line search, backtracking
Gilles Gasso Descent methods 16 / 27
Descent algorithms Main methods of descent

Gradient Algorithm
Theorem [descent direction and opposite direction of gradient]
Let J(θ) be a differential function. The direction h = −∇J(θ) ∈ Rd is a
descent direction.

Proof.
J being differentiable, for any t > 0 we have
J(θ + th) = J(θ) + t∇J(θ)> h + tkhk(th). Setting h = −∇J(θ), we get
J(θ + th) − J(θ) = −tk∇J(θ)k2 + tkhk(th). For t small enough (th) → 0 and
so J(θ + th) − J(θ) = −tk∇J(θ)k2 < 0. It is then a descent direction.

Characteristics of the gradient algorithm

Choice of the descent direction at θ k : hk = −∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk ∇J(θ k ) costs O(d)
Gilles Gasso Descent methods 17 / 27
Descent algorithms Main methods of descent

Newton algorithm
2nd order approximation of the twice diffetentiable J at θ k
1
J(θ + h) ≈ J(θ k ) + ∇J(θ k )> h + h> H(θ k )h
2
with H(θ k ) the positive definite Hessian matrix
The direction hk which minimizes this approximation is obtained by
∇J(θ + hk ) = 0 ⇒ hk = −H(θ k )−1 ∇J(θ k )

Features
Choice of the descent direction at θ k : hk = −H(θ k )−1 ∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk H(θ k )−1 ∇(θ k ) costs
O(d 3 ) flops
H(θ k ) is not always guaranteed to be positive definite matrix. Hence
we cannot always ensure that hk is a direction of descent
Gilles Gasso Descent methods 18 / 27
Descent algorithms Main methods of descent

Illustration of gradient and Newton methods

Local approximation of the two methods
in 1D
100
J
Approxim. de J Meth. Gradient
80
Approxim. de J Meth. Newton

60 Directions of descent in 2D
40 2 24 19 14 9 9
J(θ)

9 4
14 1
20

4
0
−1

19
0 1 4

0
θk

9
1

9
1
−1
−20

14
Tangente en θ
k
0 0 0

4
−40
h = − H−1 ∇ J

4
−4 −2 0 2 4

14
1
θ h=−∇J1

−1
9

9
−1
0 4
−1
1

19
4

1
14
9
4
24
−2 9 9 14 19

−2 −1 0 1 2

Gilles Gasso Descent methods 19 / 27

Descent algorithms Main methods of descent

Quasi-Newton method

Main features
Choice of the descent direction at θ k : hk = −B(θ k )−1 ∇J(θ k )
B(θ k ) is an positive definite approximation of the Hessian matrix
Complexity of the update: most of the times O(d 2 )

Gilles Gasso Descent methods 20 / 27

Descent algorithms Research of the step

Line search
Assume the direction of descent hk at θ k is fixed. We aim to find the step
size αk > 0 in the direction hk such that the function J(θ k + αk hk )
decreases enough (compared to J(θ k ))

Several options
Fixed step size: use a fixed value α > 0 at each iteration k

θ k+1 ← θ k + αhk

Optimal step size αk∗

θ k+1 ← θ k + αk∗ hk with αk∗ = arg min J(θ k + αhk )

α>0

Variable step size: the choice αk is adapted to the current iteration

θ k+1 ← θ k + αk hk
Gilles Gasso Descent methods 21 / 27
Descent algorithms Research of the step

Line search

Pros and cons

Fixed step size strategy: often not very effective
Optimal step size: can be costly in calculation time
Variable step: most commonly used approach
The step is often imprecise
A trade-off between computation cost and decrease of J

Gilles Gasso Descent methods 22 / 27

Descent algorithms Research of the step

Variable step sizeh

Armijo’s rule
Determine the step size αk in order to have a sufficient decrease of J i.e.

J(θ k + αk h) ≤ J(θ k ) + c αk ∇J(θ k )> hk

Usually c is chosen in the range 10−5 , 10−1

Having hk the direction of descent, we have ∇J(θ k )> hk < 0, which ensures
the decrease of J

Backtracking
1: Fix an initial step ᾱ, choose 0 < ρ < 1, α ← ᾱ Choice of the initial step
2: repeat Newton method:
3: α ← ρα ᾱ = 1

4: until J(θ k + αh) > J(θ k ) + c α∇J(θ k )> hk Gradient method:

J(θ k )−J(θ k−1 )
ᾱ = 2 ∇J(θ )> h
k k
Interpretation: as long as J does not decrease, we
decrease the value of the step size
Gilles Gasso Descent methods 23 / 27
Descent algorithms Summary

Summary of descent methods

General algorithm
1: Initialize θk
2: repeat
3: Find direction of descent hk ∈ Rd
4: Line search: find the step αk > 0
5: Update: θ k+1 ← θ k + αk hk
6: until convergence

Method Direction of descent h Complexity Convergence

Gradient −∇J(θ) O(d) linear
Quasi-Newton −B(θ)−1 ∇J(θ) O(d 2 ) superlineare
Newton −H(θ)−1 ∇J(θ) O(d 3 ) quadratic

Step size computation: backtracking (common) or optimal step size

Complexity of each method: depends on the complexity of calculating h, the

search for α, and the number of iterations performed till convergence
Gilles Gasso Descent methods 24 / 27
Illustration of descent methods

Gradient method

J along the iterations Evolution of the iterates

10
19 14 9 4

1.5
θ0 1
8
−1

0
9 4 0

0
6
1 1
J(θ(k))

−1
4

−1
4
2
0.5
1

0
0
0
0 0 1

4
−2
0 2 4 6 8 10 12
Itérations k
−0.5

0
−1
1
0

9
−1 4

−1
0
−1

0
−1.5 1 1 4 9 14

−1.5 −1 −0.5 0 0.5 1 1.5

Gilles Gasso Descent methods 25 / 27

Illustration of descent methods

Newton method

J along the iterations Evolution of the iterates

10
19 14 9 4

1.5
θ0 1
8
−1

0
9 4 0

0
6
1 1
J(θ(k))

−1
4

−1
4
2
0.5
1

0
0
0
0 0 1

4
−2
1 2 3 4 5 6 7
Itérations k
−0.5

0
−1
1
0

9
−1 4

−1
0

At each iteration we considered

−1

the matrix H(θ) + λI instead of −1.5 1 0

1 4 9 14

H to guarantee the positive −1.5 −1 −0.5 0 0.5 1 1.5

definite property of Hessian

Gilles Gasso Descent methods 26 / 27

Illustration of descent methods

Conclusion