Unconstrained Optimization
Unconstrained Optimization
Gilles Gasso
1 Formulation
2 Optimality conditions
3 Descent algorithms
Main methods of descent
Research of the step
Summary
Unconstrained optimization
Problem formulation
(P) min J(θ)
θ∈Rd
Unconstrained optimization
Examples 6
1 >
θ Pθ + q > θ + r
5.5
J(θ) =
2
J(θ)
5
4.5
with P a positive definite matrix 4
2
2
0 1
0
−1
θ2 −2 −2 θ1
1 θ1
J(θ) = cos(θ1 −θ2 )+sin(θ1 +θ2 )+
0
4
−1
−2
6 6
4 4
2 2
0 0
Different solutions
Global solution
θ ∗ is said to be the global minimum solution of the problem if
J(θ ∗ ) ≤ J(θ), ∀θ ∈ domJ
Local solution
θ̂ is a local minimum solution of problem (P) if it holds
J(θ̂) ≤ J(θ), ∀θ ∈ domJ such that kθ̂ − θk ≤ , > 0
2.5
Minimum global
5
Minimum local 2
Illustration 4 θ* 1.5
1
3
θ1
J(θ) = cos(θ1 − θ2 ) + sin(θ1 + θ2 ) + 4 0.5
2
0
1
θ −0.5
0 −1
0 1 2 3 4 5
Optimality conditions
Vocabulary
Any vector θ 0 that verifies ∇J(θ 0 ) = 0 is called a stationary point or critical
point
∇J(θ) ∈ Rd is the gradient vector of J at θ.
The gradient is the unique vector such that the directional derivative can be
written as:
J(θ + th) − J(θ)
lim = ∇J(θ)> h, h ∈ Rd , t∈R
t→0 t
Gilles Gasso Descent methods 7 / 27
Optimality conditions
9
9 0
14 4
10
1
24
3
4θ1 − 4θ2 1
−1
−1
9
Gradient ∇J(θ) =
0
19
−4θ1 + 4θ23
4
9
4
1
14
0 0 0 1
14
Stationary points that verify ∇J(θ) = 0.
−1
19
9
4
0 1
10
4
9
Three solutions θ (1) = , θ (2) = et −1
−1
0 1
0 1
24
14
0
1 4 9
−1
9
4 19
14
θ (3) = −2
−2 −1
9
0 1 2
−1
Remarks
θ (2) and θ (3) are local minimal but not θ (1)
every stationary point can be deemed a local extremum
Hessian matrix
Twice differential function
J : Rd → R is said to be a twice differentiable function on its domain
domJ if, at every point θ ∈, there exists an unique symmetrical matrix
H(θ) ∈ Rd×d called Hessian matrix such that
J(θ + h) = J(θ) + ∇J(θ)> + h> H(θ)h + khk2 ε(h).
ε(h) is a continuous function at 0 with limh→0 ε(h) = 0
Examples
Example 1 Exemple 2
Objective function Quadratic objective function
J(θ) = θ14 + θ24 − 4θ1 θ2 J(θ) = 21 θ > Pθ + q> θ + r
Directional derivative
Gradient
4θ13 − 4θ2 D(h, θ) = limt→0 J(θ+th)−J(θ)
t
∇J(θ) =
−4θ1 + 4θ23 D(h, θ) = (Pθ + q)> h
Hessian matrix
2 Gradient ∇J(θ) = Pθ + q
12θ1 −4
H(θ) =
−4 12θ22
Hessian matrix H(θ) = P
Remarks
H is positive definite if and only if all its eigenvalues are positive
H is negative definite if and only if all its eigenvalues are negative
For θ ∈ R, this condition means that the gradient of J at the minimum is
null, J 0 (θ) = 0 and its second derivative is positive i.e. J 00 (θ) > 0
If at a stationary point θ 0 H(θ 0 )) is negative definite, θ 0 is a local
maximum of J
9
9 0
14 4
10
1
24
3
4θ1 − 4θ2 1
−1
−1
9
Gradient : ∇J(θ) =
0
19
−4θ1 + 4θ23
4
9
4
1
14
0 0 0 1
14
0 1
Stationary points : θ (1) = , θ (2) = and
0 1 −1
19
9
4
10
4
9
−1 0 1
−1 −1
θ (3) =
24
14
0
4
−1 1 9
9
4 19
−2 9 14
2 −2 −1 0 1 2
12θ1 −4
Hessian matrix H(θ) =
−4 12θ22
Recall
A function J : Rd → R is convex if it verifies
J(αθ + (1 − α)z) ≤ αJ(θ) + (1 − α)J(z), ∀θ, z ∈ domJ, 0≤α≤1
Direction of descent
Let the function J : Rd → R. The vector h ∈ Rd is called a direction of
descent in θ if there exists α > 0 such that J(θ + αh) < J(θ)
hk : direction of descent
αk is the step size
General approach
General algorithm
1: Let k = 0, initialize θ k
2: repeat
3: Find a descent direction hk ∈ Rd
4: Line search: find a step size αk > 0 in the direction hk such that
J(θ k + αk hk ) decreases "enough"
5: Update: θ k+1 ← θ k + αk hk and k ← k + 1
6: until k∇J(θ k )k <
Gradient Algorithm
Theorem [descent direction and opposite direction of gradient]
Let J(θ) be a differential function. The direction h = −∇J(θ) ∈ Rd is a
descent direction.
Proof.
J being differentiable, for any t > 0 we have
J(θ + th) = J(θ) + t∇J(θ)> h + tkhk(th). Setting h = −∇J(θ), we get
J(θ + th) − J(θ) = −tk∇J(θ)k2 + tkhk(th). For t small enough (th) → 0 and
so J(θ + th) − J(θ) = −tk∇J(θ)k2 < 0. It is then a descent direction.
Newton algorithm
2nd order approximation of the twice diffetentiable J at θ k
1
J(θ + h) ≈ J(θ k ) + ∇J(θ k )> h + h> H(θ k )h
2
with H(θ k ) the positive definite Hessian matrix
The direction hk which minimizes this approximation is obtained by
∇J(θ + hk ) = 0 ⇒ hk = −H(θ k )−1 ∇J(θ k )
Features
Choice of the descent direction at θ k : hk = −H(θ k )−1 ∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk H(θ k )−1 ∇(θ k ) costs
O(d 3 ) flops
H(θ k ) is not always guaranteed to be positive definite matrix. Hence
we cannot always ensure that hk is a direction of descent
Gilles Gasso Descent methods 18 / 27
Descent algorithms Main methods of descent
60 Directions of descent in 2D
40 2 24 19 14 9 9
J(θ)
9 4
14 1
20
4
0
−1
19
0 1 4
0
θk
9
1
9
1
−1
−20
14
Tangente en θ
k
0 0 0
4
−40
h = − H−1 ∇ J
4
−4 −2 0 2 4
14
1
θ h=−∇J1
−1
9
9
−1
0 4
−1
1
19
4
1
14
9
4
24
−2 9 9 14 19
−2 −1 0 1 2
Quasi-Newton method
Main features
Choice of the descent direction at θ k : hk = −B(θ k )−1 ∇J(θ k )
B(θ k ) is an positive definite approximation of the Hessian matrix
Complexity of the update: most of the times O(d 2 )
Line search
Assume the direction of descent hk at θ k is fixed. We aim to find the step
size αk > 0 in the direction hk such that the function J(θ k + αk hk )
decreases enough (compared to J(θ k ))
Several options
Fixed step size: use a fixed value α > 0 at each iteration k
θ k+1 ← θ k + αhk
θ k+1 ← θ k + αk hk
Gilles Gasso Descent methods 21 / 27
Descent algorithms Research of the step
Line search
Having hk the direction of descent, we have ∇J(θ k )> hk < 0, which ensures
the decrease of J
Backtracking
1: Fix an initial step ᾱ, choose 0 < ρ < 1, α ← ᾱ Choice of the initial step
2: repeat Newton method:
3: α ← ρα ᾱ = 1
Gradient method
1.5
θ0 1
8
−1
0
9 4 0
0
6
1 1
J(θ(k))
−1
4
−1
4
2
0.5
1
0
0
0
0 0 1
4
−2
0 2 4 6 8 10 12
Itérations k
−0.5
0
−1
1
0
9
−1 4
−1
0
−1
0
−1.5 1 1 4 9 14
Newton method
1.5
θ0 1
8
−1
0
9 4 0
0
6
1 1
J(θ(k))
−1
4
−1
4
2
0.5
1
0
0
0
0 0 1
4
−2
1 2 3 4 5 6 7
Itérations k
−0.5
0
−1
1
0
9
−1 4
−1
0
Conclusion