02 Grad Desc
02 Grad Desc
R. Flamary
0
20
40
0 −20
20
−20 −40
0
−5 0 −5 0 −5 0
0 0 0
−5 −5 −5
Optimization problem
min F (x), (1)
x∈Rn
▶ F is L-smooth (at least differentiable).
▶ When F is convex x⋆ is a solution of the problem if
∇x F (x⋆ ) = 0
▶ When F is non convex x⋆ is a local minimizer of the problem if
min F (x),
x∈Rn
Iterative algorithms
▶ Principle : start from an initial point x(0) and iterate to make it better.
▶ Gradient descent (and variants) when available, proximal methods.
▶ Black box optimization (a.k.a derivative free optimization) :
▶ Genetic, random search, simulated annealing [Gen and Cheng, 1999].
▶ Particle swarm optimization, etc [Kennedy and Eberhart, 1995].
▶ Nelder-Mead simplex [Nelder and Mead, 1965].
How to choose?
▶ No free lunch theorem [Wolpert and Macready, 1997] :
No algorithm is better than the others for all problems.
▶ But on can use the properties of the problem to choose the algorithm: specialize!
x y
x y
xk xk + 1 xk + 2 xk xk + 1
xk xk + 1 xk + 2 xk xk + 1
10
0.5
8
6
0.0
y
4
2
−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x
1D Logistic regression
n
X w2
min log(1 + exp(−yi (wxi + b))) + λ
w,b
i=1
2
3.0
0
2.5
5
20 10
−1 2.0
100
Gradient norm
b
500 3
−2
1000
2
−3
1
−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.1
▶ Slow convergence around the solution (small gradients).
▶ After 1000 iterations, still not converged.
▶ Complexity O(nd) per iteration.
xk xk + 1 xk + 1 xk + 2
Principle
▶ Iterative algorithm that minimizes a surrogate function.
▶ Let F be a function to minimize and G a majorization F (x) ≤ G(x, y) ∀x, y.
▶ MM iteration :
x(k+1) = argmin G(x, x(k) ) (8)
x
L
x(k+1) = argmin F (x(k) ) + ∇F (x(k) )⊤ (x − x(k) ) + ∥x − x(k) ∥2 (10)
x 2
1
x(k+1) = x(k) − ∇F (x(k) ) (11)
L
▶ This is exactly the update of the gradient descent with step ρ = 1
L
.
L
x(k+1) = argmin F (x(k) ) + ∇F (x(k) )⊤ (x − x(k) ) + ∥x − x(k) ∥2 (10)
x 2
1
x(k+1) = x(k) − ∇F (x(k) ) (11)
L
▶ This is exactly the update of the gradient descent with step ρ = 1
L
.
3.0
0
2.5
5
20 10
−1 100 2.0
Gradient norm
b
500
−2 3
1000
2
−3
1
−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Questions
▶ Does Gradient descent converges to an optimal point ?
▶ At which speed is the minimum reached?
▶ How to choose the stepsize ρ(k) ?
Theoretical convergence and convergence speed
▶ Fixed steps ρ(k) = ρ ?
▶ Smooth and strongly convex functions ?
▶ Acceleration techniques ?
▶ Adaptive steps ρ(k) (linesearch, next course) ?
2.3.0 - Convergence of gradient descent - - 14/36
Convergence for smooth functions
Bounds for L-smooth small L Bounds for L-smooth large L
F F
L-smooth upper bound L-smooth upper bound
Cvx lower bound Cvx lower bound
x x
Convergence of gradient descent for L-smooth functions
If function F is convex and differentiable and its gradient has a Lipschitz constant L,
then the gradient descent with fixed step ρ(k) = ρ ≤ L1 converges to a solution x⋆ of
the optimization problem with the following speed:
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤ (12)
2ρk
▶ Best for ρ = 1
L
that is the largest gradient that ensures decrease of the cost.
▶ We say the the gradient descent has a convergence O( k1 ).
▶ In order to reach a precision ϵ one needs O( 1ϵ ) iterations.
▶ We prove this result in the next slides 1 .
1 See also : https://www.stat.cmu.edu/ ~ryantibs/convexopt-F13/scribes/lec6.pdf
2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 15/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2
1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ
+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ
1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value
1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ
+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ
1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value
1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ
+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ
1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value
1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ
+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ
1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value
1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ
+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ
1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value
1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ
+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ
1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2
≤
2ρk
6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2
≤
2ρk
6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2
≤
2ρk
6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2
≤
2ρk
6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence example for smooth function
10
0.5
8
6
0.0 4
y
−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.
Optimization cost
Steepest descent
1
2
0
1
5
−1 10
20
Gradient norm
b
−2 2
100
−3 1
−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.
5
−1 10
20
b
101
−2
100
−3
100
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.
x y
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 20/36
Convergence for strongly convex functions
Bounds for large κ = L/μ Bounds for small κ = L/μ
F F
L-smooth upper bound L-smooth upper bound
μ-strongly cvx lower bound μ-strongly cvx lower bound
x x
x x⋆ x⋆ x
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 22/36
Convergence proof (µ-strongly convex, L-smooth)
µ k (0)
F (x(k) ) − F (x∗ ) ≤ 1 − ∥x − x⋆ ∥2
L
Proof.
Using the descent lemma (13):
1
F (x(k) ) − F (x(k−1) ) ≤ − ∥∇F (x(k−1) )∥2
2L
µ
≤− F (x(k−1) ) − F (x⋆ )
9 L
µ
F (x(k) ) − F (x⋆ ) ≤ F (x(k−1) ) − F (x⋆ ) − F (x(k−1) ) − F (x⋆ )
L
µ
≤ 1− F (x(k−1) ) − F (x⋆ )
L
µ k
≤ 1− F (x(0) ) − F (x⋆ )
L
9 Use PL inequality (17) 2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 23/36
Convergence example for strongly convex function
0.5 20
15
0.0 10
y
−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
Convergence example for strongly convex function
Optimization cost
Steepest descent 5
1
5 4
0 10
20
100 3
−1
Gradient norm
b
4
−2
2
−3
0
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
Convergence example for strongly convex function
Steepest descent
1 F(x (k)) − F(x ⋆ )
105
GD
5 Upper bound L-smooth
0 10
20 102
Upper bound strongly convex
100
10−1
−1
b
10−4
−2
10−7
−3
10−10
−4 10−13
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
How to make Gradient Descent faster?
Steepest descent Optimization cost
1
3.0
0
2.5
5
20 10
−1 100 2.0
Gradient norm
b
500
−2 3
1000
2
−3
1
−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Acceleration techniques
▶ Use adaptive stepsizes (smarter ρ(k) ).
▶ Use momentum (remember previous gradients).
▶ Use second order information (Newton, quasi-Newton).
▶ Speedup gradient computation (stochastic gradient, slower but more efficient).
2.4.0 - Gradient descent acceleration - - 25/36
How to make Gradient Descent faster?
Steepest descent Optimization cost
1
3.0
0
2.5
5
20 10
−1 100 2.0
Gradient norm
b
500
−2 3
1000
2
−3
1
−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Acceleration techniques
▶ Use adaptive stepsizes (smarter ρ(k) ).
▶ Use momentum (remember previous gradients).
▶ Use second order information (Newton, quasi-Newton).
▶ Speedup gradient computation (stochastic gradient, slower but more efficient).
2.4.0 - Gradient descent acceleration - - 25/36
Barzilai-Borwein stepsize (BB-rule)
Principle [Barzilai and Borwein, 1988]
▶ Use the gradient and the previous gradient to compute the stepsize.
▶ It is a two-step approximation of the secant method (to cancel the gradient).
▶ The stepsize is computed as:
▶ Long BB stepsize:
∆x⊤ ∆x
ρ(k) = (18)
∆x⊤ ∆g
▶ Short BB stepsize:
∆x⊤ ∆g
ρ(k) = (19)
∆g⊤ ∆g
▶ where ∆x = x(k) − x(k−1) and ∆g = ∇F (x(k) ) − ∇F (x(k−1) ).
▶ The stepsize can be clipped to avoid too large steps (or with linesearch).
▶ Convergence for quadratic [Raydan, 1993] and non-quadratic functions
[Raydan, 1997] with linesearch.
▶ Variants used for hyperparameter-free optimization with provably better constant.
▶ Discussed more in details in next courses.
−1 5 2
Gradient norm
b
−2 6
15
500
100
9 4
−3
2
0
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations
Discussion
▶ GD and first step of BB rule use step ρ(k) = 0.01.
▶ Acceleration is important w.r.t. steepest descent step.
▶ Unstable and the stepsize can be too large and lead to loss increase.
▶ BB rule is best used with linesearch (see next course).
2L∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤ (20)
k2
▶ Convergence speed O( 12 ) is optimal for a first order method.
k
50
2 3
500
100 2
3
1
4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 50 100 150 200
w Iterations
Discussion
▶ Both GD and AGD use fixed step ρ(k) = 0.1.
▶ Acceleration speedup is important w.r.t. steepest descent step.
▶ The momentum due the the Nesterov acceleration can be seen in the trajectory.
▶ Non monotonic convergence but faster than GD.
▶ Complexity O(nd) per iteration when no line search.
2.4.2 - Gradient descent acceleration - Accelerated Gradient Descent - 29/36
Least squares and ridge regression
n
1X ⊤
min F (w) = (w xi − yi )2 + λ∥w∥2 (21)
w n i=1
2.5.1 - Smooth machine learning problems - Least Squares and Ridge regression - 30/36
Logistic regression
n
1X
min F (w) = log(1 + exp(−yi w⊤ xi )) + λ∥w∥2 (22)
w n i=1
▶ Logistic regression.
n
1X
min F (w) = log(1 + exp(−yi w⊤ xi )) + λ∥w∥2
w n i=1
Your mission
▶ Implement te loss functions f and gradients df for the three problems.
▶ Implement the gradient descent algorithm (and accelerated variant).
▶ Compare the convergence speed of the three algorithms.
Bertsekas, D. P. (1997).
Nonlinear programming.
Journal of the Operational Research Society, 48(3):334–334.
Nesterov, Y. E. (1983).
A method for solving the convex programming problem with convergence rate o (1/kˆ
2).
In Dokl. akad. nauk Sssr, volume 269, pages 543–547.
Raydan, M. (1997).
The barzilai and borwein gradient method for the large scale unconstrained minimization
problem.
SIAM Journal on Optimization, 7(1):26–33.
Walkington, N. J. (2023).
Nesterov’s method for convex optimization.
SIAM Review, 65(2):539–562.