0% found this document useful (0 votes)

27 views54 pages

02 Grad Desc

The document provides an overview of optimization techniques in data science, focusing on smooth optimization and gradient descent methods. It covers various topics including iterative optimization algorithms, convergence properties, and specific applications like logistic regression and Lasso. The course is structured to guide learners through both theoretical and practical aspects of optimization, including advanced methods and their implementations.

Uploaded by

sacnr125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views54 pages

02 Grad Desc

Uploaded by

sacnr125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Optimization for data science

Smooth optimization: Gradient descent

R. Flamary

Master Data Science, Institut Polytechnique de Paris

September 17, 2024

Full course overview
1. Introduction to optimization for data science
1.1 ML optimization problems and linear algebra recap
1.2 Optimization problems and their properties (Convexity, smoothness)
2. Smooth optimization : Gradient descent
2.1 First order algorithms, convergence for smooth and strongly convex functions
3. Smooth Optimization : Quadratic problems
3.1 Solvers for quadratic problems, conjugate gradient
3.2 Linesearch methods
4. Non-smooth Optimization : Proximal methods
4.1 Proximal operator and proximal algorithms
4.2 Lab 1: Lasso and group Lasso
5. Stochastic Gradient Descent
5.1 SGD and variance reduction techniques
5.2 Lab 2: SGD for Logistic regression
6. Standard formulation of constrained optimization problems
6.1 LP, QP and Mixed Integer Programming
7. Coordinate descent
7.1 Algorithms and Labs
8. Newton and quasi-newton methods
8.1 Second order methods and Labs
9. Beyond convex optimization
9.1 Nonconvex reg., Frank-Wolfe, DC programming, autodiff
Current course overview
1. Introduction to optimization 4
2. Smooth optimization : Gradient descent 4
2.1 Iterative optimization 4
2.1.1 Optimization problems and properties
2.1.2 Iterative optimization for smooth functions
2.2 (Steepest) Gradient descent 10
2.2.1 Gradient Descent Algorithm
2.2.2 Majorization-minimization view
2.3 Convergence of gradient descent 16
2.3.1 Convergence for smooth functions
2.3.2 Convergence for strongly convex functions
2.4 Gradient descent acceleration 42
2.4.1 Barzilai-Borwein stepsize
2.4.2 Accelerated Gradient Descent
2.5 Smooth machine learning problems 48
2.5.1 Least Squares and Ridge regression
2.5.2 Logistic regression
3. Smooth Optimization : Quadratic problems 51
4. Non-smooth optimization : Proximal methods 51
5. Stochastic Gradient Descent 51
6. Standard formulation of constrained optimization problems 51
7. Coordinate descent 51
8. Newton and quasi-newton methods 51
9. Beyond convex optimization 51
Smooth Optimization problem
Convex function Nonconvex function Nonconvex function

0
20
40
0 −20
20
−20 −40
0

−5 0 −5 0 −5 0
0 0 0
−5 −5 −5

Optimization problem
min F (x), (1)
x∈Rn
▶ F is L-smooth (at least differentiable).
▶ When F is convex x⋆ is a solution of the problem if

∇x F (x⋆ ) = 0
▶ When F is non convex x⋆ is a local minimizer of the problem if

∇x F (x⋆ ) = 0 and ∇2x F (x⋆ ) ⪰ 0

How to solve optimization problems?

▶ Solving the problem analytically : ∇F (x⋆ ) = 0
▶ Search for a solution numerically : iterative optimization algorithms
2.1.1 - Iterative optimization - Optimization problems and properties - 4/36
Iterative optimization algorithms

min F (x),
x∈Rn

Iterative algorithms
▶ Principle : start from an initial point x(0) and iterate to make it better.
▶ Gradient descent (and variants) when available, proximal methods.
▶ Black box optimization (a.k.a derivative free optimization) :
▶ Genetic, random search, simulated annealing [Gen and Cheng, 1999].
▶ Particle swarm optimization, etc [Kennedy and Eberhart, 1995].
▶ Nelder-Mead simplex [Nelder and Mead, 1965].

How to choose?
▶ No free lunch theorem [Wolpert and Macready, 1997] :
No algorithm is better than the others for all problems.
▶ But on can use the properties of the problem to choose the algorithm: specialize!

2.1.1 - Iterative optimization - Optimization problems and properties - 5/36

Assumption 1 : Convexity
Convex function
f(αx + (1 − α)y)
αf(x) + (1 − α)f(y)

x y

Convex function (recap)

▶ Function F is convex if it lies below its chords, that is ∀x, y ∈ Rn

F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y), with 0 ≤ α ≤ 1. (2)

▶ F a differentiable function is convex if and only if

F (y) ≥ F (x) + ∇F (x)⊤ (y − x), ∀y, x ∈ domF (3)
n
▶ For C = R , if x if a global minimum if and only if ∇x F (x) = 0.
▶ F is µ-strongly convex with µ > 0 if it satisfies ∀x, y ∈ Rn and 0 ≤ α ≤ 1
µ
F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y) − α(1 − α)∥x − y∥2 , (4)
2
2.1.1 - Iterative optimization - Optimization problems and properties - 6/36
Assumption 1 : Convexity
Strongly convex function Convex function lower bound Strongly convex function lower bound
f(αx + (1 − α)y) f(t) f(t)
αf(x) + (1 − α)f(y) f(x) + ∇f(x)T(t − x) f(x) + ∇f(x)T(t − x)
μα(1 − α) μ
αf(x) + (1 − α)f(y) − 2
(y − x)2 f(x) + ∇f(x)T(t − x) + 2 (t − x)2

x y

Convex function (recap)

▶ Function F is convex if it lies below its chords, that is ∀x, y ∈ Rn

F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y), with 0 ≤ α ≤ 1. (2)

▶ F a differentiable function is convex if and only if

L-smooth function (recap)

▶ Function F is gradient Lipschitz, also called L-smooth, if ∀x, y ∈ C 2
∥∇F (x) − ∇F (y)∥ ≤ L∥x − y∥ (5)
▶ If F is L-smooth, then the following inequality holds
L
F (x) ≤ F (y) + ∇F (y)⊤ (x − y) + ∥x − y∥2 (6)
2
▶ If F is L-smooth, then the following inequality holds
∇2x F (x) ⪯ LI (λmax (∇2x F (x)) ≤ L) (7)

2.1.1 - Iterative optimization - Optimization problems and properties - 7/36

Descent algorithm for smooth optimization
Iterative algorithm (iter. k) Iterative algorithm (iter. k + 1) Large step size
F
F F
k
∇F(x )
∇F(x k) ∇F(x k)
d
d d

xk xk + 1 xk + 2 xk xk + 1

General iterative algorithm

1: Initialize x(0)
2: for k = 0, 1, 2, . . . do
3: d(k) ← Compute descent direction from x(k)
4: ρ(k) ← Choose stepsize
5: x(k+1) ← x(k) + ρ(k) d(k)
6: end for
▶ x(k) ∈ Rn is the current iterate.
▶ d(k) ∈ Rn is a descent direction if ∇F (x(k) )T d(k) < 0 .
▶ For a step small enough, each iteration decreases the cost : F (x(k+1) ) ≤ F (x(k) )
▶ Stopping conditions: max number of iterations or small gradient ∥∇F (xk )∥.
2.1.2 - Iterative optimization - Iterative optimization for smooth functions - 8/36
Gradient Descent (GD) algorithm
Iterative algorithm (iter. k) Iterative algorithm (iter. k + 1) Large step size
F
F F
∇F(x k)
∇F(x k) ∇F(x k)
d
d d

xk xk + 1 xk + 2 xk xk + 1

Gradient descent algorithm (steepest descent)

1: Initialize x(0)
2: for k = 0, 1, 2, . . . do
3: d(k) ← −∇F (x(k) )
4: ρ(k) ← Choose stepsize
5: x(k+1) ← x(k) + ρ(k) d(k)
6: end for
▶ Iterative algorithm with descent direction d = −∇F (x).
▶ −∇F (x) is called the steepest descent direction.
▶ Equivalent to iterative algorithm above in 1D.
▶ In this course we study the constant step case ρ(k) = ρ.
2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 9/36
Example optimization problem
Training dataset Cost function
1.0

10
0.5
8
6
0.0
y

4
2
−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x

1D Logistic regression
n
X w2
min log(1 + exp(−yi (wxi + b))) + λ
w,b
i=1
2

▶ Linear prediction model : f (x) = wx + b

▶ Training data (xi , yi ) : (1, −1), (2, −1), (3, 1), (4, 1).
▶ Problem solution for λ = 1 : x∗ = [w⋆ , b⋆ ] = [0.96, −2.40]
▶ Initialization : x(0) = [1, −0.5].
▶ Complexity : Cost and gradient both O(nd)
2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 10/36
Example of steepest descent
Steepest descent Optimization cost
1

3.0

0
2.5
5
20 10
−1 2.0
100
Gradient norm
b

500 3
−2
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.1
▶ Slow convergence around the solution (small gradients).
▶ After 1000 iterations, still not converged.
▶ Complexity O(nd) per iteration.

2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 11/36

Majorization Minimization (MM) algorithm
MM algorithm (iter. k) MM algorithm (iter. k + 1)
F F
G(⋅, x k) G(⋅, x k)

xk xk + 1 xk + 1 xk + 2

Principle
▶ Iterative algorithm that minimizes a surrogate function.
▶ Let F be a function to minimize and G a majorization F (x) ≤ G(x, y) ∀x, y.
▶ MM iteration :
x(k+1) = argmin G(x, x(k) ) (8)
x

▶ The MM algorithm is guaranteed to decrease the cost function at each iteration.

▶ Most efficient when G is close to F , but simple to compute and optimize.
▶ References : [Hunter and Lange, 2004, Sun et al., 2016].
2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 12/36
Majorization Minimization for smooth functions

Majorization of L-smooth functions

If F is L-smooth, then the following majorization holds:
L
F (x) ≤ G(x, y) = F (y) + ∇F (y)⊤ (x − y) + ∥x − y∥2 (9)
2

Solving the MM iteration with quadratic upper bound

L
x(k+1) = argmin F (x(k) ) + ∇F (x(k) )⊤ (x − x(k) ) + ∥x − x(k) ∥2 (10)
x 2

▶ The MM iteration is a quadratic problem that can be solved analytically.

▶ The solution is given by:

1
x(k+1) = x(k) − ∇F (x(k) ) (11)
L
▶ This is exactly the update of the gradient descent with step ρ = 1
L
.

2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 13/36

Majorization Minimization for smooth functions

Majorization of L-smooth functions

If F is L-smooth, then the following majorization holds:
L
F (x) ≤ G(x, y) = F (y) + ∇F (y)⊤ (x − y) + ∥x − y∥2 (9)
2

Solving the MM iteration with quadratic upper bound

L
x(k+1) = argmin F (x(k) ) + ∇F (x(k) )⊤ (x − x(k) ) + ∥x − x(k) ∥2 (10)
x 2

▶ The MM iteration is a quadratic problem that can be solved analytically.

▶ The solution is given by:

1
x(k+1) = x(k) − ∇F (x(k) ) (11)
L
▶ This is exactly the update of the gradient descent with step ρ = 1
L
.

2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 13/36

Convergence of gradient descent
Steepest descent Optimization cost
1

3.0
0
2.5
5
20 10
−1 100 2.0

Gradient norm
b

500
−2 3
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Questions
▶ Does Gradient descent converges to an optimal point ?
▶ At which speed is the minimum reached?
▶ How to choose the stepsize ρ(k) ?
Theoretical convergence and convergence speed
▶ Fixed steps ρ(k) = ρ ?
▶ Smooth and strongly convex functions ?
▶ Acceleration techniques ?
▶ Adaptive steps ρ(k) (linesearch, next course) ?
2.3.0 - Convergence of gradient descent - - 14/36
Convergence for smooth functions
Bounds for L-smooth small L Bounds for L-smooth large L
F F
L-smooth upper bound L-smooth upper bound
Cvx lower bound Cvx lower bound

x x
Convergence of gradient descent for L-smooth functions
If function F is convex and differentiable and its gradient has a Lipschitz constant L,
then the gradient descent with fixed step ρ(k) = ρ ≤ L1 converges to a solution x⋆ of
the optimization problem with the following speed:
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤ (12)
2ρk

▶ Best for ρ = 1
L
that is the largest gradient that ensures decrease of the cost.
▶ We say the the gradient descent has a convergence O( k1 ).
▶ In order to reach a precision ϵ one needs O( 1ϵ ) iterations.
▶ We prove this result in the next slides 1 .
1 See also : https://www.stat.cmu.edu/ ~ryantibs/convexopt-F13/scribes/lec6.pdf
2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 15/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2

2 Convexity upper bound w.r.t. x(k)

3 Injectgradient step x(k+1) = x(k) − ρ∇F (x(k) )
4 For ρ ≤ 1 , −(2 − ρL) ≤ −1
L 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 16/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2

2 Convexity upper bound w.r.t. x(k)

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)
2ρ
Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1 (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)
2ρ

+ ∥x(k) − x⋆ ∥2
1
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )
2ρ
5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2
≤
2ρk

6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together

6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence example for smooth function

Training dataset L-smooth cost function

1.0

10
0.5
8
6
0.0 4
y

−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36

Convergence example for smooth function

Optimization cost
Steepest descent
1

2
0

1
5
−1 10
20
Gradient norm
b

−2 2

100
−3 1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36

Convergence example for smooth function

Steepest descent F(x (k)) − F(x ⋆ )

1
GD
Upper bound L-smooth
0
102

5
−1 10
20
b

101
−2

100
−3
100

−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36

Assumption 3 : Strong convexity
Strongly convex function Strongly convex function lower bound
f(αx + (1 − α)y) f(t)
αf(x) + (1 − α)f(y) f(x) + ∇f(x)T(t − x)
μα(1 − α) μ
αf(x) + (1 − α)f(y) − 2
(y − x)2 f(x) + ∇f(x)T(t − x) + 2 (t − x)2

x y

µ-strongly convex function (recap)

▶ F is µ-strongly convex with µ > 0 if it satisfies ∀x, y ∈ Rn and 0 ≤ α ≤ 1
µ
F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y) − α(1 − α)∥x − y∥2 , (15)
2
▶ If F is a differentiable µ-strongly convex then
µ
F (y) ≥ F (x) + ∇F (x)⊤ (y − x) + ∥y − x∥2 , ∀y, x ∈ domF
2
▶ Strongly convex functions have a unique minimum x⋆ .

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 20/36
Convergence for strongly convex functions
Bounds for large κ = L/μ Bounds for small κ = L/μ
F F
L-smooth upper bound L-smooth upper bound
μ-strongly cvx lower bound μ-strongly cvx lower bound

x x

Convergence of gradient descent for µ-strongly convex functions

If function F is µ-strongly convex, then the gradient descent with fixed step
ρ(k) = ρ = L1 converges to a solution x⋆ of the optimization problem with the
following speed:
µ k
F (x) − F (x∗ ) ≤ 1 − F (x(0) ) − F (x∗ ) (16)
L
▶ For a function F , µ = λmin (∇2 F (x)) and L = λmax (∇2 F (x)).
▶ The condition κ = L
µ
≥ 1 has important impact (close to 1 is better approx).
▶ We say the the gradient descent has a convergence O(e−k/κ ).
▶ In order to reach a precision ϵ one needs O(log(1/ϵ)) iterations.
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 21/36
Convergence proof (µ-strongly convex, L-smooth)
PL upper bond PL upper bond
F F
F(x ⋆ ) + |∇F(x)|2/(2μ) F(x ⋆ ) + |∇F(x)|2/(2μ)
Slope at x Slope at x

x x⋆ x⋆ x

Polyak-Lojasciewicz (PL) inequality

If F is a µ-strongly convex function and x⋆ its optimal point then ∀x
1
F (x) − F (x∗ ) ≤ ∥∇F (x)∥2 (17)
2µ
Proof.
Exercise 3 in class. Hints:
▶ Use strong convexity lower bound.
▶ Set y = x − µ1 ∇F (x).
▶ Inject optimal point x⋆

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 22/36
Convergence proof (µ-strongly convex, L-smooth)

µ k (0)
F (x(k) ) − F (x∗ ) ≤ 1 − ∥x − x⋆ ∥2
L
Proof.
Using the descent lemma (13):
1
F (x(k) ) − F (x(k−1) ) ≤ − ∥∇F (x(k−1) )∥2
2L
µ
≤− F (x(k−1) ) − F (x⋆ )
9 L
µ
F (x(k) ) − F (x⋆ ) ≤ F (x(k−1) ) − F (x⋆ ) − F (x(k−1) ) − F (x⋆ )
L
µ
≤ 1− F (x(k−1) ) − F (x⋆ )
L
µ k
≤ 1− F (x(0) ) − F (x⋆ )
L

9 Use PL inequality (17) 2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 23/36
Convergence example for strongly convex function

Training dataset μ-strongly convex cost function

1.0

0.5 20
15
0.0 10
y

−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
Convergence example for strongly convex function
Optimization cost
Steepest descent 5
1

5 4
0 10
20
100 3
−1
Gradient norm
b

4
−2

2
−3

0
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
Convergence example for strongly convex function
Steepest descent
1 F(x (k)) − F(x ⋆ )
105
GD
5 Upper bound L-smooth
0 10
20 102
Upper bound strongly convex
100
10−1
−1
b

10−4
−2
10−7

−3
10−10

−4 10−13
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
How to make Gradient Descent faster?
Steepest descent Optimization cost
1

3.0
0
2.5
5
20 10
−1 100 2.0

Gradient norm
b

500
−2 3
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Gradient descent is slow

▶ Unless on strongly convex fonction it has a O( k1 ) convergence.
▶ Needs to recompute the gradient at each iteration (O(nd) in ERM).

3.0
0
2.5
5
20 10
−1 100 2.0

Gradient norm
b

500
−2 3
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Gradient descent is slow

▶ Unless on strongly convex fonction it has a O( k1 ) convergence.
▶ Needs to recompute the gradient at each iteration (O(nd) in ERM).

Acceleration techniques
▶ Use adaptive stepsizes (smarter ρ(k) ).
▶ Use momentum (remember previous gradients).
▶ Use second order information (Newton, quasi-Newton).
▶ Speedup gradient computation (stochastic gradient, slower but more efficient).
2.4.0 - Gradient descent acceleration - - 25/36
Barzilai-Borwein stepsize (BB-rule)
Principle [Barzilai and Borwein, 1988]
▶ Use the gradient and the previous gradient to compute the stepsize.
▶ It is a two-step approximation of the secant method (to cancel the gradient).
▶ The stepsize is computed as:
▶ Long BB stepsize:
∆x⊤ ∆x
ρ(k) = (18)
∆x⊤ ∆g
▶ Short BB stepsize:
∆x⊤ ∆g
ρ(k) = (19)
∆g⊤ ∆g
▶ where ∆x = x(k) − x(k−1) and ∆g = ∇F (x(k) ) − ∇F (x(k−1) ).
▶ The stepsize can be clipped to avoid too large steps (or with linesearch).
▶ Convergence for quadratic [Raydan, 1993] and non-quadratic functions
[Raydan, 1997] with linesearch.
▶ Variants used for hyperparameter-free optimization with provably better constant.
▶ Discussed more in details in next courses.

2.4.1 - Gradient descent acceleration - Barzilai-Borwein stepsize - 26/36

Example of BB rule for Gradient Descent
Optimization cost
BB rule Gradient Descent
1 GD
BB rule
4
0

−1 5 2
Gradient norm
b

−2 6
15
500
100
9 4
−3
2

0
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ GD and first step of BB rule use step ρ(k) = 0.01.
▶ Acceleration is important w.r.t. steepest descent step.
▶ Unstable and the stepsize can be too large and lead to loss increase.
▶ BB rule is best used with linesearch (see next course).

2.4.1 - Gradient descent acceleration - Barzilai-Borwein stepsize - 27/36

Accelerated gradient descent

Accelerated gradient descent (AGD) [Nesterov, 1983, Walkington, 2023]

1: Initialize x(0) , y(0) = x(0) , α(0) = 0 and ρ ≤ 1
L
2: for k = 0, 1, 2, . . . do
3: y(k+1) ← x(k) − ρ∇F (x(k) )
√
1+ 1+4(α(k) )2
4: α(k+1) =← 2
(k)
−1
5: x(k+1) ← y(k+1) + αα(k+1) (y(k+1) − y(k) )
6: end for
▶ Also called Nesterov accelerated gradient (NAG).
▶ Acceleration of gradient descent with momentum.
▶ Update is gradient step (y(k+1) ) + momentum of previous step.
▶ The algorithm has a O( 12 ) convergence for L-smooth functions and ρ = 1
L
:
k

2L∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤ (20)
k2
▶ Convergence speed O( 12 ) is optimal for a first order method.
k

2.4.2 - Gradient descent acceleration - Accelerated Gradient Descent - 28/36

Example of Accelerated Gradient Descent
Accelerated gradient descent Optimization cost
1
GD GD
Accelerated GD 3.0 Accelerated GD
0
2.5
10 5
1 20 2.0
30
Gradient norm
b

50
2 3
500
100 2
3
1
4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 50 100 150 200
w Iterations

Discussion
▶ Both GD and AGD use fixed step ρ(k) = 0.1.
▶ Acceleration speedup is important w.r.t. steepest descent step.
▶ The momentum due the the Nesterov acceleration can be seen in the trajectory.
▶ Non monotonic convergence but faster than GD.
▶ Complexity O(nd) per iteration when no line search.
2.4.2 - Gradient descent acceleration - Accelerated Gradient Descent - 29/36
Least squares and ridge regression

n
1X ⊤
min F (w) = (w xi − yi )2 + λ∥w∥2 (21)
w n i=1

▶ Training dataset {(xi , yi )}n d

i=1 with yi ∈ R and w ∈ R .

▶ Least Squares (λ = 0) and Ridge regression (λ > 0).

▶ Prediction is done with ŷ = w⊤ x.

Exercise 1: Linear regression

1. Reformulate the objective value of least square as a squared norm of residual
vector of prediction errors.
2. Compute the gradients for the least square and ridge regression.
3. Express the Hessian and compute the Lipschitz constant L and µ for the least
square and ridge regression.

2.5.1 - Smooth machine learning problems - Least Squares and Ridge regression - 30/36
Logistic regression

n
1X
min F (w) = log(1 + exp(−yi w⊤ xi )) + λ∥w∥2 (22)
w n i=1

▶ Training dataset {(xi , yi )}n d

i=1 with yi ∈ {1, 1} and w ∈ R .

▶ Regularized logistic regression (λ > 0).

▶ Prediction is done with ŷ = sign(w⊤ x).

Exercise 2: Logistic regression

1. Compute the gradients for the logistic regression.
2. Express the Hessian and compute the Lipschitz constant L and µ for the logistic
regression.

2.5.2 - Smooth machine learning problems - Logistic regression - 31/36

Lab: Gradient Descent

For the optimization problems

▶ Least squares regression and Ridge regression.
n
1X ⊤
min F (w) = (w xi − yi )2 + λ∥w∥2
w n i=1

▶ Logistic regression.
n
1X
min F (w) = log(1 + exp(−yi w⊤ xi )) + λ∥w∥2
w n i=1

Your mission
▶ Implement te loss functions f and gradients df for the three problems.
▶ Implement the gradient descent algorithm (and accelerated variant).
▶ Compare the convergence speed of the three algorithms.

2.5.2 - Smooth machine learning problems - Logistic regression - 32/36

Bibliography I

Convex Optimization [Boyd and Vandenberghe, 2004]

▶ Available freely online: https://web.stanford.edu/~boyd/cvxbook/.

Nonlinear Programming [Bertsekas, 1997]

▶ Reference optimization book, contains also most of the course.
▶ Unconstrained optimization (Ch. 1), duality and lagrangian (Ch. 3, 4 ,5).

Convex analysis and monotone operator theory in Hilbert spaces

[Bauschke et al., 2011]
▶ Awesome book with lot’s of algorithms, and convergence proofs.
▶ All definitions (convexity, lower semi continuity) in specific chapters.

Numerical optimization [Nocedal and Wright, 2006]

▶ Classic introduction to numerical optimization.
References I

Barzilai, J. and Borwein, J. M. (1988).

Two-point step size gradient methods.
IMA Journal of Numerical Analysis, 8(1):141–148.

Bauschke, H. H., Combettes, P. L., et al. (2011).

Convex analysis and monotone operator theory in Hilbert spaces, volume 408.
Springer.

Bertsekas, D. P. (1997).
Nonlinear programming.
Journal of the Operational Research Society, 48(3):334–334.

Boyd, S. and Vandenberghe, L. (2004).

Convex optimization.
Cambridge university press.

Gen, M. and Cheng, R. (1999).

Genetic algorithms and engineering optimization, volume 7.
John Wiley & Sons.
References II
Hunter, D. R. and Lange, K. (2004).
A tutorial on mm algorithms.
The American Statistician, 58(1):30–37.

Kennedy, J. and Eberhart, R. (1995).

Particle swarm optimization.
In Proceedings of ICNN’95-international conference on neural networks, volume 4, pages
1942–1948. ieee.
Nelder, J. A. and Mead, R. (1965).
A simplex method for function minimization.
The computer journal, 7(4):308–313.

Nesterov, Y. E. (1983).
A method for solving the convex programming problem with convergence rate o (1/kˆ
2).
In Dokl. akad. nauk Sssr, volume 269, pages 543–547.

Nocedal, J. and Wright, S. (2006).

Numerical optimization.
Springer Science & Business Media.
References III
Raydan, M. (1993).
On the barzilai and borwein choice of steplength for the gradient method.
IMA Journal of Numerical Analysis, 13(3):321–326.

Raydan, M. (1997).
The barzilai and borwein gradient method for the large scale unconstrained minimization
problem.
SIAM Journal on Optimization, 7(1):26–33.

Sun, Y., Babu, P., and Palomar, D. P. (2016).

Majorization-minimization algorithms in signal processing, communications, and machine
learning.
IEEE Transactions on Signal Processing, 65(3):794–816.

Walkington, N. J. (2023).
Nesterov’s method for convex optimization.
SIAM Review, 65(2):539–562.

Wolpert, D. H. and Macready, W. G. (1997).

No free lunch theorems for optimization.
IEEE transactions on evolutionary computation, 1(1):67–82.

Chapter 7: Continuous Optimization (Math For Machine Learning)
No ratings yet
Chapter 7: Continuous Optimization (Math For Machine Learning)
65 pages
Mth643 Quize
No ratings yet
Mth643 Quize
15 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Math011 11 4
No ratings yet
Math011 11 4
30 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
No ratings yet
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
15 pages
p5 CO Opti Algo
No ratings yet
p5 CO Opti Algo
15 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Exam 2023
No ratings yet
Exam 2023
16 pages
Nocedal - Wright CH - 02-02
No ratings yet
Nocedal - Wright CH - 02-02
12 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
Lec 11
No ratings yet
Lec 11
13 pages
CH 4
No ratings yet
CH 4
28 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Exam 2018
No ratings yet
Exam 2018
18 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Lect5 Removed
No ratings yet
Lect5 Removed
35 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Unconstrained Optimization - Ipynb - Colaboratory
No ratings yet
Unconstrained Optimization - Ipynb - Colaboratory
5 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Nonlinearity in Structural Dynamics Chapter App G
No ratings yet
Nonlinearity in Structural Dynamics Chapter App G
11 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Optimizatio With Matlab
No ratings yet
Optimizatio With Matlab
49 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
Exam1Review Annotated
No ratings yet
Exam1Review Annotated
13 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
15 Optimization Script
No ratings yet
15 Optimization Script
62 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Math 273a: Optimization: Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015
No ratings yet
Math 273a: Optimization: Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015
17 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
3.3 Rational Root Theorem
100% (1)
3.3 Rational Root Theorem
15 pages
Numerical Methods Dynamics
No ratings yet
Numerical Methods Dynamics
16 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Numerical Differentiation - Summary PDF
No ratings yet
Numerical Differentiation - Summary PDF
8 pages
Cours D'optimisation
No ratings yet
Cours D'optimisation
159 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
8.long Division Method
No ratings yet
8.long Division Method
28 pages
Solving Optimization Problems Using The Matlab Opt
No ratings yet
Solving Optimization Problems Using The Matlab Opt
50 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
02 POLYNOMIALS MCQs
No ratings yet
02 POLYNOMIALS MCQs
28 pages
Calyley Hamilton Theoram. Eigen Notes
No ratings yet
Calyley Hamilton Theoram. Eigen Notes
8 pages
Interpolation: Finite Differences
No ratings yet
Interpolation: Finite Differences
18 pages
Inference in FOL: Ch-4-II
No ratings yet
Inference in FOL: Ch-4-II
47 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Numerical Methods BIT203
No ratings yet
Numerical Methods BIT203
6 pages
Optim
No ratings yet
Optim
70 pages
Numerical Methods: King Saud University
No ratings yet
Numerical Methods: King Saud University
24 pages
False Position Algorithm
No ratings yet
False Position Algorithm
6 pages
Assignment 8 Solution
No ratings yet
Assignment 8 Solution
15 pages
Curve Fitting Update1
No ratings yet
Curve Fitting Update1
48 pages
Polynomial Interpolation SICLAB
No ratings yet
Polynomial Interpolation SICLAB
16 pages
CBNT File
No ratings yet
CBNT File
11 pages
MAT 461/561: 13.1 The Shooting Method For BVP: Announcements
No ratings yet
MAT 461/561: 13.1 The Shooting Method For BVP: Announcements
3 pages
Digital Logic Design DLD Lab Report 1
No ratings yet
Digital Logic Design DLD Lab Report 1
13 pages
FEM Mod3@AzDOCUMENTS - in
No ratings yet
FEM Mod3@AzDOCUMENTS - in
22 pages
Lecture-2 Least Squares Regression
No ratings yet
Lecture-2 Least Squares Regression
18 pages
Gauss Jordan Elimination 2a For Print 3a
No ratings yet
Gauss Jordan Elimination 2a For Print 3a
24 pages
Problemm Math
No ratings yet
Problemm Math
15 pages
Chapter 4 Lecture 4 of 5
No ratings yet
Chapter 4 Lecture 4 of 5
12 pages
Monotonic Cubic Spline Interpolation
No ratings yet
Monotonic Cubic Spline Interpolation
8 pages
Ncert Solutions Class 9 Math Chapter 2 Polynomials Ex 2 2
No ratings yet
Ncert Solutions Class 9 Math Chapter 2 Polynomials Ex 2 2
8 pages
SEAMBooks SPCE DCCN Gauss Seidel Students
No ratings yet
SEAMBooks SPCE DCCN Gauss Seidel Students
8 pages
Numerical Analysis: Lecture - 14
No ratings yet
Numerical Analysis: Lecture - 14
8 pages
Mathematical Programming For Engineers: Course Objectives Course Outcomes
No ratings yet
Mathematical Programming For Engineers: Course Objectives Course Outcomes
2 pages
1 Rounding
No ratings yet
1 Rounding
3 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

02 Grad Desc

Uploaded by

02 Grad Desc

Uploaded by

Optimization for data science

Smooth optimization: Gradient descent

Master Data Science, Institut Polytechnique de Paris

September 17, 2024

∇x F (x⋆ ) = 0 and ∇2x F (x⋆ ) ⪰ 0

How to solve optimization problems?

2.1.1 - Iterative optimization - Optimization problems and properties - 5/36

Convex function (recap)

F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y), with 0 ≤ α ≤ 1. (2)

▶ F a differentiable function is convex if and only if

Convex function (recap)

F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y), with 0 ≤ α ≤ 1. (2)

▶ F a differentiable function is convex if and only if

L-smooth function (recap)

2.1.1 - Iterative optimization - Optimization problems and properties - 7/36

General iterative algorithm

Gradient descent algorithm (steepest descent)

▶ Linear prediction model : f (x) = wx + b

2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 11/36

▶ The MM algorithm is guaranteed to decrease the cost function at each iteration.

Majorization of L-smooth functions

Solving the MM iteration with quadratic upper bound

▶ The MM iteration is a quadratic problem that can be solved analytically.

2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 13/36

Majorization of L-smooth functions

Solving the MM iteration with quadratic upper bound

▶ The MM iteration is a quadratic problem that can be solved analytically.

2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 13/36

2 Convexity upper bound w.r.t. x(k)

2 Convexity upper bound w.r.t. x(k)

2 Convexity upper bound w.r.t. x(k)

2 Convexity upper bound w.r.t. x(k)

Training dataset L-smooth cost function

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36

Steepest descent F(x (k)) − F(x ⋆ )

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36

µ-strongly convex function (recap)

Convergence of gradient descent for µ-strongly convex functions

Polyak-Lojasciewicz (PL) inequality

Training dataset μ-strongly convex cost function

Gradient descent is slow

Gradient descent is slow

2.4.1 - Gradient descent acceleration - Barzilai-Borwein stepsize - 26/36

2.4.1 - Gradient descent acceleration - Barzilai-Borwein stepsize - 27/36

Accelerated gradient descent (AGD) [Nesterov, 1983, Walkington, 2023]

2.4.2 - Gradient descent acceleration - Accelerated Gradient Descent - 28/36

▶ Training dataset {(xi , yi )}n d

▶ Least Squares (λ = 0) and Ridge regression (λ > 0).

Exercise 1: Linear regression

▶ Training dataset {(xi , yi )}n d

▶ Regularized logistic regression (λ > 0).

Exercise 2: Logistic regression

2.5.2 - Smooth machine learning problems - Logistic regression - 31/36

For the optimization problems

2.5.2 - Smooth machine learning problems - Logistic regression - 32/36

Convex Optimization [Boyd and Vandenberghe, 2004]

Nonlinear Programming [Bertsekas, 1997]

Convex analysis and monotone operator theory in Hilbert spaces

Numerical optimization [Nocedal and Wright, 2006]

Barzilai, J. and Borwein, J. M. (1988).

Bauschke, H. H., Combettes, P. L., et al. (2011).

Boyd, S. and Vandenberghe, L. (2004).

Gen, M. and Cheng, R. (1999).

Kennedy, J. and Eberhart, R. (1995).

Nocedal, J. and Wright, S. (2006).

Sun, Y., Babu, P., and Palomar, D. P. (2016).

Wolpert, D. H. and Macready, W. G. (1997).

You might also like