0% found this document useful (0 votes)
27 views54 pages

02 Grad Desc

The document provides an overview of optimization techniques in data science, focusing on smooth optimization and gradient descent methods. It covers various topics including iterative optimization algorithms, convergence properties, and specific applications like logistic regression and Lasso. The course is structured to guide learners through both theoretical and practical aspects of optimization, including advanced methods and their implementations.

Uploaded by

sacnr125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views54 pages

02 Grad Desc

The document provides an overview of optimization techniques in data science, focusing on smooth optimization and gradient descent methods. It covers various topics including iterative optimization algorithms, convergence properties, and specific applications like logistic regression and Lasso. The course is structured to guide learners through both theoretical and practical aspects of optimization, including advanced methods and their implementations.

Uploaded by

sacnr125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Optimization for data science

Smooth optimization: Gradient descent

R. Flamary

Master Data Science, Institut Polytechnique de Paris

September 17, 2024


Full course overview
1. Introduction to optimization for data science
1.1 ML optimization problems and linear algebra recap
1.2 Optimization problems and their properties (Convexity, smoothness)
2. Smooth optimization : Gradient descent
2.1 First order algorithms, convergence for smooth and strongly convex functions
3. Smooth Optimization : Quadratic problems
3.1 Solvers for quadratic problems, conjugate gradient
3.2 Linesearch methods
4. Non-smooth Optimization : Proximal methods
4.1 Proximal operator and proximal algorithms
4.2 Lab 1: Lasso and group Lasso
5. Stochastic Gradient Descent
5.1 SGD and variance reduction techniques
5.2 Lab 2: SGD for Logistic regression
6. Standard formulation of constrained optimization problems
6.1 LP, QP and Mixed Integer Programming
7. Coordinate descent
7.1 Algorithms and Labs
8. Newton and quasi-newton methods
8.1 Second order methods and Labs
9. Beyond convex optimization
9.1 Nonconvex reg., Frank-Wolfe, DC programming, autodiff
Current course overview
1. Introduction to optimization 4
2. Smooth optimization : Gradient descent 4
2.1 Iterative optimization 4
2.1.1 Optimization problems and properties
2.1.2 Iterative optimization for smooth functions
2.2 (Steepest) Gradient descent 10
2.2.1 Gradient Descent Algorithm
2.2.2 Majorization-minimization view
2.3 Convergence of gradient descent 16
2.3.1 Convergence for smooth functions
2.3.2 Convergence for strongly convex functions
2.4 Gradient descent acceleration 42
2.4.1 Barzilai-Borwein stepsize
2.4.2 Accelerated Gradient Descent
2.5 Smooth machine learning problems 48
2.5.1 Least Squares and Ridge regression
2.5.2 Logistic regression
3. Smooth Optimization : Quadratic problems 51
4. Non-smooth optimization : Proximal methods 51
5. Stochastic Gradient Descent 51
6. Standard formulation of constrained optimization problems 51
7. Coordinate descent 51
8. Newton and quasi-newton methods 51
9. Beyond convex optimization 51
Smooth Optimization problem
Convex function Nonconvex function Nonconvex function

0
20
40
0 −20
20
−20 −40
0

−5 0 −5 0 −5 0
0 0 0
−5 −5 −5

Optimization problem
min F (x), (1)
x∈Rn
▶ F is L-smooth (at least differentiable).
▶ When F is convex x⋆ is a solution of the problem if

∇x F (x⋆ ) = 0
▶ When F is non convex x⋆ is a local minimizer of the problem if

∇x F (x⋆ ) = 0 and ∇2x F (x⋆ ) ⪰ 0

How to solve optimization problems?


▶ Solving the problem analytically : ∇F (x⋆ ) = 0
▶ Search for a solution numerically : iterative optimization algorithms
2.1.1 - Iterative optimization - Optimization problems and properties - 4/36
Iterative optimization algorithms

min F (x),
x∈Rn

Iterative algorithms
▶ Principle : start from an initial point x(0) and iterate to make it better.
▶ Gradient descent (and variants) when available, proximal methods.
▶ Black box optimization (a.k.a derivative free optimization) :
▶ Genetic, random search, simulated annealing [Gen and Cheng, 1999].
▶ Particle swarm optimization, etc [Kennedy and Eberhart, 1995].
▶ Nelder-Mead simplex [Nelder and Mead, 1965].

How to choose?
▶ No free lunch theorem [Wolpert and Macready, 1997] :
No algorithm is better than the others for all problems.
▶ But on can use the properties of the problem to choose the algorithm: specialize!

2.1.1 - Iterative optimization - Optimization problems and properties - 5/36


Assumption 1 : Convexity
Convex function
f(αx + (1 − α)y)
αf(x) + (1 − α)f(y)

x y

Convex function (recap)


▶ Function F is convex if it lies below its chords, that is ∀x, y ∈ Rn

F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y), with 0 ≤ α ≤ 1. (2)

▶ F a differentiable function is convex if and only if


F (y) ≥ F (x) + ∇F (x)⊤ (y − x), ∀y, x ∈ domF (3)
n
▶ For C = R , if x if a global minimum if and only if ∇x F (x) = 0.
▶ F is µ-strongly convex with µ > 0 if it satisfies ∀x, y ∈ Rn and 0 ≤ α ≤ 1
µ
F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y) − α(1 − α)∥x − y∥2 , (4)
2
2.1.1 - Iterative optimization - Optimization problems and properties - 6/36
Assumption 1 : Convexity
Strongly convex function Convex function lower bound Strongly convex function lower bound
f(αx + (1 − α)y) f(t) f(t)
αf(x) + (1 − α)f(y) f(x) + ∇f(x)T(t − x) f(x) + ∇f(x)T(t − x)
μα(1 − α) μ
αf(x) + (1 − α)f(y) − 2
(y − x)2 f(x) + ∇f(x)T(t − x) + 2 (t − x)2

x y

Convex function (recap)


▶ Function F is convex if it lies below its chords, that is ∀x, y ∈ Rn

F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y), with 0 ≤ α ≤ 1. (2)

▶ F a differentiable function is convex if and only if


F (y) ≥ F (x) + ∇F (x)⊤ (y − x), ∀y, x ∈ domF (3)
n
▶ For C = R , if x if a global minimum if and only if ∇x F (x) = 0.
▶ F is µ-strongly convex with µ > 0 if it satisfies ∀x, y ∈ Rn and 0 ≤ α ≤ 1
µ
F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y) − α(1 − α)∥x − y∥2 , (4)
2
2.1.1 - Iterative optimization - Optimization problems and properties - 6/36
Assumption 2 : smoothness
L-smooth function

L-smooth function (recap)


▶ Function F is gradient Lipschitz, also called L-smooth, if ∀x, y ∈ C 2
∥∇F (x) − ∇F (y)∥ ≤ L∥x − y∥ (5)
▶ If F is L-smooth, then the following inequality holds
L
F (x) ≤ F (y) + ∇F (y)⊤ (x − y) + ∥x − y∥2 (6)
2
▶ If F is L-smooth, then the following inequality holds
∇2x F (x) ⪯ LI (λmax (∇2x F (x)) ≤ L) (7)

2.1.1 - Iterative optimization - Optimization problems and properties - 7/36


Descent algorithm for smooth optimization
Iterative algorithm (iter. k) Iterative algorithm (iter. k + 1) Large step size
F
F F
k
∇F(x )
∇F(x k) ∇F(x k)
d
d d

xk xk + 1 xk + 2 xk xk + 1

General iterative algorithm


1: Initialize x(0)
2: for k = 0, 1, 2, . . . do
3: d(k) ← Compute descent direction from x(k)
4: ρ(k) ← Choose stepsize
5: x(k+1) ← x(k) + ρ(k) d(k)
6: end for
▶ x(k) ∈ Rn is the current iterate.
▶ d(k) ∈ Rn is a descent direction if ∇F (x(k) )T d(k) < 0 .
▶ For a step small enough, each iteration decreases the cost : F (x(k+1) ) ≤ F (x(k) )
▶ Stopping conditions: max number of iterations or small gradient ∥∇F (xk )∥.
2.1.2 - Iterative optimization - Iterative optimization for smooth functions - 8/36
Gradient Descent (GD) algorithm
Iterative algorithm (iter. k) Iterative algorithm (iter. k + 1) Large step size
F
F F
∇F(x k)
∇F(x k) ∇F(x k)
d
d d

xk xk + 1 xk + 2 xk xk + 1

Gradient descent algorithm (steepest descent)


1: Initialize x(0)
2: for k = 0, 1, 2, . . . do
3: d(k) ← −∇F (x(k) )
4: ρ(k) ← Choose stepsize
5: x(k+1) ← x(k) + ρ(k) d(k)
6: end for
▶ Iterative algorithm with descent direction d = −∇F (x).
▶ −∇F (x) is called the steepest descent direction.
▶ Equivalent to iterative algorithm above in 1D.
▶ In this course we study the constant step case ρ(k) = ρ.
2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 9/36
Example optimization problem
Training dataset Cost function
1.0

10
0.5
8
6
0.0
y

4
2
−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x

1D Logistic regression
n
X w2
min log(1 + exp(−yi (wxi + b))) + λ
w,b
i=1
2

▶ Linear prediction model : f (x) = wx + b


▶ Training data (xi , yi ) : (1, −1), (2, −1), (3, 1), (4, 1).
▶ Problem solution for λ = 1 : x∗ = [w⋆ , b⋆ ] = [0.96, −2.40]
▶ Initialization : x(0) = [1, −0.5].
▶ Complexity : Cost and gradient both O(nd)
2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 10/36
Example of steepest descent
Steepest descent Optimization cost
1

3.0

0
2.5
5
20 10
−1 2.0
100
Gradient norm
b

500 3
−2
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.1
▶ Slow convergence around the solution (small gradients).
▶ After 1000 iterations, still not converged.
▶ Complexity O(nd) per iteration.

2.2.1 - (Steepest) Gradient descent - Gradient Descent Algorithm - 11/36


Majorization Minimization (MM) algorithm
MM algorithm (iter. k) MM algorithm (iter. k + 1)
F F
G(⋅, x k) G(⋅, x k)

xk xk + 1 xk + 1 xk + 2

Principle
▶ Iterative algorithm that minimizes a surrogate function.
▶ Let F be a function to minimize and G a majorization F (x) ≤ G(x, y) ∀x, y.
▶ MM iteration :
x(k+1) = argmin G(x, x(k) ) (8)
x

▶ The MM algorithm is guaranteed to decrease the cost function at each iteration.


▶ Most efficient when G is close to F , but simple to compute and optimize.
▶ References : [Hunter and Lange, 2004, Sun et al., 2016].
2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 12/36
Majorization Minimization for smooth functions

Majorization of L-smooth functions


If F is L-smooth, then the following majorization holds:
L
F (x) ≤ G(x, y) = F (y) + ∇F (y)⊤ (x − y) + ∥x − y∥2 (9)
2

Solving the MM iteration with quadratic upper bound

L
x(k+1) = argmin F (x(k) ) + ∇F (x(k) )⊤ (x − x(k) ) + ∥x − x(k) ∥2 (10)
x 2

▶ The MM iteration is a quadratic problem that can be solved analytically.


▶ The solution is given by:

1
x(k+1) = x(k) − ∇F (x(k) ) (11)
L
▶ This is exactly the update of the gradient descent with step ρ = 1
L
.

2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 13/36


Majorization Minimization for smooth functions

Majorization of L-smooth functions


If F is L-smooth, then the following majorization holds:
L
F (x) ≤ G(x, y) = F (y) + ∇F (y)⊤ (x − y) + ∥x − y∥2 (9)
2

Solving the MM iteration with quadratic upper bound

L
x(k+1) = argmin F (x(k) ) + ∇F (x(k) )⊤ (x − x(k) ) + ∥x − x(k) ∥2 (10)
x 2

▶ The MM iteration is a quadratic problem that can be solved analytically.


▶ The solution is given by:

1
x(k+1) = x(k) − ∇F (x(k) ) (11)
L
▶ This is exactly the update of the gradient descent with step ρ = 1
L
.

2.2.2 - (Steepest) Gradient descent - Majorization-minimization view - 13/36


Convergence of gradient descent
Steepest descent Optimization cost
1

3.0
0
2.5
5
20 10
−1 100 2.0

Gradient norm
b

500
−2 3
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Questions
▶ Does Gradient descent converges to an optimal point ?
▶ At which speed is the minimum reached?
▶ How to choose the stepsize ρ(k) ?
Theoretical convergence and convergence speed
▶ Fixed steps ρ(k) = ρ ?
▶ Smooth and strongly convex functions ?
▶ Acceleration techniques ?
▶ Adaptive steps ρ(k) (linesearch, next course) ?
2.3.0 - Convergence of gradient descent - - 14/36
Convergence for smooth functions
Bounds for L-smooth small L Bounds for L-smooth large L
F F
L-smooth upper bound L-smooth upper bound
Cvx lower bound Cvx lower bound

x x
Convergence of gradient descent for L-smooth functions
If function F is convex and differentiable and its gradient has a Lipschitz constant L,
then the gradient descent with fixed step ρ(k) = ρ ≤ L1 converges to a solution x⋆ of
the optimization problem with the following speed:
∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤ (12)
2ρk

▶ Best for ρ = 1
L
that is the largest gradient that ensures decrease of the cost.
▶ We say the the gradient descent has a convergence O( k1 ).
▶ In order to reach a precision ϵ one needs O( 1ϵ ) iterations.
▶ We prove this result in the next slides 1 .
1 See also : https://www.stat.cmu.edu/ ~ryantibs/convexopt-F13/scribes/lec6.pdf
2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 15/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2

2 Convexity upper bound w.r.t. x(k)


3 Injectgradient step x(k+1) = x(k) − ρ∇F (x(k) )
4 For ρ ≤ 1 , −(2 − ρL) ≤ −1
L 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 16/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2

2 Convexity upper bound w.r.t. x(k)


3 Injectgradient step x(k+1) = x(k) − ρ∇F (x(k) )
4 For ρ ≤ 1 , −(2 − ρL) ≤ −1
L 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 16/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2

2 Convexity upper bound w.r.t. x(k)


3 Injectgradient step x(k+1) = x(k) − ρ∇F (x(k) )
4 For ρ ≤ 1 , −(2 − ρL) ≤ −1
L 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 16/36
Convergence proof (convex L-smooth)
Step 1 : Descent VS gradient norm Lemma
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2 (13)
2
1
Value decreases at each iteration for ρ ≤ L
.
Proof.
L (k+1)
F (x(k+1) ) ≤ F (x(k) ) + ∇F (x(k) )T (x(k+1) − x(k) ) + ∥x − x(k) ∥2
2 2
L
= F (x(k) ) + ∇F (x(k) )T (−ρ∇F (x(k) )) + ∥ − ρ∇F (x(k) )∥2
3 2
Lρ2
= F (x(k) ) − ρ∥∇F (x(k) )∥2 + ∥∇F (x(k) )∥2
2
ρ
= F (x(k) ) − ∥∇F (x(k) )∥2 (2 − ρL)
2
ρ
≤ F (x(k) ) − ∥∇F (x(k) )∥2
4 2

2 Convexity upper bound w.r.t. x(k)


3 Injectgradient step x(k+1) = x(k) − ρ∇F (x(k) )
4 For ρ ≤ 1 , −(2 − ρL) ≤ −1
L 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 16/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)

Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1  (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)


+ ∥x(k) − x⋆ ∥2
1  
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )

5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)

Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1  (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)


+ ∥x(k) − x⋆ ∥2
1  
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )

5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)

Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1  (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)


+ ∥x(k) − x⋆ ∥2
1  
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )

5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)

Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1  (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)


+ ∥x(k) − x⋆ ∥2
1  
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )

5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)

Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1  (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)


+ ∥x(k) − x⋆ ∥2
1  
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )

5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 2 : Objective w.r.t. optimal value

1
F (x(k+1) ) − F (x⋆ ) ≤ (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 ) (14)

Proof.
Using convexity one has: F (x) ≤ F (x⋆ ) + ∇F (x)⊤ (x − x⋆ ) so from (13):
ρ
F (x(k+1) ) ≤ F (x(k) ) − ∥∇F (x(k) )∥2
2
ρ
≤ F (x⋆ ) + ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
ρ
F (x(k+1) ) − F (x⋆ ) ≤ ∇F (x(k) )⊤ (x(k) − x⋆ ) − ∥∇F (x(k) )∥2
2
1  (k) ⊤
≤ 2ρ∇F (x ) (x − x⋆ ) − ρ2 ∥∇F (x(k) )∥2 − ∥x(k) − x⋆ ∥2
(k)


+ ∥x(k) − x⋆ ∥2
1  
≤ −∥x(k) − ρ∇F (x(k) ) − x⋆ ∥2 + ∥x(k) − x⋆ ∥2
5 2ρ

1
= (∥xk − x⋆ ∥2 − ∥xk+1 − x⋆ ∥2 )

5 Factorization of ∥x(k) − ρ∇F (x(k) ) −2.3.1
x⋆ ∥- 2Convergence of gradient descent - Convergence for smooth functions - 17/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together

∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2

2ρk

6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together

∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2

2ρk

6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together

∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2

2ρk

6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence proof (convex L-smooth)
Step 3 : Putting all iterations together

∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤
2ρk
Proof.
k
1X
F (x(k) ) − F (x⋆ ) = F (x(k) ) − F (x⋆ )
k i=1
k
1X
≤ F (x(i) ) − F (x⋆ )
6 k i=1
k
1 X k−1
≤ ∥x − x⋆ ∥2 − ∥xk − x⋆ ∥2
7 2ρk i=1
∥x(0) − x⋆ ∥2 − ∥x(k) − x⋆ ∥2
=
8 2ρk
∥x(0) − x⋆ ∥2

2ρk

6 DescentLemma (13)
7 Inject
Eq. (14)
8 Summation of telescopic series 2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 18/36
Convergence example for smooth function

Training dataset L-smooth cost function


1.0

10
0.5
8
6
0.0 4
y

−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36


Convergence example for smooth function

Optimization cost
Steepest descent
1

2
0

1
5
−1 10
20
Gradient norm
b

−2 2

100
−3 1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36


Convergence example for smooth function

Steepest descent F(x (k)) − F(x ⋆ )


1
GD
Upper bound L-smooth
0
102

5
−1 10
20
b

101
−2

100
−3
100

−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.05
▶ Non regularized logistic regression (λ = 0).
▶ Slow O( k1 ) convergence of Gradient Descent.

2.3.1 - Convergence of gradient descent - Convergence for smooth functions - 19/36


Assumption 3 : Strong convexity
Strongly convex function Strongly convex function lower bound
f(αx + (1 − α)y) f(t)
αf(x) + (1 − α)f(y) f(x) + ∇f(x)T(t − x)
μα(1 − α) μ
αf(x) + (1 − α)f(y) − 2
(y − x)2 f(x) + ∇f(x)T(t − x) + 2 (t − x)2

x y

µ-strongly convex function (recap)


▶ F is µ-strongly convex with µ > 0 if it satisfies ∀x, y ∈ Rn and 0 ≤ α ≤ 1
µ
F (αx + (1 − α)y) ≤ αF (x) + (1 − α)F (y) − α(1 − α)∥x − y∥2 , (15)
2
▶ If F is a differentiable µ-strongly convex then
µ
F (y) ≥ F (x) + ∇F (x)⊤ (y − x) + ∥y − x∥2 , ∀y, x ∈ domF
2
▶ Strongly convex functions have a unique minimum x⋆ .

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 20/36
Convergence for strongly convex functions
Bounds for large κ = L/μ Bounds for small κ = L/μ
F F
L-smooth upper bound L-smooth upper bound
μ-strongly cvx lower bound μ-strongly cvx lower bound

x x

Convergence of gradient descent for µ-strongly convex functions


If function F is µ-strongly convex, then the gradient descent with fixed step
ρ(k) = ρ = L1 converges to a solution x⋆ of the optimization problem with the
following speed:
 µ k  
F (x) − F (x∗ ) ≤ 1 − F (x(0) ) − F (x∗ ) (16)
L
▶ For a function F , µ = λmin (∇2 F (x)) and L = λmax (∇2 F (x)).
▶ The condition κ = L
µ
≥ 1 has important impact (close to 1 is better approx).
▶ We say the the gradient descent has a convergence O(e−k/κ ).
▶ In order to reach a precision ϵ one needs O(log(1/ϵ)) iterations.
2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 21/36
Convergence proof (µ-strongly convex, L-smooth)
PL upper bond PL upper bond
F F
F(x ⋆ ) + |∇F(x)|2/(2μ) F(x ⋆ ) + |∇F(x)|2/(2μ)
Slope at x Slope at x

x x⋆ x⋆ x

Polyak-Lojasciewicz (PL) inequality


If F is a µ-strongly convex function and x⋆ its optimal point then ∀x
1
F (x) − F (x∗ ) ≤ ∥∇F (x)∥2 (17)

Proof.
Exercise 3 in class. Hints:
▶ Use strong convexity lower bound.
▶ Set y = x − µ1 ∇F (x).
▶ Inject optimal point x⋆

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 22/36
Convergence proof (µ-strongly convex, L-smooth)

 µ k (0)
F (x(k) ) − F (x∗ ) ≤ 1 − ∥x − x⋆ ∥2
L
Proof.
Using the descent lemma (13):
1
F (x(k) ) − F (x(k−1) ) ≤ − ∥∇F (x(k−1) )∥2
2L
µ 
≤− F (x(k−1) ) − F (x⋆ )
9 L
µ 
F (x(k) ) − F (x⋆ ) ≤ F (x(k−1) ) − F (x⋆ ) − F (x(k−1) ) − F (x⋆ )
L
 µ 
≤ 1− F (x(k−1) ) − F (x⋆ )
L
 µ k  
≤ 1− F (x(0) ) − F (x⋆ )
L

9 Use PL inequality (17) 2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 23/36
Convergence example for strongly convex function

Training dataset μ-strongly convex cost function


1.0

0.5 20
15
0.0 10
y

−0.5
0
0.0 −2
−1.0 0.5 b
1.0 −4
1.5
1.0 1.5 2.0 2.5 3.0 3.5 4.0 w 2.0
x

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
Convergence example for strongly convex function
Optimization cost
Steepest descent 5
1

5 4
0 10
20
100 3
−1
Gradient norm
b

4
−2

2
−3

0
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
Convergence example for strongly convex function
Steepest descent
1 F(x (k)) − F(x ⋆ )
105
GD
5 Upper bound L-smooth
0 10
20 102
Upper bound strongly convex
100
10−1
−1
b

10−4
−2
10−7

−3
10−10

−4 10−13
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ Steepest descent with fixed step ρ(k) = 0.02
▶ Fully regularized logistic regression (λ = 1 for w and b).
▶ L-smooth and µ-strongly convex upper bounds.
▶ Fast O(e−k/κ ) convergence of Gradient Descent.

2.3.2 - Convergence of gradient descent - Convergence for strongly convex functions - 24/36
How to make Gradient Descent faster?
Steepest descent Optimization cost
1

3.0
0
2.5
5
20 10
−1 100 2.0

Gradient norm
b

500
−2 3
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Gradient descent is slow


▶ Unless on strongly convex fonction it has a O( k1 ) convergence.
▶ Needs to recompute the gradient at each iteration (O(nd) in ERM).

Acceleration techniques
▶ Use adaptive stepsizes (smarter ρ(k) ).
▶ Use momentum (remember previous gradients).
▶ Use second order information (Newton, quasi-Newton).
▶ Speedup gradient computation (stochastic gradient, slower but more efficient).
2.4.0 - Gradient descent acceleration - - 25/36
How to make Gradient Descent faster?
Steepest descent Optimization cost
1

3.0
0
2.5
5
20 10
−1 100 2.0

Gradient norm
b

500
−2 3
1000
2
−3
1

−4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Gradient descent is slow


▶ Unless on strongly convex fonction it has a O( k1 ) convergence.
▶ Needs to recompute the gradient at each iteration (O(nd) in ERM).

Acceleration techniques
▶ Use adaptive stepsizes (smarter ρ(k) ).
▶ Use momentum (remember previous gradients).
▶ Use second order information (Newton, quasi-Newton).
▶ Speedup gradient computation (stochastic gradient, slower but more efficient).
2.4.0 - Gradient descent acceleration - - 25/36
Barzilai-Borwein stepsize (BB-rule)
Principle [Barzilai and Borwein, 1988]
▶ Use the gradient and the previous gradient to compute the stepsize.
▶ It is a two-step approximation of the secant method (to cancel the gradient).
▶ The stepsize is computed as:
▶ Long BB stepsize:
∆x⊤ ∆x
ρ(k) = (18)
∆x⊤ ∆g
▶ Short BB stepsize:
∆x⊤ ∆g
ρ(k) = (19)
∆g⊤ ∆g
▶ where ∆x = x(k) − x(k−1) and ∆g = ∇F (x(k) ) − ∇F (x(k−1) ).
▶ The stepsize can be clipped to avoid too large steps (or with linesearch).
▶ Convergence for quadratic [Raydan, 1993] and non-quadratic functions
[Raydan, 1997] with linesearch.
▶ Variants used for hyperparameter-free optimization with provably better constant.
▶ Discussed more in details in next courses.

2.4.1 - Gradient descent acceleration - Barzilai-Borwein stepsize - 26/36


Example of BB rule for Gradient Descent
Optimization cost
BB rule Gradient Descent
1 GD
BB rule
4
0

−1 5 2
Gradient norm
b

−2 6
15
500
100
9 4
−3
2

0
−4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000
w Iterations

Discussion
▶ GD and first step of BB rule use step ρ(k) = 0.01.
▶ Acceleration is important w.r.t. steepest descent step.
▶ Unstable and the stepsize can be too large and lead to loss increase.
▶ BB rule is best used with linesearch (see next course).

2.4.1 - Gradient descent acceleration - Barzilai-Borwein stepsize - 27/36


Accelerated gradient descent

Accelerated gradient descent (AGD) [Nesterov, 1983, Walkington, 2023]


1: Initialize x(0) , y(0) = x(0) , α(0) = 0 and ρ ≤ 1
L
2: for k = 0, 1, 2, . . . do
3: y(k+1) ← x(k) − ρ∇F (x(k) )

1+ 1+4(α(k) )2
4: α(k+1) =← 2
(k)
−1
5: x(k+1) ← y(k+1) + αα(k+1) (y(k+1) − y(k) )
6: end for
▶ Also called Nesterov accelerated gradient (NAG).
▶ Acceleration of gradient descent with momentum.
▶ Update is gradient step (y(k+1) ) + momentum of previous step.
▶ The algorithm has a O( 12 ) convergence for L-smooth functions and ρ = 1
L
:
k

2L∥x(0) − x⋆ ∥2
F (x(k) ) − F (x⋆ ) ≤ (20)
k2
▶ Convergence speed O( 12 ) is optimal for a first order method.
k

2.4.2 - Gradient descent acceleration - Accelerated Gradient Descent - 28/36


Example of Accelerated Gradient Descent
Accelerated gradient descent Optimization cost
1
GD GD
Accelerated GD 3.0 Accelerated GD
0
2.5
10 5
1 20 2.0
30
Gradient norm
b

50
2 3
500
100 2
3
1
4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 0 50 100 150 200
w Iterations

Discussion
▶ Both GD and AGD use fixed step ρ(k) = 0.1.
▶ Acceleration speedup is important w.r.t. steepest descent step.
▶ The momentum due the the Nesterov acceleration can be seen in the trajectory.
▶ Non monotonic convergence but faster than GD.
▶ Complexity O(nd) per iteration when no line search.
2.4.2 - Gradient descent acceleration - Accelerated Gradient Descent - 29/36
Least squares and ridge regression

n
1X ⊤
min F (w) = (w xi − yi )2 + λ∥w∥2 (21)
w n i=1

▶ Training dataset {(xi , yi )}n d


i=1 with yi ∈ R and w ∈ R .

▶ Least Squares (λ = 0) and Ridge regression (λ > 0).


▶ Prediction is done with ŷ = w⊤ x.

Exercise 1: Linear regression


1. Reformulate the objective value of least square as a squared norm of residual
vector of prediction errors.
2. Compute the gradients for the least square and ridge regression.
3. Express the Hessian and compute the Lipschitz constant L and µ for the least
square and ridge regression.

2.5.1 - Smooth machine learning problems - Least Squares and Ridge regression - 30/36
Logistic regression

n
1X
min F (w) = log(1 + exp(−yi w⊤ xi )) + λ∥w∥2 (22)
w n i=1

▶ Training dataset {(xi , yi )}n d


i=1 with yi ∈ {1, 1} and w ∈ R .

▶ Regularized logistic regression (λ > 0).


▶ Prediction is done with ŷ = sign(w⊤ x).

Exercise 2: Logistic regression


1. Compute the gradients for the logistic regression.
2. Express the Hessian and compute the Lipschitz constant L and µ for the logistic
regression.

2.5.2 - Smooth machine learning problems - Logistic regression - 31/36


Lab: Gradient Descent

For the optimization problems


▶ Least squares regression and Ridge regression.
n
1X ⊤
min F (w) = (w xi − yi )2 + λ∥w∥2
w n i=1

▶ Logistic regression.
n
1X
min F (w) = log(1 + exp(−yi w⊤ xi )) + λ∥w∥2
w n i=1

Your mission
▶ Implement te loss functions f and gradients df for the three problems.
▶ Implement the gradient descent algorithm (and accelerated variant).
▶ Compare the convergence speed of the three algorithms.

2.5.2 - Smooth machine learning problems - Logistic regression - 32/36


Bibliography I

Convex Optimization [Boyd and Vandenberghe, 2004]


▶ Available freely online: https://web.stanford.edu/~boyd/cvxbook/.

Nonlinear Programming [Bertsekas, 1997]


▶ Reference optimization book, contains also most of the course.
▶ Unconstrained optimization (Ch. 1), duality and lagrangian (Ch. 3, 4 ,5).

Convex analysis and monotone operator theory in Hilbert spaces


[Bauschke et al., 2011]
▶ Awesome book with lot’s of algorithms, and convergence proofs.
▶ All definitions (convexity, lower semi continuity) in specific chapters.

Numerical optimization [Nocedal and Wright, 2006]


▶ Classic introduction to numerical optimization.
References I

Barzilai, J. and Borwein, J. M. (1988).


Two-point step size gradient methods.
IMA Journal of Numerical Analysis, 8(1):141–148.

Bauschke, H. H., Combettes, P. L., et al. (2011).


Convex analysis and monotone operator theory in Hilbert spaces, volume 408.
Springer.

Bertsekas, D. P. (1997).
Nonlinear programming.
Journal of the Operational Research Society, 48(3):334–334.

Boyd, S. and Vandenberghe, L. (2004).


Convex optimization.
Cambridge university press.

Gen, M. and Cheng, R. (1999).


Genetic algorithms and engineering optimization, volume 7.
John Wiley & Sons.
References II
Hunter, D. R. and Lange, K. (2004).
A tutorial on mm algorithms.
The American Statistician, 58(1):30–37.

Kennedy, J. and Eberhart, R. (1995).


Particle swarm optimization.
In Proceedings of ICNN’95-international conference on neural networks, volume 4, pages
1942–1948. ieee.
Nelder, J. A. and Mead, R. (1965).
A simplex method for function minimization.
The computer journal, 7(4):308–313.

Nesterov, Y. E. (1983).
A method for solving the convex programming problem with convergence rate o (1/kˆ
2).
In Dokl. akad. nauk Sssr, volume 269, pages 543–547.

Nocedal, J. and Wright, S. (2006).


Numerical optimization.
Springer Science & Business Media.
References III
Raydan, M. (1993).
On the barzilai and borwein choice of steplength for the gradient method.
IMA Journal of Numerical Analysis, 13(3):321–326.

Raydan, M. (1997).
The barzilai and borwein gradient method for the large scale unconstrained minimization
problem.
SIAM Journal on Optimization, 7(1):26–33.

Sun, Y., Babu, P., and Palomar, D. P. (2016).


Majorization-minimization algorithms in signal processing, communications, and machine
learning.
IEEE Transactions on Signal Processing, 65(3):794–816.

Walkington, N. J. (2023).
Nesterov’s method for convex optimization.
SIAM Review, 65(2):539–562.

Wolpert, D. H. and Macready, W. G. (1997).


No free lunch theorems for optimization.
IEEE transactions on evolutionary computation, 1(1):67–82.

You might also like