0% found this document useful (0 votes)
271 views51 pages

Continuous Optimization

This document discusses continuous optimization techniques used in machine learning. It introduces gradient descent, which iteratively moves parameters in the direction of steepest descent to minimize an objective function. Variants like stochastic gradient descent and momentum are also covered. Convexity and convex functions are defined, and it is shown that the minimum of a convex function over a convex set is globally optimal. Constrained optimization using Lagrange multipliers is also summarized.

Uploaded by

laphv494
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
271 views51 pages

Continuous Optimization

This document discusses continuous optimization techniques used in machine learning. It introduces gradient descent, which iteratively moves parameters in the direction of steepest descent to minimize an objective function. Variants like stochastic gradient descent and momentum are also covered. Convexity and convex functions are defined, and it is shown that the minimum of a convex function over a convex set is globally optimal. Constrained optimization using Lagrange multipliers is also summarized.

Uploaded by

laphv494
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

7 Continuous Optimization

Introduction

• Since machine learning algorithms are implemented


on a computer, the mathematical formulations are
expressed as numerical optimization methods
• Training a machine learning model: finding a good
set of parameters determined by the objective
function or the probabilistic model
 Use optimization algorithms

2/25/2023 Chapter 7 - Continuous Optimization 2


Optimization Using Gradient Descent
• We solve for the minimum of f(x): d → 
• Gradient descent exploits the fact that f(x0) decreases fastest if one
moves from x0 in the direction of the negative gradient −(∇f(x0))T of f
at x0
• For a “good” step-size  > 0, if f(x) = x2 + 1
x1 = x0 − (∇f(x0))T,
then
f(x1)  (x0)

2/25/2023 Chapter 7 - Continuous Optimization 3


Gradient Descent Algorithm
Algorithm.
1. Choose initial guess x0
2. Compute xi iteratively until we meet the stopping criteria using
xi+1 = xi − i(∇f(xi))T
3. Return parameter xi
{f(xi)  minxf(x)}
For suitable learning rate i, the sequence f(x0) > f(x1) > . . . converges
to a local minimum

2/25/2023 Chapter 7 - Continuous Optimization 4


Gradient Descent Algorithm – Ex 1
Find local minimum of f(x) = (x3-3x)/(x2+3)
Gradient
3x4 −2x3+3x2 − 9
∇f(x) =
(x2+3)2
x0 = 0, step-size =  = 0.05
number of iterations = 30
xi = xi-1 − (∇f(xi-1))T

2/25/2023 Chapter 7 - Continuous Optimization 5


Learning rate (Step-size) 
• Choosing a good step-size (learning rate) is important in GD
• step-size is too small  GD can be slow
• step-size is too large  GD can overshoot, fail to converge, or even
diverge

 = 0.01 =3

 = 0.01: Slowly converges  = 3: Fail to converge

2/25/2023 Chapter 7 - Continuous Optimization 6


Gradient Descent Algorithm – Ex 2
• Find local minimum of z = f(x, y) = x2 + 2y2 + 4
 ∇f(x, y) = [2x 4y]
x0 −1
• Choose X0 = y =
0 2
• Learning rate =  = 0.1
• GD Xi+1 = Xi - ∇f(Xi)T
−1 −2 −0.8
X1 = - 0.1 =
2 8 1.2

2/25/2023 Chapter 7 - Continuous Optimization 7


Gradient Descent Algorithm – Ex 2
x y z = f(x, y)

2/25/2023 Chapter 7 - Continuous Optimization 8


Too large 

GD with
 = 0.4
 zigzag
shape

2/25/2023 Chapter 7 - Continuous Optimization 9


How to choose suitable learning rate 
• Adaptive gradient methods rescale the learning rate  at each
iteration, depending on local properties of the function f
• f increases after a gradient step, the learning rate  was too large 
undo the step and decrease the learning rate 
• f decreases, the step could have been larger  try to increase the
learning rate 

2/25/2023 Chapter 7 - Continuous Optimization 10


Gradient Descent With Momentum (1986)
• A method that introduces an additional term to remember what
happened in the previous iteration
• The momentum-based method remembers the update ∆xi at each
iteration i and determines the next update as a linear combination of
the current and previous gradients
xi+1 = xi − i(∇f(xi))T + α∆xi
∆xi = xi − xi−1 = α∆xi−1 − i-1(∇f(xi−1))T,
where α ∈ [0, 1].

2/25/2023 Chapter 7 - Continuous Optimization 11


GD with Momentum
Find local minimum value of f(x) = 12x3 - 48x2 + 36x

Without momentum With momentum


(x0 = -1,  = 0.01) ( = 0.7, x0 = -1,  = 0.01)

2/25/2023 Chapter 7 - Continuous Optimization 12


Different values of 

With momentum With momentum


( = 0.7, x0 = 4,  = 0.01) ( = 1, x0 = 4,  = 0.01)

2/25/2023 Chapter 7 - Continuous Optimization 13


Stochastic Gradient Descent (SGD)
• Computing the gradient can be very time consuming
 find a “cheap” approximation of the gradient
• Since the goal in machine learning does not necessarily need a
precise estimate of the minimum of the objective function,
approximate gradients have been widely used
• SGD is very effective in large-scale machine learning problems such
as training deep neural networks on millions of images

2/25/2023 Chapter 7 - Continuous Optimization 14


Mini-batch Gradient Descent (SGD)
In ML, given N data points, consider the sum of the losses Ln incurred
by each example n.
L(θ) = 𝑁𝑛=1 Ln(θ), where θ are parameters
• Standard GD (“batch” optimization method) is performed using

very expensive evaluations


• In contrast to batch gradient descent, which uses all Ln, we randomly
choose a subset of Ln for mini-batch gradient descent

2/25/2023 Chapter 7 - Continuous Optimization 15


Convex sets

Definition. A set C is a convex set if for any x, y ∈ C and for any scalar
θ with 0  θ  1, we have
θx + (1 − θ)y ∈ C

Example of Example of a
a convex set non-convex set

Note. Convex sets are sets such that a straight line connecting any
two elements of the set lie inside the set.

2/25/2023 Chapter 7 - Continuous Optimization 16


Some convex sets
• In R, every interval (a, b) is convex
• In R2, C1 = {(x, y) | x2 + y2 < 1} is convex but C2 = {(x, y) | 0 < x2 + y2
< 1} is not.
• In Rn, C = (x1, x2, …, xn) | c1x1 + c2x2 + … + cnxn  b} is convex, for
all real numbers c1, c2, …, cn, b
Theorem. The intersection of two convex sets is also convex.

convex Intersection is a convex set

convex

2/26/2023 Chapter 7 - Continuous Optimization 17


Convex functions
• Definition. Let  be a convex set of D.
A function f :  →  is called a convex function if
f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y), x ∈ , y ∈ ,θ[0, 1]
Note. A concave function is the negative of a convex function

2/25/2023 Chapter 7 - Continuous Optimization 18


Concave functions
Definition. Let  be a convex set of D.

A function f :  →  is called a concave function if


f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y), x ∈ , y ∈ ,θ[0, 1]

Note. A concave function is the negative of a convex function

2/25/2023 Chapter 7 - Continuous Optimization 19


Theorem
Suppose that C is a convex set,
• If f : C →  is a convex function, then a local minimum is a global
minimum of f over C.
• If f : C →  is a concave function, then a local maximum is a global
maximum of f over C.

2/25/2023 Chapter 7 - Continuous Optimization 20


Convexity test
• If a function f : n →  is twice differentiable, then

• f(x) is convex if and only if for any two points x, y it holds that
f(y) > f(x) + ∇xf(x)T(y − x)

• f(x) is convex if and only if Hessian ∇x2f(x) is positive semidefinite


Example

2/25/2023 Chapter 7 - Continuous Optimization 21


Convex functions - Ex
• The negative entropy f(x) = xlog2x is convex for x > 0
• In fact,
Gradient ∇xf(x) = log2x + x(log2x)
= log2x + log2e

Hessian ∇x2f(x) = (1/x)log2e > 0,


for all x > 0

2/25/2023 Chapter 7 - Continuous Optimization 22


Some common convex functions
• ax + b on  for any a, b 
• ax on  for any a 
• |x|p on  for p  1
• xlogx, x > 0
• cTx + b, xn for any cn, b 
• Every norm in n
• Spectral norm of a matrix: A2 = max(A) = [max(ATA)]1/2

2/26/2023 Chapter 7 - Continuous Optimization 23


Sum of convex functions is convex
Theorem. If f and g are convex functions, then so is f + g.
• In fact, suppose f and g are convex functions
• Then, f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y)
and g(θx + (1 − θ)y)  θg(x) + (1 − θ)g(y), for any 0  θ  1
 Summing up both sides
f (θx + (1 − θ)y) + g (θx + (1 − θ)y)  θf (x) + (1 − θ)f (y) + θg(x) + (1 −
θ)g(y) = θ(f(x) + g(x)) + (1 − θ)(f (y) + g(y))
 f + g is convex

2/26/2023 Chapter 7 - Continuous Optimization 24


Constrained Optimization
Ex. Find the minimum values of the function f(x, y) = x2 + 2y2 subject to
the constraint x2 + y2 = 1.
Level curve of f(x, y)

g(x, y) = 0

2/26/2023 Chapter 7 - Continuous Optimization 25


Constrained Optimization. Lagrange multipliers
To minimize f(x, y)
Level/contour curve of f(x, y)
subject to the constraint g(x, y) = 0 is to find
the smallest value of c such that the level
curve f(x, y) = c intersects g(x, y) = 0.

Two curves are tangent at (x0, y0) and


their gradients are parallel.
f(x0, y0) = g(x0, y0) for some scalar .

: Lagrange multiplier
L(x, y, ) := f(x, y) + g(x, y) is called g(x, y) = 0
Lagrangian

2/25/2023 Chapter 7 - Continuous Optimization 26


Constrained Optimization. Lagrange multipliers
Ex0. Minimize f(x, y) = x2 + 2y2 s.t x2 + y2 = 1.

2/26/2023 Chapter 7 - Continuous Optimization 27


Constrained Optimization. Lagrange multipliers
Ex1. Minimize f(x, y) = x2 + y2 s.t x – y = 1.

Set 0 = g(x, y) = x – y – 1
and L(x, y, ) = f(x, y) + g(x, y) = x2 + y2 + (x – y – 1)
We find all values of (x0, y0, ) such that
Lx(x0,y0,)= 0, Ly(x0,y0,) = 0, and L(x0,y0,) = 0 // partial derivatives of L
2x0 +  = 0, 2y0 -  = 0, and x0 – y0 – 1 = 0
 x0 = 1/2, y0 = -1/2,  = -1
The minimum value of f s.t. x – y – 1 = 0 is f(x0,y0) = ½.
(Note that we can use the fact y = x – 1 and plug in f(x, y) = x2 + y2.)

2/25/2023 Chapter 7 - Continuous Optimization 28


Constrained Optimization. Lagrange multipliers
Ex2. Minimize f(x, y) = 2x + y
s.t x2 + y2 = 1.

Set 0 = g(x, y) = x2 + y2 – 1
and L(x, y, ) = f(x, y) + g(x, y) = 2x + y + (x2 + y2 – 1)
We find all values of (x0, y0, ) such that
Lx(x0,y0,)= 0, Ly(x0,y0,) = 0, and L(x0,y0,) = 0
 2 + 2x0 = 0, 1 + 2y0 = 0, and x02 + y02 – 1 = 0
 (x0, y0, ) = (2/5,1/5,-5/2), (x0, y0, ) = (-2/5,-1/5, 5/2)
The minimum value of f s.t. g(x, y) = 0 is f(-2/5, -1/5) = -5.

2/25/2023 Chapter 7 - Continuous Optimization 29


Constrained Optimization. Lagrange multipliers
For real-valued functions f : D → , we consider the constrained
optimization problem
minxf(x)
subject to gi(x)  0 for all i = 1, . . . , m

For λ = [λ1 λ2 … λm]T, λi  0, Lagrange multipliers λi

set L(x, λ) := f(x) + λTg(x) // Lagrangian

2/25/2023 Chapter 7 - Continuous Optimization 30


Dual Lagrangian
In general, duality in optimization is the idea of converting an optimization
problem in one set of variables x (called the primal variables) into another one in
a different set of variables λ (called the dual variables).

Primal problem minx f(x)


s. t. gi(x)  0, for all i = 1, 2, …, m
Dual Lagrangian
D() = minxL(x, ) Lagrange multipliers are named
after the French-Italian
mathematician Joseph-Louis
Lagrange (1736–1813).
Lagrangian dual problem maxD()
s. t.   0

2/25/2023 Chapter 7 - Continuous Optimization 31


Weak duality vs Strong duality minimax  maximin

f(x) D()
f(x) D() f(x)
f(x)

D() D()

Weak duality Strong duality:


minxmaxλL(x, λ)  maxλminxL(x, λ) minxmaxλL(x, λ) = maxλminxL(x, λ)
f(·) and gi(·) may be nonconvex f(·) and gi(·) are convex
D(λ) = minxL(x, λ) is concave even though f(·) and gi(·) may be nonconvex. The outer problem, maximization
D(λ) over λ, is the maximum of a concave function and can be efficiently computed.

2/25/2023 Chapter 7 - Continuous Optimization 32


Convex Optimization
• Convex optimization problem
• f(·) is a convex function,
• the constraints involving g(·) and h(·) are convex sets
 strong duality: The optimal solution of the dual problem is the same
as the optimal solution of the primal problem

2/25/2023 Chapter 7 - Continuous Optimization 33


Convex Optimization
Problem
minxf(x)
subject to gi(x)  0 for all i = 1, . . . , m
hj(x) = 0 for all j = 1, . . . , n,

where all functions f(x) and gi(x) are convex functions,


and all hj(x) = 0 are convex sets

2/27/2023 Chapter 7 - Continuous Optimization 34


Convex optimization
Ex3. Minimize f(x, y) = x2 – 4y s.t. g(x, y) = y2 – 2x  0

L(x, y, ) = f(x, y) + g(x, y) = x2 – 4y + (y2 – 2x)


Lx = 0, and Ly = 0
 2x - 2 = 0, and -4 + 2y = 0
 x = , and y = 2/
min(x,y)L(x, y, ) = -2 – 4/ =: D()
3 3
 max0 D() = D( 2) = -3 4
3
 Result = -3 4

2/25/2023 Chapter 7 - Continuous Optimization 35


Example
min 2 x 2y 3
x ,y
• Consider the problem
s .t . x 2 y2 4
1/ Find the Lagrangian L(x, y, )

2/ Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 36


Linear Programming
• Consider the special case when all the preceding functions are linear,
i.e.,
minx cTx
subject to Ax  b,
where A ∈ m×d and b ∈ m
• This is known as a linear program, which has d variables and m linear
constraints

2/25/2023 Chapter 7 - Continuous Optimization 37


Linear program - Ex
• Consider the linear program
5 T x1
min −
𝑥∈2 3 x2

2 2 33
subject to  2 4  8
 x   
 2 1   1    5 
   x2   
 0 1  1
 0 1   8 

2/26/2023 Chapter 7 - Continuous Optimization 38


Linear program - Ex
• Consider the linear program
T
5 x1
min −
𝑥∈2 3 x2

2 2 33
 2 4  8
 x   
subject to  2 1  1   5 
 
   x2   
 0 1  1
 0 1   8 

2/26/2023 Chapter 7 - Continuous Optimization 39


Linear program – Exercise
• Consider the linear programming

• Write the program in standard form (matrix notation).

2/26/2023 Chapter 7 - Continuous Optimization 40


Linear program - Lagrangian
• The Lagrangian is given by
L(x, λ) = cTx + λT(Ax − b)
= (c + ATλ)Tx − λTb
𝜕
• L(x, λ) = 0  c + ATλ = 0
𝜕x
• Therefore, the dual Lagrangian is
D(λ) = minx L(x, λ) = −λTb,
And we would like to find maxλ0D(λ)

2/25/2023 Chapter 7 - Continuous Optimization 41


Linear program - Dual program
• The dual optimization problem
maxλ (− bTλ)
subject to c + ATλ = 0,
m  λ  0
This is also a linear program, but with m variables
• We have two choices
• Solve the primal program for d variables
• Solve the dual program for m variables

2/25/2023 Chapter 7 - Continuous Optimization 42


Linear program - Lagrangian
• Lagrangian 33
• D(λ) = minx L(x, λ) = −λTb 8
 
= [-1 -2 -3 -4 -5] 5
 
 1
 8 
 D(λ) = -331 -82 -53 +4 -85

2/25/2023 Chapter 7 - Continuous Optimization 43


Example
• Consider the linear program
min 2
2x 1 x2
x 1,x 2

1 2 1
x1
s .t . 3 1 4
x2
2 3 3

• Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 44


Quadratic Programming
• Consider the problem
1 T
minx x Qx + cTx
2
subject to Ax  b,
where A ∈ m×d, b ∈ m, and c ∈ d,
Q ∈ d×d is positive definite (and therefore the objective function is
convex)
• This is known as a quadratic program with d variables and m linear
constraints

2/25/2023 Chapter 7 - Continuous Optimization 45


Quadratic Programming – Ex

The optimal value must lie in the shaded region, and is indicated by the star

2/25/2023 Chapter 7 - Continuous Optimization 46


Quadratic Programming – Exercise
Consider the quadratic programming

Write the program in standard form (matrix notation).

2/26/2023 Chapter 7 - Continuous Optimization 47


Quadratic Programming - Lagrangian
• The Lagrangian is given by
1
L(x, λ) = xTQx + cTx + λT(Ax − b)
2
1
= xTQx + (c + ATλ)Tx − λTb,
2
Taking the derivative of L(x, λ) with respect to x and setting it to zero
gives
Qx + (c + ATλ) = 0
Assuming that Q is invertible, we get
x = −Q−1(c + ATλ)

2/25/2023 Chapter 7 - Continuous Optimization 48


Quadratic Programming – Dual Lagrangian
• The dual Lagrangian
1
D(λ) = − (c + ATλ)TQ−1(c + ATλ) − λTb
2
Therefore, the dual optimization problem is given by
1
maxλ − (c + ATλ)TQ−1(c + ATλ) − λTb
2
subject to λ  0

We will see an application of quadratic programming in ML in Chapter


12 Support Vector Machines

2/25/2023 Chapter 7 - Continuous Optimization 49


Example
T
1 x1 2 2
• Consider the linear program min x1 x 2
x 1,x 2 2
2 x2 2 4
2 1 1
x1
s .t . 3 2 2
x2
1 1 3

• Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 50


THANKS

2/25/2023 Chapter 7 - Continuous Optimization 51

You might also like