0% found this document useful (0 votes)
39 views15 pages

Lecture 09 - Calculus and Optimization Techniques (3) - Plain

Stochastic gradient descent (SGD) approximates the gradient of the loss function using a single randomly selected training example. This makes each iteration faster than traditional gradient descent but it may take more iterations to converge. Minibatch SGD reduces the variance in the gradient approximation by using a small batch of randomly selected examples. Projected gradient descent includes an additional projection step to enforce constraints by projecting the update onto the constraint set. Proximal gradient descent minimizes a regularized loss function by performing gradient descent on the loss and applying a proximal operator to regularize via the regularizer term. Constrained optimization problems can also be solved using Lagrangian methods by introducing Lagrange multipliers and optimizing the primal and dual problems.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Lecture 09 - Calculus and Optimization Techniques (3) - Plain

Stochastic gradient descent (SGD) approximates the gradient of the loss function using a single randomly selected training example. This makes each iteration faster than traditional gradient descent but it may take more iterations to converge. Minibatch SGD reduces the variance in the gradient approximation by using a small batch of randomly selected examples. Projected gradient descent includes an additional projection step to enforce constraints by projecting the update onto the constraint set. Proximal gradient descent minimizes a regularized loss function by performing gradient descent on the loss and applying a proximal operator to regularize via the regularizer term. Constrained optimization problems can also be solved using Lagrangian methods by introducing Lagrange multipliers and optimizing the primal and dual problems.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Optimization for ML (3)

CS771: Introduction to Machine Learning


Piyush Rai
2
Stochastic Gradient Descent (SGD) Writing as an average instead of
sum. Won’t affect minimization of

 Consider a loss function of the form


Expensive to compute –
requires doing it for all the
 The (sub)gradient in this case can be written as training examples in each
iteration 

𝒈= ∇ 𝒘 𝐿 ( 𝑤 ) =∇ 𝒘 ¿ (Sub)gradient of the loss


on training example
 Stochastic Gradient Descent (SGD) approximates using a single training example

 At iter. , pick an index uniformly randomly and approximate as


Can show that is an
𝒈 ≈ 𝒈 𝑖=∇ 𝒘 ℓ 𝑖 (𝒘 ) unbiased estimate of ,
i.e.,
 May take more iterations than GD to converge but each iteration is much faster 
 SGD per iter cost is whereas GD per iter cost is
CS771: Intro to ML
3
Minibatch SGD
 Gradient approximation using a single training example may be noisy
The approximation may have a high variance
– may slow down convergence, updates may
be unstable, and may even give sub-optimal
solutions (e.g., local minima where GD might
have given global minima)

 We can use unif. rand. chosen train. ex. with indices


 Using this “minibatch” of examples, we can compute a minibatch gradient
𝐵
1
𝒈≈
𝐵
∑ 𝒈𝑖 𝑏
𝑏=1

 Averaging helps in reducing the variance in the stochastic gradient


 Time complexity is per iteration in this case
CS771: Intro to ML
4

Constrained Optimization

CS771: Intro to ML
5
Projected Gradient Descent
 Consider an optimization problem of the form

 Projected GD is very similar to GD with an extra projection step

 Each iteration will be of the form


 Perform update:
Projection
step
 Check if satisfies constraints Projection
 If set = operator

 If project as =

CS771: Intro to ML
6
Projected GD: How to Project?
 Here projecting a point means finding the “closest” point from the constraint
set
Projected GD commonly
used only when the
projection step is simple
and efficient to compute
 For some sets
: Unit, the projection
radius ball step is easy : Set of non-negative reals
(0,1)

(1,0)

Projection = Normalize to unit Euclidean length vector Projection = Set each negative entry in to be zero

CS771: Intro to ML
7
Proximal Gradient Descent
 Consider minimizing a regularized loss function of the form
Note: The reg. hyperparam.
assumed part of itself

 Proximal GD popular when regularizer is non-differentiable


 Basic idea: Do GD on and use a prox. operator to regularize via
 For a func. , its prox. operator is
Proximal GD That is, regularize
by reducing the

 Assume reg. loss function of the form


Special Cases value of each
component of the
 Initialize as For the vector by half
i.e. scaling
 For iteration (or until convergence)
 Calculate the (sub)gradient of train. Loss (w/o reg.)
If defines a set based constraint
:=
 Set learning rate
 Step 1:
 Step 2: Prox. GD becomes equivalent
to projected GD CS771: Intro to ML
8
Constrained Opt. via Lagrangian
 Consider the following constrained minimization problem (using instead of )

 Note: If constraints of the form , use


 Can handle multiple inequality and equality constraints too (will see later)
 Can transform the above into the following equivalent unconstrained problem

 Our problem can now be written as

CS771: Intro to ML
9
Constrained Opt. via Lagrangian
The Lagrangian:
 Therefore, we can write our original problem as

 The Lagrangian is now optimized w.r.t. and (Lagrange multiplier)


 We can defined Primal and Dual problem as

Both equal if and the set


are convex complimentary slackness/Karush-
Kuhn-Tucker (KKT) condition
CS771: Intro to ML
10
Constrained Opt. with Multiple Constraints
 We can also have multiple inequality and equality constraints

 Introduce Lagrange multipliers and


 The Lagrangian based primal and dual problems will be

CS771: Intro to ML
11

Some other useful


optimization methods

CS771: Intro to ML
12
Co-ordinate Descent (CD)
 Standard gradient descent update for :
 CD: In each iter, update only one entry (co-ordinate) of . Keep all others fixed

-- partial derivative w.r.t. the element of vector (or the


element of the gradient vector g)

 Cost of each update is now independent of


 In each iter, can choose co-ordinate to update unif. randomly or in cyclic order
 Instead of updating a single co-ord, can also update “blocks” of co-ordinates
=
 Called Block co-ordinate descent (BCD)
 To avoid cost of gradient computation, can cache previous computations
 Recall that grad. computations may have terms like – if just one co-ordinate of w
changes, we should avoid computing the new from scratch
CS771: Intro to ML
13
Alternating Optimization (ALT-OPT)
 Consider opt. problems with several variables, say two variables and

 Often, this “joint” optimization is hard/impossible to solve


 We can take an alternating optimization approach to solve such problems

 Usually converges to a local optima. But very very useful. Will see examples
later CS771: Intro to ML
14
Newton’s Method
 Unlike GD and its variants, Newton’s method uses second-order information
(second derivative, a.k.a. the Hessian)
 At each point , minimize the quadratic (second-order) approx. of
[]

𝐿(𝒘 )
Show that

Converges much faster than GD (very fast for convex


functions). Also no “learning rate”. But per iteration
cost is slower due to Hessian computation and
inversion
Faster versions of Newton’s method also exist, e.g.,
those based on approximating Hessian using
previous gradients (see L-BFGS which is a popular
𝒘
(1)
𝒘 𝑜𝑝𝑡 𝒘 method)
CS771: Intro to ML
15
Coming up next
 Some practical issue in optimization for ML
 Wrapping up the discussion of optimization techniques
 Probabilistic models for ML

CS771: Intro to ML

You might also like