0% found this document useful (0 votes)

39 views15 pages

Lecture 09 - Calculus and Optimization Techniques (3) - Plain

Stochastic gradient descent (SGD) approximates the gradient of the loss function using a single randomly selected training example. This makes each iteration faster than traditional gradient descent but it may take more iterations to converge. Minibatch SGD reduces the variance in the gradient approximation by using a small batch of randomly selected examples. Projected gradient descent includes an additional projection step to enforce constraints by projecting the update onto the constraint set. Proximal gradient descent minimizes a regularized loss function by performing gradient descent on the loss and applying a proximal operator to regularize via the regularizer term. Constrained optimization problems can also be solved using Lagrangian methods by introducing Lagrange multipliers and optimizing the primal and dual problems.

Uploaded by

Raja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views15 pages

Lecture 09 - Calculus and Optimization Techniques (3) - Plain

Uploaded by

Raja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Optimization for ML (3)

CS771: Introduction to Machine Learning

Piyush Rai
2
Stochastic Gradient Descent (SGD) Writing as an average instead of
sum. Won’t affect minimization of

 Consider a loss function of the form

Expensive to compute –
requires doing it for all the
 The (sub)gradient in this case can be written as training examples in each
iteration 

𝒈= ∇ 𝒘 𝐿 ( 𝑤 ) =∇ 𝒘 ¿ (Sub)gradient of the loss

on training example
 Stochastic Gradient Descent (SGD) approximates using a single training example

 At iter. , pick an index uniformly randomly and approximate as

Can show that is an
𝒈 ≈ 𝒈 𝑖=∇ 𝒘 ℓ 𝑖 (𝒘 ) unbiased estimate of ,
i.e.,
 May take more iterations than GD to converge but each iteration is much faster 
 SGD per iter cost is whereas GD per iter cost is
CS771: Intro to ML
3
Minibatch SGD
 Gradient approximation using a single training example may be noisy
The approximation may have a high variance
– may slow down convergence, updates may
be unstable, and may even give sub-optimal
solutions (e.g., local minima where GD might
have given global minima)

 We can use unif. rand. chosen train. ex. with indices

 Using this “minibatch” of examples, we can compute a minibatch gradient
𝐵
1
𝒈≈
𝐵
∑ 𝒈𝑖 𝑏
𝑏=1

 Averaging helps in reducing the variance in the stochastic gradient

 Time complexity is per iteration in this case
CS771: Intro to ML
4

Constrained Optimization

CS771: Intro to ML
5
Projected Gradient Descent
 Consider an optimization problem of the form

 Projected GD is very similar to GD with an extra projection step

 Each iteration will be of the form

 Perform update:
Projection
step
 Check if satisfies constraints Projection
 If set = operator

 If project as =

CS771: Intro to ML
6
Projected GD: How to Project?
 Here projecting a point means finding the “closest” point from the constraint
set
Projected GD commonly
used only when the
projection step is simple
and efficient to compute
 For some sets
: Unit, the projection
radius ball step is easy : Set of non-negative reals
(0,1)

(1,0)

Projection = Normalize to unit Euclidean length vector Projection = Set each negative entry in to be zero

CS771: Intro to ML
7
Proximal Gradient Descent
 Consider minimizing a regularized loss function of the form
Note: The reg. hyperparam.
assumed part of itself

 Proximal GD popular when regularizer is non-differentiable

 Basic idea: Do GD on and use a prox. operator to regularize via
 For a func. , its prox. operator is
Proximal GD That is, regularize
by reducing the

 Assume reg. loss function of the form

Special Cases value of each
component of the
 Initialize as For the vector by half
i.e. scaling
 For iteration (or until convergence)
 Calculate the (sub)gradient of train. Loss (w/o reg.)
If defines a set based constraint
:=
 Set learning rate
 Step 1:
 Step 2: Prox. GD becomes equivalent
to projected GD CS771: Intro to ML
8
Constrained Opt. via Lagrangian
 Consider the following constrained minimization problem (using instead of )

 Note: If constraints of the form , use

 Can handle multiple inequality and equality constraints too (will see later)
 Can transform the above into the following equivalent unconstrained problem

 Our problem can now be written as

CS771: Intro to ML
9
Constrained Opt. via Lagrangian
The Lagrangian:
 Therefore, we can write our original problem as

 The Lagrangian is now optimized w.r.t. and (Lagrange multiplier)

 We can defined Primal and Dual problem as

Both equal if and the set

are convex complimentary slackness/Karush-
Kuhn-Tucker (KKT) condition
CS771: Intro to ML
10
Constrained Opt. with Multiple Constraints
 We can also have multiple inequality and equality constraints

 Introduce Lagrange multipliers and

 The Lagrangian based primal and dual problems will be

CS771: Intro to ML
11

Some other useful

optimization methods

CS771: Intro to ML
12
Co-ordinate Descent (CD)
 Standard gradient descent update for :
 CD: In each iter, update only one entry (co-ordinate) of . Keep all others fixed

-- partial derivative w.r.t. the element of vector (or the

element of the gradient vector g)

 Cost of each update is now independent of

 In each iter, can choose co-ordinate to update unif. randomly or in cyclic order
 Instead of updating a single co-ord, can also update “blocks” of co-ordinates
=
 Called Block co-ordinate descent (BCD)
 To avoid cost of gradient computation, can cache previous computations
 Recall that grad. computations may have terms like – if just one co-ordinate of w
changes, we should avoid computing the new from scratch
CS771: Intro to ML
13
Alternating Optimization (ALT-OPT)
 Consider opt. problems with several variables, say two variables and

 Often, this “joint” optimization is hard/impossible to solve

 We can take an alternating optimization approach to solve such problems

 Usually converges to a local optima. But very very useful. Will see examples
later CS771: Intro to ML
14
Newton’s Method
 Unlike GD and its variants, Newton’s method uses second-order information
(second derivative, a.k.a. the Hessian)
 At each point , minimize the quadratic (second-order) approx. of
[]

𝐿(𝒘 )
Show that

Converges much faster than GD (very fast for convex

functions). Also no “learning rate”. But per iteration
cost is slower due to Hessian computation and
inversion
Faster versions of Newton’s method also exist, e.g.,
those based on approximating Hessian using
previous gradients (see L-BFGS which is a popular
𝒘
(1)
𝒘 𝑜𝑝𝑡 𝒘 method)
CS771: Intro to ML
15
Coming up next
 Some practical issue in optimization for ML
 Wrapping up the discussion of optimization techniques
 Probabilistic models for ML

CS771: Intro to ML

Duchi SH Si CH 08
No ratings yet
Duchi SH Si CH 08
8 pages
MITx 6.86x Notes - MD
No ratings yet
MITx 6.86x Notes - MD
91 pages
Crypto8e Merged
100% (1)
Crypto8e Merged
492 pages
AWS D1.1 - Example PQR & WPS Documents
0% (1)
AWS D1.1 - Example PQR & WPS Documents
4 pages
Notes 20220602
No ratings yet
Notes 20220602
208 pages
Struers Prestopress3 Embedded Press
No ratings yet
Struers Prestopress3 Embedded Press
23 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
18 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
771 A18 Lec9
No ratings yet
771 A18 Lec9
129 pages
Lecture 2
No ratings yet
Lecture 2
26 pages
Lecture 03 - Supervised Learning by Computing Distances - Plain
No ratings yet
Lecture 03 - Supervised Learning by Computing Distances - Plain
17 pages
Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Learning With Prototypes: CS771: Introduction To Machine Learning Nisheeth
22 pages
6 - Support Vector Machines
No ratings yet
6 - Support Vector Machines
14 pages
Continuous Optimization
No ratings yet
Continuous Optimization
51 pages
771 A18 Lec10
No ratings yet
771 A18 Lec10
164 pages
Working With Data and Features: CS771: Introduction To Machine Learning Nisheeth Srivastava
No ratings yet
Working With Data and Features: CS771: Introduction To Machine Learning Nisheeth Srivastava
22 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Lecture 23
No ratings yet
Lecture 23
15 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Ds 5
No ratings yet
Ds 5
21 pages
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
No ratings yet
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
39 pages
Introduction, Course Logistics: CS771: Introduction To Machine Learning Nisheeth Srivastava
No ratings yet
Introduction, Course Logistics: CS771: Introduction To Machine Learning Nisheeth Srivastava
28 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Projected Gradient
No ratings yet
Projected Gradient
21 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Ds 3
No ratings yet
Ds 3
25 pages
CS115 Optimization
No ratings yet
CS115 Optimization
160 pages
Chap01 Introduction
No ratings yet
Chap01 Introduction
21 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Lecture 16 - Hyperplane Classifiers - Perceptron - Plain
No ratings yet
Lecture 16 - Hyperplane Classifiers - Perceptron - Plain
9 pages
ML Notes
No ratings yet
ML Notes
14 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Lecture03 - Convex Functions and Optimization
No ratings yet
Lecture03 - Convex Functions and Optimization
10 pages
Perceptrons and SVMS: Cs771: Introduction To Machine Learning Nisheeth
No ratings yet
Perceptrons and SVMS: Cs771: Introduction To Machine Learning Nisheeth
18 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Lecture 02 - Warming-Up and Data and Features - Plain
No ratings yet
Lecture 02 - Warming-Up and Data and Features - Plain
23 pages
Optimizatio With Matlab
No ratings yet
Optimizatio With Matlab
49 pages
Lec 03
No ratings yet
Lec 03
42 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Lecture5 SVM
No ratings yet
Lecture5 SVM
67 pages
7.c-CMP460-S22-Linear Models - Gradient Descent
No ratings yet
7.c-CMP460-S22-Linear Models - Gradient Descent
25 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Lecture 3
No ratings yet
Lecture 3
21 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
No ratings yet
Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
81 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
Intro SVM New Example PDF
100% (1)
Intro SVM New Example PDF
56 pages
15 Optimization Script
No ratings yet
15 Optimization Script
62 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Lecture 17 - Hyperplane Classifiers - SVM - Plain
No ratings yet
Lecture 17 - Hyperplane Classifiers - SVM - Plain
16 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
From Everand
Hidden Surface Determination: Unveiling the Secrets of Computer Vision
Fouad Sabry
No ratings yet
Lecture 04 - Supervised Learning by Computing Distances (2) - Plain
No ratings yet
Lecture 04 - Supervised Learning by Computing Distances (2) - Plain
16 pages
Deep Learning
No ratings yet
Deep Learning
189 pages
Bernd Klein Python and Machine Learning Letter
No ratings yet
Bernd Klein Python and Machine Learning Letter
453 pages
General Observation
No ratings yet
General Observation
93 pages
Dataset: (Most Famous)
No ratings yet
Dataset: (Most Famous)
8 pages
Bernd Klein Python Data Analysis Letter
No ratings yet
Bernd Klein Python Data Analysis Letter
514 pages
Model Training: (Anything Done While We Train The Model)
No ratings yet
Model Training: (Anything Done While We Train The Model)
194 pages
Cnns Convolution Neural Networks
No ratings yet
Cnns Convolution Neural Networks
50 pages
Command Line Python Scripting: Takeaways: Syntax
No ratings yet
Command Line Python Scripting: Takeaways: Syntax
2 pages
A B Testing
100% (1)
A B Testing
28 pages
Working With Programs: Takeaways: Syntax
No ratings yet
Working With Programs: Takeaways: Syntax
2 pages
AFM ER308 Afm Er308L
No ratings yet
AFM ER308 Afm Er308L
9 pages
Corporate and Academic Services: Part 1: Basic Data
No ratings yet
Corporate and Academic Services: Part 1: Basic Data
3 pages
THE Infinite Game: Simon Sinek
No ratings yet
THE Infinite Game: Simon Sinek
27 pages
BNAP Forms 2023 1
No ratings yet
BNAP Forms 2023 1
5 pages
Animal Husbandry MCQ
No ratings yet
Animal Husbandry MCQ
8 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
Material Safety Data Sheet: Ephedrine Hydrochloride
No ratings yet
Material Safety Data Sheet: Ephedrine Hydrochloride
6 pages
Client Services Agreement
No ratings yet
Client Services Agreement
37 pages
Lesson Plan in English
No ratings yet
Lesson Plan in English
3 pages
Accenture Presentation Script
No ratings yet
Accenture Presentation Script
3 pages
Standardization of Ayurvedic Formulations: Dr. Vipin Chaurasiya PG Scholar VPSV Ayurveda College Kottakkal
100% (1)
Standardization of Ayurvedic Formulations: Dr. Vipin Chaurasiya PG Scholar VPSV Ayurveda College Kottakkal
38 pages
Leong Hup International: Rules The Roost in Poultry Industry
No ratings yet
Leong Hup International: Rules The Roost in Poultry Industry
17 pages
AP Physics 1 Practice Test 1: Kinematics
No ratings yet
AP Physics 1 Practice Test 1: Kinematics
1 page
DPT R8-3W Cat
No ratings yet
DPT R8-3W Cat
2 pages
1 F40, R-41, In-House IHTM-14 Test Report
No ratings yet
1 F40, R-41, In-House IHTM-14 Test Report
1 page
Report On Smart Device
No ratings yet
Report On Smart Device
5 pages
The Everyday Healthy Vegetarian by Nandita Iyer
No ratings yet
The Everyday Healthy Vegetarian by Nandita Iyer
458 pages
SH3532 95石油化工换热设备施工及验收规范
No ratings yet
SH3532 95石油化工换热设备施工及验收规范
30 pages
Policy Server Installation Guide
0% (1)
Policy Server Installation Guide
24 pages
5.1 s2.0 S095006182032657X Main
No ratings yet
5.1 s2.0 S095006182032657X Main
15 pages
Gr11 Business Studies SG LR
No ratings yet
Gr11 Business Studies SG LR
215 pages
Artists and Artisans
100% (2)
Artists and Artisans
46 pages
Altivar 61 For Medium Voltage Motors
No ratings yet
Altivar 61 For Medium Voltage Motors
34 pages
ASME IX Explanations
100% (4)
ASME IX Explanations
13 pages
Rajant SpecSheet LX5 Squid Cable 110817
No ratings yet
Rajant SpecSheet LX5 Squid Cable 110817
2 pages
Power HP Ecu PDF
100% (3)
Power HP Ecu PDF
82 pages
Buy Social Security Number SSN
No ratings yet
Buy Social Security Number SSN
8 pages

Lecture 09 - Calculus and Optimization Techniques (3) - Plain

Uploaded by

Lecture 09 - Calculus and Optimization Techniques (3) - Plain

Uploaded by

Optimization for ML (3)

CS771: Introduction to Machine Learning

 Consider a loss function of the form

𝒈= ∇ 𝒘 𝐿 ( 𝑤 ) =∇ 𝒘 ¿ (Sub)gradient of the loss

 At iter. , pick an index uniformly randomly and approximate as

 We can use unif. rand. chosen train. ex. with indices

 Averaging helps in reducing the variance in the stochastic gradient

 Projected GD is very similar to GD with an extra projection step

 Each iteration will be of the form

 Proximal GD popular when regularizer is non-differentiable

 Assume reg. loss function of the form

 Note: If constraints of the form , use

 Our problem can now be written as

 The Lagrangian is now optimized w.r.t. and (Lagrange multiplier)

Both equal if and the set

 Introduce Lagrange multipliers and

Some other useful

-- partial derivative w.r.t. the element of vector (or the

 Cost of each update is now independent of

 Often, this “joint” optimization is hard/impossible to solve

Converges much faster than GD (very fast for convex

You might also like