0% found this document useful (0 votes)

7 views

Lecture 8

Uploaded by

bhavesh agrawal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Lecture 8

Uploaded by

bhavesh agrawal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

MACHINE LEARNING (CS 403/603)

Introduction to Gradient Descent

Dr. Puneet Gupta

Derivatives
Magnitude of derivative at a point is the rate of change of the function at that point

Positive derivative means f is increasing at x if we increase the value of x by a very small amount;
negative derivative means it is decreasing

Understanding how f changes its value as we change x is helpful to understand optimization

(minimization/maximization) algorithms.

Derivative becomes zero at stationary points (optima or saddle points)

● The function becomes “flat” (f’(x)=0 if we change x by a very little at such points
● These are the points where the function has its maxima/minima (unless they are saddles). At
saddle points, derivative is zero but neither minima nor maxima. These are common in deep
learning models.
Derivatives
Each element in this gradient vector
tells us how much f will change if we
move a little along the corresponding
direction.

Optima and saddle points are defined similar to one-dim case. It require the properties
that we saw for one-dim case must be satisfied along all the directions.

The second derivative in this case is known as the Hessian

Analyzing optimal solutions for loss
functions
Finding the optima
(minima) of loss function
by visualizing the loss
function as a function of
weights in terms of
curves or surfaces.
In convex functions,
local minima and global
minima are same but in
non-convex minima,
they differ.

Global Optima
Loss

Loss

Local Optima

Convex Function Non-convex Function

W W
Convex set
A subset C is convex if, for all x and y in C, the line segment connecting x
and y is included in C.
This means that the affine combination (1 − t)x + ty belongs to C, for all x
and y in C, and t in the interval [0, 1].

Convex Set Non-convex Set

Convex Functions
f is convex if for all vectors x, y ∈ C and β ∈ [0,1]
f(βx+(1- β)y) ≤ βf(x) + (1- β)f(y)
The domain of a convex
function needs to be a
f(y) convex set.
f(x)
Intuitively, a function f(x)
is convex if all of its
βf(x) + (1- β)f(y)
f(βx+(1- β)y) chords lie above the
function everywhere

Examples

Loss
Loss

Convex Function Non-convex Function

W W
Convex Functions
Conditions to test convexity of
differentiable function:
● First-order convexity (graph of

Loss, f
function f, must be above all the (x,f(x))
tangents)

● Second-order convexity: Second

derivative or Hessian (if exists) must be Convex Function, f(w)
positive semi-definite W
f is convex if and only if f″(x) ≥ 0 for all x.

Some important points to remember:

● All linear and affine functions (e.g., ax + b) are convex
● exp(ax) is convex for x ∈ R, for any a ∈ R
● log(x) is concave (not convex) for x > 0
● xa is convex for x > 0, for any a ≥ 1 and a < 0, concave for 0 ≤ a ≤ 1
● Non-negative weighted sum of convex functions is also a convex function
● Affine transformation preserves convexity: if f (x) is convex then f (x) = f (ax + b) is
also convex
First-order optimality condition
Usually, ML problems are non-convex in nature and they can be solved by non-convex
optimization, which is a research area and outside the scope of our course.
Approach, we have used: The gradient g must be zero at each optima (local or global).
Also known as first-order optimality condition. That is, set g=0 and evaluate unknown
parameters (like w for hyperplane based learning) to find optima.

It may or may not provide closed form solution as in linear regression or logistic
regression respectively. Even if it does not provide close form solution, the gradient g
can still be helpful by utilizing it in iterative optimization methods.

Optimal Solution

Wrong Solution
Iterative Optimization using
gradients: Gradient Descent
First-order method (utilizing only the gradient g
of the objective)
Basic idea: Start at some location w(0) and
move in the opposite direction of the gradient.
By how much?
Till when?

● ηt known as learning rate, can be

constant or vary at each time step.
● The effective step size (how much
w moves) depends on both ηt and Negative Positive
current gradient gt gradient gradient

When to stop: Many criteria, e.g.,

gradients become negligible, or
validation error starts increasing.

What happen for convex function?

● Guaranteed to provide to local w(0) w(1) w(2) w(3)
optima (which is global optima for w(opt)
convex functions). w(2) w(1) w(0)
Importance of Weight Initializations
A good initialization w(0) plays a crucial role. We may get trapped in a bad local
optima for non-convex function.

Remedy Bad
Run multiple times initialization
with different Good
initialization and initialization
select the best
one.

Local Optima Global Optima

Importance of Learning Rates

Problems with small learning rates Problems with Large learning rates
● May take too long to converge ● May Keep Oscillating
● May not be able to “cross” bad optima ● Jump from good region to bad region
and reach towards good optima

The learning rate can defined as:

● Constant (Require proper tuning for good convergence and proper optima estimation)
● Adaptively decreasing as time step increases (like, divide η by a constant factor)
● Use adaptive learning rates (e.g., using methods such as Adagrad, Adam) (Revisit later).
Stochastic Gradient Descent
Mini-batch SGD
SGD uses a single example to approximate the gradient. It is a reasonable estimate of g
but will have large variance.

g, actual gradient
gi, stochastic gradient

One way to control the variance in the gradient’s approximation is mini-batch SGD where
mini-batch containing more than one sample is used for approximating the gradients.
Actual gradient is approximated in mini-batch SGD using:

where B is the batch size.

Gradient Descent: Examples
Gradient Descent: Observations
Summary
● Today, we looked at GD and SGD that are
applicable when function is differentiable.
● We will look several other ways to compute the
gradients, like subgradient descent can be used
when loss function is not differentiable,
alternating optimization when several
dependent variables need to be evaluated,
projected gradient descent when the variables
are constrained and so on....

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Pilot Jungmeister
No ratings yet
Pilot Jungmeister
8 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
18 pages
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
No ratings yet
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
10 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Lect 5- Gradient Descent
No ratings yet
Lect 5- Gradient Descent
31 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
ML Notes
No ratings yet
ML Notes
14 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Lec_11
No ratings yet
Lec_11
13 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
LInear
No ratings yet
LInear
14 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Lecture 91
No ratings yet
Lecture 91
17 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
No ratings yet
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
23 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Maths For ML
No ratings yet
Maths For ML
1 page
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
Why Convexity Is The Key To Optimization: Convex Sets
No ratings yet
Why Convexity Is The Key To Optimization: Convex Sets
4 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
DL- Unit 2
No ratings yet
DL- Unit 2
60 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
Continuous Optimization
No ratings yet
Continuous Optimization
23 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
SGD
No ratings yet
SGD
19 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Lecture 11
No ratings yet
Lecture 11
30 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
IE643 Lecture8 2020sep11 2020sep8
No ratings yet
IE643 Lecture8 2020sep11 2020sep8
100 pages
Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang
No ratings yet
Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang
23 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
An Introduction To Convexity: Geir Dahl November 2010
No ratings yet
An Introduction To Convexity: Geir Dahl November 2010
126 pages
Exercises of Function Study
From Everand
Exercises of Function Study
Simone Malacrida
No ratings yet
Integration of Artificial Intelligence, Blockchain, and Wearable Technology For Chronic Disease Management: A New Paradigm in Smart Healthcare
No ratings yet
Integration of Artificial Intelligence, Blockchain, and Wearable Technology For Chronic Disease Management: A New Paradigm in Smart Healthcare
11 pages
Microprocessor Technology: Presented by Anshika Porwal Scholar No-222116609
No ratings yet
Microprocessor Technology: Presented by Anshika Porwal Scholar No-222116609
27 pages
SpectralSpatial Morphological Attention Transformer For Hyperspectral Image Classification
No ratings yet
SpectralSpatial Morphological Attention Transformer For Hyperspectral Image Classification
15 pages
Basicsof Probability
No ratings yet
Basicsof Probability
8 pages
Towards A Blockchain Based Fall Prediction Model For Aged Care
No ratings yet
Towards A Blockchain Based Fall Prediction Model For Aged Care
10 pages
Foot2hip: A Deep Neural Network Model For Predicting Lower Limb Kinematics From Foot Measurements
No ratings yet
Foot2hip: A Deep Neural Network Model For Predicting Lower Limb Kinematics From Foot Measurements
11 pages
Database Management System
No ratings yet
Database Management System
5 pages
Database Management System: Questions (1 - 20) Are Based On The Following 3 Tables
0% (1)
Database Management System: Questions (1 - 20) Are Based On The Following 3 Tables
6 pages
Database Management System: Refer Below To Answer The Questions (Q.1 To Q4)
No ratings yet
Database Management System: Refer Below To Answer The Questions (Q.1 To Q4)
6 pages
File Structures & Indexing
No ratings yet
File Structures & Indexing
4 pages
Transactions and Concurrency Control
No ratings yet
Transactions and Concurrency Control
6 pages
Transactions and Concurrency Control
No ratings yet
Transactions and Concurrency Control
7 pages
Transactions and Concurrency Control
100% (1)
Transactions and Concurrency Control
7 pages
CD Q4 PDF
No ratings yet
CD Q4 PDF
2 pages
Database Management System 1. For A Database Relation R (A, B, C, D) Where The Domains of A, B, C and D Only Include Atomic
No ratings yet
Database Management System 1. For A Database Relation R (A, B, C, D) Where The Domains of A, B, C and D Only Include Atomic
4 pages
CD Q5 PDF
No ratings yet
CD Q5 PDF
3 pages
CD Q4 PDF
No ratings yet
CD Q4 PDF
2 pages
Lab Maual For Experiments 6 To 10
No ratings yet
Lab Maual For Experiments 6 To 10
19 pages
Interview Vocab
No ratings yet
Interview Vocab
4 pages
CD Q5 PDF
No ratings yet
CD Q5 PDF
3 pages
Compiler Design: 1. The Advantage of Panic Mode of Error Recovery Is That
No ratings yet
Compiler Design: 1. The Advantage of Panic Mode of Error Recovery Is That
4 pages
Vehicle Quantity
No ratings yet
Vehicle Quantity
1 page
Hwids - 2016 10 13 - 08 40 01
No ratings yet
Hwids - 2016 10 13 - 08 40 01
9 pages
The Emerging Power of Social Media - Prospects and Problems
No ratings yet
The Emerging Power of Social Media - Prospects and Problems
2 pages
Grade-6-Report-Card-2020-2021 DINEROS JERNEL
No ratings yet
Grade-6-Report-Card-2020-2021 DINEROS JERNEL
3 pages
Critical Appraisal of National Green Tribunal Act, 2010
No ratings yet
Critical Appraisal of National Green Tribunal Act, 2010
16 pages
N890.090
No ratings yet
N890.090
1 page
SITHPAT006 Student Logbook
No ratings yet
SITHPAT006 Student Logbook
15 pages
Create and Usee-Mail
No ratings yet
Create and Usee-Mail
11 pages
инструкция net modul kotly automatyczne co biomaster 5 klasa 20180316121204
No ratings yet
инструкция net modul kotly automatyczne co biomaster 5 klasa 20180316121204
32 pages
RRL About Early Intervention
No ratings yet
RRL About Early Intervention
2 pages
Gender
No ratings yet
Gender
6 pages
Syllabus-Lhs Madrigal 2022
No ratings yet
Syllabus-Lhs Madrigal 2022
2 pages
Group 1 - Heineken Report - CPM
No ratings yet
Group 1 - Heineken Report - CPM
17 pages
Career Guidance History
No ratings yet
Career Guidance History
38 pages
Motivational Policies of 3 International Companies
No ratings yet
Motivational Policies of 3 International Companies
4 pages
Rasa Aesthetics Goes Global Relevance and Legitimacy Priya... )
No ratings yet
Rasa Aesthetics Goes Global Relevance and Legitimacy Priya... )
28 pages
Aquino - Brochure
No ratings yet
Aquino - Brochure
47 pages
Resume - Copy Nani 1
No ratings yet
Resume - Copy Nani 1
2 pages
Cover Pageee Mee
No ratings yet
Cover Pageee Mee
5 pages
How To Test MD380 Power Module - (Rectifier + Inverter) PDF
No ratings yet
How To Test MD380 Power Module - (Rectifier + Inverter) PDF
21 pages
SMT Lead Site Supervision Assesment Checklist
No ratings yet
SMT Lead Site Supervision Assesment Checklist
18 pages
Attagel 50 January 2018 R5 ED2
No ratings yet
Attagel 50 January 2018 R5 ED2
2 pages
Andhra Mahabharatham
No ratings yet
Andhra Mahabharatham
708 pages
Mac 1
No ratings yet
Mac 1
225 pages
Class-8 Cube & Cube Roots Worksheet
100% (3)
Class-8 Cube & Cube Roots Worksheet
2 pages
Emotion: by Michael P Calligaro
No ratings yet
Emotion: by Michael P Calligaro
13 pages
Approach To Scientific Writing PDF
No ratings yet
Approach To Scientific Writing PDF
3 pages
Swot Analysis:: Strength
No ratings yet
Swot Analysis:: Strength
5 pages
Branding
No ratings yet
Branding
2 pages

Lecture 8

Uploaded by

Lecture 8

Uploaded by

MACHINE LEARNING (CS 403/603)

Introduction to Gradient Descent

Dr. Puneet Gupta

Understanding how f changes its value as we change x is helpful to understand optimization

Derivative becomes zero at stationary points (optima or saddle points)

The second derivative in this case is known as the Hessian

Convex Function Non-convex Function

Convex Set Non-convex Set

Convex Function Non-convex Function

● Second-order convexity: Second

Some important points to remember:

● ηt known as learning rate, can be

When to stop: Many criteria, e.g.,

What happen for convex function?

Local Optima Global Optima

The learning rate can defined as:

where B is the batch size.

You might also like