0% found this document useful (0 votes)

71 views

EDA Lecture Module 4

This document describes Module 4 on optimization for the CSE3506 - Essentials of Data Analytics course taught by Dr. Sathiya Narayanan S at VIT-Chennai during the Winter Semester of 2020-21. The module covers topics such as gradient descent, variants of gradient descent including momentum optimization, Nesterov accelerated gradient, and adaptive learning methods. It provides an introduction to optimization problems and gradient descent optimization. Variants of gradient descent are discussed based on the amount of data used to compute gradients in each iteration. Momentum optimization is introduced as a method to dampen oscillations during gradient descent.

Uploaded by

WINORLOSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

EDA Lecture Module 4

Uploaded by

WINORLOSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 20

Contents

1 Module 4: Optimization

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 20

Module 4: Optimization

Topics to be covered in Module-4

Introduction to Optimization
Gradient Descent
Variants of Gradient Descent
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
Adadelta
RMSProp
Adam
AMSGrad

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 20

Module 4: Optimization

Introduction to Optimization
Optimization is the process of maximizing or minimizing a real
function by systematically choosing input values from an allowed set
of values and computing the value of the function.
It refers to usage of specific methods to determine the best solution
from all feasible solutions, say for example, finding the best functional
representation and finding the best hyperplane to classify data.
Three components of an optimization problem: objective function
(minimzation or maximization), decision variables and constraints.
Based on the type of objective function, constraints and decision
variables, several types of optimization problems exists. An
optmization can be linear or non-linear, convex or non-convex,
iterative or non-iterative, etc.
Optimization is considered as one among the three pillars of data
science. Linear algebra and statistics are the other two pillars.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 20
Module 4: Optimization

Introduction to Optimization
Consider the following optimization problem which attempts to find
the maximal marigin hyperplane with marigin M:

maximizeα0 ,α1 ,...,αp M (1)

p
X
subject to αj2 = 1, (2)
j=1
p
X
yi α0 + αj xij ≥ M for all i = 1, 2, ..., n. (3)
j=1

Equation (1) is the objective function, equations (2) and (3) are the
constraints, and α0 , α1 , ..., αp are the decision variables.
In general, an objective function is denoted as f (·) and minimizer of
f (·) is same as the maximizer of −f (·).
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 20
Module 4: Optimization

Gradient Descent
Gradient Descent is the most common optimization algorithm in
machine learning and deep learning.
It is a first-order, iterative-based optimization algorithm which only
takes into account the first derivative when performing the updates
on the parameters.
In each iteration, there are 2 steps: (i) finding the (locally) steepest
direction according to the first derivative of an objective function; and
(ii) finding the best point in the line. The parameters are updated in
the opposite direction of the gradient of the objective function.
The learning rate determines the convergence (i.e. the number of
iterations required to reach the local minimum). It should neither be
too small nor too large. Very small α leads to very slow convergance
and a very large value leads to oscillations around the minima or may
even lead to divergence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 20
Module 4: Optimization

Gradient (Steepest) Descent

Let f (X ) denote the objective function and X0 denote the starting
point. In iteration k, the best point is given by

Xk = Xk−1 − αGk−1

where α is the learning rate (step length) and

Gk−1 = Of (X ) = f 0 (X ) is the derivative of f (X ) (search direction).
Consider for example, f (X ) = x1 + 2x12 + 2x1 x2 + 3x22 , α = 0.1 and

0.5
X0 = .
0.5

In this case,
0 1 + 4x1 + 2x2
f (X ) = .
2x1 + 6x2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 20

Module 4: Optimization

Gradient (Steepest) Descent (contd.)

In the first iteration, the direction G0 and the best point X1 are
estimated as follows:

0 4 0.1
G0 = f (X0 ) = and X1 = X0 − αG0 = .
4 0.1

Similarly, In the next iteration,

0 1.6 −0.6
G1 = f (X1 ) = and X2 = X1 − αG1 = .
0.8 0.02

The iterations continue till convergence. The parameter α plays a

significant role in both convergence and stability. Figure 1 shows a
sample plot of sequence of estimated points.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 20

Module 4: Optimization

Figure 1: Steepest descent - convergence plot. Source: Mishra S.K., Ram B.

(2019) Steepest Descent Method. In: Introduction to Unconstrained
Optimization with R. Springer, Singapore.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 20
Module 4: Optimization

Variants of Gradient Descent

There are three variants of gradient descent based on the amount of data
(samples) considered for computing the gradient at each iteration.
1 Batch Gradient Descent: The parameter update step involves
summing up all data samples. It has straight trajectory towards the
minimum and its convergence is guaranteed.
2 Mini-Batch Gradient Descent: The parameter update involves
summing up lower number of samples based on batch size. It is faster
than batch gradient descent but convergence is not guaranteed.
3 Stochastic Gradient Descent: The parameter update is done
sample-wise. It has less generalization error compared to mini-batch
gradient descent but the run time is more.
Therefore, there exists a gradient accuracy - time complexity tradeoff
between these variants.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 20
Module 4: Optimization

Question 4.1

Apply gradient descent approach to minimize the function:

f (X ) = 4 x12 + 3 x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 .

Assume the step size is 0.135 and the starting point is

x1 (0) 2
X0 = = .
x2 (0) 2

Let the stopping criteria be the absolute difference between the function
values in successive iterations less than 0.005. Your answer should show
the search direction and the value of the function in each iteration.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 20

Module 4: Optimization

Momentum Optimizer

In gradient descent approach, the biggest challenge lies in choosing a

proper learning rate α. In addition, there are challenges such as
non-convex error functions (quite common in neural networks) getting
struck at their suboptimal local minima.
To circumvent these challenges, several optimization algorithms were
proposed and used by the deep learning community. Notable among
them are momentum, Nesterov accelerated gradient, Adagrad,
Adadelta, RMSprop, and AMSGrad.
As indicated earlier, gradient descent approach (say the stochastic
gradient descent) with improper α value might lead to oscillations
around the minima. Momentum optimizer attempts to dampen these
oscillations by accelerating the stochastic gradient descent in the
relevant direction.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 20

Module 4: Optimization

Momentum Optimizer

Momentum optimizer accomplishes the task by adding a fraction γ of

the update vector of the past iteration to the current update vector:

wk = wk−1 − [γ vk−2 + α f 0 (wk−1 )]

= wk−1 − γ vk−2 − α f 0 (wk−1 )

where the term vk−2 = wk−2 − wk−1 = γ vk−3 + α f 0 (wk−2 ) refers to

the update vector in the previous iteration.
Two forces act on the parameter to be updated in an iteration: the
gradient force (α f 0 (wk−1 )) and the momentum force (γ vk−2 ).
The momentum term γ vk−2 decreases where there is a change in
gradient direction(s) and increases when there is no change in
direction(s). Therefore, this approach dampens oscillations and leads
to faster convergence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 20
Module 4: Optimization
Nesterov Accelerated Gradient
Nesterov Accelerated Gradient (NAG) attempts to use the
momentum more effectively compared to momemtum optimizer.
Given the fact that wk−1 − γ vk−2 gives a rough approximation of wk ,
the search direction (i.e. gradient) is computed with respect to
anticipated current update wk−1 − γ vk−2 instead of previous update
wk−1 . The current update vector is expressed as follows:
wk = wk−1 − [γ vk−2 + α f 0 (wk−1 − γ vk−2 )]
= wk−1 − γ vk−2 − α f 0 (wk−1 − γ vk−2 )
This anticipatory update in NAG improves the performance of
gradient descent further. Click here for more details.
Both momentum optimizer and NAG require two hyper-parameters (γ
and α) to be set manually. These parameters decide the learning rate.
These two optimizers use same learning rate for all dimensions which
is not proper.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 20
Module 4: Optimization
Adagrad
Adaptive Gradient (Adagrad) optimizer adaptively scales the learning
rate for different dimensions. For a parameter, the scale factor is
inversely proportional to the square root of sum of historical squared
values of gradient. The update rule is:
α
wk (i) = wk−1 (i) − p Gk−1 (i)
Rk−1 (i, i) +
where Rk−1 is a diagonal matrix with each diagonal element i, i being
the sum of squares of the gradients with respect to w (i) upto time
step k − 1, and is the smoothing term (usually 10−8 ).
The learning rate reduces faster for parameters showing large slope.
Adagrad does not require manual tuning of hyper-parameters.
It converges rapidly when applied to convex functions. In the case of
non-convex functions, the learning rate becomes too small and
therefore, at some point, the model may stop learning.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 20
Module 4: Optimization
Adadelta
Adadelta, an extension of Adagrad, attempts to resolve Adagrad’s
issue - radically diminishing learning rates. It limits the window of
accumulated gradients to some fixed size.
Instead of storing the previous squared gradients, the sum of
gradients is recursively defined as a decaying average of all past
squared gradients. The update becomes:
α
wk = wk−1 − p Gk−1 (4)
2
E [G ]k−1 +
where E [G 2 ]k−1 = βE [G 2 ]k−2 + (1 − β)Gk−1
2 .
p
The term E [G 2 ]k−1 + is the Root-Mean-Square (RMS) of the
gradient. Adadelta, further replaces α term in the numerator with the
RMS of the previous update. Therefore, there is no need to set the
value of α.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 20
Module 4: Optimization

RMSProp

Both Adadelta and RMSProp have been developed independently

around the same time.
RMSProp is same as the first update of Adadelta (given as Equation
(4) in previous slide):
α
wk = wk−1 − Gk−1 .
RMS[G ]k−1

Like Adadelta, it uses exponentially decaying average of squared

gradient and discards history from the extreme past.
It converges rapidly once it finds a locally convex bowl. It behaves
like Adagrad initialized within that convex bowl.
RMSProp is very effective for mini-batch gradient descent learning.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 20

Module 4: Optimization

Adam
Adaptive Moment estimation (Adam) combines RMSProp and
Momentum.
It incorporates the momentum term (i.e. first moment with
exponential weighting of the gradient) in RMSProp as follows:
α
wk = wk−1 − p m̂k−1
v̂k−1 +
where m̂k−1 and v̂k−1 are bias-corrected versions of mk−1 (first
moment) and vk−1 (second moment) respectively. The first and
second moments are:

mk−1 = β1 mk−2 + (1 − β1 )Gk−1

2
vk−1 = β2 vk−2 + (1 − β2 )Gk−1 .

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 20

Module 4: Optimization

AMSGrad

In situations where some mini-batches provide large and informative

gradients, Adam converges to a suboptimal solution. This is due to
the fact that the exponential averaging diminishes the influence of
such rarely occuring mini-batches, which leads to poor convergence.
AMSGrad updates the parameters by considering the maximum of
past squared gradients rather than the exponential average. The
update rule is:
α
wk = wk−1 − p mk−1 .
MAX(ṽk−2 , vk−1 ) +

Note that bias-correction is not considered.

AMSGrad results in a non-increasing step size. This resolves the
problem suffered by Adam.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 20

Module 4: Optimization
Module-4 Summary
Introduction to Optimization: three components
Gradient Descent: first-order, iterative-based optimization algorithm
Variants of Gradient Descent: batch gradient descent, mini-batch
gradient descent and stochastic gradient descent
Momentum Optimizer: accelerates the stochastic gradient descent in
the relevant direction - NAG uses the momentum term for
anticipatory update
Adagrad: adaptively scales learning rate for different dimension
Adadelta: sum of gradients recursively defined as the decaying
average of past gradients
RMSProp: same first update of Adadelta
Adam: combination of RMSProp and momentum
AMSGrad: considers the maximum of past squared gradients
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 20

Coding: MCQ's
100% (1)
Coding: MCQ's
3 pages
Computer Science Case Study: Genetic Algorithms: Instructions To Candidates
No ratings yet
Computer Science Case Study: Genetic Algorithms: Instructions To Candidates
8 pages
4.1 - EDA Lecture Module 4 Vetri Sir New
No ratings yet
4.1 - EDA Lecture Module 4 Vetri Sir New
19 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
optim
No ratings yet
optim
33 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Chap_4_2
No ratings yet
Chap_4_2
214 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
ch4
No ratings yet
ch4
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
optimization
No ratings yet
optimization
6 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Week 4
No ratings yet
Week 4
6 pages
06_23ECE216_GradientDescent_v2
No ratings yet
06_23ECE216_GradientDescent_v2
73 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
SuperGD
No ratings yet
SuperGD
15 pages
OPTIMIZATION-MODULE IV
No ratings yet
OPTIMIZATION-MODULE IV
7 pages
2410.19706 (1)
No ratings yet
2410.19706 (1)
15 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
ML Notes
No ratings yet
ML Notes
14 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
optimization-techniques
No ratings yet
optimization-techniques
9 pages
Optimizer
No ratings yet
Optimizer
13 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Introduction to Optimization-Lec1
No ratings yet
Introduction to Optimization-Lec1
36 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Code Adam Optimization Algorithm From Scratch
No ratings yet
Code Adam Optimization Algorithm From Scratch
28 pages
Optimizers and Activation functions in Deep Learning
No ratings yet
Optimizers and Activation functions in Deep Learning
15 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Errors and Approximation (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Sl. No. Full Name Gender Course: Internal
No ratings yet
Sl. No. Full Name Gender Course: Internal
6 pages
Schneider Electric - FTE R&D Job Description - 2022 Batch
No ratings yet
Schneider Electric - FTE R&D Job Description - 2022 Batch
32 pages
Reg - No Talent First Name Talent Middle Name
No ratings yet
Reg - No Talent First Name Talent Middle Name
32 pages
Essentials of Data Analytics: J Component Report
No ratings yet
Essentials of Data Analytics: J Component Report
25 pages
Module 4
No ratings yet
Module 4
40 pages
Name Reg - No
No ratings yet
Name Reg - No
4 pages
Situational Poverty
100% (1)
Situational Poverty
9 pages
Ece3099 Ipt PPT Template 18becxxxx
No ratings yet
Ece3099 Ipt PPT Template 18becxxxx
27 pages
Sno Reg No First Name Last Name
No ratings yet
Sno Reg No First Name Last Name
14 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
12 pages
Q1 First Element After 4 Pass of Insertion Sort 4 5 8 3 7 9 6
No ratings yet
Q1 First Element After 4 Pass of Insertion Sort 4 5 8 3 7 9 6
3 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
17 pages
Winners Advice
No ratings yet
Winners Advice
40 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
29 pages
S.No Registration Number Full Name
No ratings yet
S.No Registration Number Full Name
39 pages
18bec1241 Team6 Esd J Report
No ratings yet
18bec1241 Team6 Esd J Report
22 pages
Full Profect Report
No ratings yet
Full Profect Report
39 pages
18bec1241 Tarp Report Team1
No ratings yet
18bec1241 Tarp Report Team1
100 pages
Candidatename Gender Degree Branch
No ratings yet
Candidatename Gender Degree Branch
2 pages
Test Name: CAT1 - CH2020215000725 - ECE3005 - M21 Name: RAJAL SHAH - Rajal - Shah2018@vitstudent - Ac.in
No ratings yet
Test Name: CAT1 - CH2020215000725 - ECE3005 - M21 Name: RAJAL SHAH - Rajal - Shah2018@vitstudent - Ac.in
7 pages
Dr. Vetrivelan. P School of Electronics Engineering: Loan Prediction Using Data Analytics
No ratings yet
Dr. Vetrivelan. P School of Electronics Engineering: Loan Prediction Using Data Analytics
31 pages
LIC LAB REPORT New
No ratings yet
LIC LAB REPORT New
27 pages
Winter
No ratings yet
Winter
1 page
Learn Python in 10 Minutes
No ratings yet
Learn Python in 10 Minutes
10 pages
Quantum-Kit: Simulating Shor's Factorization of 24-Bit Number On Desktop
No ratings yet
Quantum-Kit: Simulating Shor's Factorization of 24-Bit Number On Desktop
8 pages
TOC Mid 1 Descriptive Paper
No ratings yet
TOC Mid 1 Descriptive Paper
3 pages
Discrete Structures
No ratings yet
Discrete Structures
21 pages
Chapter 3 Logic Gates
No ratings yet
Chapter 3 Logic Gates
16 pages
L7-L8-L9-Matching-main
No ratings yet
L7-L8-L9-Matching-main
51 pages
Student Placement Prediction
No ratings yet
Student Placement Prediction
4 pages
Final Term Test Soln 2020
No ratings yet
Final Term Test Soln 2020
9 pages
Zoho Round 3-1
No ratings yet
Zoho Round 3-1
28 pages
Binary Number Conversion
No ratings yet
Binary Number Conversion
6 pages
Introduction To Data Structures Module
No ratings yet
Introduction To Data Structures Module
44 pages
Gate Questions
100% (1)
Gate Questions
35 pages
05 Deep Learning
No ratings yet
05 Deep Learning
53 pages
Daa PDF
No ratings yet
Daa PDF
16 pages
4MTH312 Mock Exam Solns 2023
No ratings yet
4MTH312 Mock Exam Solns 2023
18 pages
Course Outline Math-351 Numerical Methods
No ratings yet
Course Outline Math-351 Numerical Methods
2 pages
Helloworld - Java. Compiling Java Program at Console or Terminal
No ratings yet
Helloworld - Java. Compiling Java Program at Console or Terminal
3 pages
ResearchPaper DAA
No ratings yet
ResearchPaper DAA
5 pages
Object-Oriented Compiler Construction
No ratings yet
Object-Oriented Compiler Construction
10 pages
ML Unit Wise Important Questions
No ratings yet
ML Unit Wise Important Questions
2 pages
Max and Min PDF
No ratings yet
Max and Min PDF
19 pages
Total Domination Books PDF
75% (4)
Total Domination Books PDF
184 pages
Lecture 7 & 8 - Regular Expressions
No ratings yet
Lecture 7 & 8 - Regular Expressions
39 pages
UWATERLOO CS341 Midterm Solution 2011
No ratings yet
UWATERLOO CS341 Midterm Solution 2011
3 pages
Labmannual Ada
No ratings yet
Labmannual Ada
70 pages
Roots of Equations
No ratings yet
Roots of Equations
13 pages
2 Introduction To Queue
No ratings yet
2 Introduction To Queue
22 pages
InfThe Rev PDF
No ratings yet
InfThe Rev PDF
12 pages

EDA Lecture Module 4

Uploaded by

EDA Lecture Module 4

Uploaded by

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 20

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 20

Topics to be covered in Module-4

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 20

maximizeα0 ,α1 ,...,αp M (1)

Gradient (Steepest) Descent

where α is the learning rate (step length) and

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 20

Gradient (Steepest) Descent (contd.)

Similarly, In the next iteration,

The iterations continue till convergence. The parameter α plays a

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 20

Figure 1: Steepest descent - convergence plot. Source: Mishra S.K., Ram B.

Variants of Gradient Descent

Apply gradient descent approach to minimize the function:

f (X ) = 4 x12 + 3 x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 .

Assume the step size is 0.135 and the starting point is

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 20

In gradient descent approach, the biggest challenge lies in choosing a

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 20

Momentum optimizer accomplishes the task by adding a fraction γ of

wk = wk−1 − [γ vk−2 + α f 0 (wk−1 )]

where the term vk−2 = wk−2 − wk−1 = γ vk−3 + α f 0 (wk−2 ) refers to

Both Adadelta and RMSProp have been developed independently

Like Adadelta, it uses exponentially decaying average of squared

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 20

mk−1 = β1 mk−2 + (1 − β1 )Gk−1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 20

In situations where some mini-batches provide large and informative

Note that bias-correction is not considered.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 20

You might also like