0% found this document useful (0 votes)

12 views32 pages

Module 2 Part1new

This document outlines the syllabus for a deep learning course, focusing on optimization techniques such as Gradient Descent, Stochastic Gradient Descent, and various adaptive optimizers like Adam and RMSProp. It also covers regularization techniques like L1 and L2 regularization to prevent overfitting in neural networks. The content emphasizes the importance of training deep models effectively and includes mathematical formulations for various methods discussed.

Uploaded by

thejasurendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views32 pages

Module 2 Part1new

Uploaded by

thejasurendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CST414

DEEP LEARNING
Module-2 PART -I

1
SYLLABUS
2

Module-2 (Deep learning) Introduction to deep learning, Deep feed

forward network, Training deep models, Optimization techniques - Gradient Descent (GD), GD
with momentum, Nesterov accelerated GD, Stochastic GD, AdaGrad,
 RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early stopping, Dataset
augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods,

TRACE KTU
Dropout, Parameter initialization.
OPTIMIZATION TECHNIQUES
3


TRACE KTU
4


TRACE KTU
5

TRACE KTU
2. STOCHASTIC GRADIENT DESCENT(SGD)
6
 One of the disadvantage of GD is that entire data is loaded
into the system for gradient computation
 This makes it computationally intensive
 Here we considering only a single data for computing the loss
 Loss=( y-y^)2, then this is SGD

TRACE KTU
 Iterates one observation at a time
 MINI BATCH SGD:-
 It is a cross over between GD and SGD
 Dataset is divided into batches and computes gradient of
each batch
7

TRACE KTU
Smoothly converging with GD
converging with some noises
 3. SGD WITH MOMENTUM
8
 Even with mini Batch SGD there are noises and taking too much time for
convergence
 And all other disadvantages of GD like falling into local minima also affects
SGD
 As we reach the saddle point the step size of updation will be very small a 0.
 SGD with momentum works with the concept of EWMA-Exponentially
Weighted Moving Average.

TRACE KTU
 Used to find the trend in time series. The formula of EWMA is

 Vt=βVt-1+(1-β)ѳt

β is the weightage to the past event

0<β<1
 Suppose if β=0.5 and t=1
9  Most recent previous values getting more importance than
earlier events
 This same concept in SGD with momentum
 If we are repeatedly asked to go in a particular direction, we
can take bigger steps to that direction
 In GD,

TRACE KTU
 Wnew= Wold -ȠꝺL/ꝺwold
10
 Modify it with respect to EWMA
 Wnew= Wold –[ ꞵVt-1+ȠꝺL/ꝺwold ]
 where Vt-1= ꝺL/ꝺwold +ꞵ ꝺ[L/ꝺwold ]t-1+ ꞵ^2 ꝺ[L/ꝺwold ]t-
2+………………….
 ꞵVt-1 is the momentum

TRACE KTU
This will reduce the noise while trying to reach the global
minimum
4.NESTEROV ACCELERATED GD:-
11
When the learning rate η is relatively large, Nesterov
Accelerated Gradients allows larger decay rate α than
Momentum method, while preventing oscillations.
The theorem also shows that both Momentum method and
Nesterov Accelerated Gradient become equivalent when η is
small.
TRACE KTU
 Actual movement is large due to the added momentum
12  We cross the actual minimum point and have to come back to
get the minimum point
 SGD with momentum oscillates around the minimum point
 MBGD is faster than GD

TRACE KTU
 Look before you leap
13
 In NAG,we calculate gradient at the lookahead point and then use it to
update the weight.
 In MBGD, weight update rule
 Wnew= Wold –[ ꞵVt-1+ȠꝺL/ꝺwold ]
 Ie, based on w= history of velocity+gradient at that point

 In NAG
 TRACE KTU
w= history of velocity+gradient at a look ahead point
 WLA=wnew- βVt-1
 Vt= βVt-1+ȠꝺL/ꝺwLA
 Wnew=wold-vt
5.ADGRAD OPTIMIZERS(Adaptive Gradient Optimizer.)
14
 each weight has a different learning rate (η)
 Weight update rule,
 Wt=Wt-1- ȠꝺL/ꝺWt-1
 Wt=Wnew
 Wt-1= Wold

TRACE KTU
 In SGD and minibatch SGD, value of learning rate is same for
each weight or each parameter
 In ADAGRAD optimizer, learning rate gets modified based on
how frequently a parameter gets updated during training

 Wt=Wt-1- Ƞt-1 ꝺL/ꝺWt-1

TRACE KTU
16

TRACE KTU
6.RMS PROP(ROOT MEAN SQUARED PROPAGATION)
17 The RMSProp algorithm uses exponential smoothing
with parameter ρ (0, 1) in the relative estimations of
the gradients.
Here for the adaptation of step size instead of
considering decaying average of all gradients, only
partial past gradients are considered
TRACE KTU
It forgets early gradients and focus
recently observed partial gradients
on the most

In ADAGRAD, alpha(t) is considering the square of

gradient of all past iterations
Hence there is a chance that it will grow to a very high
number.so learning rate will be reduced to a very low
value
In RMS prop , to avoid this issue in the computation of
18 Ƞ’ and weighted average for the value of alpha(t)

Sdw
TRACE KTU
Sdwt = βSdwt-1+(1- β)(ꝺL/ꝺwt)^2
β is selected as 0.95 in most of the scenerio
 So the weight update is
19

Bias can also updated (db)

Sdb = TRACE
Sdb βKTU +(1- β)(ꝺL/ꝺb)^2

Bt=bt-1-Ƞ’ ꝺL/ꝺbt-1
20
7.ADAM OPTIMIZERS

 Adaptive movement Estimation

 It is the combination of MBGD and RMSprop
 In MBGD, we c]achieved smoothening while converging to
minima
 In RMSprop, we could modify the learning rate.
TRACE KTU
 Here utilizes the first and second moments of gradient to adapt
to the learning rate for each weight of the neural network.
 Mt= moving average of gradient(1st moves)
 vt = moving average of squared gradient(2nd moves)
 To estimates the moments at a time, ADAM uses following
 Mt=ꞵ1mt-1+(1-ꞵ1)gt………………………momentum
 Vt= ꞵ2mt-1+(1-ꞵ2)gt^2----------------------RMSprop
 Finally weight updation,
21  Wt=wt-1 - Ƞ mt/sqrt(vt+ɛ)
 BIAS CORRECTION IN ADAM:-
E(Mt)=E[gt]
E(gt)= E[gt^2]
When we start the algorithm, m₀=0 and v₀=0, ie,
TRACE KTU
estimators are biased towards zero
Since m₀=0
m₁=ꞵ₁ m₀+(1- ꞵ₁)g₁
= (1- ꞵ₁)g₁
m₂=ꞵ₁ m₁ +(1- ꞵ₁)g₂
=ꞵ₁ (1- ꞵ₁)g₁+(1- ꞵ₁)g₂
m₃=ꞵ₁ m₂ +(1- ꞵ₁)g₃
22
= ꞵ₁[ꞵ₁ (1- ꞵ₁)g₁+(1- ꞵ₁)g₂] +(1- ꞵ₁)g₃
= ꞵ₁^2 (1- ꞵ₁)g₁+ ꞵ₁ (1- ꞵ₁)g₂+(1- ꞵ₁)g₃
Finally we can write
𝒕
mₜ= (1- ꞵ₁) 𝒊 𝟎ꞵ
𝒕 𝒊
gi

TRACE KTU
Substituting with sum of finite geometric series
mₜ=(1- ꞵ₁) (1- ꞵ₁^t) gi/ (1- ꞵ₁)
= (1- ꞵ₁^t) gi
E(mₜ)= E[gi] (1- ꞵ₁^t) +ξ
ξ=error due to approximation
So mₜ^= mₜ/ (1- ꞵ₁^t)
Second moment bias correction is also done in the
23 same way
vₜ^= vₜ/ (1- ꞵ₂^t)
Hence weight updation is
wₜ=wₜ₋₁-Ƞmₜ^/sqrt(vₜ^+ɛ)

TRACE KTU
REGULARIZATION TECHNIQUES
24

 Regularization is a set of techniques that can prevent overfitting

in neural networks and thus improve the accuracy of a Deep
Learning model
 One of the most important aspects when training neural
networks is avoiding overfitting.

TRACE KTU
25

TRACE KTU
26

TRACE KTU
 If there is a data generating function F1 for the train data, it
27 should also work for other dataset which are unseen.
 That means the function F1 is to be generalized without
overcomplicating the model
 There should be F2 that mimics F1, but is more generalized.
 Parameter Norm Penalties:-
 technique is to add a parameter norm penalty Ω(θ) to the objective

TRACE KTU
function J,

 J ͂ (θ,X,y) =J(θ,X,y)+αΩ(θ)
J-objective function
θ-hyper parameters like weight
28  X- input features
 Y- labels
 Ω- parameter that controls the relative contribution of weight to
regularize the model
 The value of α is lies between [0,∞]

TRACE KTU
 If α is 0, there is no regularization
 There are two methods of regularization is L1 and L2
regularization
29 L1-Regularization
It is also known as LASSO Regression(Least Absolute
Shrinkage and Selection Operator)
Here we add ‘Absolute value of magnitude’ of coefficient as a
penalty term to the loss function

TRACE KTU
Lasso shrinks the less important feature's coefficient to zero,
thus removing some features---feature selection
L1 regularization is robust in dealing with outliers
It creates sparsity in solution, which means less
30 important features or noise terms will be zero
L1 makes robust to outliers

TRACE KTU
L1-regularization leads to sparse parameter
learning.
– Zero values of wi can be dropped.
– Equivalent to dropping edges from neural
network.
L2-regularization
31
Also called weight decay or ridge regression
Regularization term Ω is defined as the Euclidean
norm(L2 Norm) of the weght matrices, which is sum
overall squared weght values of a weight matrix
This is weighted by a scalar value α/2
TRACE KTU
L(w)^= α/2||w||₂^2+L(w)
= α/2ΣΣwij^2 +L(w)
α is called regularization rate(an additional parameter)
The gradient of the new loss function can be written as
 ∆wL(w)^ = αw+ ∆w L(w)
New update rule is
32 Wnew=wold- Ƞ(αwold+∆wL(wold))
This is gradient descent during L2 regularization
Which can be written as
Wnew = (1-Ƞα)wold-Ƞ∆wL(wold)
OR
TRACE KTU
Wnew = (1-Ƞα)wold-ȠꝺL/ꝺ(wold)
Some additional subtraction from the current weights
L2-regularization with parameter λ is equivalent to adding
Gaussian noise with variance λ to input.

Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
ADA - Study Material - 2016 - 27092016 - 084119AM
No ratings yet
ADA - Study Material - 2016 - 27092016 - 084119AM
105 pages
Decision Trees Another Example Problem
No ratings yet
Decision Trees Another Example Problem
6 pages
Recursive Question
No ratings yet
Recursive Question
3 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Apr-2023 BS23B
No ratings yet
Apr-2023 BS23B
6 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
99 + Machine Learning Algorithms
No ratings yet
99 + Machine Learning Algorithms
7 pages
CS60010_Fitting-1
No ratings yet
CS60010_Fitting-1
39 pages
Module 2 Part3
No ratings yet
Module 2 Part3
31 pages
Measurement of Time Complexity of An Algorithm
No ratings yet
Measurement of Time Complexity of An Algorithm
3 pages
Chapter 5 Adversarial Search Algorithms
No ratings yet
Chapter 5 Adversarial Search Algorithms
25 pages
Quantum Computation in Computational Geometry
No ratings yet
Quantum Computation in Computational Geometry
8 pages
AVL Tree
No ratings yet
AVL Tree
19 pages
A Brief Presentation On Shell Sort
No ratings yet
A Brief Presentation On Shell Sort
4 pages
UNIT3
No ratings yet
UNIT3
17 pages
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
No ratings yet
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
16 pages
SuperGD
No ratings yet
SuperGD
15 pages
2410.19706 (1)
No ratings yet
2410.19706 (1)
15 pages
path_sgd_behnam
No ratings yet
path_sgd_behnam
12 pages
MATLAB Code for solving Traffic Signal Timing Optimization using the Firefly Optimization Algorithm
No ratings yet
MATLAB Code for solving Traffic Signal Timing Optimization using the Firefly Optimization Algorithm
5 pages
Assignment
No ratings yet
Assignment
2 pages
DWM Unit 5 Mining Frequent Patterns and Cluster Analysis
100% (1)
DWM Unit 5 Mining Frequent Patterns and Cluster Analysis
15 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
List of Five Checked and Unchecked Exceptions
No ratings yet
List of Five Checked and Unchecked Exceptions
19 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
CP Priyash Agarwal Study Details - 1
No ratings yet
CP Priyash Agarwal Study Details - 1
2 pages
All-Pairs Shortest Paths
No ratings yet
All-Pairs Shortest Paths
3 pages
DL MODULE 2
No ratings yet
DL MODULE 2
8 pages
Metnum - Uts - Mochamad Shobirin 073P
No ratings yet
Metnum - Uts - Mochamad Shobirin 073P
20 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Group-PROJECT-on-DATA-STRUCTURES-and-ALGORITHMS
No ratings yet
Group-PROJECT-on-DATA-STRUCTURES-and-ALGORITHMS
3 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Unit 4 Data Analytics
No ratings yet
Unit 4 Data Analytics
11 pages
optimization
No ratings yet
optimization
26 pages
MD5 Message Digest Algorithm: University Institute of Engineering (UIE)
No ratings yet
MD5 Message Digest Algorithm: University Institute of Engineering (UIE)
17 pages
Module 2
No ratings yet
Module 2
67 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
CST414-SCHEME
No ratings yet
CST414-SCHEME
8 pages
Lec 8
No ratings yet
Lec 8
43 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Training NNs
No ratings yet
Training NNs
34 pages
Analysis and Design of Algorithm (ADA) : Module-1
No ratings yet
Analysis and Design of Algorithm (ADA) : Module-1
23 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
optim
No ratings yet
optim
33 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
ANN Theory
No ratings yet
ANN Theory
23 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
ADA - Unit-I
No ratings yet
ADA - Unit-I
64 pages
Linear Regression: Jia-Bin Huang Virginia Tech
No ratings yet
Linear Regression: Jia-Bin Huang Virginia Tech
59 pages
Problem Set 1 Solutions
No ratings yet
Problem Set 1 Solutions
11 pages
cours5
No ratings yet
cours5
23 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Matrix Multiplication1
No ratings yet
Matrix Multiplication1
10 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Bafpres Chapter 9
No ratings yet
Bafpres Chapter 9
6 pages
DSA Lab List
No ratings yet
DSA Lab List
21 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
MLP Lecture 4
No ratings yet
MLP Lecture 4
35 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
MIE334 - Syllabus, 2020
No ratings yet
MIE334 - Syllabus, 2020
3 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Speed Mathamatics
From Everand
Speed Mathamatics
Naila Hina
1/5 (1)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
A-level Physics Revision: Cheeky Revision Shortcuts
From Everand
A-level Physics Revision: Cheeky Revision Shortcuts
Scool Revision
3/5 (10)

Module 2 Part1new

Uploaded by

Module 2 Part1new

Uploaded by

CST414

Module-2 (Deep learning) Introduction to deep learning, Deep feed

β is the weightage to the past event

 Wt=Wt-1- Ƞt-1 ꝺL/ꝺWt-1

In ADAGRAD, alpha(t) is considering the square of

Bias can also updated (db)

 Adaptive movement Estimation

 Regularization is a set of techniques that can prevent overfitting

You might also like