0% found this document useful (0 votes)
12 views

Module 2 Part1new

This document outlines the syllabus for a deep learning course, focusing on optimization techniques such as Gradient Descent, Stochastic Gradient Descent, and various adaptive optimizers like Adam and RMSProp. It also covers regularization techniques like L1 and L2 regularization to prevent overfitting in neural networks. The content emphasizes the importance of training deep models effectively and includes mathematical formulations for various methods discussed.

Uploaded by

thejasurendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Module 2 Part1new

This document outlines the syllabus for a deep learning course, focusing on optimization techniques such as Gradient Descent, Stochastic Gradient Descent, and various adaptive optimizers like Adam and RMSProp. It also covers regularization techniques like L1 and L2 regularization to prevent overfitting in neural networks. The content emphasizes the importance of training deep models effectively and includes mathematical formulations for various methods discussed.

Uploaded by

thejasurendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CST414

DEEP LEARNING
Module-2 PART -I

1
SYLLABUS
2

Module-2 (Deep learning) Introduction to deep learning, Deep feed


forward network, Training deep models, Optimization techniques - Gradient Descent (GD), GD
with momentum, Nesterov accelerated GD, Stochastic GD, AdaGrad,
 RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early stopping, Dataset
augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble methods,

TRACE KTU
Dropout, Parameter initialization.
OPTIMIZATION TECHNIQUES
3

TRACE KTU
4


TRACE KTU
5

TRACE KTU
2. STOCHASTIC GRADIENT DESCENT(SGD)
6
 One of the disadvantage of GD is that entire data is loaded
into the system for gradient computation
 This makes it computationally intensive
 Here we considering only a single data for computing the loss
 Loss=( y-y^)2, then this is SGD

TRACE KTU
 Iterates one observation at a time
 MINI BATCH SGD:-
 It is a cross over between GD and SGD
 Dataset is divided into batches and computes gradient of
each batch
7

TRACE KTU
Smoothly converging with GD
converging with some noises
 3. SGD WITH MOMENTUM
8
 Even with mini Batch SGD there are noises and taking too much time for
convergence
 And all other disadvantages of GD like falling into local minima also affects
SGD
 As we reach the saddle point the step size of updation will be very small a 0.
 SGD with momentum works with the concept of EWMA-Exponentially
Weighted Moving Average.

TRACE KTU
 Used to find the trend in time series. The formula of EWMA is

 Vt=βVt-1+(1-β)ѳt

β is the weightage to the past event


0<β<1
 Suppose if β=0.5 and t=1
9  Most recent previous values getting more importance than
earlier events
 This same concept in SGD with momentum
 If we are repeatedly asked to go in a particular direction, we
can take bigger steps to that direction
 In GD,

TRACE KTU
 Wnew= Wold -ȠꝺL/ꝺwold
10
 Modify it with respect to EWMA
 Wnew= Wold –[ ꞵVt-1+ȠꝺL/ꝺwold ]
 where Vt-1= ꝺL/ꝺwold +ꞵ ꝺ[L/ꝺwold ]t-1+ ꞵ^2 ꝺ[L/ꝺwold ]t-
2+………………….
 ꞵVt-1 is the momentum

TRACE KTU
This will reduce the noise while trying to reach the global
minimum
4.NESTEROV ACCELERATED GD:-
11
When the learning rate η is relatively large, Nesterov
Accelerated Gradients allows larger decay rate α than
Momentum method, while preventing oscillations.
The theorem also shows that both Momentum method and
Nesterov Accelerated Gradient become equivalent when η is
small.
TRACE KTU
 Actual movement is large due to the added momentum
12  We cross the actual minimum point and have to come back to
get the minimum point
 SGD with momentum oscillates around the minimum point
 MBGD is faster than GD

TRACE KTU
 Look before you leap
13
 In NAG,we calculate gradient at the lookahead point and then use it to
update the weight.
 In MBGD, weight update rule
 Wnew= Wold –[ ꞵVt-1+ȠꝺL/ꝺwold ]
 Ie, based on w= history of velocity+gradient at that point

 In NAG
 TRACE KTU
w= history of velocity+gradient at a look ahead point
 WLA=wnew- βVt-1
 Vt= βVt-1+ȠꝺL/ꝺwLA
 Wnew=wold-vt
5.ADGRAD OPTIMIZERS(Adaptive Gradient Optimizer.)
14
 each weight has a different learning rate (η)
 Weight update rule,
 Wt=Wt-1- ȠꝺL/ꝺWt-1
 Wt=Wnew
 Wt-1= Wold

TRACE KTU
 In SGD and minibatch SGD, value of learning rate is same for
each weight or each parameter
 In ADAGRAD optimizer, learning rate gets modified based on
how frequently a parameter gets updated during training

 Wt=Wt-1- Ƞt-1 ꝺL/ꝺWt-1


15

TRACE KTU
16

TRACE KTU
6.RMS PROP(ROOT MEAN SQUARED PROPAGATION)
17 The RMSProp algorithm uses exponential smoothing
with parameter ρ (0, 1) in the relative estimations of
the gradients.
Here for the adaptation of step size instead of
considering decaying average of all gradients, only
partial past gradients are considered
TRACE KTU
It forgets early gradients and focus
recently observed partial gradients
on the most

In ADAGRAD, alpha(t) is considering the square of


gradient of all past iterations
Hence there is a chance that it will grow to a very high
number.so learning rate will be reduced to a very low
value
In RMS prop , to avoid this issue in the computation of
18 Ƞ’ and weighted average for the value of alpha(t)

Sdw
TRACE KTU
Sdwt = βSdwt-1+(1- β)(ꝺL/ꝺwt)^2
β is selected as 0.95 in most of the scenerio
 So the weight update is
19

Bias can also updated (db)

Sdb = TRACE
Sdb βKTU +(1- β)(ꝺL/ꝺb)^2

Bt=bt-1-Ƞ’ ꝺL/ꝺbt-1
20
7.ADAM OPTIMIZERS

 Adaptive movement Estimation


 It is the combination of MBGD and RMSprop
 In MBGD, we c]achieved smoothening while converging to
minima
 In RMSprop, we could modify the learning rate.
TRACE KTU
 Here utilizes the first and second moments of gradient to adapt
to the learning rate for each weight of the neural network.
 Mt= moving average of gradient(1st moves)
 vt = moving average of squared gradient(2nd moves)
 To estimates the moments at a time, ADAM uses following
 Mt=ꞵ1mt-1+(1-ꞵ1)gt………………………momentum
 Vt= ꞵ2mt-1+(1-ꞵ2)gt^2----------------------RMSprop
 Finally weight updation,
21  Wt=wt-1 - Ƞ mt/sqrt(vt+ɛ)
 BIAS CORRECTION IN ADAM:-
E(Mt)=E[gt]
E(gt)= E[gt^2]
When we start the algorithm, m₀=0 and v₀=0, ie,
TRACE KTU
estimators are biased towards zero
Since m₀=0
m₁=ꞵ₁ m₀+(1- ꞵ₁)g₁
= (1- ꞵ₁)g₁
m₂=ꞵ₁ m₁ +(1- ꞵ₁)g₂
=ꞵ₁ (1- ꞵ₁)g₁+(1- ꞵ₁)g₂
m₃=ꞵ₁ m₂ +(1- ꞵ₁)g₃
22
= ꞵ₁[ꞵ₁ (1- ꞵ₁)g₁+(1- ꞵ₁)g₂] +(1- ꞵ₁)g₃
= ꞵ₁^2 (1- ꞵ₁)g₁+ ꞵ₁ (1- ꞵ₁)g₂+(1- ꞵ₁)g₃
Finally we can write
𝒕
mₜ= (1- ꞵ₁) 𝒊 𝟎ꞵ
𝒕 𝒊
gi

TRACE KTU
Substituting with sum of finite geometric series
mₜ=(1- ꞵ₁) (1- ꞵ₁^t) gi/ (1- ꞵ₁)
= (1- ꞵ₁^t) gi
E(mₜ)= E[gi] (1- ꞵ₁^t) +ξ
ξ=error due to approximation
So mₜ^= mₜ/ (1- ꞵ₁^t)
Second moment bias correction is also done in the
23 same way
vₜ^= vₜ/ (1- ꞵ₂^t)
Hence weight updation is
wₜ=wₜ₋₁-Ƞmₜ^/sqrt(vₜ^+ɛ)

TRACE KTU
REGULARIZATION TECHNIQUES
24

 Regularization is a set of techniques that can prevent overfitting


in neural networks and thus improve the accuracy of a Deep
Learning model
 One of the most important aspects when training neural
networks is avoiding overfitting.

TRACE KTU
25

TRACE KTU
26

TRACE KTU
 If there is a data generating function F1 for the train data, it
27 should also work for other dataset which are unseen.
 That means the function F1 is to be generalized without
overcomplicating the model
 There should be F2 that mimics F1, but is more generalized.
 Parameter Norm Penalties:-
 technique is to add a parameter norm penalty Ω(θ) to the objective

TRACE KTU
function J,

 J ͂ (θ,X,y) =J(θ,X,y)+αΩ(θ)
J-objective function
θ-hyper parameters like weight
28  X- input features
 Y- labels
 Ω- parameter that controls the relative contribution of weight to
regularize the model
 The value of α is lies between [0,∞]

TRACE KTU
 If α is 0, there is no regularization
 There are two methods of regularization is L1 and L2
regularization
29 L1-Regularization
It is also known as LASSO Regression(Least Absolute
Shrinkage and Selection Operator)
Here we add ‘Absolute value of magnitude’ of coefficient as a
penalty term to the loss function

TRACE KTU
Lasso shrinks the less important feature's coefficient to zero,
thus removing some features---feature selection
L1 regularization is robust in dealing with outliers
It creates sparsity in solution, which means less
30 important features or noise terms will be zero
L1 makes robust to outliers

TRACE KTU
L1-regularization leads to sparse parameter
learning.
– Zero values of wi can be dropped.
– Equivalent to dropping edges from neural
network.
L2-regularization
31
Also called weight decay or ridge regression
Regularization term Ω is defined as the Euclidean
norm(L2 Norm) of the weght matrices, which is sum
overall squared weght values of a weight matrix
This is weighted by a scalar value α/2
TRACE KTU
L(w)^= α/2||w||₂^2+L(w)
= α/2ΣΣwij^2 +L(w)
α is called regularization rate(an additional parameter)
The gradient of the new loss function can be written as
 ∆wL(w)^ = αw+ ∆w L(w)
New update rule is
32 Wnew=wold- Ƞ(αwold+∆wL(wold))
This is gradient descent during L2 regularization
Which can be written as
Wnew = (1-Ƞα)wold-Ƞ∆wL(wold)
OR
TRACE KTU
Wnew = (1-Ƞα)wold-ȠꝺL/ꝺ(wold)
Some additional subtraction from the current weights
L2-regularization with parameter λ is equivalent to adding
Gaussian noise with variance λ to input.

You might also like