Module 2 Part1new
Module 2 Part1new
DEEP LEARNING
Module-2 PART -I
1
SYLLABUS
2
TRACE KTU
Dropout, Parameter initialization.
OPTIMIZATION TECHNIQUES
3
TRACE KTU
4
TRACE KTU
5
TRACE KTU
2. STOCHASTIC GRADIENT DESCENT(SGD)
6
One of the disadvantage of GD is that entire data is loaded
into the system for gradient computation
This makes it computationally intensive
Here we considering only a single data for computing the loss
Loss=( y-y^)2, then this is SGD
TRACE KTU
Iterates one observation at a time
MINI BATCH SGD:-
It is a cross over between GD and SGD
Dataset is divided into batches and computes gradient of
each batch
7
TRACE KTU
Smoothly converging with GD
converging with some noises
3. SGD WITH MOMENTUM
8
Even with mini Batch SGD there are noises and taking too much time for
convergence
And all other disadvantages of GD like falling into local minima also affects
SGD
As we reach the saddle point the step size of updation will be very small a 0.
SGD with momentum works with the concept of EWMA-Exponentially
Weighted Moving Average.
TRACE KTU
Used to find the trend in time series. The formula of EWMA is
Vt=βVt-1+(1-β)ѳt
TRACE KTU
Wnew= Wold -ȠꝺL/ꝺwold
10
Modify it with respect to EWMA
Wnew= Wold –[ ꞵVt-1+ȠꝺL/ꝺwold ]
where Vt-1= ꝺL/ꝺwold +ꞵ ꝺ[L/ꝺwold ]t-1+ ꞵ^2 ꝺ[L/ꝺwold ]t-
2+………………….
ꞵVt-1 is the momentum
TRACE KTU
This will reduce the noise while trying to reach the global
minimum
4.NESTEROV ACCELERATED GD:-
11
When the learning rate η is relatively large, Nesterov
Accelerated Gradients allows larger decay rate α than
Momentum method, while preventing oscillations.
The theorem also shows that both Momentum method and
Nesterov Accelerated Gradient become equivalent when η is
small.
TRACE KTU
Actual movement is large due to the added momentum
12 We cross the actual minimum point and have to come back to
get the minimum point
SGD with momentum oscillates around the minimum point
MBGD is faster than GD
TRACE KTU
Look before you leap
13
In NAG,we calculate gradient at the lookahead point and then use it to
update the weight.
In MBGD, weight update rule
Wnew= Wold –[ ꞵVt-1+ȠꝺL/ꝺwold ]
Ie, based on w= history of velocity+gradient at that point
In NAG
TRACE KTU
w= history of velocity+gradient at a look ahead point
WLA=wnew- βVt-1
Vt= βVt-1+ȠꝺL/ꝺwLA
Wnew=wold-vt
5.ADGRAD OPTIMIZERS(Adaptive Gradient Optimizer.)
14
each weight has a different learning rate (η)
Weight update rule,
Wt=Wt-1- ȠꝺL/ꝺWt-1
Wt=Wnew
Wt-1= Wold
TRACE KTU
In SGD and minibatch SGD, value of learning rate is same for
each weight or each parameter
In ADAGRAD optimizer, learning rate gets modified based on
how frequently a parameter gets updated during training
TRACE KTU
16
TRACE KTU
6.RMS PROP(ROOT MEAN SQUARED PROPAGATION)
17 The RMSProp algorithm uses exponential smoothing
with parameter ρ (0, 1) in the relative estimations of
the gradients.
Here for the adaptation of step size instead of
considering decaying average of all gradients, only
partial past gradients are considered
TRACE KTU
It forgets early gradients and focus
recently observed partial gradients
on the most
Sdw
TRACE KTU
Sdwt = βSdwt-1+(1- β)(ꝺL/ꝺwt)^2
β is selected as 0.95 in most of the scenerio
So the weight update is
19
Sdb = TRACE
Sdb βKTU +(1- β)(ꝺL/ꝺb)^2
Bt=bt-1-Ƞ’ ꝺL/ꝺbt-1
20
7.ADAM OPTIMIZERS
TRACE KTU
Substituting with sum of finite geometric series
mₜ=(1- ꞵ₁) (1- ꞵ₁^t) gi/ (1- ꞵ₁)
= (1- ꞵ₁^t) gi
E(mₜ)= E[gi] (1- ꞵ₁^t) +ξ
ξ=error due to approximation
So mₜ^= mₜ/ (1- ꞵ₁^t)
Second moment bias correction is also done in the
23 same way
vₜ^= vₜ/ (1- ꞵ₂^t)
Hence weight updation is
wₜ=wₜ₋₁-Ƞmₜ^/sqrt(vₜ^+ɛ)
TRACE KTU
REGULARIZATION TECHNIQUES
24
TRACE KTU
25
TRACE KTU
26
TRACE KTU
If there is a data generating function F1 for the train data, it
27 should also work for other dataset which are unseen.
That means the function F1 is to be generalized without
overcomplicating the model
There should be F2 that mimics F1, but is more generalized.
Parameter Norm Penalties:-
technique is to add a parameter norm penalty Ω(θ) to the objective
TRACE KTU
function J,
J ͂ (θ,X,y) =J(θ,X,y)+αΩ(θ)
J-objective function
θ-hyper parameters like weight
28 X- input features
Y- labels
Ω- parameter that controls the relative contribution of weight to
regularize the model
The value of α is lies between [0,∞]
TRACE KTU
If α is 0, there is no regularization
There are two methods of regularization is L1 and L2
regularization
29 L1-Regularization
It is also known as LASSO Regression(Least Absolute
Shrinkage and Selection Operator)
Here we add ‘Absolute value of magnitude’ of coefficient as a
penalty term to the loss function
TRACE KTU
Lasso shrinks the less important feature's coefficient to zero,
thus removing some features---feature selection
L1 regularization is robust in dealing with outliers
It creates sparsity in solution, which means less
30 important features or noise terms will be zero
L1 makes robust to outliers
TRACE KTU
L1-regularization leads to sparse parameter
learning.
– Zero values of wi can be dropped.
– Equivalent to dropping edges from neural
network.
L2-regularization
31
Also called weight decay or ridge regression
Regularization term Ω is defined as the Euclidean
norm(L2 Norm) of the weght matrices, which is sum
overall squared weght values of a weight matrix
This is weighted by a scalar value α/2
TRACE KTU
L(w)^= α/2||w||₂^2+L(w)
= α/2ΣΣwij^2 +L(w)
α is called regularization rate(an additional parameter)
The gradient of the new loss function can be written as
∆wL(w)^ = αw+ ∆w L(w)
New update rule is
32 Wnew=wold- Ƞ(αwold+∆wL(wold))
This is gradient descent during L2 regularization
Which can be written as
Wnew = (1-Ƞα)wold-Ƞ∆wL(wold)
OR
TRACE KTU
Wnew = (1-Ƞα)wold-ȠꝺL/ꝺ(wold)
Some additional subtraction from the current weights
L2-regularization with parameter λ is equivalent to adding
Gaussian noise with variance λ to input.