0% found this document useful (0 votes)
6 views

Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization

The document discusses various techniques for improving machine learning and deep learning networks, focusing on hyperparameter tuning and optimization strategies. Key methods include regularization techniques (L1, L2, dropout), data augmentation, early stopping, normalization of training data, and different initialization methods. Additionally, it covers optimization algorithms such as gradient descent, RMSprop, and Adam's optimization, along with the importance of hyperparameter tuning using a coarse-to-fine approach.

Uploaded by

saksham2700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization

The document discusses various techniques for improving machine learning and deep learning networks, focusing on hyperparameter tuning and optimization strategies. Key methods include regularization techniques (L1, L2, dropout), data augmentation, early stopping, normalization of training data, and different initialization methods. Additionally, it covers optimization algorithms such as gradient descent, RMSprop, and Adam's optimization, along with the importance of hyperparameter tuning using a coarse-to-fine approach.

Uploaded by

saksham2700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Improving ML, DL networks: Hyperparameter tunin

Optimization
30 July 2020 12:26

OVERCOMING HIGH-VARIANCE:
Regularization:
𝐽 = [(1/𝑚) * 𝛴 𝐿(𝑦, ℎ(𝑥))]
1. L2 regularization: 𝐽 (𝑤, 𝑏) = 𝐽 + (𝜆/2𝑚) * |𝑤|! {Using Euc
2. L1 regularization: 𝐽(𝑤, 𝑏) = 𝐽 + (𝜆/2𝑚) * 𝛴 |𝑤|

In a Neural Network,
𝐽 = [(1/𝑚) * 𝛴 𝐿(𝑦 " , ℎ(𝑥 " ))]
!
Frobenius Norm Regularization: 𝐽 = 𝐽 + (𝜆/2𝑚) * 𝛴 \𝑤 " \ ; wher
𝑻
𝑤 (") . 𝑤 " ; i = #layer

Dropout Regularization:
• Keep some nodes and dropout some of others (randomly).
• Inverted dropout: Create a matrix (randomly) containing valu
(probabilities). Declare a variable keep_probab and drop all th
probabilities < keep_probab.
• Very much used in computer vision
• Cost function J is not properly defined

Data Augmentation:
Distort, flip, rotate, randomly crop, zoom images to create more ar

Early Stopping:
Plot a graph between #iterations and (Training/dev set error). Pic
#iterations where both the training and dev set errors are low.

Normalizing Training data:


𝑋 = (𝑋 − 𝜇) / 𝜎
ng, Regularization &

clidian norm}

!
re \𝑤 " \ =

ues between 0-1


hose nodes with

rtificial data

ck up a point on the
#iterations where both the training and dev set errors are low.

Normalizing Training data:


𝑋 = (𝑋 − 𝜇) / 𝜎
µ = Mean
σ = Variance

This automatically does feature scaling.


Contours of the cost function become more symmetric
We can increase the learning_rate.

Vanishing / Exploding gradients:


Very deep NN can have this problem.

Solution:
While INITIALIZING the weights, you can set the variance of the pa
&
A. ["#$]
'
!
B. : ReLU uses this better
'["#$]
&
C. 𝑆𝑞𝑟𝑡( ) : tanh uses this better --- Xavier initialization
'["#$]
2
D. 𝑆𝑞𝑟𝑡( [)*&] )
𝑛 + 𝑛[)]
In practice, all of the above can be used to initialize the parameters
function.

Python code:
𝑊[𝑙] = 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛(𝑠ℎ𝑎𝑝𝑒) * 𝑛𝑝. 𝑠𝑞𝑟𝑡(𝑑𝑒𝑠𝑖𝑟𝑒𝑑𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)
B. : ReLU uses this better
'["#$]
&
C. 𝑆𝑞𝑟𝑡( ["#$] ) : tanh uses this better --- Xavier
'
initialization
2
D. 𝑆𝑞𝑟𝑡( [)*&] )
𝑛 + 𝑛 [) ]
In practice, all of the above can be used to initialize the
parameters for any activation function.

Python code:
𝑊[𝑙]
= 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛(𝑠ℎ𝑎𝑝𝑒) * 𝑛𝑝. 𝑠𝑞𝑟𝑡(𝑑𝑒𝑠𝑖𝑟𝑒𝑑𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)

There are 3 types of initializations:


A. Zeros initialization: Setting the parameters to '0'
B. Random initialization: This initializes weights to
random values and biases to 0
C. He initialization: Similar to Xavier initialization
!
except that here we do 𝑠𝑞𝑟𝑡( ["#$] ) ---> For a ReLU
'
inititalization

Important points regarding Initializations:


i. Different initialization leads to different results
ii. Random initialization breaks the symmetry and make
sure that hidden units learn different things
iii. Don't initialize to values that are too large
iv. He initialization works very well for ReLU activations.

Gradient checking:
Combine the parameters to a giant 1D matrix.
𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑊 & , 𝑏& , . . . , 𝑊 , , 𝑏 , ), 𝐴𝑥𝑖𝑠 = 1)
𝛩 = 𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()

Same, do with the derivatives


𝑑𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑑𝑊 & , 𝑑𝑏& , . . . , 𝑑𝑊 , , 𝑑𝑏 , ), 𝐴𝑥𝑖𝑠
𝛩 = 𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()

Same, do with the derivatives


𝑑𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑑𝑊 & , 𝑑𝑏& , . . . , 𝑑𝑊 , , 𝑑𝑏 , ), 𝐴𝑥𝑖𝑠
= 1)
𝑑𝛩 = 𝑑𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()

𝑑𝛩-../01 [𝑖 ]
𝐽(𝛩& , 𝛩! , … 𝛩" + 𝜀 … ) − 𝐽(𝛩& , 𝛩! , … 𝛩" − 𝜀 … )
=
2𝜀

This 𝑑𝛩-../01 [i] must be approximately equal to dΘ.

Exponentially weighted averages:

When there is a lot of noise in the data (For example - daily


temperature of a London data), then it is best to consider
doing exponentially weighted averages.

𝑉2 = 0
𝑉3 = β ∗ 𝑉3*& + (1 − β) ∗ Θ3

Where Θ is the temperature on the 𝑡 34 day.


V is the exponentially weighted average.

BIAS CORRECTION:
𝑉3
𝑉3 =
1 − β3
Now, 𝑉3 is the new data on which we can train our model.
WAYS TO OPTIMIZE COST FUNCTIONS:
1. GRADIENT DESCENT (Batch/ semi-batch / stochastic)
2. GRADIENT DESCENT WITH MOMENTUM
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β 𝑉56 + (1 − β)𝑑𝑊
𝑉57 ≔ β 𝑉57 + (1 − β)𝑑𝑏

𝑊 = 𝑊 − 𝛼 𝑑𝑊
𝑏 = 𝑏 − 𝛼 𝑑𝑏
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β =
0.9
3. RMSprop
On iteration t:
Compute dW, db on the current mini-batch
𝑆56 : = β 𝑆56 + (1 − β)𝑑𝑊 !
𝑆57 ≔ β 𝑉57 + (1 − β)𝑑𝑏 !

𝑑𝑊
𝑊 = 𝑊 − 𝛼
𝑠𝑞𝑟𝑡(𝑆56 ) + 𝜀
𝑑𝑏
𝑏 =𝑏 − 𝛼
𝑠𝑞𝑟𝑡(𝑆57 ) + 𝜀
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β =
0.99
4. Adam's optimization (combines RMSprop with
momentum)
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β& 𝑉56 + (1 − β& )𝑑𝑊
𝑉57 ≔ β& 𝑉57 + (1 − β& )𝑑𝑏
momentum)
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β& 𝑉56 + (1 − β& )𝑑𝑊
𝑉57 ≔ β& 𝑉57 + (1 − β& )𝑑𝑏

𝑆56 : = β! 𝑆56 + (1 − β! )𝑑𝑊 !


𝑆57 ≔ β! 𝑉57 + (1 − β! )𝑑𝑏 !

80//98395
𝑉5:
𝑉56 =
1 − β&;
80//98395
𝑉57
𝑉57 =
1 − β&;
80//98395 𝑆5:
𝑆5: =
1 − β;!
80//98395
𝑆57
𝑆57 =
1 − β;!

80//98395
𝑉5:
𝑊 = 𝑊 − 𝛼 80//98395
𝑠𝑞𝑟𝑡š𝑆5: ›+ 𝜀
𝑑𝑏
𝑏 =𝑏 − 𝛼 80//98395
𝑠𝑞𝑟𝑡š𝑆57 ›+ 𝜀
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β& =
0.9 and β! = 0.999. 𝜀 = 10*< .
𝛼 𝑛𝑒𝑒𝑑𝑠 𝑡𝑢𝑛𝑖𝑛𝑔 𝑢𝑠𝑖𝑛𝑔 𝑑𝑒𝑣 𝑠𝑒𝑡.
5. Learning Rate Decay
1
𝛼= 𝛼
1 + 𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒 ∗ 𝑒𝑝𝑜𝑐ℎ𝑛𝑢𝑚 2
𝛼 = 0.959.084'=>
𝑘
𝛼= 𝛼
𝑠𝑞𝑟𝑡(𝑒𝑝𝑜𝑐ℎ𝑛𝑢𝑚) 2
We can use any of the above to get 𝛼 for every
iteration. 𝛼 will go on decreasing slowly.
=
We can use any of the above to get 𝛼 for every
iteration. 𝛼 will go on decreasing slowly.

Hyperparameters tuning:
*represents the importance of a hyperparameter while
tuning
∗∗∗∗∗∗𝛼

∗∗∗∗β [0.9] For GD with Momentum


**** Mini-batch size
****#hidden units

**#Layers
**Learning rate decay

β& [0.9], β! [0.999], 𝜀 [10*< ] For Adam's optimization

Tuning Hyperparameters process:


1. Choose random points in hyperparameters space,
2. Use Coarse to fine approach

Choose an appropriate scale for hyperparameters. For


example, if you are choosing 𝛼 between 0.0001 and 1, then
instead of distributing random numbers uniformly, use a
log scale to distribute the random numbers.
r = -4 * np.random.rand()
𝛼 = 10/
This makes
𝛼 𝑙𝑖𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 10*? 𝑡𝑜 102 , 𝑜𝑟 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 0.0001 𝑡𝑜 1
𝛼 𝑙𝑖𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 10 𝑡𝑜 10 , 𝑜𝑟 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 0.0001 𝑡𝑜 1

Batch Normalization
Normalizing inputs and some network hidden layers to
speed up learning
Batch norm works with mini batches
How to implement gradient descent while using backprop ?

For t = 1 -->#MiniBatches
Compute forward prop on 𝑋 {3}
In each hidden layer, use BN to replace
𝑧 [)] 𝑤𝑖𝑡ℎ 𝑧̃ [)]
Use backprop to compute 𝑑𝑊 [)] , 𝑑𝑏 [)] , 𝑑β[)] , 𝑑𝛾 [)]
Update parameters
𝑊 [)] = 𝑊 [)] − 𝛼 ∗ 𝑑𝑊 [)]
𝑏 [)] = 𝑏 [)] − 𝛼 ∗ 𝑑𝑏 [)]
β[)] = β[)] − 𝛼 ∗ 𝑑β[)]
𝛾 [)] = 𝛾 [)] − 𝛼 ∗ 𝑑𝛾 [)]
This works well w/ momentum, RMSprop, Adam.

Benefits:
1. Manages to work even when there is a covariant shift
2. Provides some regularization effect.

How to apply Batch Normalization at the test time?


Use exponentially weighted averages to find the final β, 𝛾
using all the mini-batches.

SOFTMAX - CLASSIFICATION:
Used for multi-class classification. The o/p layer has

You might also like