Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
Optimization
30 July 2020 12:26
OVERCOMING HIGH-VARIANCE:
Regularization:
𝐽 = [(1/𝑚) * 𝛴 𝐿(𝑦, ℎ(𝑥))]
1. L2 regularization: 𝐽 (𝑤, 𝑏) = 𝐽 + (𝜆/2𝑚) * |𝑤|! {Using Euc
2. L1 regularization: 𝐽(𝑤, 𝑏) = 𝐽 + (𝜆/2𝑚) * 𝛴 |𝑤|
In a Neural Network,
𝐽 = [(1/𝑚) * 𝛴 𝐿(𝑦 " , ℎ(𝑥 " ))]
!
Frobenius Norm Regularization: 𝐽 = 𝐽 + (𝜆/2𝑚) * 𝛴 \𝑤 " \ ; wher
𝑻
𝑤 (") . 𝑤 " ; i = #layer
Dropout Regularization:
• Keep some nodes and dropout some of others (randomly).
• Inverted dropout: Create a matrix (randomly) containing valu
(probabilities). Declare a variable keep_probab and drop all th
probabilities < keep_probab.
• Very much used in computer vision
• Cost function J is not properly defined
Data Augmentation:
Distort, flip, rotate, randomly crop, zoom images to create more ar
Early Stopping:
Plot a graph between #iterations and (Training/dev set error). Pic
#iterations where both the training and dev set errors are low.
clidian norm}
!
re \𝑤 " \ =
rtificial data
ck up a point on the
#iterations where both the training and dev set errors are low.
Solution:
While INITIALIZING the weights, you can set the variance of the pa
&
A. ["#$]
'
!
B. : ReLU uses this better
'["#$]
&
C. 𝑆𝑞𝑟𝑡( ) : tanh uses this better --- Xavier initialization
'["#$]
2
D. 𝑆𝑞𝑟𝑡( [)*&] )
𝑛 + 𝑛[)]
In practice, all of the above can be used to initialize the parameters
function.
Python code:
𝑊[𝑙] = 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛(𝑠ℎ𝑎𝑝𝑒) * 𝑛𝑝. 𝑠𝑞𝑟𝑡(𝑑𝑒𝑠𝑖𝑟𝑒𝑑𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)
B. : ReLU uses this better
'["#$]
&
C. 𝑆𝑞𝑟𝑡( ["#$] ) : tanh uses this better --- Xavier
'
initialization
2
D. 𝑆𝑞𝑟𝑡( [)*&] )
𝑛 + 𝑛 [) ]
In practice, all of the above can be used to initialize the
parameters for any activation function.
Python code:
𝑊[𝑙]
= 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛(𝑠ℎ𝑎𝑝𝑒) * 𝑛𝑝. 𝑠𝑞𝑟𝑡(𝑑𝑒𝑠𝑖𝑟𝑒𝑑𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)
Gradient checking:
Combine the parameters to a giant 1D matrix.
𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑊 & , 𝑏& , . . . , 𝑊 , , 𝑏 , ), 𝐴𝑥𝑖𝑠 = 1)
𝛩 = 𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()
𝑑𝛩-../01 [𝑖 ]
𝐽(𝛩& , 𝛩! , … 𝛩" + 𝜀 … ) − 𝐽(𝛩& , 𝛩! , … 𝛩" − 𝜀 … )
=
2𝜀
𝑉2 = 0
𝑉3 = β ∗ 𝑉3*& + (1 − β) ∗ Θ3
BIAS CORRECTION:
𝑉3
𝑉3 =
1 − β3
Now, 𝑉3 is the new data on which we can train our model.
WAYS TO OPTIMIZE COST FUNCTIONS:
1. GRADIENT DESCENT (Batch/ semi-batch / stochastic)
2. GRADIENT DESCENT WITH MOMENTUM
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β 𝑉56 + (1 − β)𝑑𝑊
𝑉57 ≔ β 𝑉57 + (1 − β)𝑑𝑏
𝑊 = 𝑊 − 𝛼 𝑑𝑊
𝑏 = 𝑏 − 𝛼 𝑑𝑏
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β =
0.9
3. RMSprop
On iteration t:
Compute dW, db on the current mini-batch
𝑆56 : = β 𝑆56 + (1 − β)𝑑𝑊 !
𝑆57 ≔ β 𝑉57 + (1 − β)𝑑𝑏 !
𝑑𝑊
𝑊 = 𝑊 − 𝛼
𝑠𝑞𝑟𝑡(𝑆56 ) + 𝜀
𝑑𝑏
𝑏 =𝑏 − 𝛼
𝑠𝑞𝑟𝑡(𝑆57 ) + 𝜀
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β =
0.99
4. Adam's optimization (combines RMSprop with
momentum)
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β& 𝑉56 + (1 − β& )𝑑𝑊
𝑉57 ≔ β& 𝑉57 + (1 − β& )𝑑𝑏
momentum)
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β& 𝑉56 + (1 − β& )𝑑𝑊
𝑉57 ≔ β& 𝑉57 + (1 − β& )𝑑𝑏
80//98395
𝑉5:
𝑉56 =
1 − β&;
80//98395
𝑉57
𝑉57 =
1 − β&;
80//98395 𝑆5:
𝑆5: =
1 − β;!
80//98395
𝑆57
𝑆57 =
1 − β;!
80//98395
𝑉5:
𝑊 = 𝑊 − 𝛼 80//98395
𝑠𝑞𝑟𝑡š𝑆5: ›+ 𝜀
𝑑𝑏
𝑏 =𝑏 − 𝛼 80//98395
𝑠𝑞𝑟𝑡š𝑆57 ›+ 𝜀
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β& =
0.9 and β! = 0.999. 𝜀 = 10*< .
𝛼 𝑛𝑒𝑒𝑑𝑠 𝑡𝑢𝑛𝑖𝑛𝑔 𝑢𝑠𝑖𝑛𝑔 𝑑𝑒𝑣 𝑠𝑒𝑡.
5. Learning Rate Decay
1
𝛼= 𝛼
1 + 𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒 ∗ 𝑒𝑝𝑜𝑐ℎ𝑛𝑢𝑚 2
𝛼 = 0.959.084'=>
𝑘
𝛼= 𝛼
𝑠𝑞𝑟𝑡(𝑒𝑝𝑜𝑐ℎ𝑛𝑢𝑚) 2
We can use any of the above to get 𝛼 for every
iteration. 𝛼 will go on decreasing slowly.
=
We can use any of the above to get 𝛼 for every
iteration. 𝛼 will go on decreasing slowly.
Hyperparameters tuning:
*represents the importance of a hyperparameter while
tuning
∗∗∗∗∗∗𝛼
**#Layers
**Learning rate decay
Batch Normalization
Normalizing inputs and some network hidden layers to
speed up learning
Batch norm works with mini batches
How to implement gradient descent while using backprop ?
For t = 1 -->#MiniBatches
Compute forward prop on 𝑋 {3}
In each hidden layer, use BN to replace
𝑧 [)] 𝑤𝑖𝑡ℎ 𝑧̃ [)]
Use backprop to compute 𝑑𝑊 [)] , 𝑑𝑏 [)] , 𝑑β[)] , 𝑑𝛾 [)]
Update parameters
𝑊 [)] = 𝑊 [)] − 𝛼 ∗ 𝑑𝑊 [)]
𝑏 [)] = 𝑏 [)] − 𝛼 ∗ 𝑑𝑏 [)]
β[)] = β[)] − 𝛼 ∗ 𝑑β[)]
𝛾 [)] = 𝛾 [)] − 𝛼 ∗ 𝑑𝛾 [)]
This works well w/ momentum, RMSprop, Adam.
Benefits:
1. Manages to work even when there is a covariant shift
2. Provides some regularization effect.
SOFTMAX - CLASSIFICATION:
Used for multi-class classification. The o/p layer has