0% found this document useful (0 votes)

6 views16 pages

Improving ML, DL Networks Hyperparameter Tuning, Regularization & Optimization

The document discusses various techniques for improving machine learning and deep learning networks, focusing on hyperparameter tuning and optimization strategies. Key methods include regularization techniques (L1, L2, dropout), data augmentation, early stopping, normalization of training data, and different initialization methods. Additionally, it covers optimization algorithms such as gradient descent, RMSprop, and Adam's optimization, along with the importance of hyperparameter tuning using a coarse-to-fine approach.

Uploaded by

saksham2700

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views16 pages

Improving ML, DL Networks Hyperparameter Tuning, Regularization & Optimization

Uploaded by

saksham2700

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Improving ML, DL networks: Hyperparameter tunin

Optimization
30 July 2020 12:26

OVERCOMING HIGH-VARIANCE:
Regularization:
𝐽 = [(1/𝑚) * 𝛴 𝐿(𝑦, ℎ(𝑥))]
1. L2 regularization: 𝐽 (𝑤, 𝑏) = 𝐽 + (𝜆/2𝑚) * |𝑤|! {Using Euc
2. L1 regularization: 𝐽(𝑤, 𝑏) = 𝐽 + (𝜆/2𝑚) * 𝛴 |𝑤|

In a Neural Network,
𝐽 = [(1/𝑚) * 𝛴 𝐿(𝑦 " , ℎ(𝑥 " ))]
!
Frobenius Norm Regularization: 𝐽 = 𝐽 + (𝜆/2𝑚) * 𝛴 \𝑤 " \ ; wher
𝑻
𝑤 (") . 𝑤 " ; i = #layer

Dropout Regularization:
• Keep some nodes and dropout some of others (randomly).
• Inverted dropout: Create a matrix (randomly) containing valu
(probabilities). Declare a variable keep_probab and drop all th
probabilities < keep_probab.
• Very much used in computer vision
• Cost function J is not properly defined

Data Augmentation:
Distort, flip, rotate, randomly crop, zoom images to create more ar

Early Stopping:
Plot a graph between #iterations and (Training/dev set error). Pic
#iterations where both the training and dev set errors are low.

Normalizing Training data:

𝑋 = (𝑋 − 𝜇) / 𝜎
ng, Regularization &

clidian norm}

!
re \𝑤 " \ =

ues between 0-1

hose nodes with

rtificial data

ck up a point on the
#iterations where both the training and dev set errors are low.

Normalizing Training data:

𝑋 = (𝑋 − 𝜇) / 𝜎
µ = Mean
σ = Variance

This automatically does feature scaling.

Contours of the cost function become more symmetric
We can increase the learning_rate.

Vanishing / Exploding gradients:

Very deep NN can have this problem.

Solution:
While INITIALIZING the weights, you can set the variance of the pa
&
A. ["#$]
'
!
B. : ReLU uses this better
'["#$]
&
C. 𝑆𝑞𝑟𝑡( ) : tanh uses this better --- Xavier initialization
'["#$]
2
D. 𝑆𝑞𝑟𝑡( [)*&] )
𝑛 + 𝑛[)]
In practice, all of the above can be used to initialize the parameters
function.

Python code:
𝑊[𝑙] = 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛(𝑠ℎ𝑎𝑝𝑒) * 𝑛𝑝. 𝑠𝑞𝑟𝑡(𝑑𝑒𝑠𝑖𝑟𝑒𝑑𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)
B. : ReLU uses this better
'["#$]
&
C. 𝑆𝑞𝑟𝑡( ["#$] ) : tanh uses this better --- Xavier
'
initialization
2
D. 𝑆𝑞𝑟𝑡( [)*&] )
𝑛 + 𝑛 [) ]
In practice, all of the above can be used to initialize the
parameters for any activation function.

Python code:
𝑊[𝑙]
= 𝑛𝑝. 𝑟𝑎𝑛𝑑𝑜𝑚. 𝑟𝑎𝑛𝑑𝑛(𝑠ℎ𝑎𝑝𝑒) * 𝑛𝑝. 𝑠𝑞𝑟𝑡(𝑑𝑒𝑠𝑖𝑟𝑒𝑑𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒)

There are 3 types of initializations:

A. Zeros initialization: Setting the parameters to '0'
B. Random initialization: This initializes weights to
random values and biases to 0
C. He initialization: Similar to Xavier initialization
!
except that here we do 𝑠𝑞𝑟𝑡( ["#$] ) ---> For a ReLU
'
inititalization

Important points regarding Initializations:

i. Different initialization leads to different results
ii. Random initialization breaks the symmetry and make
sure that hidden units learn different things
iii. Don't initialize to values that are too large
iv. He initialization works very well for ReLU activations.

Gradient checking:
Combine the parameters to a giant 1D matrix.
𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑊 & , 𝑏& , . . . , 𝑊 , , 𝑏 , ), 𝐴𝑥𝑖𝑠 = 1)
𝛩 = 𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()

Same, do with the derivatives

𝑑𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑑𝑊 & , 𝑑𝑏& , . . . , 𝑑𝑊 , , 𝑑𝑏 , ), 𝐴𝑥𝑖𝑠
𝛩 = 𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()

Same, do with the derivatives

𝑑𝛩 = 𝑛𝑝. 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒((𝑑𝑊 & , 𝑑𝑏& , . . . , 𝑑𝑊 , , 𝑑𝑏 , ), 𝐴𝑥𝑖𝑠
= 1)
𝑑𝛩 = 𝑑𝛩. 𝑓𝑙𝑎𝑡𝑡𝑒𝑛()

𝑑𝛩-../01 [𝑖 ]
𝐽(𝛩& , 𝛩! , … 𝛩" + 𝜀 … ) − 𝐽(𝛩& , 𝛩! , … 𝛩" − 𝜀 … )
=
2𝜀

This 𝑑𝛩-../01 [i] must be approximately equal to dΘ.

Exponentially weighted averages:

When there is a lot of noise in the data (For example - daily

temperature of a London data), then it is best to consider
doing exponentially weighted averages.

𝑉2 = 0
𝑉3 = β ∗ 𝑉3*& + (1 − β) ∗ Θ3

Where Θ is the temperature on the 𝑡 34 day.

V is the exponentially weighted average.

BIAS CORRECTION:
𝑉3
𝑉3 =
1 − β3
Now, 𝑉3 is the new data on which we can train our model.
WAYS TO OPTIMIZE COST FUNCTIONS:
1. GRADIENT DESCENT (Batch/ semi-batch / stochastic)
2. GRADIENT DESCENT WITH MOMENTUM
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β 𝑉56 + (1 − β)𝑑𝑊
𝑉57 ≔ β 𝑉57 + (1 − β)𝑑𝑏

𝑊 = 𝑊 − 𝛼 𝑑𝑊
𝑏 = 𝑏 − 𝛼 𝑑𝑏
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β =
0.9
3. RMSprop
On iteration t:
Compute dW, db on the current mini-batch
𝑆56 : = β 𝑆56 + (1 − β)𝑑𝑊 !
𝑆57 ≔ β 𝑉57 + (1 − β)𝑑𝑏 !

𝑑𝑊
𝑊 = 𝑊 − 𝛼
𝑠𝑞𝑟𝑡(𝑆56 ) + 𝜀
𝑑𝑏
𝑏 =𝑏 − 𝛼
𝑠𝑞𝑟𝑡(𝑆57 ) + 𝜀
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β =
0.99
4. Adam's optimization (combines RMSprop with
momentum)
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β& 𝑉56 + (1 − β& )𝑑𝑊
𝑉57 ≔ β& 𝑉57 + (1 − β& )𝑑𝑏
momentum)
On iteration t:
Compute dW, db on the current mini-batch
𝑉56 : = β& 𝑉56 + (1 − β& )𝑑𝑊
𝑉57 ≔ β& 𝑉57 + (1 − β& )𝑑𝑏

𝑆56 : = β! 𝑆56 + (1 − β! )𝑑𝑊 !

𝑆57 ≔ β! 𝑉57 + (1 − β! )𝑑𝑏 !

80//98395
𝑉5:
𝑉56 =
1 − β&;
80//98395
𝑉57
𝑉57 =
1 − β&;
80//98395 𝑆5:
𝑆5: =
1 − β;!
80//98395
𝑆57
𝑆57 =
1 − β;!

80//98395
𝑉5:
𝑊 = 𝑊 − 𝛼 80//98395
𝑠𝑞𝑟𝑡š𝑆5: ›+ 𝜀
𝑑𝑏
𝑏 =𝑏 − 𝛼 80//98395
𝑠𝑞𝑟𝑡š𝑆57 ›+ 𝜀
Here,
𝛼 𝑎𝑛𝑑 β are athe hyperparameters. Generally, we ˜ix β& =
0.9 and β! = 0.999. 𝜀 = 10*< .
𝛼 𝑛𝑒𝑒𝑑𝑠 𝑡𝑢𝑛𝑖𝑛𝑔 𝑢𝑠𝑖𝑛𝑔 𝑑𝑒𝑣 𝑠𝑒𝑡.
5. Learning Rate Decay
1
𝛼= 𝛼
1 + 𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒 ∗ 𝑒𝑝𝑜𝑐ℎ𝑛𝑢𝑚 2
𝛼 = 0.959.084'=>
𝑘
𝛼= 𝛼
𝑠𝑞𝑟𝑡(𝑒𝑝𝑜𝑐ℎ𝑛𝑢𝑚) 2
We can use any of the above to get 𝛼 for every
iteration. 𝛼 will go on decreasing slowly.
=
We can use any of the above to get 𝛼 for every
iteration. 𝛼 will go on decreasing slowly.

Hyperparameters tuning:
*represents the importance of a hyperparameter while
tuning
∗∗∗∗∗∗𝛼

∗∗∗∗β [0.9] For GD with Momentum

**** Mini-batch size
****#hidden units

**#Layers
**Learning rate decay

β& [0.9], β! [0.999], 𝜀 [10*< ] For Adam's optimization

Tuning Hyperparameters process:

1. Choose random points in hyperparameters space,
2. Use Coarse to fine approach

Choose an appropriate scale for hyperparameters. For

example, if you are choosing 𝛼 between 0.0001 and 1, then
instead of distributing random numbers uniformly, use a
log scale to distribute the random numbers.
r = -4 * np.random.rand()
𝛼 = 10/
This makes
𝛼 𝑙𝑖𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 10*? 𝑡𝑜 102 , 𝑜𝑟 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 0.0001 𝑡𝑜 1
𝛼 𝑙𝑖𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 10 𝑡𝑜 10 , 𝑜𝑟 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 0.0001 𝑡𝑜 1

Batch Normalization
Normalizing inputs and some network hidden layers to
speed up learning
Batch norm works with mini batches
How to implement gradient descent while using backprop ?

For t = 1 -->#MiniBatches
Compute forward prop on 𝑋 {3}
In each hidden layer, use BN to replace
𝑧 [)] 𝑤𝑖𝑡ℎ 𝑧̃ [)]
Use backprop to compute 𝑑𝑊 [)] , 𝑑𝑏 [)] , 𝑑β[)] , 𝑑𝛾 [)]
Update parameters
𝑊 [)] = 𝑊 [)] − 𝛼 ∗ 𝑑𝑊 [)]
𝑏 [)] = 𝑏 [)] − 𝛼 ∗ 𝑑𝑏 [)]
β[)] = β[)] − 𝛼 ∗ 𝑑β[)]
𝛾 [)] = 𝛾 [)] − 𝛼 ∗ 𝑑𝛾 [)]
This works well w/ momentum, RMSprop, Adam.

Benefits:
1. Manages to work even when there is a covariant shift
2. Provides some regularization effect.

How to apply Batch Normalization at the test time?

Use exponentially weighted averages to find the final β, 𝛾
using all the mini-batches.

SOFTMAX - CLASSIFICATION:
Used for multi-class classification. The o/p layer has

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
7 CNN 3
No ratings yet
7 CNN 3
30 pages
Regularization: Updates To Assignment
No ratings yet
Regularization: Updates To Assignment
21 pages
Optimization
No ratings yet
Optimization
51 pages
Optimization
No ratings yet
Optimization
44 pages
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
HW1P1 F23
No ratings yet
HW1P1 F23
37 pages
DL Notes B Div
No ratings yet
DL Notes B Div
13 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
Chapter 2 - 4 Important Techniques
No ratings yet
Chapter 2 - 4 Important Techniques
34 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Batch Norm Parameter Tuning
No ratings yet
Batch Norm Parameter Tuning
2 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Training NNs
No ratings yet
Training NNs
34 pages
Cours 6
No ratings yet
Cours 6
26 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Cours 5
No ratings yet
Cours 5
23 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
R Deep Neural Network Step by Step
No ratings yet
R Deep Neural Network Step by Step
27 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Deep Learning: Computer Science and Engineering
No ratings yet
Deep Learning: Computer Science and Engineering
18 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Unit 3
No ratings yet
Unit 3
110 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Solving Parabolic Periodic P-Laplacian by Deep Learning
No ratings yet
Solving Parabolic Periodic P-Laplacian by Deep Learning
15 pages
Lec 8
No ratings yet
Lec 8
43 pages
Basic Math For CNC
100% (1)
Basic Math For CNC
10 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
CNN Training Aspects Presentation
No ratings yet
CNN Training Aspects Presentation
26 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Worksheet Algebraic Expression
No ratings yet
Worksheet Algebraic Expression
5 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Cours 4
No ratings yet
Cours 4
30 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Gaussian Processes: Probabilistic Inference (CO-493)
No ratings yet
Gaussian Processes: Probabilistic Inference (CO-493)
146 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Building Your Deep Neural Network - Step by Step v8 PDF
No ratings yet
Building Your Deep Neural Network - Step by Step v8 PDF
44 pages
Math Poems
No ratings yet
Math Poems
7 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
DL Class3
No ratings yet
DL Class3
28 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Signals and Systems: 18EC45 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
Signals and Systems: 18EC45 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme)
3 pages
Mcqs of Trigonometry
No ratings yet
Mcqs of Trigonometry
16 pages
Complex Numbers and Moivres Theorem
100% (1)
Complex Numbers and Moivres Theorem
31 pages
Laplace Transformation
No ratings yet
Laplace Transformation
8 pages
First Order and First Degree Differential Equation
No ratings yet
First Order and First Degree Differential Equation
8 pages
Coloring Vertex
No ratings yet
Coloring Vertex
14 pages
Cambridge IGCSE: MATHEMATICS 0580/43
100% (1)
Cambridge IGCSE: MATHEMATICS 0580/43
20 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Paperplainz Com Ib Math Aa SL Practice Exams Paper 1 Practice Exam 1
No ratings yet
Paperplainz Com Ib Math Aa SL Practice Exams Paper 1 Practice Exam 1
5 pages
HW4 Solution
No ratings yet
HW4 Solution
6 pages
Mech ENGCourses
No ratings yet
Mech ENGCourses
15 pages
g7m3l10 - Properties of Inequalities
No ratings yet
g7m3l10 - Properties of Inequalities
5 pages
Transform and Sinusoidal Funct 11
No ratings yet
Transform and Sinusoidal Funct 11
16 pages
Hull Form Optimization of A Cargo Ship For Reduced Drag-Dikonversi
No ratings yet
Hull Form Optimization of A Cargo Ship For Reduced Drag-Dikonversi
12 pages
RAO, Capitulo 14, Vibraciones
No ratings yet
RAO, Capitulo 14, Vibraciones
61 pages
Introduction To Wavelets - : Wavelets Seminar With DR' Hagit Hal-Or
No ratings yet
Introduction To Wavelets - : Wavelets Seminar With DR' Hagit Hal-Or
59 pages
Planimeter
No ratings yet
Planimeter
4 pages
Matrices
No ratings yet
Matrices
2 pages
Bartolo, Benci, Fortunato - ABSTRACT CRITICAL POINT THEOREMS AND APPLICATIONS TO SOME NONLINEAR PROBLEMS WITH - STRONG" RESONANCE AT INFINITY
No ratings yet
Bartolo, Benci, Fortunato - ABSTRACT CRITICAL POINT THEOREMS AND APPLICATIONS TO SOME NONLINEAR PROBLEMS WITH - STRONG" RESONANCE AT INFINITY
32 pages
Graph Theory: Lebanese University Faculty of Science BS Computer Science 2 Year - Fall Semester
No ratings yet
Graph Theory: Lebanese University Faculty of Science BS Computer Science 2 Year - Fall Semester
22 pages
Goudas Paper PDF
No ratings yet
Goudas Paper PDF
9 pages
7C76 Math Pre Calculus 12 CLOC Online
No ratings yet
7C76 Math Pre Calculus 12 CLOC Online
4 pages
Fiitjee: Topic: SL
No ratings yet
Fiitjee: Topic: SL
1 page
A and Every B B B. Prove That Sup A Inf B
No ratings yet
A and Every B B B. Prove That Sup A Inf B
1 page
Analytic Signals and Hilbert Transform Filters
No ratings yet
Analytic Signals and Hilbert Transform Filters
4 pages
Cac Tich Phan Thong Dung Trong Vat Ly Ly Thuyet
No ratings yet
Cac Tich Phan Thong Dung Trong Vat Ly Ly Thuyet
5 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet