0% found this document useful (0 votes)

279 views1 page

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

Gradient descent is used to optimize neural networks by iteratively processing training examples to update network parameters downhill towards a minimum. Batch gradient descent processes all examples at once while stochastic gradient descent processes one example at a time. Mini-batch gradient descent processes examples in small batches of 1 to m examples to balance speed of batch processing with frequent parameter updates of stochastic gradient descent. Choosing an optimal mini-batch size can provide the fastest learning in practice.

Uploaded by

Sharath Poikayil Satheesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

279 views1 page

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

Uploaded by

Sharath Poikayil Satheesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Gradient Descent Analogy: "Goes Downhill"

Batch Size = 1

Stochastic Gradient Descent 1. Over the number of iterations

2. Over the m training examples

Required 3 for loops to implement
3. Over the number of layers in the neural
network

5 million data points? Divide by 1000, update

parameters in GD after every mini batch is
Divide huge data into batches for training processed.

Batch size = m (total training examples Batch Gradient Descent

Batch size = 1 Stochastic gradient descent 

mini-batch between 1-m

BGD takes too long

SGD: you lose the speed up from vectorization

Why Mini-Batch GD is good
In practice gives fastest learning

Choosing your mini batch size In practice MDGD Vectorized implementation

Mini-Batch Gradient Descent
Less waiting

make sure mini-batch size ﬁts in memory of

GPU or CPU

Shufﬂing and partitioning are two steps

required to build mini-batches

<= 2000 Use batch gradient descent

Training Set Size
Mini-batch sizes of power 2^n (64, 128, 256,
Bigger? 512)

Trend should be going down

You may not converge everytime
Hyperparameter tuning Learning rate for example

Exponentially Weighted Averages Key equation 

2. Optimization Algorithms Almost always works faster than Batch GD

Use exponentially weighted average of

gradients and use those gradients to update
your weights V{t} / 1-B^t for removing initial bias in EWA

Smoothes out the steps of gradient descent

DL is a highly iterative process, you have a GD with Momentum  larger the value, smoother the update 
lot of things and hyperparameters to take care
of 0.8 - 0.999
Common values
Value of Beta
small dataset? these ratios will be okay Old: 70/30 or 60/20/20 Don't feel like tuning? 0.9 is mostly used

depends on the size of data Because dl has a lot of data, 95/5 will also work May need several attempts to ﬁnd out the
Setting up data in train/dev/test sets right value of beta for your model.
Make sure dev and test set come from the
same distribution Tuning of alpha with beta is common

2D data: we can plot the line and see if bias or S{dW} = beta * S{dw} + (1-beta) * dW^2 (
variance is there element wise).

Training error - baye's error small, train - dev S{db} = beta * S{db} + (1-beta) * db^2 ( dW = dW / sqrt(S{dW})
High variance Overﬁtting error big? RMS Prop  element wise). db = db / sqrt(S{db})

Bias/Variance  Adam = Adaptive Moment Estimation

Training error - baye's error big, train-dev
High bias Underfitting error small?
Mix of momentum and RMS Prop
Training error - baye's error big, train-dev
High bias and variance Worst of both worlds error big? What is Optimal/Baye's error? High dims data? One of the most effective optimization
technique
Training error - baye's error small, train-dev
Low bias low variance Best of both worlds error small? alpha
Adam 
The optimal error will define the bias or beta1
variance or both problem. parameters
beta2
try bigger network
epsilon
almost always helps
Relatively low memory requirements 
train longer 1. Ask if the model has high bias?
Advantages
Usually works well even with little tuning of
try different NN architecture
hyperparameters (except learning_rate)
more training data will not help here Basic recipe for machine learning Learning Rate Decay By formula
get more data After training initial model
Improving Deep Neural can be taken care of by different optimization
The problem of local optima Problem of plateaus algorithms
regularization 2. Ask if the model has high variance?
Networks:
more appropriate NN architecture
Hyperparameter tuning, learning rate is the most important
W may be sparse, avoid L1 regularization Regularization and some hyperparameters are more important
than others next maybe the momentum term
in NN application it becomes Forbenius Norm 1. Practical Aspects of Deep Learning Optimization
L2 regularization 3rd mini batch size, etc.
in backprob, "weight decay" happens Solving problem of bias and variance when
not low to find the right parameters, you gotta find
Inverted Dropout (Most Common) them randomly coarse to fine scheme do random sampling

Cost function not clearly defined Dropout Regularization  Use an appropriate scale to pick
hyperparameters Drastically different lower and upper limits? use Log scale than linear scale
Use only in the training time
Regularizing your neural network Exponentially weighted average Cant be randomly chosen because very
Getting more data is tough
hyperparameter sensitive
Flip
Data Augmentation Pandas approach Babysit one model
Rotate
Caviar approach Train multiple models simultaneously
Augment the data In practice:
Crop
Dealing with high variance problem Approach selection will depend on availability
of computation power
Skew, etc.
Makes hyperparameter search easier
Downside of early stopping  Orthogonalization Early Stopping 
Makes NN more robust Enables to train large NNs easily
lambda is very large, weights become smaller,
Z cones in smaller range of values
Normalize Z terms
and hence NN becomes much simpler Why does regularization help with overfitting?
Hyperparameter Tuning
gamma and beta are learnable parameters
1. Subtract the mean (1/m)*np.sum(x^(i))
Batch Normalization z(i) = gamma * z(i)norm + beta using GD/Adam/RMSProp/Momentum
2. Normalize the variance (x = x / sigma^2) (
Instead of using unnormalized values you use
Sigma = (1/m) * np.sum(xî ** 2))
normalized values of Z
Use same mean and variance to normalize the
batch norm is applied using mini batch
test set
Normalizing Inputs gradient descent
The cost function looks elongated when not Implementation
It makes weights deeper in the NN more
normalized and hence gradient descent
Why BN works? robust to change
becomes tough to perform
BN has a slight regularization effect though it
Input features are on different scales,
should not be used as a regularization
anyway always normalize normalization becomes very important
technique
Random Initialization
mu and sigma^2 are calculates using
3. Hyperparameter tuning, Batch exponentially weighted averages across mini-
if tanh activation Xavier Initialization
Weight initialization to tackle this problem Vanishing and exploding gradients normalization and Programming BN at Test time  batch Subtopic 1
if relu activation He initialization Setting up your optimization problem frameworks
Generalization of Logistic Regression
if relu activation np.sqrt(np.divide(2, n^(l-1) + n^(l)))
Recognize 1 out of c classes
Helps find bugs in Backprop
Softmax Regression Algorithm 
Helps save time
In backprob we use this to check if it is Changes in the label (make it no.of final
working correctly or not classes x 1 vector)
Find limit ----> euclidean distance
Training a softmax classifier
Checks distance between gradient calculated Normally programming frameworks will take
manually and by the backprop. care of the complex operations

Don't use in training, only to debug Gradient checking Caffe / Caffe2

If algo fails grad check, look at components to DL4J

try to identify bugs
CNTK
Remember regularization Practical Tips
Keras
keep_prop = 1.0 doesn't work with dropout
Lasagne
run at random initialization, perhaps again Deep learning Frameworks
after some training mxNet

Paddle Paddle

Tensorﬂow Did in the assignment

Theano

Torch

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Umendo The True Story PDF
No ratings yet
Umendo The True Story PDF
147 pages
LAT Report Chandigarh Region
No ratings yet
LAT Report Chandigarh Region
81 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Training NNs
No ratings yet
Training NNs
34 pages
Kennedy, Gavin - The New Negotiation Edge
83% (6)
Kennedy, Gavin - The New Negotiation Edge
288 pages
64 ABB Interview Questions Answers
No ratings yet
64 ABB Interview Questions Answers
6 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Relational Gifting JCR
No ratings yet
Relational Gifting JCR
27 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Unit 14
No ratings yet
Unit 14
24 pages
Review of Book
No ratings yet
Review of Book
14 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Cours 6
No ratings yet
Cours 6
26 pages
Improving ML, DL Networks Hyperparameter Tuning, Regularization & Optimization
No ratings yet
Improving ML, DL Networks Hyperparameter Tuning, Regularization & Optimization
16 pages
Deviant Selves, Transgressive Acts, and Moral Narratives. The Symbolic-Interactionist Field of Transgression, Crime, and Justice
No ratings yet
Deviant Selves, Transgressive Acts, and Moral Narratives. The Symbolic-Interactionist Field of Transgression, Crime, and Justice
23 pages
Cours 5
No ratings yet
Cours 5
23 pages
Extended Mind Thesis
No ratings yet
Extended Mind Thesis
6 pages
Age Basic Conflict Basic Strength/ Virtue Core Pathology Important Event Outcome
No ratings yet
Age Basic Conflict Basic Strength/ Virtue Core Pathology Important Event Outcome
1 page
Q17.What Are Essentials of TA Contract
No ratings yet
Q17.What Are Essentials of TA Contract
5 pages
Home Burial
100% (1)
Home Burial
25 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Example Bachelor Thesis Proposal
100% (3)
Example Bachelor Thesis Proposal
5 pages
SIW:SIWT
No ratings yet
SIW:SIWT
4 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
Practical Mind Uploading Approach
No ratings yet
Practical Mind Uploading Approach
2 pages
Aries Man and Capricorn Woman - Google Search
No ratings yet
Aries Man and Capricorn Woman - Google Search
1 page
Revelo Character Report Card
No ratings yet
Revelo Character Report Card
9 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Gadget Level Dependence of Children Age 3-6 Years Invillage Cranggang Dawe Kudus
No ratings yet
Gadget Level Dependence of Children Age 3-6 Years Invillage Cranggang Dawe Kudus
8 pages
Respecting Differences 10 4th Quarter - 110905
No ratings yet
Respecting Differences 10 4th Quarter - 110905
5 pages
Dysthymia Care
No ratings yet
Dysthymia Care
1 page
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Human-Centered Design For Community-Centered Collaboration
No ratings yet
Human-Centered Design For Community-Centered Collaboration
17 pages
WPPSI-IV. San Diego: Jerome M. Sattler, Inc
No ratings yet
WPPSI-IV. San Diego: Jerome M. Sattler, Inc
6 pages
Why Loners Are The Most Loyal and Intellectual People You Will Ever Meet
No ratings yet
Why Loners Are The Most Loyal and Intellectual People You Will Ever Meet
9 pages
English Presentation Peta
No ratings yet
English Presentation Peta
13 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Draw Me A Story
No ratings yet
Draw Me A Story
2 pages
Contrastive Analysis Between English and Arabic Prepositions
100% (1)
Contrastive Analysis Between English and Arabic Prepositions
17 pages
Home Comfort An Introduction To Inner Fulfilment by Srila Bhakti Raksak Sridhar
No ratings yet
Home Comfort An Introduction To Inner Fulfilment by Srila Bhakti Raksak Sridhar
26 pages
ESL Listening-Speaking Rubric DFinal April2009
No ratings yet
ESL Listening-Speaking Rubric DFinal April2009
4 pages
Inquiries, Investagation and Immersion DLP Week 1 November 7-10
92% (74)
Inquiries, Investagation and Immersion DLP Week 1 November 7-10
2 pages
k2 Case Study Guide - Nathan Janes
No ratings yet
k2 Case Study Guide - Nathan Janes
6 pages
Vera
From Everand
Vera
Rita Santos
No ratings yet
Eastern National Parks and Seashores: Test Your Knowledge
From Everand
Eastern National Parks and Seashores: Test Your Knowledge
Donald W. Linzey
No ratings yet
Desserts for Senior Living Facilities: (Big Batch Baking for Schools, Churches, and Large Institutions)
From Everand
Desserts for Senior Living Facilities: (Big Batch Baking for Schools, Churches, and Large Institutions)
James Shepard Savannah Cruit
No ratings yet
The Adventures of Duke the Pheasant
From Everand
The Adventures of Duke the Pheasant
Duane W. Rennhack
No ratings yet
Bella's Starry Path to Heaven's Love
From Everand
Bella's Starry Path to Heaven's Love
Martha Russel
No ratings yet
Stan Lee's Traveler Vol. 1
From Everand
Stan Lee's Traveler Vol. 1
Stan Lee
No ratings yet
Mighty Morphin Power Rangers Archive Vol. 1
From Everand
Mighty Morphin Power Rangers Archive Vol. 1
Kyle Higgins
No ratings yet
Biblical Lessons from Grandpa: Preparing the Next Generations
From Everand
Biblical Lessons from Grandpa: Preparing the Next Generations
Michael F. Schmidt
No ratings yet
Three-Fold Cord: Creation Redemption Dominion
From Everand
Three-Fold Cord: Creation Redemption Dominion
Michael P Hays
No ratings yet
The Full Christmas Story
From Everand
The Full Christmas Story
Danny Haag
No ratings yet
The Adventures of Lizzy and Chuck
From Everand
The Adventures of Lizzy and Chuck
Maria Stanley
No ratings yet
Your Guide To: Fearless Entrepreneurship
From Everand
Your Guide To: Fearless Entrepreneurship
Nina Nova
No ratings yet
The Art of Southwest Landscaping
From Everand
The Art of Southwest Landscaping
Dawn Layna Fried
No ratings yet
From the Heart
From Everand
From the Heart
J. Bauman
No ratings yet
Run to Win
From Everand
Run to Win
Eric D. Johnson
No ratings yet
Growing Up on the Farm
From Everand
Growing Up on the Farm
Pamela Ingram May
No ratings yet
The Legend of Willie Smalls
From Everand
The Legend of Willie Smalls
George Sims
No ratings yet
Morals for Minions
From Everand
Morals for Minions
Dr. Debra Wilson
No ratings yet
The Christmas Fish
From Everand
The Christmas Fish
Melba Harris
No ratings yet
The Adventures of Eli and Jake
From Everand
The Adventures of Eli and Jake
Linda Hoffman
No ratings yet
Our Dream House
From Everand
Our Dream House
Janet Lombard Clements
No ratings yet
Colonial Comics, Volume II: New England, 1750–1775
From Everand
Colonial Comics, Volume II: New England, 1750–1775
Jason Rodriguez
3/5 (1)
Colonial Comics, Volume II: New England, 1750–1775
From Everand
Colonial Comics, Volume II: New England, 1750–1775
Independent Publishers Group
3/5 (1)
The Way of Courage
From Everand
The Way of Courage
Janet Hallagin
No ratings yet
Saban's Power Rangers Artist Tribute
From Everand
Saban's Power Rangers Artist Tribute
Dan Mora
No ratings yet
Vincent Price Presents: Phibes
From Everand
Vincent Price Presents: Phibes
Mel Smith
No ratings yet
Crossed Wires: Team-Up
From Everand
Crossed Wires: Team-Up
Chad Rebmann
No ratings yet
Primordia
From Everand
Primordia
John Fultz
No ratings yet
Vincent Price Presents: Gallery #4
From Everand
Vincent Price Presents: Gallery #4
Joel Robinson
No ratings yet
Orbit: Mark Zuckerberg, Creator of Facebook
From Everand
Orbit: Mark Zuckerberg, Creator of Facebook
Jerome Maida
No ratings yet
10th use: Giant
From Everand
10th use: Giant
Darren G. Davis
No ratings yet
Orion the Hunter: Giant
From Everand
Orion the Hunter: Giant
Scott Davis
No ratings yet
Blackbeard Legacy Gallery
From Everand
Blackbeard Legacy Gallery
Darren G. Davis
No ratings yet
Odyssey Presents: Gallery
From Everand
Odyssey Presents: Gallery
Chad Rebmann
No ratings yet
Baneberry Creek: Academy for Wayward Fairies #3
From Everand
Baneberry Creek: Academy for Wayward Fairies #3
CW Cooke
No ratings yet
Flying Saucers Vs. the Earth #2
From Everand
Flying Saucers Vs. the Earth #2
Ryan Burton
No ratings yet
Blackbeard Legacy #2 Volume 2
From Everand
Blackbeard Legacy #2 Volume 2
Eric Arvin
No ratings yet
Vincent Price Presents #04
From Everand
Vincent Price Presents #04
Chad Helder
No ratings yet
Legend of Isis #10: Volume 2
From Everand
Legend of Isis #10: Volume 2
Aaron Stueve
No ratings yet
Violet Rose #0
From Everand
Violet Rose #0
Emma Davis
No ratings yet
Flying Saucers Vs. the Earth #4
From Everand
Flying Saucers Vs. the Earth #4
Ryan Burton
No ratings yet
Odyssey Presents: Anthology #2
From Everand
Odyssey Presents: Anthology #2
Chad Rebmann
No ratings yet
Monster’s Among Us: A War of Witches
From Everand
Monster’s Among Us: A War of Witches
Andrew Shayde
No ratings yet
Flying Saucers Vs. the Earth #1
From Everand
Flying Saucers Vs. the Earth #1
Ryan Burton
No ratings yet
Blackbeard Legacy #2 Volume 1
From Everand
Blackbeard Legacy #2 Volume 1
Darren G. Davis
No ratings yet

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

Uploaded by

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

Uploaded by

Gradient Descent Analogy: "Goes Downhill"

Stochastic Gradient Descent 1. Over the number of iterations

2. Over the m training examples

5 million data points? Divide by 1000, update

Batch size = m (total training examples Batch Gradient Descent

Batch size = 1 Stochastic gradient descent 

mini-batch between 1-m

BGD takes too long

SGD: you lose the speed up from vectorization

Choosing your mini batch size In practice MDGD Vectorized implementation

make sure mini-batch size ﬁts in memory of

Shufﬂing and partitioning are two steps

<= 2000 Use batch gradient descent

Trend should be going down

Exponentially Weighted Averages Key equation 

Use exponentially weighted average of

Smoothes out the steps of gradient descent

Bias/Variance  Adam = Adaptive Moment Estimation

Don't use in training, only to debug Gradient checking Caffe / Caffe2

If algo fails grad check, look at components to DL4J

Tensorﬂow Did in the assignment

You might also like