0% found this document useful (0 votes)

44 views41 pages

Lecture 6

This document summarizes techniques for training neural networks, including regularization to prevent overfitting and optimization to overcome underfitting. Regularization techniques discussed include early stopping, L2 regularization, multi-task learning, data augmentation, and dropout. Optimization techniques include stochastic gradient descent (SGD), SGD with momentum, and RMSProp. The document provides examples of how these techniques are applied in training deep learning models like convolutional and recurrent neural networks.

Uploaded by

Sher Afghan Malik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views41 pages

Lecture 6

Uploaded by

Sher Afghan Malik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

CS273B Lecture 6: regularization and

optimization for deep learning

James Zou

10/12/16

Recap: architectures

•  Feedforward
Learning a nonlinear
mapping from inputs to
outputs.

Predicting:

•  Convnets
TF binding,

gene expression,

disease status from images,

risk from SNPs,

protein structure

•  RNN, LSTM

…

How to train your neural network

Regularization—prevent overfitting

Optimization—overcome underfitting

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout

Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Empirical loss vs true loss

Given training set

( , ), ( , ), ... D

Goal of neural networks (and most ML) is to solve

= arg min ED [ ( ( , ), )] True loss

where L is the loss metric.

However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

Empirical loss vs true loss

Given training set

( , ), ( , ), ... D

Goal of neural networks (and most ML) is to solve

= arg min ED [ ( ( , ), )] True loss

where L isarise
Overfitting the loss
due metric.

to using this proxy of empirical loss.

However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

How to train your neural network

train
validation
test

stop

error
validation error

training error

# of steps

Use in combination with any optimization and regularization.

How to train your neural network

= arg min ( ( , ), ) +

In gradient descent

= ( ( , ), ) +

+ = ·
=( ) ( ( , ), )
Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

Why does this help to reduce overfitting?

Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

Why does this help to reduce overfitting?

•  Corresponds to a Bayesian prior that the

weights are close to zero.

•  Restricts the complexity of the learned neural
network.

How to train your neural network

y1
y2

h1
h2
h3

hshared

Increase your training set: multitask learning

Task specific y1
y2

predictions

h1
h2
h3

hshared
Leverages all the data

Increase your training set: data augmentation

First, normalize the input—zero mean and unit standard

deviation.

Example from Jason Brownlee

Increase your training set: data augmentation

Transform input data via rotations, shifts and adding

random noise.

new training data

Example from Jason Brownlee

Increase your training set: data augmentation

Transform input data via rotations, shifts and adding

random noise.

new training data

Example from Jason Brownlee

How to train your neural network

( )

( )
Dropout

Set each hidden unit to 0

w/ probability 0.5.

( )

Set each input unit to 0
( ) w/ probability 0.2.

Dropout

Set each hidden unit to 0

w/ probability 0.5.

( )

Set each input unit to 0
( ) w/ probability 0.2.

Dropout

Set each hidden unit to 0

w/ probability 0.5.

( )

Set each input unit to 0
( ) w/ probability 0.2.

For prediction and back propagation, use only the

present edges. The dropped-out edges are ignored and
kept at their previous values.

Dropout: test time

Set each hidden unit to 0

w/ probability 0.5.

( )

Set each input unit to 0
( ) w/ probability 0.2.

At test time, multiply the output of each unit by its

dropout probability.

Dropout intuition

Dropout is approximately
training and averaging an
exponentially large
ensemble of networks.

Summary: regularization

Three classes of approaches

•  L2 regularization—reduce complexity of function space

•  Multi-task learning; data augmentation—effectively increase
the number of training examples.

•  Dropout and other noise addition algorithms—increase the
stability of training algorithm.

How to train your neural network

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

+ = +
SGD with momentum

SGD can zig-zag esp. when the loss landscape is ill-

conditioned.

Momentum—prefers to go in similar direction as
before.

Figure from Goodfellow, Bengio, Courville

SGD with momentum

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

+ = + +
What are limitations of gradient based methods?

What are limitations of gradient based methods?

•  Local minima and saddle points.

•  Performance depends crucially on step sizes.

If too small, then requires many steps.

If too large, then gradients no longer informative.

•  Algorithms we have seen requires setting by hand.

RMSProp

Idea: set learning rate adaptively using history.

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

= +( )
= +

+ = +
Example: DeepBind

DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.

DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.

SGD with momentum. Batch size [

, ]
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.

SGD with momentum. Batch size [

, ]

Dropout.

Hyperparameter optimization

•  Amount of weight decay

[ , ]

•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.

Hyperparameter optimization

•  Amount of weight decay

[ , ]

•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.

Shahriari et al. Taking the human out of the loop: a review of

Bayesian optimization.

Materi Naive Bayes
No ratings yet
Materi Naive Bayes
15 pages
Neural Network
No ratings yet
Neural Network
22 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Question Bank
No ratings yet
Question Bank
2 pages
V05 SS24 DL CNNs Lecture2
No ratings yet
V05 SS24 DL CNNs Lecture2
73 pages
AIML-Module-3-part 2
No ratings yet
AIML-Module-3-part 2
122 pages
Module-4_4
No ratings yet
Module-4_4
19 pages
Merged Presentation Choladeck Choladeck-compressed
No ratings yet
Merged Presentation Choladeck Choladeck-compressed
239 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
6_Tips for Training Deep Neural Networks
No ratings yet
6_Tips for Training Deep Neural Networks
59 pages
PA.UNIT - IV
No ratings yet
PA.UNIT - IV
45 pages
machine mcq
No ratings yet
machine mcq
32 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
DNN tip
No ratings yet
DNN tip
49 pages
Prediction of Idiopathic Recurrent Spontaneous Miscarriage Using Machine Learning
No ratings yet
Prediction of Idiopathic Recurrent Spontaneous Miscarriage Using Machine Learning
8 pages
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
No ratings yet
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
16 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
An Overview of Tensorflow + Deep learning 沒一村
No ratings yet
An Overview of Tensorflow + Deep learning 沒一村
31 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Recurrent Neural Networks (RNNS) : Shusen Wang
No ratings yet
Recurrent Neural Networks (RNNS) : Shusen Wang
33 pages
Early Stopping, Dropout, Augmentation, Optimizers New
No ratings yet
Early Stopping, Dropout, Augmentation, Optimizers New
91 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Machine Learning-Lecture#7-Fall 2020
No ratings yet
Machine Learning-Lecture#7-Fall 2020
18 pages
Machine Learning NN
100% (2)
Machine Learning NN
16 pages
Unit-3
No ratings yet
Unit-3
47 pages
DL MODULE 2
No ratings yet
DL MODULE 2
8 pages
[Athena Scientific Series in Optimization and Neural Computation, 6] Dimitris Bertsimas, John N. Tsitsiklis - Introduction to Linear Optimization (1997, Athena Scientific) - libgen.lc
No ratings yet
[Athena Scientific Series in Optimization and Neural Computation, 6] Dimitris Bertsimas, John N. Tsitsiklis - Introduction to Linear Optimization (1997, Athena Scientific) - libgen.lc
186 pages
Deep-Learning-Module-2-Important-Topics-PYQs
No ratings yet
Deep-Learning-Module-2-Important-Topics-PYQs
30 pages
CS3491 AI and ML Important Question Bank (1)
No ratings yet
CS3491 AI and ML Important Question Bank (1)
7 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
Different Deep CNN Architectures - LeNet, AlexNet, VGG
No ratings yet
Different Deep CNN Architectures - LeNet, AlexNet, VGG
13 pages
Lunet: A Deep Neural Network For Network Intrusion Detection
No ratings yet
Lunet: A Deep Neural Network For Network Intrusion Detection
8 pages
Unit – IV
No ratings yet
Unit – IV
24 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
CNN For Computer Vision Problem (Session 1)
No ratings yet
CNN For Computer Vision Problem (Session 1)
43 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
DWM Assignment 2 (24-25 Odd)
No ratings yet
DWM Assignment 2 (24-25 Odd)
2 pages
Introduction To RNNS!: Arun Mallya!
No ratings yet
Introduction To RNNS!: Arun Mallya!
52 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Lect 6
No ratings yet
Lect 6
60 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
08 Training
No ratings yet
08 Training
18 pages
cours5
No ratings yet
cours5
23 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
1903.10318 - Fine-Tune BERT For Extractive Summarization
No ratings yet
1903.10318 - Fine-Tune BERT For Extractive Summarization
6 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Lec 2
No ratings yet
Lec 2
5 pages
cours4
No ratings yet
cours4
30 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Anime Face Generation Using DC-GANs
No ratings yet
Anime Face Generation Using DC-GANs
6 pages
2009
No ratings yet
2009
2 pages
Machine Learning CH - Nural Net
No ratings yet
Machine Learning CH - Nural Net
1 page
Deep MLP's
No ratings yet
Deep MLP's
44 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Leture 01
No ratings yet
Leture 01
105 pages
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
No ratings yet
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
16 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Pure Optimization
No ratings yet
Pure Optimization
23 pages
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
No ratings yet
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
9 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DL Class3
No ratings yet
DL Class3
28 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Advanced Deep Learning Questions - ChatGPT
No ratings yet
Advanced Deep Learning Questions - ChatGPT
13 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
From Everand
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
Yuxi (Hayden) Liu
No ratings yet
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet

Lecture 6

Uploaded by

Lecture 6

Uploaded by

CS273B Lecture 6: regularization and

optimization for deep learning

Given training set

Given training set

Use in combination with any optimization and regularization.

Why does this help to reduce overfitting?

Why does this help to reduce overfitting?

• Corresponds to a Bayesian prior that the

First, normalize the input—zero mean and unit standard

Example from Jason Brownlee

Transform input data via rotations, shifts and adding

new training data

Example from Jason Brownlee

Transform input data via rotations, shifts and adding

new training data

Example from Jason Brownlee

Set each hidden unit to 0

Set each hidden unit to 0

Set each hidden unit to 0

For prediction and back propagation, use only the

Set each hidden unit to 0

At test time, multiply the output of each unit by its

Three classes of approaches

SGD can zig-zag esp. when the loss landscape is ill-

Figure from Goodfellow, Bengio, Courville

• Local minima and saddle points.

• Performance depends crucially on step sizes.

Idea: set learning rate adaptively using history.

Early stopping using validation data.

Early stopping using validation data.

SGD with momentum. Batch size [

Early stopping using validation data.

SGD with momentum. Batch size [

• Amount of weight decay

• Dropout probability: 0.5, 0.25, 0.

• Batch size 30 to 200.

• Amount of weight decay

• Dropout probability: 0.5, 0.25, 0.

• Batch size 30 to 200.

Shahriari et al. Taking the human out of the loop: a review of

You might also like

•  Corresponds to a Bayesian prior that the

•  Local minima and saddle points.

•  Performance depends crucially on step sizes.

•  Amount of weight decay

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.

•  Amount of weight decay

•  Dropout probability: 0.5, 0.25, 0.

•  Batch size 30 to 200.