0% found this document useful (0 votes)
34 views

Lecture 6

This document summarizes techniques for training neural networks, including regularization to prevent overfitting and optimization to overcome underfitting. Regularization techniques discussed include early stopping, L2 regularization, multi-task learning, data augmentation, and dropout. Optimization techniques include stochastic gradient descent (SGD), SGD with momentum, and RMSProp. The document provides examples of how these techniques are applied in training deep learning models like convolutional and recurrent neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture 6

This document summarizes techniques for training neural networks, including regularization to prevent overfitting and optimization to overcome underfitting. Regularization techniques discussed include early stopping, L2 regularization, multi-task learning, data augmentation, and dropout. Optimization techniques include stochastic gradient descent (SGD), SGD with momentum, and RMSProp. The document provides examples of how these techniques are applied in training deep learning models like convolutional and recurrent neural networks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CS273B Lecture 6: regularization and

optimization for deep learning


James Zou

10/12/16

Recap: architectures

•  Feedforward
Learning a nonlinear
mapping from inputs to
outputs.


Predicting:

•  Convnets
TF binding,

gene expression,

disease status from images,


risk from SNPs,

protein structure

•  RNN, LSTM





How to train your neural network

Regularization—prevent overfitting









Optimization—overcome underfitting

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Empirical loss vs true loss

Given training set


( , ), ( , ), ... D


Goal of neural networks (and most ML) is to solve


= arg min ED [ ( ( , ), )] True loss

where L is the loss metric.



However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

Empirical loss vs true loss

Given training set


( , ), ( , ), ... D


Goal of neural networks (and most ML) is to solve


= arg min ED [ ( ( , ), )] True loss

where L isarise
Overfitting the loss
due metric.

to using this proxy of empirical loss.



However we can only solve the proxy

= arg min ( ( , ), ) Empirical loss

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Early Stopping

Entire dataset

train
validation
test

stop

error
validation error

training error

# of steps

Use in combination with any optimization and regularization.



How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

In gradient descent

= ( ( , ), ) +

+ = ·
=( ) ( ( , ), )
Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

Why does this help to reduce overfitting?



Weight decay

Optimize the loss
Regularization penalty

= arg min ( ( , ), ) +

Why does this help to reduce overfitting?


•  Corresponds to a Bayesian prior that the


weights are close to zero.

•  Restricts the complexity of the learned neural
network.

How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Increase your training set: multitask learning

y1
y2

h1
h2
h3

hshared

Increase your training set: multitask learning

Task specific y1
y2

predictions

h1
h2
h3

hshared
Leverages all the data

Increase your training set: data augmentation

First, normalize the input—zero mean and unit standard


deviation.

Example from Jason Brownlee



Increase your training set: data augmentation

Transform input data via rotations, shifts and adding


random noise.

new training data


Example from Jason Brownlee



Increase your training set: data augmentation

Transform input data via rotations, shifts and adding


random noise.

new training data


Example from Jason Brownlee



How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Dropout

( )

( )
Dropout

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

Dropout

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

Dropout

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

For prediction and back propagation, use only the


present edges. The dropped-out edges are ignored and
kept at their previous values.

Dropout: test time

Set each hidden unit to 0


w/ probability 0.5.

( )


Set each input unit to 0
( ) w/ probability 0.2.

At test time, multiply the output of each unit by its


dropout probability.

Dropout intuition

Dropout is approximately
training and averaging an
exponentially large
ensemble of networks.

Summary: regularization

Three classes of approaches





•  L2 regularization—reduce complexity of function space



•  Multi-task learning; data augmentation—effectively increase
the number of training examples.



•  Dropout and other noise addition algorithms—increase the
stability of training algorithm.



How to train your neural network

Regularization—prevent overfitting

•  Early stopping

•  L2 regularization (aka weight decay)

•  Multi-task learning; data augmentation

•  Dropout



Optimization—overcome underfitting

•  SGD, SGD with momentum

•  RMSProp

Stochastic gradient descent

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

+ = +
SGD with momentum

SGD  can zig-zag esp. when the loss landscape is ill-


conditioned.



Momentum—prefers to go in similar direction as
before.  

Figure from Goodfellow, Bengio, Courville



SGD with momentum

[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

+ = + +
What are limitations of gradient based methods?

What are limitations of gradient based methods?

•  Local minima and saddle points.


•  Performance depends crucially on step sizes.



If too small, then requires many steps.

If too large, then gradients no longer informative.



•  Algorithms we have seen requires setting by hand.

RMSProp

Idea: set learning rate adaptively using history.



[ , ]

{ ( ) , ( ) , ..., ( ) } ()

= (( ( ), ), ( ))

= +( )
= +

+ = +
Example: DeepBind

DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.



DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.


SGD with momentum. Batch size [


, ]
DeepBind optimization

Objective function

= arg min ( ( , ), ) + || ||

Initialization
N( , ) [ , ]

Early stopping using validation data.


SGD with momentum. Batch size [


, ]

Dropout.

Hyperparameter optimization

•  Amount of weight decay


[ , ]


•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.


•  Batch size 30 to 200.





Hyperparameter optimization

•  Amount of weight decay


[ , ]


•  Learning rate [ . , . ] and momentum.

•  Dropout probability: 0.5, 0.25, 0.


•  Batch size 30 to 200.


Shahriari et al. Taking the human out of the loop: a review of


Bayesian optimization.

You might also like