Practical Aspects of Deep Learning PI
Practical Aspects of Deep Learning PI
Learning Part I
Arles Rodríguez
[email protected]
Facultad de Ciencias
Departamento de Matemáticas
Universidad Nacional de Colombia
Motivation
• It is impossible to get the best
hyperparameters at the first time 1.idea
for a specific application:
– #layers
3.experiment
– #hidden units by layer 2.code
– learning rates
– Activation functions
• Deep learning is an iterative
process.
Motivation
• NN has been successful on:
NLP, Artificial Vision, Speak
1.idea
recognition and structured data.
• Intuition about an area do not
transfer to other application 3.experiment
areas. 2.code
Data
Training test
70% 30%
dev/ hold
Training test
out cv set
60% 20% 20%
98% 1% 1%10000
With even more data:
99.9% 0.25% 0.25%
99.9% 0.4% 0.1%
train/test set distribution
• The key difference between training/test
datasets is that tests sets are unseen.
• This is because the training procedure has not
used the test examples.
• Training and dev/test sets must come from the
same distribution.
Is this a good data setting?
• The cat app is segmented in 4 regions based on
the largest markets: US, China, India, Latin
America.
• Is it right to randomly assign two of these
segments to the training/dev sets and the other
two to the test set?
Is this a good data setting?
• Is it right to randomly assign two of these
segments to the training/dev sets and the other
two to the test set?
• Answer: The dev set should reflect the task
you want to improve the most: Do well in the
four regions and not in only two.
• Probably the app works well on the training set
but not in the test set.
Another example
• We trained a model to detect cats in
pictures.
• The data was taken mainly from the
internet and split the data set in 70%/30%
into training and test sets, and the
algorithm worked well.
• The users starts uploading their cat pictures
and the performance is poor, what happen?
• Data must come from the same
distribution!
Bias/variance
𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1
High bias Just right High variance
This model underfits data Medium level of complexity This model overfits data
With two features (2D) it is possible to plot data and visualize bias and data.
With high dimensional data it is not possible to plot and visualize decision boundary
Example: cat classification
y=1 y=0
Human performance is nearly 0%)
Error
Train set error 1%
Dev set error 11%
Diagnosis:
Example: cat classification
y=1 y=0
Human performance is nearly 0%)
Error
Train set error 1%
Dev set error 11%
Diagnosis: (high
variance)
overfit
Example: cat classification
y=1 y=0
Human performance is nearly 0%)
Error
Train set error 1% 15%
Dev set error 11% 16%
Diagnosis: (high
variance)
overfit
Example: cat classification
y=1 y=0
Human performance is nearly 0%)
Error
Train set error 1% 15%
Dev set error 11% 16%
Diagnosis: (high (high bias)
variance) underfit
overfit
Example: cat classification
y=1 y=0
Human performance is nearly 0%)
Error
Train set error 1% 15% 0,5%
Dev set error 11% 16% 1%
Diagnosis: (high (high bias)
variance) underfit
overfit
Example: cat classification
y=1 y=0
Human performance is nearly 0%)
Error
Train set error 1% 15% 0,5%
Dev set error 11% 16% 1%
Diagnosis: (high (high bias) (low bias, low
variance) underfit variance)
overfit
If we have that no human can do well (blurrier images) bayes error could be high and
analysis can change.
Some assumptions
• Bayes error is quite
small.
• Train/dev set are drawn 𝑥2
from the same
distribution.
• In high dimensional
data it is possible to 𝑥1
have high bias/high
variance
bias/variance tradeoff I
• That is something easy to learn but hard to master.
• Traditional models talk about bias/variance
tradeoff:
– Reduce bias error at the cost of increasing variance and
viceversa.
– Adding neurons/layers in NN, or adding features
reduces bias but can increase variance.
– Adding regularization can increase bias but reduces
variance.
bias/variance tradeoff II
• With deep learning:
– It is possible to increase a neural network size and
tune regularization to reduce bias without
noticeably increasing variance.
– By adding training data, it is possible to reduce
variance without affecting bias.
– With a model architecture that is well suited for a
task, bias and variance can be reduced
simultaneously.
Basic recipe for NN models
Train Model
𝑛𝑥
• L2 regularization is the most‖used:
𝑤‖2= ∑ 𝑤 𝑗 =𝑤 𝑤
2 2 𝑇
[𝑙]
𝑤 :=𝑤
[𝑙 ]
( 1− 𝛼
𝜆
𝑚 )
− α ( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 )
𝑥1
^𝑦
𝑥2
X
X
X
X X
(
𝑤[𝑙] :=𝑤[𝑙 ] 1− 𝛼
𝜆
𝑚 )
− α ( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 )
X
X
X
𝑚 𝐿
1 𝜆 [ 𝑙] 2 ❑❑
𝐽 ( 𝑤 ,𝑏 ,…,𝑤 ,𝑏 )= ∑ 𝐿 ( ^𝑦 , 𝑦 ) +¿
[1 ] [ 1] [𝐿] [𝐿]
(𝑖 ) (𝑖 )
∑ ‖𝑤 ‖𝐹 ¿
𝑚 𝑖=1 2𝑚 𝑙=1
• Penalizes the weights matrices from being too large.
• If is high, tends to be close to zero.
• Sets the weights so close to zero for hidden units that is zeroing out
a lot of the input of the hidden units
Regularization and activation functions
• , as z is quite small
• A higher causes lower values because:
values
Implementation
𝑚 𝐿
1 𝜆 [ 𝑙] 2 ❑❑
𝐽 ( 𝑤 , 𝑏 ,…,𝑤 ,𝑏 )= ∑ 𝐿 ( ^𝑦 , 𝑦 ) +¿
[1 ] [ 1] [𝐿] [𝐿]
(𝑖 ) (𝑖 )
∑ ‖𝑤 ‖𝐹 ¿
𝑚 𝑖=1 2𝑚 𝑙=1
• When plotting cost function, add regularization term
as function must be decrease monotonically.
𝐽 (…)
#iterations
Dropout regularization
• Is less used than L2 regularization
𝑥1
^𝑦
Perform backward
𝑥2 propagation on diminished
𝑥3 network.
𝑥4
Dropout Implementation
• A way to implement dropout is inverted dropout.
• Technique take a layer (e.g., ) and define a dropout
vector:
Reduced by 20%
• := 0.2=0.8
• Does not change the expected value by
dividing over keep_prob 0.8 /0.8=
• If keep_prob=1, maintain all the same
Dropout Implementation
Example if keep_prob=0.8:
• 10 units are shutdown
Reduced by 20%
𝑥2 ^
𝑦
𝑥3
𝑛[0 ]=3 7 7 3 2 1
Strategy: apply same keep_prob to only some layers with same value
Some considerations
• Many of successful dropout are applied in
computer vision.
• Regularization prevents overfitting.
• Without enough data it is possible to have
overfitting.
• Downside: to monitor the cost function, a
strategy can be to turn keep_prob=1, monitor
J(…), and then turn on dropout.
Other regularization methods
• To deal with overfitting, getting more data can
help
• However, getting more training data can be
expensive.
• Sometimes you cannot get more data.
Technique 1: Data augmentation
Flip
horizontally
Random
zoom and
crop
Technique 1: Data augmentation
Rotate
Distort
Technique 2: Early stopping
• Plot
J(…)
#iterations
• Plot
#iterations