0% found this document useful (0 votes)
74 views46 pages

Practical Aspects of Deep Learning PI

The document discusses several key aspects of setting up deep learning models for success: 1) It is important to properly split data into training, validation, and test sets to avoid overfitting and get an accurate evaluation of model performance. 2) Tuning hyperparameters like number of layers, units per layer, learning rate, and activation functions is an iterative process that requires evaluating models on a validation set. 3) Regularization techniques like L1 and L2 normalization can be applied to reduce overfitting by adding a penalty term to the loss function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views46 pages

Practical Aspects of Deep Learning PI

The document discusses several key aspects of setting up deep learning models for success: 1) It is important to properly split data into training, validation, and test sets to avoid overfitting and get an accurate evaluation of model performance. 2) Tuning hyperparameters like number of layers, units per layer, learning rate, and activation functions is an iterative process that requires evaluating models on a validation set. 3) Regularization techniques like L1 and L2 normalization can be applied to reduce overfitting by adding a penalty term to the loss function.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Practical aspects of Deep

Learning Part I

Arles Rodríguez
[email protected]

Facultad de Ciencias
Departamento de Matemáticas
Universidad Nacional de Colombia
Motivation
• It is impossible to get the best
hyperparameters at the first time 1.idea
for a specific application:
– #layers
3.experiment
– #hidden units by layer 2.code
– learning rates
– Activation functions
• Deep learning is an iterative
process.
Motivation
• NN has been successful on:
NLP, Artificial Vision, Speak
1.idea
recognition and structured data.
• Intuition about an area do not
transfer to other application 3.experiment
areas. 2.code

• Successful also depends on the


amount of data and the
hardware and software
configuration.
Motivation
• Idea: how efficiently we
can go around the 1.idea
iterative process?
– Setting up data well.
3.experiment
– Hyperparameter tunning. 2.code

– Optimize execution and


implementation aspects of
the model.
Setting up data well
• In early machine learning models (pre-deep learning
era):

Data

Training test

70% 30%

• It is generally accepted for 100, 100, or 10000


samples as a good rule of thumb.
Setting up data well
• A good idea can be split the data in three sets (for less
than 10.000 samples):
Data

dev/ hold
Training test
out cv set
60% 20% 20%

Used to train Use dev set to Evaluate the best model


on the model tune parameters, to get unbiased estimate
select features of how the model is
and make doing (generalization)
decisions
Setting up data well
• With the big data era (data in the order of samples or more),
dev and test set have been becoming a much smaller
percentage of total.
• Dev set is useful to estimate which of two different algorithm
choices is better.
Data (100000 samples)

Training dev/cv set test

98% 1% 1%10000
With even more data:
99.9% 0.25% 0.25%
99.9% 0.4% 0.1%
train/test set distribution
• The key difference between training/test
datasets is that tests sets are unseen.
• This is because the training procedure has not
used the test examples.
• Training and dev/test sets must come from the
same distribution.
Is this a good data setting?
• The cat app is segmented in 4 regions based on
the largest markets: US, China, India, Latin
America.
• Is it right to randomly assign two of these
segments to the training/dev sets and the other
two to the test set?
Is this a good data setting?
• Is it right to randomly assign two of these
segments to the training/dev sets and the other
two to the test set?
• Answer: The dev set should reflect the task
you want to improve the most: Do well in the
four regions and not in only two.
• Probably the app works well on the training set
but not in the test set.
Another example
• We trained a model to detect cats in
pictures.
• The data was taken mainly from the
internet and split the data set in 70%/30%
into training and test sets, and the
algorithm worked well.
• The users starts uploading their cat pictures
and the performance is poor, what happen?
• Data must come from the same
distribution!
Bias/variance

𝑥2 𝑥2 𝑥2

𝑥1 𝑥1 𝑥1
High bias Just right High variance
This model underfits data Medium level of complexity This model overfits data

With two features (2D) it is possible to plot data and visualize bias and data.
With high dimensional data it is not possible to plot and visualize decision boundary
Example: cat classification

y=1 y=0
Human performance is nearly 0%)

Error
Train set error 1%
Dev set error 11%
Diagnosis:
Example: cat classification

y=1 y=0
Human performance is nearly 0%)

Error
Train set error 1%
Dev set error 11%
Diagnosis: (high
variance)
overfit
Example: cat classification

y=1 y=0
Human performance is nearly 0%)

Error
Train set error 1% 15%
Dev set error 11% 16%
Diagnosis: (high
variance)
overfit
Example: cat classification

y=1 y=0
Human performance is nearly 0%)

Error
Train set error 1% 15%
Dev set error 11% 16%
Diagnosis: (high (high bias)
variance) underfit
overfit
Example: cat classification

y=1 y=0
Human performance is nearly 0%)

Error
Train set error 1% 15% 0,5%
Dev set error 11% 16% 1%
Diagnosis: (high (high bias)
variance) underfit
overfit
Example: cat classification

y=1 y=0
Human performance is nearly 0%)

Error
Train set error 1% 15% 0,5%
Dev set error 11% 16% 1%
Diagnosis: (high (high bias) (low bias, low
variance) underfit variance)
overfit

If we have that no human can do well (blurrier images) bayes error could be high and
analysis can change.
Some assumptions
• Bayes error is quite
small.
• Train/dev set are drawn 𝑥2
from the same
distribution.
• In high dimensional
data it is possible to 𝑥1
have high bias/high
variance
bias/variance tradeoff I
• That is something easy to learn but hard to master.
• Traditional models talk about bias/variance
tradeoff:
– Reduce bias error at the cost of increasing variance and
viceversa.
– Adding neurons/layers in NN, or adding features
reduces bias but can increase variance.
– Adding regularization can increase bias but reduces
variance.
bias/variance tradeoff II
• With deep learning:
– It is possible to increase a neural network size and
tune regularization to reduce bias without
noticeably increasing variance.
– By adding training data, it is possible to reduce
variance without affecting bias.
– With a model architecture that is well suited for a
task, bias and variance can be reduced
simultaneously.
Basic recipe for NN models
Train Model

The model has (training data


high bias? performance)
no yes

No The model has (dev data • Try bigger network


high variance? performance) • Add more hidden units/layers
yes • Train longer
Done  • Train more advanced
• Get more data optimization algorithms
• Try regularization • Try another NN architecture
• Try another NN architecture
Regularization applied to Logistic Regression

• Apply to the high variance problem:


• Example logistic regression:
, where,

, where is the regularization parameter

𝑛𝑥
• L2 regularization is the most‖used:
𝑤‖2= ∑ 𝑤 𝑗 =𝑤 𝑤
2 2 𝑇

• L1 regularization is defined as: 𝑗=1


L2 Regularization applied to NN
𝑚 𝐿
1 𝜆 [ 𝑙] 2 ❑❑
𝐽 ( 𝑤 , 𝑏 ,…,𝑤 ,𝑏 )= ∑ 𝐿 ( ^𝑦 , 𝑦 ) +¿
[1 ] [ 1] (𝑖 ) (𝑖 )[𝐿]
∑ ‖𝑤 ‖𝐹 ¿
[𝐿]
𝑚 𝑖=1 2𝑚 𝑙=1
Squared norm
Frobenius norm:
[𝑙 ] [𝑙 − 1]
𝑛 𝑛
‖𝑤 ‖ =∑ ∑ (𝑤[𝑖,𝑙 ] 𝑗 )2
[ 𝑙] 2
𝐹
𝑖=1 𝑗 =1
𝐹𝑟𝑜𝑏𝑒𝑛𝑖𝑢𝑠 𝑚𝑖𝑛𝑖−𝑒𝑥𝑐𝑒𝑟𝑐𝑖𝑠𝑒 :
#units l-1 layer
shape: = (𝑛[ 𝑙 ] , 𝑛[ 𝑙 − 1] )
#units |23 5
7
3
9|
Gradient descent
𝜆 [𝑙 ]
𝑑𝑤 [𝑙 ]=( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 ) + 𝑤
𝑚

𝑤[𝑙] :=𝑤[𝑙 ] − 𝛼 𝑑𝑤 [𝑙]

L2 is called also weight decay:


[𝑙] [𝑙 ]
𝑤 :=𝑤 − 𝛼 ( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 ) +
[ 𝜆
𝑚
𝑊
[𝑙 ]
]
𝜆 [𝑙 ]
𝑤[𝑙 ] :=𝑤[𝑙 ] − 𝛼 𝑤 −   α ( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 )
𝑚

[𝑙]
𝑤 :=𝑤
[𝑙 ]
( 1− 𝛼
𝜆
𝑚 )
−   α ( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 )

Term little less than 1


How does regularization deals with
overfitting?

𝑥1
^𝑦
𝑥2
X
X
X
X X
(
𝑤[𝑙] :=𝑤[𝑙 ] 1− 𝛼
𝜆
𝑚 )
−   α ( 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 )
X
X
X
𝑚 𝐿
1 𝜆 [ 𝑙] 2 ❑❑
𝐽 ( 𝑤 ,𝑏 ,…,𝑤 ,𝑏 )= ∑ 𝐿 ( ^𝑦 , 𝑦 ) +¿
[1 ] [ 1] [𝐿] [𝐿]
(𝑖 ) (𝑖 )
∑ ‖𝑤 ‖𝐹 ¿
𝑚 𝑖=1 2𝑚 𝑙=1
• Penalizes the weights matrices from being too large.
• If is high, tends to be close to zero.
• Sets the weights so close to zero for hidden units that is zeroing out
a lot of the input of the hidden units
Regularization and activation functions
• , as z is quite small
• A higher causes lower values because:

values
Implementation
𝑚 𝐿
1 𝜆 [ 𝑙] 2 ❑❑
𝐽 ( 𝑤 , 𝑏 ,…,𝑤 ,𝑏 )= ∑ 𝐿 ( ^𝑦 , 𝑦 ) +¿
[1 ] [ 1] [𝐿] [𝐿]
(𝑖 ) (𝑖 )
∑ ‖𝑤 ‖𝐹 ¿
𝑚 𝑖=1 2𝑚 𝑙=1
• When plotting cost function, add regularization term
as function must be decrease monotonically.
𝐽 (…)

#iterations
Dropout regularization
• Is less used than L2 regularization

𝑥1 Go through each layer of the


^𝑦
NN and set some probability
𝑥2
of eliminating a node in the
𝑥3 neural network.
𝑥4
Dropout regularization
• Is less used than L2 regularization
X
𝑥1
X ^𝑦 Toss a coin and have a
X chance (e.g., 0.5) of keeping
𝑥2
each node (0.5 of loss the
𝑥3 X
X X node).
𝑥4
Remove all the outgoing
links from the node as well
Dropout regularization
• Is less used than L2 regularization

𝑥1
^𝑦
Perform backward
𝑥2 propagation on diminished
𝑥3 network.
𝑥4
Dropout Implementation
• A way to implement dropout is inverted dropout.
• Technique take a layer (e.g., ) and define a dropout
vector:

• keep_prob: probability of maintain a unit.


keep_prob=0.8 means a prob of 0.2 of eliminating
any hidden unit. a
Multiply interpret True=1, False=0

• Maintains the expected value of activation:


a
Dropout Implementation
Example if keep_prob=0.8:
• 10 units are shutdown

Reduced by 20%

• := 0.2=0.8
• Does not change the expected value by
dividing over keep_prob 0.8 /0.8=
• If keep_prob=1, maintain all the same
Dropout Implementation
Example if keep_prob=0.8:
• 10 units are shutdown

Reduced by 20%

• Does not change the expected value by


dividing over keep_prob
• If keep_prob=1, maintain all the same
Dropout Implementation
• For different training examples different units
are zeroed out.
• Dropout is not applied at test time because it
would generate a random output.
• Intuition: to work with less units have a
regularized effect.
Dropout intuition
• As random units are
removed a unit cannot
rely on any feature so
have to spread out
weights.
• From the perspective of
the violet single unit, we
are eliminating features.
Dropout intuition (ii)
𝑥1

𝑥2 ^
𝑦

𝑥3

𝑛[0 ]=3 7 7 3 2 1

keep _ prob[𝑙] 7,3) 7,7) 3,7) 1,2)


For layer 3 For layer 2 that have For layer 3 For layer 5
keep_prob=0.7 a high probability of keep_prob=0.7 keep_prob=1
overfitting chose
Cons: more hyperparameters. keep_prob=0.5

Strategy: apply same keep_prob to only some layers with same value
Some considerations
• Many of successful dropout are applied in
computer vision.
• Regularization prevents overfitting.
• Without enough data it is possible to have
overfitting.
• Downside: to monitor the cost function, a
strategy can be to turn keep_prob=1, monitor
J(…), and then turn on dropout.
Other regularization methods
• To deal with overfitting, getting more data can
help
• However, getting more training data can be
expensive.
• Sometimes you cannot get more data.
Technique 1: Data augmentation

Flip
horizontally

Random
zoom and
crop
Technique 1: Data augmentation

Rotate

Distort
Technique 2: Early stopping
• Plot
J(…)

#iterations

See in which iteration NN is doing best and stop on that iteration


Technique 2: Early stopping
J(…)

• Plot

#iterations

• At the beginning, parameters w are close to


zero and at iterating can be bigger.
• By early stopping we have a midsize rate w ()
• It is preferred to use L2 regularization with
several values.
Early stopping breaks orthogonalization
• Orthogonalization means to think about one
task at a time.
• Goals:
– Optimize cost function J
– Gradient descent
– Other optimization algorithms (later…)
– Not to overfit data
• Early stopping breaks optimization because
deals with overfit and with optimize cost
function at the same time.
References
• Ng. A (2022) Deep Learning Specialization.
https://www.deeplearning.ai/courses/deep-learning-specialization/
• Ng, A. (2018). Machine learning yearning. URL:
http://www.mlyearning.org/
¡Thank you!

You might also like