Cours 4
Cours 4
2023-2024
1
CHAPTER 4
PRACTICAL ASPECTS OF DEEP LEARNING
2
Applied ML is a highly iterative process
Idea
# layers
# hidden units
learning rates
activation functions …
Code
Experiment
3
Train/dev/test sets
Classical ML (100 – 10000 samples):
Data: Training Dev Test
98% 1% 1%
99.5% 0.25% 0.25%
Make sure dev set and test set come from the same distribution.
For example if cat train set is from the web and the dev/test images are
from users cell phone they will mismatch. It is better to make sure that dev
and test set are from the same distribution.
The dev set rule is to try them on some of the good models you've created.
Its OK to only have a dev set without a testing set. But a lot of people in this case
call the dev set as the test set. A better terminology is to call it a dev set as its used
in the development.
5
Bias and Variance
6
Bias and Variance
𝑦=1 𝑦=0
Cat classification
Done Try until you have a low bias and low variance.
Its very helpful to use deep learning for solving the "Bias/variance
tradeoff" problem because with deep learning you have more
options/tools.
Training a bigger neural network never hurts.
8
Regularization (Logistic regression)
Euclidian norm
min 𝐽 𝑤, 𝑏 , 𝑤 ∈ ℝ𝑛𝑥 , 𝑏 ∈ ℝ
𝑤,𝑏
1 𝑚 𝜆 2 2 𝑛𝑥
𝑳𝟐 regularization : 𝐽 𝑤, 𝑏 = 𝑚 𝑖=1 ℒ 𝑦 𝑖 ,𝑦 𝑖
+ 2𝑚 𝑤 2, 𝑤 2 = 𝑗=1 𝑤𝑗
2
= 𝑤𝑇𝑤
1 𝑚 𝜆 𝑛𝑥
𝑳𝟏 regularization : 𝐽 𝑤, 𝑏 = 𝑚 𝑖=1 ℒ 𝑦 𝑖 ,𝑦 𝑖 + 𝑚 𝑤 1, 𝑤 1 = 𝑗=1 𝑤𝑗
9
Regularization (Neural network)
𝜆 [𝐿]
𝑑𝑤 [𝐿] = 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 + 𝑤
𝑚
𝑤 [𝐿] = 𝑤 [𝐿] − 𝛼𝑑𝑤 [𝐿]
Weight decay:
𝜆 In practice this penalizes large
[𝐿] [𝐿]
𝑤 =𝑤 −𝛼 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 + 𝑤 [𝐿] weights and effectively limits
𝑚
𝛼𝜆 𝐿 the freedom in your model.
𝑤 [𝐿] = 𝑤 [𝐿] − 𝑤 − 𝛼 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 𝛼𝜆
The new term 1 − 𝑚 𝑤 [𝐿]
𝑚
𝛼𝜆 causes the weight to decay in
𝑤 [𝐿] = 1− 𝑤 [𝐿] − 𝛼 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝
𝑚 proportion to its size.
10
How does regularization prevent overfitting?
𝑥1 𝑚 𝐿
1 𝜆 2
𝑥2 𝑦 𝐽 𝑤 [𝑙] , 𝑏 [𝑙] =
𝑚
ℒ 𝑦 𝑖 ,𝑦 𝑖
+
2𝑚
𝑤 [𝑙] 𝐹
𝑖=1 𝑙=1
𝑥3
If 𝜆 is very big ⇒ 𝑤 [𝑙] ≈ 0 ⇒ more simple neural network.
Intuition 1:
If 𝜆 is too large : a lot of w's will be close to zeros which will make the NN
simpler (you can think of it as it would behave closer to logistic regression).
If 𝜆 is good enough: it will just reduce some weights that makes the neural
network overfit.
11
How does regularization prevent overfitting?
tanh:
𝑧 𝑧 − 𝑒 −𝑧
𝑎= 𝑧
𝑧 + 𝑒 −𝑧
≈ Linear Nonlinear
𝑥1 𝑥1
𝑥2 𝑥2
𝑦 𝑦
𝑥3 𝑥3
𝑥4 𝑥4
0.5 0.5 0.5
Go through each of the layers of the network and set some probability of
eliminating a node in neural network.
For each of these layers, we're going to, for each node, toss a coin and have a 0.5
chance of keeping each node and 0.5 chance of removing each node.
Then remove all the outgoing things from that node as well.
13
Dropout regularization
𝑥1 𝑥1
𝑥2 𝑥2
𝑦 𝑦
𝑥3 𝑥3
𝑥4 𝑥4
0.5 0.5 0.5
We end up with a much smaller network.
And then do back propagation training.
14
Implementing dropout (“Inverted dropout”)
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped.
80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to
solve the scaling problem
a3 = a3 / keep_prob
Vector d[l] is used for forward and back propagation and is the same for them, but
it is different for each iteration (pass) or training example.
At test time we don't use dropout. If you implement dropout at test time - it would
add noise to predictions.
15
Why does drop-out work?
Intuition: Can’t rely on any one feature, so have to spread out weights.
𝑥1
𝑦
𝑥2
1.0
𝑥3 1.0
0.7
1.0
0.7 0.5
16
Data augmentation
4
in OCR, you can impose random rotations and distortions to digits/letters.
New data obtained using this technique isn't as good as the real independent
data, but still can be used as a regularization technique.
17
Early stopping
𝑱
..
# iterations
In this technique we plot the training set and the dev set cost together for each iteration.
At some iteration the dev set cost will stop decreasing and will start increasing.
We will pick the point at which the training set error and dev set error are best (lowest
training cost with lowest dev cost).
We will take these parameters as the best parameters.
The advantage of this method is that you don't need to search a hyperparameter like
in other regularization approaches (like lambda in L2 regularization).
18
Normalizing training sets
𝒙𝟐 𝒙𝟐 𝒙𝟐
𝒙𝟏 𝒙𝟏
𝒙𝟏
𝒘 𝒘
𝒃 𝒃
𝒃 𝒃
𝒘 𝒘
𝑥1
𝑦
𝑥2
𝑤 [1] 𝑤 [2] 𝑤 [3] ⋯ 𝑤 [𝐿]
The Vanishing / Exploding gradients occurs when your derivatives become very small
or very big.
To understand the problem, suppose that we have a deep neural network with number of
layers L, and all the activation functions are linear and each b = 0
𝑔 𝑧 =𝑧 ,𝑏=0
𝑦 = 𝑤 [𝐿] 𝑤 [𝐿−1] ⋯ 𝑤 [2] 𝑤 [1] 𝑥
Example: Deep neural network (L layers)
0.5 0
If 𝑤 = ⟹ 0.5𝐿−1 ⇒ Vanishing
0 0.5
1.5 0
If 𝑤 = ⟹ 1.5𝐿−1 ⇒ Exploding
0 1.5
21
Vanishing/exploding gradients
A partial solution to the Vanishing / Exploding gradients in NN is better or more careful
choice of the random initialization of weights.
He/Xavier initialization:
𝑙 2
o For ReLU: 𝑊 = 𝑟𝑎𝑛𝑑 ∗
𝑛[𝑙−1]
𝑙 1
o For tanh: 𝑊 = 𝑟𝑎𝑛𝑑 ∗
𝑛[𝑙−1]
𝑙 2
o For tanh (Bengio et al.): 𝑊 = 𝑟𝑎𝑛𝑑 ∗
𝑛 𝑙 +𝑛 𝑙−1
Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to
start with).
This is one of the best way of partially solution to Vanishing / Exploding gradients (ReLU
+ Weight Initialization with variance) which will help gradients not to vanish/explode too
quickly.
This initialization is called "He Initialization / Xavier Initialization" and has been
published in 2015 paper.
22
Gradient Checking
If your cost does not decrease on each iteration you may have a back-
propagation bug.
23
Gradient Checking
Take 𝑊 [1] , 𝑏 [1] , ⋯ , 𝑊 [𝐿] , 𝑏 [𝐿] and reshape into a big vector 𝜃.
Is 𝐝𝜽 the gradient of 𝐉 𝜽 ?
24
Gradient Checking
Algorithm:
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 −𝑑𝜃
Finally we evaluate this formula and check (with 𝑒𝑝𝑠 = 10−7) (*):
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 + 𝑑𝜃
o if it is < 𝟏𝟎−𝟕 : great, very likely the backpropagation implementation is correct.
o if around 𝟏𝟎−𝟓 : can be OK, but need to inspect if there are no particularly big values in
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 − 𝑑𝜃.
o if it is ≥ 𝟏𝟎−𝟑 : bad, probably there is a bug in backpropagation implementation.
25
Gradient checking implementation notes
Don't use the gradient checking algorithm at training time because it's very slow.
If the algorithm fails grad check, look at components to try to identify the bug.
𝜆
Don't forget to add 𝑚 𝑤 1 to 𝐽 if you are using L1 or L2 regularization.
Run gradient checking at random initialization and train the network for a while maybe
there's a bug which can be seen when weights 𝑤 and bias 𝑏 become larger (further from
0) and can't be seen on the first iteration (when weights 𝑤 and bias 𝑏 are very small).
26
Initialization summary
The weights 𝑾 should be initialized randomly to break symmetry.
However, you can initialize the biases 𝑏 to zeros. Symmetry is still broken so long
as 𝑾 is initialized randomly.
Different initializations lead to different results.
Random initialization is used to break symmetry and make sure different hidden
units can learn different things.
Don't intialize to values that are too large.
He initialization works well for networks with ReLU activations.
27
L2 Regularization summary
Observations:
o λ is a hyperparameter that you can tune using a dev set.
o L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to
"oversmooth", resulting in a model with high bias.
28
Dropout summary
What you should remember about dropout:
Dropout is a regularization technique.
Only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for
the activations.
For example:
If keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled
by 0.5 since only the remaining half are contributing to the solution.
Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected
value.
You can check that this works even when keep_prob is other values than 0.5.
29
References
Andrew Ng. Deep learning. Coursera.
Geoffrey Hinton. Neural Networks for Machine Learning.
Kevin P. Murphy. Probabilistic Machine Learning An Introduction. MIT
Press, 2022.
MIT Deep Learning 6.S191 (http://introtodeeplearning.com/)
30