0% found this document useful (0 votes)
15 views30 pages

Cours 4

Uploaded by

bouchaar dounia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views30 pages

Cours 4

Uploaded by

bouchaar dounia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Deep learning

Dr. Aissa Boulmerka


[email protected]

2023-2024

1
CHAPTER 4
PRACTICAL ASPECTS OF DEEP LEARNING

2
Applied ML is a highly iterative process
Idea
# layers
# hidden units
learning rates
activation functions …
Code
Experiment

 Its impossible to get all your hyperparameters right on a new application


from the first time.
 So the idea is you go through the loop: Idea ⟹ Code ⟹ Experiment.
 You have to go through the loop many times to figure out your
hyperparameters.

3
Train/dev/test sets
Classical ML (100 – 10000 samples):
Data: Training Dev Test

60% 20% 20%

Deep learning (1M samples)


Data: Training Dev Test

98% 1% 1%
99.5% 0.25% 0.25%

 Your data will be split into three parts:


 Training set (Has to be the largest set)
 Hold-out cross validation set / Development or "dev" set.
 Testing set.
 You will try to build a model upon training set.
 Then try to optimize hyperparameters on dev set as much as possible.
 Then after your model is ready you try and evaluate the testing set.
 The trend now gives the training data the biggest sets.
4
Mismatched train/test distribution

 Make sure dev set and test set come from the same distribution.
 For example if cat train set is from the web and the dev/test images are
from users cell phone they will mismatch. It is better to make sure that dev
and test set are from the same distribution.

Training set: Dev/test sets:


Cat pictures from Cat pictures from
webpages users using your app

 The dev set rule is to try them on some of the good models you've created.

 Its OK to only have a dev set without a testing set. But a lot of people in this case
call the dev set as the test set. A better terminology is to call it a dev set as its used
in the development.

5
Bias and Variance

high bias "just right" high variance


Underfitting Appropriate Overfitting

 Bias / Variance techniques are easy to learn, but difficult to master.


 So here the explanation of Bias / Variance:
 If your model is underfitting, it has a "high bias"
 If your model is overfitting then it has a "high variance"
 Your model will be alright if you balance the Bias / Variance.

6
Bias and Variance
𝑦=1 𝑦=0

Cat classification

Train set error 1% 15% 15% 0.5%


Dev set error 11% 16% 30% 1%
High variance High bias High bias and Low bias and
(Overfitting) (Underfitting) high variance low variance
(Overfitting and (Best)
underfitting

Assuming humans get 0% error


7
Basic recipe for machine learning

 Bigger network (size of hidden units,


number of layers).
High bias? YES  Try to run training longer.
(Training data performance)  Try NN architecture search.
 Try different (advanced) optimization
algorithms.
NO
 More data
High variance ? YES  Try regularization
(dev set performance)  NN architecture search (a different
model that is suitable for your data.)
NO

Done  Try until you have a low bias and low variance.
 Its very helpful to use deep learning for solving the "Bias/variance
tradeoff" problem because with deep learning you have more
options/tools.
 Training a bigger neural network never hurts.
8
Regularization (Logistic regression)
Euclidian norm
min 𝐽 𝑤, 𝑏 , 𝑤 ∈ ℝ𝑛𝑥 , 𝑏 ∈ ℝ
𝑤,𝑏

1 𝑚 𝜆 2 2 𝑛𝑥
𝑳𝟐 regularization : 𝐽 𝑤, 𝑏 = 𝑚 𝑖=1 ℒ 𝑦 𝑖 ,𝑦 𝑖
+ 2𝑚 𝑤 2, 𝑤 2 = 𝑗=1 𝑤𝑗
2
= 𝑤𝑇𝑤

1 𝑚 𝜆 𝑛𝑥
𝑳𝟏 regularization : 𝐽 𝑤, 𝑏 = 𝑚 𝑖=1 ℒ 𝑦 𝑖 ,𝑦 𝑖 + 𝑚 𝑤 1, 𝑤 1 = 𝑗=1 𝑤𝑗

 Adding regularization to NN will help it reduce variance (overfitting)


 L1 regularization version makes a lot of 𝑤 values become zeros, which makes the model
size smaller.
 L2 regularization is being used much more often.
 𝜆 (lambda) is the regularization parameter (hyperparameter).

9
Regularization (Neural network)

min 𝐽 𝑤 [1] , 𝑏 [1] , … , 𝑤 [𝐿] , 𝑏 [𝐿] ,


𝑤,𝑏
1 𝑚 𝜆 2 𝟐
𝐽 𝑤 [1] , 𝑏 [1] , … , 𝑤 [𝐿] , 𝑏 [𝐿] = 𝑖=1 ℒ 𝑦 𝑖 ,𝑦 𝑖
+ 𝑤 [𝑙] 𝐹
. 𝑭: Forbenius norm
𝑚 2𝑚
2 𝑛 [𝑙] 𝑛[𝑙−1] [𝑙] 2
𝑤 [𝑙] 𝐹
= 𝑖=1 𝑗=1 𝑤𝑖𝑗 𝑤 [𝑙] : 𝑛[𝑙] , 𝑛[𝑙−1]

𝜆 [𝐿]
𝑑𝑤 [𝐿] = 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 + 𝑤
𝑚
𝑤 [𝐿] = 𝑤 [𝐿] − 𝛼𝑑𝑤 [𝐿]
Weight decay:
𝜆  In practice this penalizes large
[𝐿] [𝐿]
𝑤 =𝑤 −𝛼 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 + 𝑤 [𝐿] weights and effectively limits
𝑚
𝛼𝜆 𝐿 the freedom in your model.
𝑤 [𝐿] = 𝑤 [𝐿] − 𝑤 − 𝛼 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝 𝛼𝜆
 The new term 1 − 𝑚 𝑤 [𝐿]
𝑚
𝛼𝜆 causes the weight to decay in
𝑤 [𝐿] = 1− 𝑤 [𝐿] − 𝛼 𝑓𝑟𝑜𝑚 𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝
𝑚 proportion to its size.
10
How does regularization prevent overfitting?

𝑥1 𝑚 𝐿
1 𝜆 2
𝑥2 𝑦 𝐽 𝑤 [𝑙] , 𝑏 [𝑙] =
𝑚
ℒ 𝑦 𝑖 ,𝑦 𝑖
+
2𝑚
𝑤 [𝑙] 𝐹
𝑖=1 𝑙=1

𝑥3
 If 𝜆 is very big ⇒ 𝑤 [𝑙] ≈ 0 ⇒ more simple neural network.

Here are some intuitions:

 Intuition 1:
 If 𝜆 is too large : a lot of w's will be close to zeros which will make the NN
simpler (you can think of it as it would behave closer to logistic regression).
 If 𝜆 is good enough: it will just reduce some weights that makes the neural
network overfit.

11
How does regularization prevent overfitting?

tanh:
𝑧 𝑧 − 𝑒 −𝑧
𝑎= 𝑧
𝑧 + 𝑒 −𝑧

≈ Linear Nonlinear

𝑧 [𝑙] = 𝑤 𝑙 𝑎 𝑙−1 + 𝑏 [𝑙]


If 𝜆 is very big ⇒ 𝑤 [𝑙] ≈ 0 ⇒ 𝑧 [𝑙] will be relatively small, then every layer will be ≈ Linear

 Intuition 2 (with tanh activation function):


 If lambda is too large: w's will be small (close to zero) - will use the linear part of the
tanh activation function, so we will go from non linear activation to roughly linear which
would make the NN a roughly linear classifier.
 If lambda good enough: it will just make some of tanh activations roughly linear which
will prevent overfitting.
12
Dropout regularization

𝑥1 𝑥1
𝑥2 𝑥2
𝑦 𝑦
𝑥3 𝑥3
𝑥4 𝑥4
0.5 0.5 0.5
 Go through each of the layers of the network and set some probability of
eliminating a node in neural network.
 For each of these layers, we're going to, for each node, toss a coin and have a 0.5
chance of keeping each node and 0.5 chance of removing each node.
 Then remove all the outgoing things from that node as well.

13
Dropout regularization

𝑥1 𝑥1
𝑥2 𝑥2
𝑦 𝑦
𝑥3 𝑥3
𝑥4 𝑥4
0.5 0.5 0.5
 We end up with a much smaller network.
 And then do back propagation training.

14
Implementing dropout (“Inverted dropout”)
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped.
80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to
solve the scaling problem
a3 = a3 / keep_prob

 Vector d[l] is used for forward and back propagation and is the same for them, but
it is different for each iteration (pass) or training example.
 At test time we don't use dropout. If you implement dropout at test time - it would
add noise to predictions.

15
Why does drop-out work?
 Intuition: Can’t rely on any one feature, so have to spread out weights.

𝑥1
𝑦
𝑥2
1.0
𝑥3 1.0
0.7
1.0

0.7 0.5

16
Data augmentation

 In a computer vision data:


 You can flip all your pictures horizontally this will give you m more data instances.
 You could also apply a random position and rotation to an image to get more data.


4
in OCR, you can impose random rotations and distortions to digits/letters.
 New data obtained using this technique isn't as good as the real independent
data, but still can be used as a regularization technique.
17
Early stopping

𝑱
..
# iterations
 In this technique we plot the training set and the dev set cost together for each iteration.
At some iteration the dev set cost will stop decreasing and will start increasing.
 We will pick the point at which the training set error and dev set error are best (lowest
training cost with lowest dev cost).
 We will take these parameters as the best parameters.
 The advantage of this method is that you don't need to search a hyperparameter like
in other regularization approaches (like lambda in L2 regularization).
18
Normalizing training sets

𝒙𝟐 𝒙𝟐 𝒙𝟐

𝒙𝟏 𝒙𝟏

𝒙𝟏

Subtract mean: Normalize variance:


𝑚 𝑚
1 1 2
𝜇= 𝑋 (𝑖) 𝜎2 = 𝑋 (𝑖)
𝑚 𝑚
𝑖=1 𝑖=1
𝑋 ≔𝑋−𝜇 𝑋 ≔ 𝑋/𝜎 2

Use the same parameters 𝝁 and 𝝈𝟐 to normalize the test set.


19
Why normalize inputs?
Unnormalized : Normalized :
𝑱 𝑚 𝑱
1
J 𝑤, 𝑏 = ℒ 𝑦, 𝑦
𝑚
𝑖=1

𝒘 𝒘
𝒃 𝒃

𝒃 𝒃

𝒘 𝒘

If we normalize, we can use a much larger learning rate


𝛼 ⟹ speed up the training process
20
Vanishing/exploding gradients ‫اﻟﻤﺘﻔﺠﺮه‬/‫اﻟﺘﺪرﺟﺎت اﻟﻤﺘﻞاﺷﻴﻪ‬

𝑥1
𝑦
𝑥2
𝑤 [1] 𝑤 [2] 𝑤 [3] ⋯ 𝑤 [𝐿]

 The Vanishing / Exploding gradients occurs when your derivatives become very small
or very big.
 To understand the problem, suppose that we have a deep neural network with number of
layers L, and all the activation functions are linear and each b = 0
𝑔 𝑧 =𝑧 ,𝑏=0
𝑦 = 𝑤 [𝐿] 𝑤 [𝐿−1] ⋯ 𝑤 [2] 𝑤 [1] 𝑥
Example: Deep neural network (L layers)
0.5 0
If 𝑤 = ⟹ 0.5𝐿−1 ⇒ Vanishing
0 0.5
1.5 0
If 𝑤 = ⟹ 1.5𝐿−1 ⇒ Exploding
0 1.5

In both cases gradient descent takes a very long time

21
Vanishing/exploding gradients
 A partial solution to the Vanishing / Exploding gradients in NN is better or more careful
choice of the random initialization of weights.
 He/Xavier initialization:

𝑙 2
o For ReLU: 𝑊 = 𝑟𝑎𝑛𝑑 ∗
𝑛[𝑙−1]

𝑙 1
o For tanh: 𝑊 = 𝑟𝑎𝑛𝑑 ∗
𝑛[𝑙−1]

𝑙 2
o For tanh (Bengio et al.): 𝑊 = 𝑟𝑎𝑛𝑑 ∗
𝑛 𝑙 +𝑛 𝑙−1

 Number 1 or 2 in the nominator can also be a hyperparameter to tune (but not the first to
start with).
 This is one of the best way of partially solution to Vanishing / Exploding gradients (ReLU
+ Weight Initialization with variance) which will help gradients not to vanish/explode too
quickly.
 This initialization is called "He Initialization / Xavier Initialization" and has been
published in 2015 paper.
22
Gradient Checking
 If your cost does not decrease on each iteration you may have a back-
propagation bug.

 Gradient checking approximates the gradients and is very helpful for


finding the errors in your backpropagation implementation but it's
slower than gradient descent (so use only for debugging).

 Implementation of this is very simple.

23
Gradient Checking

 Take 𝑊 [1] , 𝑏 [1] , ⋯ , 𝑊 [𝐿] , 𝑏 [𝐿] and reshape into a big vector 𝜃.

The cost function will be 𝐉 𝜽

 Take 𝑑𝑊 [1] , 𝑑𝑏 [1] , ⋯ , 𝑑𝑊 [𝐿] , 𝑑𝑏 [𝐿] and reshape into a big


vector d𝜃.

Is 𝐝𝜽 the gradient of 𝐉 𝜽 ?

24
Gradient Checking
 Algorithm:

eps = 𝟏𝟎−𝟕 # small number


for i in len(𝜽):
𝐽 𝜃1 , 𝜃2 … , 𝜃𝑖 + 𝑒𝑝𝑠, … − 𝐽 𝜃1 , 𝜃2 … , 𝜃𝑖 − 𝑒𝑝𝑠, …
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 [𝑖] =
2 ∗ 𝑒𝑝𝑠

𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 −𝑑𝜃
 Finally we evaluate this formula and check (with 𝑒𝑝𝑠 = 10−7) (*):
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 + 𝑑𝜃
o if it is < 𝟏𝟎−𝟕 : great, very likely the backpropagation implementation is correct.
o if around 𝟏𝟎−𝟓 : can be OK, but need to inspect if there are no particularly big values in
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 − 𝑑𝜃.
o if it is ≥ 𝟏𝟎−𝟑 : bad, probably there is a bug in backpropagation implementation.

(*) || ||: Euclidean vector norm.

25
Gradient checking implementation notes
 Don't use the gradient checking algorithm at training time because it's very slow.

 Use gradient checking only for debugging.

 If the algorithm fails grad check, look at components to try to identify the bug.

𝜆
 Don't forget to add 𝑚 𝑤 1 to 𝐽 if you are using L1 or L2 regularization.

 Gradient checking doesn't work with dropout because 𝐽 is not consistent.


o You can first turn off dropout (set keep_prob = 1.0 ), run gradient checking and then turn on dropout
again.

 Run gradient checking at random initialization and train the network for a while maybe
there's a bug which can be seen when weights 𝑤 and bias 𝑏 become larger (further from
0) and can't be seen on the first iteration (when weights 𝑤 and bias 𝑏 are very small).

26
Initialization summary
 The weights 𝑾 should be initialized randomly to break symmetry.
 However, you can initialize the biases 𝑏 to zeros. Symmetry is still broken so long
as 𝑾 is initialized randomly.
 Different initializations lead to different results.
 Random initialization is used to break symmetry and make sure different hidden
units can learn different things.
 Don't intialize to values that are too large.
 He initialization works well for networks with ReLU activations.

27
L2 Regularization summary
 Observations:
o λ is a hyperparameter that you can tune using a dev set.
o L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to
"oversmooth", resulting in a model with high bias.

 What is L2-regularization actually doing?:


o L2-regularization relies on the assumption that a model with small weights is simpler than a model with
large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the
weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother
model in which the output changes more slowly as the input changes.

 What you should remember: Implications of L2-regularization on:


o cost computation: A regularization term is added to the cost
o backpropagation function: There are extra terms in the gradients with respect to weight matrices
o weights: weights end up smaller ("weight decay") - are pushed to smaller values.

28
Dropout summary
What you should remember about dropout:
 Dropout is a regularization technique.
 Only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
 Apply dropout both during forward and backward propagation.
 During training time, divide each dropout layer by keep_prob to keep the same expected value for
the activations.

For example:
 If keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled
by 0.5 since only the remaining half are contributing to the solution.
 Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected
value.
 You can check that this works even when keep_prob is other values than 0.5.

29
References
 Andrew Ng. Deep learning. Coursera.
 Geoffrey Hinton. Neural Networks for Machine Learning.
 Kevin P. Murphy. Probabilistic Machine Learning An Introduction. MIT
Press, 2022.
 MIT Deep Learning 6.S191 (http://introtodeeplearning.com/)

30

You might also like