0% found this document useful (0 votes)
4 views64 pages

03 Reg Slides

The lecture on Regularization in Deep Learning covers topics such as model overfitting, methods to limit model capacity, early stopping, ensemble methods, and data augmentation. It discusses techniques like L2 regularization, weight decay, and dropout to mitigate overfitting and improve model generalization. The session also emphasizes the importance of hyperparameter tuning and introduces concepts like Deep Double Descent and Bayesian neural networks.

Uploaded by

mrolaw01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views64 pages

03 Reg Slides

The lecture on Regularization in Deep Learning covers topics such as model overfitting, methods to limit model capacity, early stopping, ensemble methods, and data augmentation. It discusses techniques like L2 regularization, weight decay, and dropout to mitigate overfitting and improve model generalization. The session also emphasizes the importance of hyperparameter tuning and introduces concepts like Deep Double Descent and Bayesian neural networks.

Uploaded by

mrolaw01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

CS-E4890 Deep Learning

Lecture #3 Regularization

21.1.2025

Jorma Laaksonen ––– Juho Kannala ––– Alexander Ilin


Today’s topics

1. Model overfitting
2. Limiting model capacity
3. Early stopping
4. Ensemble methods
5. Data augmentation
6. Rethinking generalization
7. Deep Double Descent
8. Hyperparameter search
9. Home assignment

1
Overfitting

1.5

1.0

0.5
• Good performance on training data but bad
0.0
performance on new, test data
(poor generalization). 0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0
Regression with a polynomial function of order M = 12
green: correct model, red: fitted model

2
Why overfitting happens?

• Conventional wisdom: The model is too flexible 1.5


for the amount of training data. 1.0
• Wikipedia: An overfitted model is a statistical
0.5
model that contains more parameters than can
0.0
be justified by the data.
0.5
• Rule of thumb (one in ten rule) for logistic
regression: To keep the risk of overfitting low, 1.0

the number of examples should be ten times 1.5


0.0 0.2 0.4 0.6 0.8 1.0
larger than the number of parameters.
Regression with a polynomial function of order M = 12

3
How to detect overfitting?

• Use validation set (black dots) to evaluate the performance.

1.5

1.0

0.5

0.0

0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0

4
Regularization

1.0
• Regularization is a technique used to solve the 0.5
overfitting problem.
0.0
• Neural networks are very powerful (universal 0.5
approximators), they can model very large and
1.0
complex datasets.
1.5
0.0 0.2 0.4 0.6 0.8 1.0
Regression with an MLP network
green: correct model, red: fitted model

5
Regularization methods

1.5

1.0

0.5

0.0
1. Limit model capacity: 0.5

• Reduce network size 1.0

• Weight decay 1.5


0.0 0.2 0.4 0.6 0.8 1.0

• Parameter sharing

2. Early stopping
1.0

0.5

3. Ensemble (committee) methods: 0.0

• Dropout 0.5

• Probabilistic treatment (e.g. Bayesian neural networks) 1.0


0.0 0.2 0.4 0.6 0.8 1.0
4. Data augmentation:
1.5
• Noise injection
1.0
• Transformations 0.5
• Adversarial training 0.0

0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0

6
1. Limiting model capacity
Reduce size of network

• Recall the conventional wisdom: Overfitting is likely to happen when a model contains more
parameters than can be justified by the data.
• Solution: Reduce the number of parameters.
• We can vary the number of neurons/the number of layers to find the architecture that works best
(on the validation data).
• Advantage: Conceptually easy.
• Disadvantage: Other regularization methods often give better accuracy.

8
L2 regularization (Tikhonov, 1943)

• Add a penalty term Ω(w) to the training cost:

Lreg = L + Ω(w) .
α α αX 2
Ω(w) = kwk2 = w> w = wi
2 2 2
i

The penalty term is a function of parameters w, not data.


• L2 regularization pushes the solution towards zero.
• L2 regularization is also called ridge regression. w ∗ unregularized solution
w̃ regularized solution

• L2 regularization is often called also as weight decay. For example, torch.optim.Adam(params,


lr, betas, eps, weight decay) implements L2 regularization with the hyperparameter α set
by parameter weight decay.

9
L2 regularization vs weight decay

• Using term weight decay for L2 -regularization may cause confusion.


• Weight decay as described by Hanson and Pratt (1988):

wt+1 = (1 − λ)wt − η∇L (1)

• For standard stochastic gradient descent, weight decay is equivalent to L2 regularization:


α
Lreg = L + kwk2
2
wt+1 = wt − η∇Lreg = wt − η∇L − ηαwt = (1 − ηα)wt − η∇L

• Algorithms like Adam cannnot be written in the form similar to (1).


• Loshchilov and Hutter (2017) proposed a regularized version of Adam which tries to follow the
early-days definition of weight decay. It is available in PyTorch as torch.optim.AdamW.
 
mt
wt+1 = wt − η α√ + λwt
vt + 

10
Why L2 regularization reduces overfitting

• Intuition: Smaller weights usually produce smoother functions (smaller magnitudes of derivatives).
• Consider a linear regression problem (no bias term for simplicity):
N
1 X 2 α
L(w) = yn − w> xn + w> w
2N 2
n=1

• Let us find the minimum by computing the gradient and equating it to zero:
N N N
2 X 1 X 1 X
∇w L = yn − w> xn (−xn ) + αw = xn xn> w − yn xn + αw

2N N N
n=1 n=1 n=1
N
! N
1 X 1 X
= xn xn> + αI w− yn xn = 0
N N
n=1 n=1

which yeilds
N
!−1 N
!
1 X 1 X
w̃ = xn xn> + αI yn xn
N N
n=1 n=1

11
Why weight decay reduces overfitting

• The solution of linear regression with weight decay:


N
!−1 N
!
1 X 1 X
w̃ = xn xn> + αI yn xn
N N
n=1 n=1

• L2 regularization causes the learning algorithm to “perceive” the input as having higher variance.
This makes the weights shrink.
• The regularization effect is larger for the weight values determined by the minor (opposite to
principal) components of the data.

12
2. Early stopping
Early stopping

• Monitor validation performance during training.


• Stop when it starts to deteriorate (with other
regularization techniques it might never start).
• Keeps solution close to the initialization.

14
Why early stopping reduces overfitting

• For a linear regression problem, the loss is quadratic and thus can be written in the following form:
1
L(w) = L(w∗ ) + (w − w∗ )> H(w − w∗ )
2
where w∗ is the global minimum and H is the Hessian, i.e. second order derivatives, of the loss.
• The update with gradient descent:
wt = wt−1 − H(wt−1 − w∗ )
wt − w∗ = wt−1 − w∗ − H(wt−1 − w∗ ) = (I − H)(wt−1 − w∗ )
• Using the eigendecomposition H = QΛQ> gives
Q> (wt − w∗ ) = Q> (QQ> − QΛQ> )(wt−1 − w∗ ) = (I − Λ) Q> (wt−1 − w∗ )
| {z }
=(I−Λ)Q> (wt−2 −w∗ )

= (I − Λ)2 Q> (wt−2 − w∗ ) = (I − Λ)t Q> (w0 − w∗ )


• Assuming w0 = 0, this yields
Q> wt = Q> w∗ − (I − Λ)t Q> w∗ = [I − (I − Λ)t ]Q> w∗

15
Optimal solution with weight decay

• Now consider minimizing the same loss with a weight decay penalty:
1 α
Lα (w) = L(w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2
The optimal weights w̃ can be found by equating the gradient to zero:

∇Lα = H(w − w∗ ) + αw = 0
w̃ = (H + αI)−1 Hw∗

In the rotated coordinate system, the solution is given by

Q> w̃ = Q> (QΛQ> + αI)−1 QΛQ> w∗


−1
= Q> Q(Λ + αI)Q> QΛQ> w∗


= (Λ + αI)−1 ΛQ> w∗

16
Why early stopping reduces overfitting

• If we use L2 regularization:

Q> w̃ = (Λ + αI)−1 ΛQ> w∗

• If we use early stopping after iteration t:

Q> wt = [I − (I − Λ)t ]Q> w∗

• If the hyperparameters , α and t are chosen such


that
(Λ + αI)−1 Λ = [I − (I − Λ)t ] w ∗ unregularized solution
w̃ regularized solution
then L2 regularization and early stopping can be
seen as equivalent.

17
Early stopping

• Early stopping stops training before


we go to a narrow hole in which the
model may generalize poorly.

training error
validation error
w

18
3. Ensemble methods
Ensemble methods

• Train several models and take average of their outputs.


1.0
• Also known as bagging or model averaging.
0.5
• It helps to make individual models different by
0.0
• varying models or algorithms
• varying hyperparameters 0.5
• varying data (dropping examples or dimensions)
• varying random seed 1.0
0.0 0.2 0.4 0.6 0.8 1.0

20
Dropout (Hinton et al., 2012)

• At training time: For each data example x (or


mini-batch), randomly delete/inactivate/ignore each
hidden node with probability p.
• Can be seen as
• injecting (multiplicative binary) noise
• training an ensemble of models with shared weights.
• For a network with N neurons, our ensemble contains
2N models.

21
Dropout (Hinton et al., 2012)

• At training time: For each data example x (or


mini-batch), randomly delete/inactivate/ignore each
hidden node with probability p.
• Can be seen as
• injecting (multiplicative binary) noise
• training an ensemble of models with shared weights.
• For a network with N neurons, our ensemble contains
2N models.

21
Dropout (Hinton et al., 2012)

• At training time: For each data example x (or


mini-batch), randomly delete/inactivate/ignore each
hidden node with probability p.
• Can be seen as
• injecting (multiplicative binary) noise
• training an ensemble of models with shared weights.
• For a network with N neurons, our ensemble contains
2N models.

21
Dropout (Hinton et al., 2012)

• At training time: For each data example x (or


mini-batch), randomly delete/inactivate/ignore each
hidden node with probability p.
• Can be seen as
• injecting (multiplicative binary) noise
• training an ensemble of models with shared weights.
• For a network with N neurons, our ensemble contains
2N models.

21
Dropout: Training and evaluation modes

• At test time neurons are not dropped, which will create a mismatch between training and
inference modes.
• If a signal x is dropped with probability p, the expected
model = nn.Sequential(
value after the dropout layer is nn.Linear(1, 100),
nn.Tanh(),
E [y ] = (1 − p)x nn.Dropout(0.02),
...
That means that when we drop neurons, the expected )
value of y will be (1 − p) times that of the “non-dropped”
# Switch to training mode
setup. model.train()
# training the model
• To mitigate this, we can do the following trick: ...
# Switch to evaluation mode
• Training mode: zero signals with probability p and scale
1 model.eval()
the remaining ones by factor 1−p . # test the model
• Evaluation mode: do nothing.

22
Probabilistic treatment: Bayesian neural networks

• Bayesian neural networks were proposed in the late 1980s, popularized by David MacKay (1992).
• Bayesian methodology: one should combine predictions p(y | x, Mi ) given by all possible models:
X
p(y | x, D) = p(y | x, Mi )p(Mi | D)
i

weighting them by
p(Mi )p(D | Mi )
p(Mi | D) =
p(D)
.

23
Probabilistic treatment: Bayesian neural networks

• If we fix the architecture of a neural network, the set of possible models is defined by all possible
parameter values w: Z
p(y | x, D) = p(y | x, w)p(w | D)dw

• We then need to evaluate the posterior distribution p(w | D) of the model parameters given the
training data. We do that using Bayes rule:
p(D|w)p(w)
p(w | D) =
p(D)
• We can use different strategies to approximate p(w | D):
• maximum a posteriori estimation (point estimates of w)
• variational approximation of p(w | D)
• draw samples from p(w | D)

24
Maximum a posteriori with Gaussian prior = L2 regularization

• Maximum a posteriori estimation:


w̃ = arg max log p(w|D) = arg max [log p(D|w) + log p(w) − log p(D)]
w w

which is equivalent to minimizing


L(w) = − log p(D|w) − log p(w) .
• Recall that, for example, MSE can be viewed as − log p(D|w) for a Gaussian model:
N ny  2 N
1 X X (n)
yj − fj (x(n) , w) = −β log N (y(n) | f(x(n) , w), σ 2 I) + const
Y
Nny
n=1 j=1 n=1

• If we assume Gaussian prior p(w) = N (0, α −1


I) we get:
α α
 
− log p(w) = − log exp − kwk2 + const = kwk2 + const
2 2
• Thus, L2 regularization is equivalent to maximum a posteriori estimation in a probabilistic model
with Gaussian prior (not really an ensemble method).

25
Variational approximation of the posterior distribution

• Variational approximations: Within a selected family of 3.0


true distribution
approximation
distributions q(w), for example, Gaussian: 2.5

2.0
N (wi | µi , σi2 ),
Y
q(w | θ) = 1.5
i 1.0

find the one that is closest to the true posterior 0.5

distribution p(w | D). 0.0


1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

• θ = {µi , σi } are called variational parameters, we need to tune them.


• Blundell et al. (2015) tune θ = {µi , σi } by minimizing the Kullback-Leibler (KL) divergence
between q(w | θ) and the true posterior distribution:
Z
q(w | θ)
L(θ) = KL[q(w | θ) || p(w | D)] = q(w | θ) log dw + const
p(w)p(D | w)
= KL[q(w | θ) || p(w)] − Eq(w|θ) [log p(D | w)] + const
| {z } | {z }
regularization term fit to data

26
Bayesian neural networks

• By using an ensemble of models, Bayesian neural


networks can reduce overfitting.
• BNNs can produce confidence intervals for their
predictions.
• BNNs can miss some of the modes in the posterior
distribution over the weights, thus the uncertainties
can be easily underestimated.

image from (Blundell et al., 2015)

27
Sampling approach: Stein variational gradient descent

3.0
true distribution
• Liu and Wang (2016) find samples wk from the posterior 2.5 approximation
samples from approximation
approximation q(w) that minimizes the KL divergence with
2.0
the true posterior.
1.5
• Each sample defines one neural network. 1.0
• We create an ensemble of neural networks.
0.5
• We do not postulate the form of q(w) explicitly. 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

• We still can compute the gradient of the KL divergence, which yields the update rule:
K
1 X
wk ← wk + η k(wk 0 , wk )∇wk 0 [log p(wk 0 ) + log p(D | wk 0 )] + ∇wj k(wk 0 , wk )
K | {z } | {z }
k 0 =1 smoothed gradient repulsive force

k(w, w0 ) is some kernel, for example, k(w, w0 ) = exp(− h1 kw − w0 k2 ).


• The repulsive force term prevents all wk to collapse into the same values.

28
4. Data augmentation
Injecting noise (Sietsma and Dow, 1991)

• Inject random noise during training


(different noise instance in each epoch).
• Can be applied to input data, to hidden
activations, or to weights.
• Can be seen as data augmentation.
• Simple end effective.

30
Image transformations

• In some domains it is easy to generate


more labeled data by transformations.
• Transformation of images: random crop,
translation, scaling, flip, rotation.
• The classification network learns to be
invariant to such transformations.

Image from (Dosovitskiy et al., 2014)

31
mixup (Zhang et al., 2017)

• mixup constructs virtual training examples x̃, ỹ in the following way:

x̃ = λxi + (1 − λ)xj
ỹ = λyi + (1 − λ)yj

where xi , xj are raw input vectors and yi , yj are one-hot label encodings. (xi , yi ) and (xj , yj ) are
two examples drawn at random from the training set, λ ∈ [0, 1].
• mixup extends the training distribution by incorporating the prior knowledge that linear
interpolations of feature vectors should lead to linear interpolations of the associated targets.
• Note that for images, we take as training examples mixtures of two different images. Even though
the mixtures do not look like real images, this data augmentation method works and improves
generalization.

32
Adversarial examples

• Training of neural networks:


N
1 X
L(x(n) , y(n) , w) → min
N w
n=1

where L(x(n) , y(n) w) is, for example, a cross-entropy loss.


• Szegedy et al. (2014) discovered that it is very easy to fool a trained neural network. One can
modify a given input x such that the output of the network changes:

L(x + r, y, w) → max
r

keeping the perturbation r small, for example, krk ≤ ε.


• Modified input x + r is called an adversarial example and r the adversarial perturbation.

33
FGSM attack (Goodfellow et al., 2014)

• Finding adversarial examples is surprisingly easy. For example, with the fast gradient sign method
(FGSM):
x + r = x + ε sign(∇x L(w, x, y))

x, f (x) = “panda” sign(∇x L(w, x, y)) x + r, f (x + r) = “gibbon”

34
Adversarial training

• Adversarial examples are difficult for neural networks, including them in the training set helps
reduce the test error. This is called adversarial training.
• Adversarial training is data augmentation with adversarial examples.
• The existence of adversarial examples motivated a new subfield of deep learning in which
techniques are developed to defend neural networks against adversarial attacks.

35
Madry’s defense

• Madry’s defense model (Madry et al., 2017) is one of the strongest defense models.
• Recall standard optimization:
min E(x,y)∼D [L(w, x, y)]
w

• Madry’s defense model: h i


min E(x,y)∼D max L(w, x + δ, y)
w δ∈S

• Instead of feeding clean training samples x, we feed the worst adversarial examples found with
another optimization procedure.
• Saddle point problem: composition of an inner maximization problem and an outer minimization
problem.

36
Adversarial training helps develop more meaningful representations

• Gradients wrt inputs look much more meaningful for an adversarily trained network.

images from (Madry et al., 2018)

37
Adversarial examples look more meaningful

images from (Madry et al., 2018)

38
Rethinking generalization
(Zhang et al., 2016)
Conventional wisdom

1.5

1.0
• The model is too flexible for the amount of
0.5
training data.
0.0
• Wikipedia: An overfitted model is a statistical
model that contains more parameters than can 0.5

be justified by the data. 1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0

Regression with a polynomial function of order M = 12

40
Rethinking generalization (Zhang et al., 2016)

• Deep neural networks easily fit random labels:


• The effective capacity of neural networks is sufficient for memorizing the entire data set.
• Optimization on random labels remains easy.
• Same networks exhibit remarkably small difference between training and test performance.

Fitting random labels and random pixels on CIFAR10.

41
The role of explicit regularization

• Explicit regularization may improve generalization performance, but is neither necessary nor by
itself sufficient for controlling generalization error.

The training and test accuracy of various models on CIFAR10.

42
Implicit regularization

• Batch normalization is usually found to improve the generalization performance, even though it
was not explicitly designed for regularization.

The training and test accuracy of various models on CIFAR10.

• Stochastic gradient descent (SGD) may act as an implicit regularizer:


• For linear models, SGD always converges to a solution with small norm.

43
Deep Double Descent
(Nakkiran et al., 2019)
Deep learning practice

• Conventional wisdom: larger models overfit more, therefore one should use simpler models.

• Deep learning practitioners:


• Larger models trained usually
generalize better.
• Early stopping may improve test
performance in some settings.
However, training large neural
networks to zero training error
usually only improves
performance.
• More data is always better. Train and test error as a function of model size, for ResNet18s of varying width on
CIFAR-10 with 15% label noise.

45
Deep Double Descent (Nakkiran et al., 2019)

• Two regimes of training procedure:


1. Underparameterized regime: the
test error as a function of model
complexity follows the classical
behavior.
2. Overparametrized regime:
increasing complexity only
decreases test error.
• The transition between the two
regimes happens after the model
complexity is sufficiently large to
achieve nearly zero training error. Train and test error as a function of model size, for ResNet18s of varying width on
CIFAR-10 with 15% label noise.

• Hypothesis for any natural data distribution: After the model complexity grows enough to
interpolate the entire training dataset, the test error decreases with the model complexity.

46
Effect of data augmentation

• Modifications which increase the interpolation threshold (e.g., data augmentation, increasing the
number of training samples) shift the peak in test error towards larger models.

47
In some settings more data can even hurt

• Test loss as a function of Transformer model size (embedding dimension) on language translation:

• The curve for 18k samples is generally lower than the one for 4k samples, but also shifted to the
right, since fitting 18k samples requires a larger model.
• Thus, for some models, the performance for 18k samples is worse than for 4k samples.

48
Epoch-wise double descent

• Increasing the training time increases the


effective model complexity.
• Sufficiently large models can undergo a
“double descent” behavior where test error
first decreases then increases near the
interpolation threshold, and then decreases
again.

Training ResNet18s on CIFAR10 with 20% label noise.

• For “medium sized” models (for which training to completion will only barely reach ≈ 0 error) the
test error as a function of training time will follow a classical U-like curve where it is better to
stop early.
• Models that are too small to reach the approximation threshold will remain in the “under
parameterized” regime where increasing training time monotonically decreases test error.
49
Effect of early stopping

• Early stopping helps for critically parameterized models.

Training ResNet18s of varying width on CIFAR-10 with 15% label noise.

• Double descent does not typically occur with optimal early stopping: early stopping prevents
models from reaching 0 train error, which is necessary for the double descent effect to occur.

50
Hyperparameter search
Selecting hyperparameters

• Hyperparameter search: use the performance on the validation set to select the optimal values of
the hyperparameters.
• Hyperparameters that you may want to tune:
• learning rate schedule • number of layers
• transformations used for data augmentation • number of neurons
• weight decay coefficient
• dropout rate • convolution kernel width
• mini-batch size • nonlinearity

• What works best in practice (this is not the case in the home assignment :-)):
• A large model combined with strong regularization.
• The training error is very low.

52
Hyperparameter search: Grid search

• Select a fixed set of possible values for each


hyperparameter.
• Compute the validation loss for all hyperparameter
combinations.
• Problems:
• Many evaluations will be unnecessary if some
hyperparameters are non-influential.
• Computational cost increases exponentially with the
number of hyperparameters.

image from (Bergstra and Bengio, 2012)

53
Hyperparameter search: Random search

• Random combinations of the hyperparameters are


formed and evaluated.
• Advantages:
• Random search does not waste evaluations for
non-influential hyperparameters.
• More convenient and faster than grid search.

image from (Bergstra and Bengio, 2012)

54
Recommended reading

• Chapter 7 of Deep learning book


• References in the slides

55
Recap
Summary of Lecture #3

1. Model overfitting may result if a model contains too many parameters compared to training data.
2. Model overfitting leads to good performance on training data but bad performance on test data.
3. Regularization techniques aim to prevent model overfitting.
4. Model capacity can be limited by reducing the model size, weight decay and parameter sharing.
5. Early stopping is based on maximizing performance on an evaluation set.
6. Dropout is a very efficient ensemble method.
7. Data augmentation methods include noise injection, transformations and adversarial training.
8. Explicit regularization does not necessarily improve generalization.
9. In deep double descent phenomenon a DL model’s performance first gets worse and then better.
10. Random search should be preferred over grid search for hyperparameter optimization.

57
Home assignment
Assignment 03 reg

• First notebook: Experiment with different regularization methods on a toy regression problem.
• Second notebook: Implement and tune a recommender system.
• In order to achieve good performance on the test set, you will have to use regularization
techniques and tune the (hyper)parameters.

59
How to represent users or items

• A simple representation is a one-hot vector. For example, user i can be represented with vector zi
such that zi = 1, zj6=i = 0.
• Better representaion:
• represent each user i as a vector wi
• treat all vectors wi as model parameters and tune them in the training procedure
• this is equivalent to Wzi where W is a matrix of “embeddings” (with vectors wi in its columns).
• This is implemented in torch.nn.Embedding(num embeddings, embedding dim)
• num embeddings is the size of the dictionary
• embedding dim is the size of each embedding vector wi

60

You might also like