03 Reg Slides
03 Reg Slides
Lecture #3 Regularization
21.1.2025
1. Model overfitting
2. Limiting model capacity
3. Early stopping
4. Ensemble methods
5. Data augmentation
6. Rethinking generalization
7. Deep Double Descent
8. Hyperparameter search
9. Home assignment
1
Overfitting
1.5
1.0
0.5
• Good performance on training data but bad
0.0
performance on new, test data
(poor generalization). 0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
Regression with a polynomial function of order M = 12
green: correct model, red: fitted model
2
Why overfitting happens?
3
How to detect overfitting?
1.5
1.0
0.5
0.0
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
4
Regularization
1.0
• Regularization is a technique used to solve the 0.5
overfitting problem.
0.0
• Neural networks are very powerful (universal 0.5
approximators), they can model very large and
1.0
complex datasets.
1.5
0.0 0.2 0.4 0.6 0.8 1.0
Regression with an MLP network
green: correct model, red: fitted model
5
Regularization methods
1.5
1.0
0.5
0.0
1. Limit model capacity: 0.5
• Parameter sharing
2. Early stopping
1.0
0.5
• Dropout 0.5
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
6
1. Limiting model capacity
Reduce size of network
• Recall the conventional wisdom: Overfitting is likely to happen when a model contains more
parameters than can be justified by the data.
• Solution: Reduce the number of parameters.
• We can vary the number of neurons/the number of layers to find the architecture that works best
(on the validation data).
• Advantage: Conceptually easy.
• Disadvantage: Other regularization methods often give better accuracy.
8
L2 regularization (Tikhonov, 1943)
Lreg = L + Ω(w) .
α α αX 2
Ω(w) = kwk2 = w> w = wi
2 2 2
i
9
L2 regularization vs weight decay
10
Why L2 regularization reduces overfitting
• Intuition: Smaller weights usually produce smoother functions (smaller magnitudes of derivatives).
• Consider a linear regression problem (no bias term for simplicity):
N
1 X 2 α
L(w) = yn − w> xn + w> w
2N 2
n=1
• Let us find the minimum by computing the gradient and equating it to zero:
N N N
2 X 1 X 1 X
∇w L = yn − w> xn (−xn ) + αw = xn xn> w − yn xn + αw
2N N N
n=1 n=1 n=1
N
! N
1 X 1 X
= xn xn> + αI w− yn xn = 0
N N
n=1 n=1
which yeilds
N
!−1 N
!
1 X 1 X
w̃ = xn xn> + αI yn xn
N N
n=1 n=1
11
Why weight decay reduces overfitting
• L2 regularization causes the learning algorithm to “perceive” the input as having higher variance.
This makes the weights shrink.
• The regularization effect is larger for the weight values determined by the minor (opposite to
principal) components of the data.
12
2. Early stopping
Early stopping
14
Why early stopping reduces overfitting
• For a linear regression problem, the loss is quadratic and thus can be written in the following form:
1
L(w) = L(w∗ ) + (w − w∗ )> H(w − w∗ )
2
where w∗ is the global minimum and H is the Hessian, i.e. second order derivatives, of the loss.
• The update with gradient descent:
wt = wt−1 − H(wt−1 − w∗ )
wt − w∗ = wt−1 − w∗ − H(wt−1 − w∗ ) = (I − H)(wt−1 − w∗ )
• Using the eigendecomposition H = QΛQ> gives
Q> (wt − w∗ ) = Q> (QQ> − QΛQ> )(wt−1 − w∗ ) = (I − Λ) Q> (wt−1 − w∗ )
| {z }
=(I−Λ)Q> (wt−2 −w∗ )
15
Optimal solution with weight decay
• Now consider minimizing the same loss with a weight decay penalty:
1 α
Lα (w) = L(w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2
The optimal weights w̃ can be found by equating the gradient to zero:
∇Lα = H(w − w∗ ) + αw = 0
w̃ = (H + αI)−1 Hw∗
= (Λ + αI)−1 ΛQ> w∗
16
Why early stopping reduces overfitting
• If we use L2 regularization:
17
Early stopping
training error
validation error
w
18
3. Ensemble methods
Ensemble methods
20
Dropout (Hinton et al., 2012)
21
Dropout (Hinton et al., 2012)
21
Dropout (Hinton et al., 2012)
21
Dropout (Hinton et al., 2012)
21
Dropout: Training and evaluation modes
• At test time neurons are not dropped, which will create a mismatch between training and
inference modes.
• If a signal x is dropped with probability p, the expected
model = nn.Sequential(
value after the dropout layer is nn.Linear(1, 100),
nn.Tanh(),
E [y ] = (1 − p)x nn.Dropout(0.02),
...
That means that when we drop neurons, the expected )
value of y will be (1 − p) times that of the “non-dropped”
# Switch to training mode
setup. model.train()
# training the model
• To mitigate this, we can do the following trick: ...
# Switch to evaluation mode
• Training mode: zero signals with probability p and scale
1 model.eval()
the remaining ones by factor 1−p . # test the model
• Evaluation mode: do nothing.
22
Probabilistic treatment: Bayesian neural networks
• Bayesian neural networks were proposed in the late 1980s, popularized by David MacKay (1992).
• Bayesian methodology: one should combine predictions p(y | x, Mi ) given by all possible models:
X
p(y | x, D) = p(y | x, Mi )p(Mi | D)
i
weighting them by
p(Mi )p(D | Mi )
p(Mi | D) =
p(D)
.
23
Probabilistic treatment: Bayesian neural networks
• If we fix the architecture of a neural network, the set of possible models is defined by all possible
parameter values w: Z
p(y | x, D) = p(y | x, w)p(w | D)dw
• We then need to evaluate the posterior distribution p(w | D) of the model parameters given the
training data. We do that using Bayes rule:
p(D|w)p(w)
p(w | D) =
p(D)
• We can use different strategies to approximate p(w | D):
• maximum a posteriori estimation (point estimates of w)
• variational approximation of p(w | D)
• draw samples from p(w | D)
24
Maximum a posteriori with Gaussian prior = L2 regularization
25
Variational approximation of the posterior distribution
2.0
N (wi | µi , σi2 ),
Y
q(w | θ) = 1.5
i 1.0
26
Bayesian neural networks
27
Sampling approach: Stein variational gradient descent
3.0
true distribution
• Liu and Wang (2016) find samples wk from the posterior 2.5 approximation
samples from approximation
approximation q(w) that minimizes the KL divergence with
2.0
the true posterior.
1.5
• Each sample defines one neural network. 1.0
• We create an ensemble of neural networks.
0.5
• We do not postulate the form of q(w) explicitly. 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
• We still can compute the gradient of the KL divergence, which yields the update rule:
K
1 X
wk ← wk + η k(wk 0 , wk )∇wk 0 [log p(wk 0 ) + log p(D | wk 0 )] + ∇wj k(wk 0 , wk )
K | {z } | {z }
k 0 =1 smoothed gradient repulsive force
28
4. Data augmentation
Injecting noise (Sietsma and Dow, 1991)
30
Image transformations
31
mixup (Zhang et al., 2017)
x̃ = λxi + (1 − λ)xj
ỹ = λyi + (1 − λ)yj
where xi , xj are raw input vectors and yi , yj are one-hot label encodings. (xi , yi ) and (xj , yj ) are
two examples drawn at random from the training set, λ ∈ [0, 1].
• mixup extends the training distribution by incorporating the prior knowledge that linear
interpolations of feature vectors should lead to linear interpolations of the associated targets.
• Note that for images, we take as training examples mixtures of two different images. Even though
the mixtures do not look like real images, this data augmentation method works and improves
generalization.
32
Adversarial examples
L(x + r, y, w) → max
r
33
FGSM attack (Goodfellow et al., 2014)
• Finding adversarial examples is surprisingly easy. For example, with the fast gradient sign method
(FGSM):
x + r = x + ε sign(∇x L(w, x, y))
34
Adversarial training
• Adversarial examples are difficult for neural networks, including them in the training set helps
reduce the test error. This is called adversarial training.
• Adversarial training is data augmentation with adversarial examples.
• The existence of adversarial examples motivated a new subfield of deep learning in which
techniques are developed to defend neural networks against adversarial attacks.
35
Madry’s defense
• Madry’s defense model (Madry et al., 2017) is one of the strongest defense models.
• Recall standard optimization:
min E(x,y)∼D [L(w, x, y)]
w
• Instead of feeding clean training samples x, we feed the worst adversarial examples found with
another optimization procedure.
• Saddle point problem: composition of an inner maximization problem and an outer minimization
problem.
36
Adversarial training helps develop more meaningful representations
• Gradients wrt inputs look much more meaningful for an adversarily trained network.
37
Adversarial examples look more meaningful
38
Rethinking generalization
(Zhang et al., 2016)
Conventional wisdom
1.5
1.0
• The model is too flexible for the amount of
0.5
training data.
0.0
• Wikipedia: An overfitted model is a statistical
model that contains more parameters than can 0.5
1.5
0.0 0.2 0.4 0.6 0.8 1.0
40
Rethinking generalization (Zhang et al., 2016)
41
The role of explicit regularization
• Explicit regularization may improve generalization performance, but is neither necessary nor by
itself sufficient for controlling generalization error.
42
Implicit regularization
• Batch normalization is usually found to improve the generalization performance, even though it
was not explicitly designed for regularization.
43
Deep Double Descent
(Nakkiran et al., 2019)
Deep learning practice
• Conventional wisdom: larger models overfit more, therefore one should use simpler models.
45
Deep Double Descent (Nakkiran et al., 2019)
• Hypothesis for any natural data distribution: After the model complexity grows enough to
interpolate the entire training dataset, the test error decreases with the model complexity.
46
Effect of data augmentation
• Modifications which increase the interpolation threshold (e.g., data augmentation, increasing the
number of training samples) shift the peak in test error towards larger models.
47
In some settings more data can even hurt
• Test loss as a function of Transformer model size (embedding dimension) on language translation:
• The curve for 18k samples is generally lower than the one for 4k samples, but also shifted to the
right, since fitting 18k samples requires a larger model.
• Thus, for some models, the performance for 18k samples is worse than for 4k samples.
48
Epoch-wise double descent
• For “medium sized” models (for which training to completion will only barely reach ≈ 0 error) the
test error as a function of training time will follow a classical U-like curve where it is better to
stop early.
• Models that are too small to reach the approximation threshold will remain in the “under
parameterized” regime where increasing training time monotonically decreases test error.
49
Effect of early stopping
• Double descent does not typically occur with optimal early stopping: early stopping prevents
models from reaching 0 train error, which is necessary for the double descent effect to occur.
50
Hyperparameter search
Selecting hyperparameters
• Hyperparameter search: use the performance on the validation set to select the optimal values of
the hyperparameters.
• Hyperparameters that you may want to tune:
• learning rate schedule • number of layers
• transformations used for data augmentation • number of neurons
• weight decay coefficient
• dropout rate • convolution kernel width
• mini-batch size • nonlinearity
• What works best in practice (this is not the case in the home assignment :-)):
• A large model combined with strong regularization.
• The training error is very low.
52
Hyperparameter search: Grid search
53
Hyperparameter search: Random search
54
Recommended reading
55
Recap
Summary of Lecture #3
1. Model overfitting may result if a model contains too many parameters compared to training data.
2. Model overfitting leads to good performance on training data but bad performance on test data.
3. Regularization techniques aim to prevent model overfitting.
4. Model capacity can be limited by reducing the model size, weight decay and parameter sharing.
5. Early stopping is based on maximizing performance on an evaluation set.
6. Dropout is a very efficient ensemble method.
7. Data augmentation methods include noise injection, transformations and adversarial training.
8. Explicit regularization does not necessarily improve generalization.
9. In deep double descent phenomenon a DL model’s performance first gets worse and then better.
10. Random search should be preferred over grid search for hyperparameter optimization.
57
Home assignment
Assignment 03 reg
• First notebook: Experiment with different regularization methods on a toy regression problem.
• Second notebook: Implement and tune a recommender system.
• In order to achieve good performance on the test set, you will have to use regularization
techniques and tune the (hyper)parameters.
59
How to represent users or items
• A simple representation is a one-hot vector. For example, user i can be represented with vector zi
such that zi = 1, zj6=i = 0.
• Better representaion:
• represent each user i as a vector wi
• treat all vectors wi as model parameters and tune them in the training procedure
• this is equivalent to Wzi where W is a matrix of “embeddings” (with vectors wi in its columns).
• This is implemented in torch.nn.Embedding(num embeddings, embedding dim)
• num embeddings is the size of the dictionary
• embedding dim is the size of each embedding vector wi
60