"Regularization
"Regularization
Magdalena Ivanovska
Department of Data Science and Analytics
I The error increases whenever the Euclidean distance between the prediction and the
target increases.
I In theory:
I expected training error = expected test error
I In practice:
I expected test error ≥ expected training error
I For a good performance, we need the ability to:
1. Make the training error small (avoid underfitting)
2. Make the gap between training and test error small (avoid overfitting)
I Underfitting occurs when the model is not able to obtain a sufficiently low training
error.
I Overfitting occurs when the gap between the training error and test error is too large.
I Capacity of the model is its ability to fit a wide variety of functions.
I We can control underfitting and overfitting by altering the capacity of the model.
I Low capacity may cause underfitting.
I High capacity may cause overfitting.
I One way to control the capacity of a learning algorithm is by choosing its hypothesis
space.
I Hypothesis space is the set of functions that the learning algorithm is allowed to
select as being the solution.
I E.g. the hypothesis space of linear regression consists of all the linear functions of the
input features.
I To increase the capacity of linear regression, we can allow for polynomial functions.
Pn
E.g. ŷ = b + w1 x + w2 x2 or ŷ = b + i=1 wi xi , for some degree n.
I ML algorithms perform best when their capacity is appropriate to:
I the complexity of the task they need to perform
I the amount of training data they need to fit
Training error
Underfitting zone Overfitting zone
Generalization error
Error
Generalization gap
0 Optimal Capacity
Capacity
Figure 5.3: Typical relationship between capacity and error. Training and test error
Magdalena Ivanovska: Regularization 11
Estimators, Bias, and Variance
I Point estimation is the attempt to provide the single “best” prediction of some
quantity of interest (single parameter, a vector of parameters, or a function).
I We denote the point estimate of θ by θ̂.
I In general, a point estimator or statistic of a set of i.i.d. data points {x(1) , . . . , x(m) }
is any function of the data:
θ̂ m = g(x(1) , . . . , x(m) )
I A good estimator is a function g that produces an estimate that is close to the true θ.
Bias(θ̂ m ) = E(θ̂ m ) − θ
I The bias measures the expected deviation of the estimator from the true value of the
parameter.
I An estimator θ̂ m is unbiased if bias(θ̂ m ) = 0, which implies E(θ̂ m ) = θ.
I An estimator θ̂ m is asymptotically unbiased if limm→∞ bias(θ̂ m ) = 0, which implies
limm→∞ E(θ̂ m ) = θ.
Note: Statistical bias should not be confused with the bias parameter in neural networks.
I It tells us how much we would expect an estimator to vary depending on the data
sample.
I In other words, variance error represents the variability in model’s predictions across
different datasets.
I We also use the standard error of the estimator:
q
SE(θ̂ m ) = Var(θ̂ m )
I The capacity of a network refers to the level of complexity of the function that it can
learn to approximate
I Small networks, or networks with a relatively small number of parameters, have a low
capacity and are therefore likely to underfit, resulting in poor performance, since they
cannot learn the underlying structure of complex datasets
I Very large networks may result in overfitting, where the network memorizes the training
data and does extremely well on the training dataset while achieving a poor
performance on the held-out test dataset
I When we deal with real-world ML problems, we do not know how large the network
should be a priori.
Bias Generalization
error Variance
Optimal Capacity
capacity
Figure 5.6: As capacity increases (x-axis), bias (dotted) tends to decrease and variance
Magdalena Ivanovska: Regularization 19
BIAS-VARIANCE TRADE-OFF
I The hyperparameters of an ML algorithm are settings that are used to control the
algorithm’s behaviour.
I Example: Polynomial regression hyperparameters:
I The degree of the polynomial (capacity hyperparameter)
I The control of the weight decay λ
I The value of a hyperparameter is not learned by the learning algorithm itself because:
I it is difficult to optimize
I it is not appropriate to learn it on the training set (overfitting)
I To set the hyperparameters, we use a validation set of data.
I Typically, 20% of the training data is used for validation.
I The validation set is used to estimate the generalization error during or after the
training and update the hyperparameters accordingly.
I Random sampling – randomly assigning examples to training, validation, and test set
according to predetermined ratios.
I Stratified Dataset Splitting – a method commonly used with imbalanced datasets,
where certain classes or categories have significantly fewer instances than others.
I the dataset is split in three sets while preserving the relative proportions of each class
across the splits.
I Cross-validation is an alternative procedure that enables using the whole dataset in
estimating the mean test error in cases when the dataset is too small.
I In k-fold cross-validation the dataset is partitioned into k non-overlapping subsets.
I The test error can then be estimated by taking the average test error across k trials.
I On trial i, the i-th subset of the data is used as the test set and the rest of the data is
used as the training set.
Regularization
I The no free lunch theorem states that we must design our ML algorithm to perform
well on a specific task.
I One way to achieve good performance is by modifying the hypothesis space in order
to increase or decrease capacity.
I Another way is to give a learning algorithm a preference for one solution over another
in its hypothesis space.
I E.g. we can modify the training criterion for linear regression to include weight decay:
y
x( x( x(
I Figure 5.5: We fit a high-degree polynomial regression model to our example training set
in all the three cases we use polynomials of degree 9 as models
from figure 5.2. The true function is quadratic, but here we use only models with degree 9.
IWe
thevary
actual
the function
amount ofthat we decay
weight are trying to learn
to prevent is quadratic
these high-degree models from overfitting.
(Left)With very large , we can force the model to learn a function with no slope at
all. This underfits because it can only represent a constant function. (Center)With a
Magdalena Ivanovska: Regularization 27
REGULARIZATION
I Creating an algorithm that performs well on new data is one of the central problems in
machine learning.
I Regularization is any modification we make to a learning algorithm that is intended to
reduce its generalization error but not its training error.
I Similarly as there is no best ML algorithm, there is no best regularization.
I However, many ML tasks can be solved effectively with very general-purpose forms of
regularization.
I An effective regularizer is the one that makes a good bias-variance trade-off:
I reduces the variance significantly while not overly increasing the bias
I A DL practice wisdom: The best fitting model (in the sense of minimizing the
generalization error) is a large model that has been regularized appropriately!
I Many regularization approaches are based on limiting the capacity of the models by
adding a parameter norm penalty Ω(θ) to the objective function.
I We denote by J˜ the regularized objective (cost) function:
˜
J(θ) = J(θ) + αΩ(θ)
I α ∈ [0, ∞) is a hyperparameter that weights the contribution of the penalty term Ω(θ)
relative to the standard (non-regularized) J.
I In neural networks, we typically choose to regularize the weights but not biases.
I a weight specifies interaction between two variables while a bias controls only one variable,
I hence, the biases typically require less data to fit accurately
I we do not induce much variance by leaving biases unregularized
I regularizing biases can introduce a significant amount of underfitting.
I The L2 parameter norm penalty is one of the simplest and most commonly used penalty
I It is defined as follows:
˜
J(w) = J(w) + αw> w
I This is known as L2 -regularization because it is based on the L2 -norm:
√ q
L2 (w) = ||w||2 = w> w = w12 + . . . + wn2
I The L1 parameter norm penalty provides another way to penalize the size of the model
parameters.
I It is defined as follows:
˜
J(w) = J(w) + α||w||1
I This is known as L1 -regularization because it is based on the L1 -norm:
I When training large models with sufficient capacity to overfit the task, it is often the
case that:
I training set error decreases steadily over time
I validation set error decreases but then starts increasing again.
I A better validation error (and so probably better test error) can be obtained by
returning to the parameter setting at the point in time with lowest validation error.
I Every time the validation set error improves, we store a copy of the model parameters.
I When the training algorithm terminates, we return to the best validation error parameters,
rather then the latest parameters.
I The training iterations stop when no parameters have been improved over the best
recorded validation set error
0.20
0.10
0.05
0.00
0 50 100 150 200 250
Time (epochs)
Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over
time (indicated as number of training iterations over the dataset, or epochs). In this
Magdalena example, we train a maxout network on MNIST. Observe that the training objective
Ivanovska: Regularization 38
ADVANTAGES AND DISADVANTAGES OF EARLY STOPPING
I This is probably the most commonly used form of regularization in deep learning
I The number of training steps (or training time) is just another hyperparameter.
I Early stopping can be seen as a regularization technique where this hyperparameter is
tuned using the validation set.
I It is costly because of running the validation set evaluation periodically during training.
I to reduce the computational cost one can reduce the validation set, or
I evaluate the validation loss less frequently
I Early stopping is an unobtrusive form of regularization:
I it does not change the training procedure, the objective function, or the allowable
parameter values
I it can be easily combined with other forms of regularization.
I Early stopping automatically determines the correct amount of regularization
I while weight decay, for example, requires many training experiments with different values
of its hyperparameter.
I Bagging (short for bootstrap aggregating) is an ensemble method that allows the
same kind of model, training algorithm and objective function to be reused several
times.
I Bagging involves constructing k different datasets with the same number of examples
as the original dataset.
I Each dataset is constructed by sampling with replacement from the original dataset.
I each new dataset is missing some of the examples in the original dataset (1/3 of the
examples, on average)
I each new dataset contains several duplicate examples.
I Model i is trained on dataset i
I The models’ predictions are then aggregated.
Original dataset
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector on
the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two different
resampled datasets. The bagging training procedure is to construct each of these datasets
I We want to train an 8 detector on a dataset of three digits.
by sampling with replacement. The first dataset omits the 9 and repeats the 8. On this
dataset, the detector learns that a loop on top of the digit corresponds to an 8. On
I We make two different resampled datasets based on the original one.
the second dataset, we repeat the 9 and omit the 6. In this case, the detector learns
that a loop on the bottom of the digit corresponds to an 8. Each of these individual
I On the first/second resampled dataset, the detector learns that a loop on the top/bottom of
classification rules is brittle, but if we average their output then the detector is robust,
achieving maximal confidence only when both loops of the 8 are present.
the digit corresponds to an 8.
I The averaged model is more
different reliable
kind of model usingthan thealgorithm
a different individual onesfunction.
or objective (will look for both the loops to
Bagging
is a method that allows the same kind of model, training algorithm and objective
detect an 8). function to be reused several times.
Specifically, bagging involves constructing k different datasets. Each dataset
Magdalena Ivanovska: Regularization has the same number of examples as the original dataset, but each dataset is 42
DROPOUT
y y y y
h1 h2 h1 h2 h1 h2 h2
x1 x2 x2 x1 x1 x2
y
I in a network with 2 input and 2 y y y y
hidden units h1 h1 h2 h2
I there are 24 = 16 possible h1 h2
x1 x2 x1 x2 x2
ensembles
y y y y
I In networks with wider layers, the x1 x2
h2 h1
x2
Ensemble of subnetworks
Magdalena Ivanovska: Regularization Figure 7.6: Dropout trains an ensemble consisting of all sub-networks that can be44
TRAINING WITH DROPOUT
k
1 X (i)
p (y|x)
k
i=1
I This sum is evaluated after the training sessions, usually empirically by sampling a set
of µ’s and aggregating the predictions of the corresponding models.
I Imposing non-zero events and normalization is also usually applied.