0% found this document useful (0 votes)
37 views48 pages

"Regularization

The document discusses regularization techniques for machine learning models. It covers generalization, capacity, overfitting and underfitting. High capacity models are more likely to overfit while low capacity models often underfit. Regularization aims to reduce overfitting by controlling the capacity of models. The document outlines sources of error such as bias and variance, and how regularization can reduce variance to improve generalization.

Uploaded by

Bình Minh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views48 pages

"Regularization

The document discusses regularization techniques for machine learning models. It covers generalization, capacity, overfitting and underfitting. High capacity models are more likely to overfit while low capacity models often underfit. Regularization aims to reduce overfitting by controlling the capacity of models. The document outlines sources of error such as bias and variance, and how regularization can reduce variance to improve generalization.

Uploaded by

Bình Minh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Regularization

GRA3420 Deep learning and explainable AI

Magdalena Ivanovska
Department of Data Science and Analytics

February 16, 2024


.

These lecture slides are based on www.deeplearningbook.org,


Deep Learning: Foundations and Concepts by C. Bishop and H. Bishop,
and other resources cited throughout the presentation

Magdalena Ivanovska: Regularization 2


AGENDA

I Generalization, capacity, overfitting, and underfitting


I Bias and variance
I Regularization
I Regularization techniques

Magdalena Ivanovska: Regularization 3


Generalization

Magdalena Ivanovska: Regularization 4


EVALUATING PERFORMANCE

I To evaluate the performance of the algorithm, we need a test set of m examples:


I design matrix X test
I corresponding labels/targets: y test .
I E.g. we can measure the performance of the linear regression by computing the mean
squared error (MSE) of the model on the test set:
1 X test 1
MSEtest = (ŷi − yitest )2 = ||ŷ test − y test ||22
m m
i

I The error increases whenever the Euclidean distance between the prediction and the
target increases.

Magdalena Ivanovska: Regularization 5


GENERALIZATION

I Generalization – the ability of the ML algorithm to perform well on previously


unobserved data
I We train the model to minimize the training error but we care about the
generalization error (the test error).
I How can we affect performance on a test set with training on another set?
I i.i.d. assumptions:
I The examples in each dataset are independent from each other.
I The training set and the test set are identically distributed, hence the examples in them
are drawn from the same probability distribution.
I We call this underlying distribution the data-generating distribution, denoted pdata .
I Due to i.i.d. assumptions, the expected value of the test error is the same as the
expected value of the training error.

Magdalena Ivanovska: Regularization 6


GENERALIZATION

I In theory:
I expected training error = expected test error
I In practice:
I expected test error ≥ expected training error
I For a good performance, we need the ability to:
1. Make the training error small (avoid underfitting)
2. Make the gap between training and test error small (avoid overfitting)

Magdalena Ivanovska: Regularization 7


CAPACITY, OVERFITTING, AND UNDERFITTING

I Underfitting occurs when the model is not able to obtain a sufficiently low training
error.
I Overfitting occurs when the gap between the training error and test error is too large.
I Capacity of the model is its ability to fit a wide variety of functions.
I We can control underfitting and overfitting by altering the capacity of the model.
I Low capacity may cause underfitting.
I High capacity may cause overfitting.

Magdalena Ivanovska: Regularization 8


HYPOTHESIS SPACE

I One way to control the capacity of a learning algorithm is by choosing its hypothesis
space.
I Hypothesis space is the set of functions that the learning algorithm is allowed to
select as being the solution.
I E.g. the hypothesis space of linear regression consists of all the linear functions of the
input features.
I To increase the capacity of linear regression, we can allow for polynomial functions.
Pn
E.g. ŷ = b + w1 x + w2 x2 or ŷ = b + i=1 wi xi , for some degree n.
I ML algorithms perform best when their capacity is appropriate to:
I the complexity of the task they need to perform
I the amount of training data they need to fit

Magdalena Ivanovska: Regularization 9


UNDERFITTING AND OVERFITTING IN POLYN. ESTIMATION

Magdalena Ivanovska: Regularization 10


CHAPTER 5. MACHINE LEARNING BASICS
RELATIONSHIP BETWEEN CAPACITY AND ERROR

Training error
Underfitting zone Overfitting zone
Generalization error
Error

Generalization gap

0 Optimal Capacity
Capacity

Figure 5.3: Typical relationship between capacity and error. Training and test error
Magdalena Ivanovska: Regularization 11
Estimators, Bias, and Variance

Magdalena Ivanovska: Regularization 12


POINT ESTIMATION

I Point estimation is the attempt to provide the single “best” prediction of some
quantity of interest (single parameter, a vector of parameters, or a function).
I We denote the point estimate of θ by θ̂.
I In general, a point estimator or statistic of a set of i.i.d. data points {x(1) , . . . , x(m) }
is any function of the data:

θ̂ m = g(x(1) , . . . , x(m) )

I A good estimator is a function g that produces an estimate that is close to the true θ.

Magdalena Ivanovska: Regularization 13


STATISTICAL BIAS

I The point estimate θ̂ m is a random variable.


I The bias of an estimator θ̂ m is defined as:

Bias(θ̂ m ) = E(θ̂ m ) − θ

I The bias measures the expected deviation of the estimator from the true value of the
parameter.
I An estimator θ̂ m is unbiased if bias(θ̂ m ) = 0, which implies E(θ̂ m ) = θ.
I An estimator θ̂ m is asymptotically unbiased if limm→∞ bias(θ̂ m ) = 0, which implies
limm→∞ E(θ̂ m ) = θ.

Note: Statistical bias should not be confused with the bias parameter in neural networks.

Magdalena Ivanovska: Regularization 14


VARIANCE

I The variance of an estimator is:

Var(θ̂ m ) = E[(θ̂ m − E(θ̂ m ))2 ]

I It tells us how much we would expect an estimator to vary depending on the data
sample.
I In other words, variance error represents the variability in model’s predictions across
different datasets.
I We also use the standard error of the estimator:
q
SE(θ̂ m ) = Var(θ̂ m )

Magdalena Ivanovska: Regularization 15


BIAS VS. VARIANCE

I Bias and variance measure two different sources of error in an estimator:


I Bias error represents the extent to which a model’s predictions deviate from the actual
values it is trying to predict due to its simplifications or assumptions.
the model contains a systematic error that is not due to randomness.
I Variance measures the model’s sensitivity to fluctuations in the training dataset. It
quantifies how much the model’s predictions vary when trained on different subsets of the
data.
the model is sensitive to the randomness in the training data .
I High bias means that a model is too simple and systematically makes errors in its
predictions, leading to underfitting.
I High variance mean the model is too complex and unable to generalize on unseen data,
leading to overfitting.

Magdalena Ivanovska: Regularization 16


BIAS VS. VARIANCE

Magdalena Ivanovska: Regularization 17


BIAS VS. VARIANCE IN DL

I The capacity of a network refers to the level of complexity of the function that it can
learn to approximate
I Small networks, or networks with a relatively small number of parameters, have a low
capacity and are therefore likely to underfit, resulting in poor performance, since they
cannot learn the underlying structure of complex datasets
I Very large networks may result in overfitting, where the network memorizes the training
data and does extremely well on the training dataset while achieving a poor
performance on the held-out test dataset
I When we deal with real-world ML problems, we do not know how large the network
should be a priori.

Magdalena Ivanovska: Regularization 18


Desirable estimators
CAPACITY, BIAS, AND are VARIANCE
those with small MSE and these are estimators that
manage to keep both their bias and variance somewhat in check.

Underfitting zone Overfitting zone

Bias Generalization
error Variance

Optimal Capacity
capacity

Figure 5.6: As capacity increases (x-axis), bias (dotted) tends to decrease and variance
Magdalena Ivanovska: Regularization 19
BIAS-VARIANCE TRADE-OFF

I It is desired to have both low bias and low variance.


I The objective of the bias-variance trade-off is to find the right balance between model
simplicity (bias) and model flexibility (variance) to create models that generalize well
to new, unseen data.
I The bias-variance trade-off is often negotiated via:
I cross-validation
I keeping the MSE low helps since:

MSE = E[(θ̂ m − θ)2 ] = Bias(θ̂ m )2 + Var(θ̂ m )

Magdalena Ivanovska: Regularization 20


HYPERPARAMETERS AND VALIDATION SET

I The hyperparameters of an ML algorithm are settings that are used to control the
algorithm’s behaviour.
I Example: Polynomial regression hyperparameters:
I The degree of the polynomial (capacity hyperparameter)
I The control of the weight decay λ
I The value of a hyperparameter is not learned by the learning algorithm itself because:
I it is difficult to optimize
I it is not appropriate to learn it on the training set (overfitting)
I To set the hyperparameters, we use a validation set of data.
I Typically, 20% of the training data is used for validation.
I The validation set is used to estimate the generalization error during or after the
training and update the hyperparameters accordingly.

Magdalena Ivanovska: Regularization 21


TRAINING, VALIDATION AND TEST DATASET

Magdalena Ivanovska: Regularization 22


METHODS TO SPLIT THE DATASET

I Random sampling – randomly assigning examples to training, validation, and test set
according to predetermined ratios.
I Stratified Dataset Splitting – a method commonly used with imbalanced datasets,
where certain classes or categories have significantly fewer instances than others.
I the dataset is split in three sets while preserving the relative proportions of each class
across the splits.
I Cross-validation is an alternative procedure that enables using the whole dataset in
estimating the mean test error in cases when the dataset is too small.
I In k-fold cross-validation the dataset is partitioned into k non-overlapping subsets.
I The test error can then be estimated by taking the average test error across k trials.
I On trial i, the i-th subset of the data is used as the test set and the rest of the data is
used as the training set.

Magdalena Ivanovska: Regularization 23


.

Regularization

Magdalena Ivanovska: Regularization 24


THE NO FREE LUNCH THEOREM

I The no free lunch theorem states that:


I Averaged over all possible data generating distributions pdata , every classification
algorithm has the same error rate when classifying previously unobserved data points.
I In other words, no machine learning algorithm is universally any better than any other.
I The goal of machine learning research is not to seek a universal learning algorithm or
the absolute best learning algorithm.
I Instead, the goal of ML is to understand
I what kinds of distributions are relevant to the “real world”
I design algorithms that perform well on data drawn from distributions we care about.

Magdalena Ivanovska: Regularization 25


REGULARIZATION

I The no free lunch theorem states that we must design our ML algorithm to perform
well on a specific task.
I One way to achieve good performance is by modifying the hypothesis space in order
to increase or decrease capacity.
I Another way is to give a learning algorithm a preference for one solution over another
in its hypothesis space.
I E.g. we can modify the training criterion for linear regression to include weight decay:

J(w) = MSEtrain + λw> w


I This means we are regularizing a model that learns a function f (x; w) by adding a
penalty Ω(w) called a regularizer to the cost function.
I In the above example, Ω(w) = w> w.

Magdalena Ivanovska: Regularization 26


decay, we can train a high-degree polynomial regression model with different values
REGULARIZATION IN the
of . See figure 5.5 for LINEAR
results. REGRESSION

Underfitting Appropriate weight decay Overfitting


(Excessive λ) (Medium λ) (λ →()

y
x( x( x(

I Figure 5.5: We fit a high-degree polynomial regression model to our example training set
in all the three cases we use polynomials of degree 9 as models
from figure 5.2. The true function is quadratic, but here we use only models with degree 9.
IWe
thevary
actual
the function
amount ofthat we decay
weight are trying to learn
to prevent is quadratic
these high-degree models from overfitting.
(Left)With very large , we can force the model to learn a function with no slope at
all. This underfits because it can only represent a constant function. (Center)With a
Magdalena Ivanovska: Regularization 27
REGULARIZATION

I Creating an algorithm that performs well on new data is one of the central problems in
machine learning.
I Regularization is any modification we make to a learning algorithm that is intended to
reduce its generalization error but not its training error.
I Similarly as there is no best ML algorithm, there is no best regularization.
I However, many ML tasks can be solved effectively with very general-purpose forms of
regularization.
I An effective regularizer is the one that makes a good bias-variance trade-off:
I reduces the variance significantly while not overly increasing the bias
I A DL practice wisdom: The best fitting model (in the sense of minimizing the
generalization error) is a large model that has been regularized appropriately!

Magdalena Ivanovska: Regularization 28


PARAMETER NORM PENALTIES

I Many regularization approaches are based on limiting the capacity of the models by
adding a parameter norm penalty Ω(θ) to the objective function.
I We denote by J˜ the regularized objective (cost) function:

˜
J(θ) = J(θ) + αΩ(θ)

I α ∈ [0, ∞) is a hyperparameter that weights the contribution of the penalty term Ω(θ)
relative to the standard (non-regularized) J.
I In neural networks, we typically choose to regularize the weights but not biases.
I a weight specifies interaction between two variables while a bias controls only one variable,
I hence, the biases typically require less data to fit accurately
I we do not induce much variance by leaving biases unregularized
I regularizing biases can introduce a significant amount of underfitting.

Magdalena Ivanovska: Regularization 29


L2 PARAMETER REGULARIZATION

I The L2 parameter norm penalty is one of the simplest and most commonly used penalty
I It is defined as follows:
˜
J(w) = J(w) + αw> w
I This is known as L2 -regularization because it is based on the L2 -norm:
√ q
L2 (w) = ||w||2 = w> w = w12 + . . . + wn2

I It is also known as weight decay in DL and as ridge regression or Tikhonov


regularization in other academic communities.
˜
I With this J(w) we are expressing a preference for the weights to have smaller values.
I the value α controls this preference and is chosen ahead of time.
I larger α forces the weights to become smaller.
˜
I optimization of J(w) results in a choice of weights that make a trade off between fitting
the training set and being small.

Magdalena Ivanovska: Regularization 30


L1 PARAMETER REGULARIZATION

I The L1 parameter norm penalty provides another way to penalize the size of the model
parameters.
I It is defined as follows:
˜
J(w) = J(w) + α||w||1
I This is known as L1 -regularization because it is based on the L1 -norm:

L1 (w) = ||w||1 = |w1 | + . . . + |wn |

Magdalena Ivanovska: Regularization 31


L1 VERSUS L2

I In comparison to L2 , L1 -regularization results in a solution that is more sparse.


I this means that some of the parameters have an optimal value of zero.
I The sparsity induced by L1 -regularization can be used as a feature selection
mechanism.
I feature selection simplifies a ML algorithm by choosing only a subset of (relevant)
features of the data.
I the L1 penalty causes a subset of the weights to become zero, suggesting that the
corresponding features can be discarded.
I The LASSO (least absolute shrinkage and selection operator) model integrates:
I L1 -penalty
I with a linear model
I and a least-square cost function

Magdalena Ivanovska: Regularization 32


DATASET AUGMENTATION

I The best way to regularize is to train with more data.


I The amount of data is often very limited.
I We can create fake data and augment it to the training set.
I This technique is most suitable for classification since a classifier summarizes
complicated high-dimensional data with a single category.
I We can generate new (x, y) pairs just by transforming the inputs x in our training set.
I Dataset augmentation has been particularly effective for object recognition.
I create new data by translating the training images a few pixels in each direction,
I or by rotating or scaling the images.
I one should avoid transformations that change the correct class.
E.g. A horizontal flip can make ‘b’ to look like ‘d’, a 180o rotation: can make 6 look like 9.

Magdalena Ivanovska: Regularization 33


DATASET AUGMENTATION

Magdalena Ivanovska: Regularization 34


MULTI-TASK LEARNING

I Multi-task learning is a way to improve generalization by pooling the examples from


several tasks.
I For example, multi-task learning in classification aims to improve the performance of
multiple classification tasks by learning them jointly.
I E.g. Classifying spam mail (spam-filter) for different users:
I different users have different distributions of features which distinguish spam emails from
legitimate ones,
I yet there are common features like text related to money transfer.
I Solving each user’s spam classification problem jointly via multi-task learning can let
the solutions inform each other and improve performance.

Example source: Wikipedia

Magdalena Ivanovska: Regularization 35


network in figure 7.2.
MULTI-TASK LEARNING
2. Generic parameters, shared across all the tasks (which
pooled data of all the tasks). These are the lower layers of
Underlying assumption: There exists common factors that explain variations in x and
set of7.2.
in figure
each task is associated with a subset of those.

I Every chain starting from x corresponds to a different y(1) y(2)


task.
I h(shared) is a shared representation across tasks.
I Task-specific parameters are associated with weights h(1) h(2) h(3)
into and from h(1) and h(2) and are learned on the top
of h(shared)
I h(1) and h(2) are specialized representation for the h(shared)
tasks learning y (1) and y (2) respectively.
I h(3) are factors that explain some variations in x but
are not relevant to any of the tasks. x

Figure 7.2: Multi-task learning can be cast in several ways in deep l


Magdalena Ivanovska: Regularization and this figure illustrates the common situation where the tasks share
36
EARLY STOPPING

I When training large models with sufficient capacity to overfit the task, it is often the
case that:
I training set error decreases steadily over time
I validation set error decreases but then starts increasing again.
I A better validation error (and so probably better test error) can be obtained by
returning to the parameter setting at the point in time with lowest validation error.
I Every time the validation set error improves, we store a copy of the model parameters.
I When the training algorithm terminates, we return to the best validation error parameters,
rather then the latest parameters.
I The training iterations stop when no parameters have been improved over the best
recorded validation set error

Magdalena Ivanovska: Regularization 37


EARLY STOPPING
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

0.20

Loss (negative log-likelihood)


Training set loss
0.15 Validation set loss

0.10

0.05

0.00
0 50 100 150 200 250
Time (epochs)

Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over
time (indicated as number of training iterations over the dataset, or epochs). In this
Magdalena example, we train a maxout network on MNIST. Observe that the training objective
Ivanovska: Regularization 38
ADVANTAGES AND DISADVANTAGES OF EARLY STOPPING

I This is probably the most commonly used form of regularization in deep learning
I The number of training steps (or training time) is just another hyperparameter.
I Early stopping can be seen as a regularization technique where this hyperparameter is
tuned using the validation set.
I It is costly because of running the validation set evaluation periodically during training.
I to reduce the computational cost one can reduce the validation set, or
I evaluate the validation loss less frequently
I Early stopping is an unobtrusive form of regularization:
I it does not change the training procedure, the objective function, or the allowable
parameter values
I it can be easily combined with other forms of regularization.
I Early stopping automatically determines the correct amount of regularization
I while weight decay, for example, requires many training experiments with different values
of its hyperparameter.

Magdalena Ivanovska: Regularization 39


ENSEMBLE METHODS

I Ensemble methods are techniques for reducing generalization error by combining


several models.
I train several different models separately
I all the models vote on the output on a test set.
I The strategy of combining models in ML is called model averaging.
I Different ensemble methods construct the ensemble of models in different ways.
I For example, an ensemble can be a set of k regression models.
I IIn this case it can be proved (p. 249) that, on average, the ensemble performs at least as
good as any of its members;
I if the errors that members make are independent, the ensemble will perform much better
than its members.
I The combined models can be completely different.

Magdalena Ivanovska: Regularization 40


BAGGING

I Bagging (short for bootstrap aggregating) is an ensemble method that allows the
same kind of model, training algorithm and objective function to be reused several
times.
I Bagging involves constructing k different datasets with the same number of examples
as the original dataset.
I Each dataset is constructed by sampling with replacement from the original dataset.
I each new dataset is missing some of the examples in the original dataset (1/3 of the
examples, on average)
I each new dataset contains several duplicate examples.
I Model i is trained on dataset i
I The models’ predictions are then aggregated.

Magdalena Ivanovska: Regularization 41


BAGGING ILLUSTRATION CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Original dataset

First resampled dataset First ensemble member

Second resampled dataset Second ensemble member

Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector on
the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two different
resampled datasets. The bagging training procedure is to construct each of these datasets
I We want to train an 8 detector on a dataset of three digits.
by sampling with replacement. The first dataset omits the 9 and repeats the 8. On this
dataset, the detector learns that a loop on top of the digit corresponds to an 8. On
I We make two different resampled datasets based on the original one.
the second dataset, we repeat the 9 and omit the 6. In this case, the detector learns
that a loop on the bottom of the digit corresponds to an 8. Each of these individual
I On the first/second resampled dataset, the detector learns that a loop on the top/bottom of
classification rules is brittle, but if we average their output then the detector is robust,
achieving maximal confidence only when both loops of the 8 are present.
the digit corresponds to an 8.
I The averaged model is more
different reliable
kind of model usingthan thealgorithm
a different individual onesfunction.
or objective (will look for both the loops to
Bagging
is a method that allows the same kind of model, training algorithm and objective
detect an 8). function to be reused several times.
Specifically, bagging involves constructing k different datasets. Each dataset
Magdalena Ivanovska: Regularization has the same number of examples as the original dataset, but each dataset is 42
DROPOUT

I Dropout provides an inexpensive approximation to training and evaluating a bagged


ensemble of exponentially many neural networks.
I The idea is to create an ensamble of models by removing a subset of the non-output
units from an underlying base network.

Magdalena Ivanovska: Regularization 43


DROPOUT ENSEMBLE

y y y y

h1 h2 h1 h2 h1 h2 h2

x1 x2 x2 x1 x1 x2

y
I in a network with 2 input and 2 y y y y

hidden units h1 h1 h2 h2
I there are 24 = 16 possible h1 h2
x1 x2 x1 x2 x2
ensembles
y y y y
I In networks with wider layers, the x1 x2

probability of dropping all possible Base network


h1 h1 h2

paths from inputs to outputs x1 x2 x1 x1


becomes smaller.
y y y y

h2 h1

x2

Ensemble of subnetworks

Magdalena Ivanovska: Regularization Figure 7.6: Dropout trains an ensemble consisting of all sub-networks that can be44
TRAINING WITH DROPOUT

I To train with dropout, we use a minibatch SGD.


I Each time we load an example into a minibatch, we randomly sample a different binary
mask µ to apply to all of the input and hidden units in the network (randomly assign 0
or 1 to each)
I The probability of sampling a mask value of one (causing a unit to be included) is a
hyperparameter fixed before training begins.
I Typically, an input unit is included with probability 0.8 and a hidden unit is included with
probability 0.5.
I We then run forward propagation, back-propagation, and the learning update of the
parameters as usual.
I Dropout training consists in minimizing Eµ J(θ, µ)
I The expectation contains exponentially many terms.

Magdalena Ivanovska: Regularization 45


FORWARD PROPAGATION WITH DROPOUT

I the network structure (left)


I forward propagation modification (right)
I e.g. x̂1 = x1 µx1

Magdalena Ivanovska: Regularization 46


DROPOUT VS. BAGGING

I In the case of bagging, the models are all independent


I In the case of dropout, the models share parameters.
I In bagging, each model is trained to convergence on its respective training set.
I In dropout, a tiny fraction of the possible sub-networks are each trained for a single step.
I In both, the training set encountered by each sub-network is a subset of the original
training set sampled with replacement.

Magdalena Ivanovska: Regularization 47


ENSEMBLE PREDICTION

I To make a prediction, an ensemble averages the predictions of all of its members.


I In bagging, the prediction of the ensemble is given by

k
1 X (i)
p (y|x)
k
i=1

I In dropout, the arithmetic mean is weighted


X
p(µ)p(y|x, µ)
µ

I This sum is evaluated after the training sessions, usually empirically by sampling a set
of µ’s and aggregating the predictions of the corresponding models.
I Imposing non-zero events and normalization is also usually applied.

Magdalena Ivanovska: Regularization 48

You might also like