Generalization Error
Generalization Error
3. Improving Generalization
Inductive Biases
Regularization
Data Augmentation
Early Stopping
Bagging (Bootstrap Aggregating)
Boosting
Until now, we focused on parameter estimation, i.e. finding the best parameters that
explain a given set of data.
Today, we remind ourselves that the goal of supervised learning (and more generally of
machine learning), is not to find a model performs well at explaining the given data, but a
model that performs well on average.
Expected Error
We can measure the performance of an estimator by computing its expected mean squared
error,
1/ 9
The lower the expected mean squared error is, the better our estimator is. The expected
error can be decomposed in three terms: bias,
variance
Generalization Error
In regression we want to minimize the generalization error
where is the distribution of the inputs, is the target function, and is the
parametetric model. Our goal is to find the set of parameters that minimize the
generalization error, i.e.,
Generalization Error
Unfortunately, we don't have and we must rely on a finite amount of samples. All we
can do is use those samples to estimate , for example, by using a maximum likelihood
estimator,
Sources of Errors
Projection Error: the target function is not representable with the parametric model.
The projection error forms an irreducible bias (which cannot be mitigated by using
infinite samples)
Finite Samples: the use of finite samples give only partial information about the
target function, resulting in estimation variance, since different datasets produce
different models.
Noise: The relation between input and output is affected by noise. This noise produce
an irreducible error, even in presence of infinite samples and zero projection error.
2/ 9
Expected Error
We can measure the performance of an estimator by computing its expected mean squared
error,
The lower the expected mean squared error is, the better our estimator is. The expected
error can be decomposed in three terms: bias,
variance
A high number of parameters defines a large set of models. Therefore, a high number of
parameters often results in a high variance of the estimator.
3/ 9
Samples and Model Complexity
The ideal situation is when we have a large set of samples (close to infinite), and a large set
of parameters (high model complexity). In this situation, we can contain both the variance
and the bias. (Note that is the current trend in deep machine learning).
When samples are scarce, however, is often preferable to have small models (few
parameters, low model complexity) to contain the variance. E.g., if we know that the target
function is quadratic, why use a neural network?
When the model complexity is low, the model may underfit, meaning that it does not fit the
training data very well.
Hyper-parameters
In parametric machine learning we aim to find the optimal parameters of a model to
obtain the smallest expected error.
However, there are many parameters that, classically, are not optimized, such as: model
complexity, neural network structure, regularization factor, and so on. Such parameters are
called hyper-parameters and, as we already saw, they also have an influence on the model
performance.
A common way to decide the hyper-parameter is to estimate the expected error and choose
the set of hyper parameters that optimizes the (estimated) expected error.
4/ 9
Statistical Bootstrapping: A Variance Estimator
Until now, we assumed we were performing only one estimate based on the data. The
estimate was therefore composed of a single model, making it difficult to understand how
reliable such a model is.
A central idea on next slides is to perform many estimations with the given data, and
measure their variance.
Bootstrapping
Bootstrapping can be used in machine learning for estimating the variance of the estimate.
In case of supervised machine learning, bootstrapping results in a point estimate of the
variance. This allows us to see which areas of the input state the estimator is more (or less)
reliable in.
5/ 9
Model Validation
The model variance, however, is only half of the story. Eventually, we want to estimate the
generalization error.
One thing we can do is divide the dataset in two sets: the training set and the validation set.
The training set is used for training the model, while the validation set is used for estimating
the generalization error.
Definition
The validation error is an estimator of the generalization error.
Model Validation
Deciding what percentage of the data should be allocated to the training set and the
validation set is a trade-off.
Having a large training set will produce models that are close to the one that can be
obtained with the full dataset, but, due to the limited size of the validation set, we will obtain
an unreliable (high variance) estimate of the generalization error.
Having a small training set allows us to obtain a lower variance estimate of the
generalization error. But this introduces a high bias, since the estimated models will differ
sensibly from the one obtained with the whole dataset.
Leave-one-out
However, we can take inspiration from bootstrapping to make a better use of the samples:
we can divide the dataset in training and validation sets multiple times and perform many
estimations.
We can obtain the most accurate estimation of the generalization error by performing
estimates (where is the number of samples). For each estimate, we divide the dataset in
samples for the training set and leave out 1 sample for the validation. The total
validation error is the average of each individual validation error.
This way, the estimated generalization error has low bias. Each time we use almost all of the
training set, and, nevertheless, we use also the whole dataset for estimating the
generalization error (since each sample will form a validation set).
6/ 9
-Fold Cross Validation
The leave-one-out method, however, is computationally very expensive. In practice, most of
the time it is enough to perform only estimates, where each time
samples form the training set and form the validation set. This technique is called cross
validation.
For this reason, is essential to always keep a portion of the given data for the test set. Such a
set can be used only once after developing the model to estimate its performance. To avoid
overfitting the test set, we can never reuse it.
Inductive Biases
An inductive bias is a set of assumptions that reduce the model complexity without
resulting in a projection error that is too high.
An example of an inductive bias is the use of convolutional neural networks (CNNs) for
image processing. CNNs assume that the input data has spatial structure and that local
features are more important than global ones. They also assume that the same features can
appear in different locations of the image, which leads to translation invariance.
Convolutional layers are less "powerful" than fully connected ones but if the assumptions
hold they do not increase the projection error.
Regularization
Regularization penalizes models that are considered "not probable".
Regularization can be seen as a soft form of model complexity reduction. It reduces the
estimator variance but it increases the bias.
Data Augmentation
To compensate the lack of data and reduce overfitting, one can produce synthetic data. As
we saw, a higher number of samples reduces the estimator variance, preventing overfitting.
However, introducing synthetic data might increase the estimator bias, since the new
samples might not reflect the true target function.
7/ 9
Data augmentation is frequently used in computer vision. Some examples:
while
Boosting
Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. Boosting aims to reduce the bias.
Boosting works by iteratively adding new weak learners that try to correct the mistakes of
the previous ones. The final prediction is a weighted combination of the weak learners. This
is the main difference from bagging, which, instead trains the models separately.
One notorious boosting algorithm is AdaBoost
(2) Assign higher weights to the misclassified examples and lower weights to the correctly
classified ones.
3 Draw another random sample of data according to the updated weights and train
another weak learner on it.
8/ 9
(4) Repeat steps 2 and 3 until a predefined number of weak learners are obtained or no
further improvement is possible.
(5) Combine the weak learners by giving more weight to those with lower error rates and
less weight to those with higher error rates.
Summary
According to classical machine learning:
Bias Variance
Model Complexity
Regularization
# Samples
Bagging
Boosting
Early Stopping ( # Epochs)
Data Augmentation
Double Descent
Double descent is a phenomenon observed in machine learning where the test error
first decreases, then increases, and then decreases again as the model complexity
increases.
Double descent contradicts the classical bias-variance trade-off, which predicts that
the test error should monotonically increase after reaching a minimum at the
optimal model complexity.
Double descent has been empirically demonstrated for various types of models, but it
is not yet fully understood.
Double Descent
When # parameters than # samples, there are many possible models that fit the data.
Among those, there are some that have low variance since they are similar for different
training sets. It remains an open question why, among all possible models, many ML
techniques naturally tend towards the ones with low variance.
9/ 9