0% found this document useful (0 votes)

16 views9 pages

Generalization Error

The document discusses generalization error in machine learning, emphasizing the importance of minimizing it through various methods such as regularization, data augmentation, and ensemble techniques like bagging and boosting. It highlights the bias-variance trade-off and the role of hyper-parameters in model performance, as well as methods for estimating model variance and generalization error, including bootstrapping and cross-validation. Additionally, it introduces the concept of double descent, which challenges traditional views on model complexity and error rates.

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views9 pages

Generalization Error

Uploaded by

tdr2mqm6gr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Overview

1. The Generalization Error

Generalization Error
Sources of Errors
Samples and Model Complexity
Overfitting and Underfitting

2. Estimation of Model Variance and Generalization Error

Statistical Bootstrapping: A Variance Estimator
Model Validation
Test Set

3. Improving Generalization
Inductive Biases
Regularization
Data Augmentation
Early Stopping
Bagging (Bootstrap Aggregating)
Boosting

Bird Eye View

As we said in previous lectures, estimation is a central problem in machine learning.

Until now, we focused on parameter estimation, i.e. finding the best parameters that
explain a given set of data.
Today, we remind ourselves that the goal of supervised learning (and more generally of
machine learning), is not to find a model performs well at explaining the given data, but a
model that performs well on average.

Expected Error
We can measure the performance of an estimator by computing its expected mean squared
error,

1/ 9
The lower the expected mean squared error is, the better our estimator is. The expected
error can be decomposed in three terms: bias,

variance

and irreducible error,

Remember this slide? Slide 13 of "Statistical Estimator"

Generalization Error
In regression we want to minimize the generalization error

where is the distribution of the inputs, is the target function, and is the
parametetric model. Our goal is to find the set of parameters that minimize the
generalization error, i.e.,

Generalization Error
Unfortunately, we don't have and we must rely on a finite amount of samples. All we
can do is use those samples to estimate , for example, by using a maximum likelihood
estimator,

Sources of Errors
Projection Error: the target function is not representable with the parametric model.
The projection error forms an irreducible bias (which cannot be mitigated by using
infinite samples)
Finite Samples: the use of finite samples give only partial information about the
target function, resulting in estimation variance, since different datasets produce
different models.
Noise: The relation between input and output is affected by noise. This noise produce
an irreducible error, even in presence of infinite samples and zero projection error.

2/ 9
Expected Error
We can measure the performance of an estimator by computing its expected mean squared
error,

The lower the expected mean squared error is, the better our estimator is. The expected
error can be decomposed in three terms: bias,

variance

Variance due to finite samples,

and irreducible error,

Variance due to noisy samples

Remember this slide? Slide 13 of "Statistical Estimator"

Samples and Model Complexity

A low number of samples makes many models plausible. Therefore, a low number of
samples often results in a high variance of the estimator.
A high number of samples makes only a few models plausible. Therefore, a high number of
samples often results in a low variance of the estimator.
A low number of parameters defines a small set of models. Therefore, a small number of
parameters often results in a low variance of the estimator.

A high number of parameters defines a large set of models. Therefore, a high number of
parameters often results in a high variance of the estimator.

Samples and Model Complexity

A low number of parameters defines a small set of models, decreasing the probability that
the target function belongs to that set. Therefore, a small number of parameters often
results in a high bias.
A high number of parameters defines a large set of models, increasing the probability that
the target function belongs to the set. Therefore, a high number of parameters often results
in a low bias.

3/ 9
Samples and Model Complexity
The ideal situation is when we have a large set of samples (close to infinite), and a large set
of parameters (high model complexity). In this situation, we can contain both the variance
and the bias. (Note that is the current trend in deep machine learning).
When samples are scarce, however, is often preferable to have small models (few
parameters, low model complexity) to contain the variance. E.g., if we know that the target
function is quadratic, why use a neural network?

Overfitting and Underfitting

When the model complexity is high and the number of samples is limited, the estimation
suffers from high variance and exhibits overfitting.

When the model complexity is low, the model may underfit, meaning that it does not fit the
training data very well.

Good Generalization Good Bias-Variance Tradeoff

In the finite-samples scenario, the only way to achieve a good average performance is to
control the bias-variance of the estimator.
But how do we know how good our model is?
And, how can we improve it?

Hyper-parameters
In parametric machine learning we aim to find the optimal parameters of a model to
obtain the smallest expected error.

However, there are many parameters that, classically, are not optimized, such as: model
complexity, neural network structure, regularization factor, and so on. Such parameters are
called hyper-parameters and, as we already saw, they also have an influence on the model
performance.
A common way to decide the hyper-parameter is to estimate the expected error and choose
the set of hyper parameters that optimizes the (estimated) expected error.

Estimation of Model Variance and Generalization Error

4/ 9
Statistical Bootstrapping: A Variance Estimator
Until now, we assumed we were performing only one estimate based on the data. The
estimate was therefore composed of a single model, making it difficult to understand how
reliable such a model is.
A central idea on next slides is to perform many estimations with the given data, and
measure their variance.

Statistical Bootstrapping: A Variance Estimator

Bootstrapping scheme: We have samples. We form datasets by
sampling with replacement from the original dataset. We obtain different estimators,
.
The estimated variance is

Of course, is also affected by variance and bias, which decrease as .

Bootstrapping
Bootstrapping can be used in machine learning for estimating the variance of the estimate.
In case of supervised machine learning, bootstrapping results in a point estimate of the
variance. This allows us to see which areas of the input state the estimator is more (or less)
reliable in.

5/ 9
Model Validation
The model variance, however, is only half of the story. Eventually, we want to estimate the
generalization error.
One thing we can do is divide the dataset in two sets: the training set and the validation set.

The training set is used for training the model, while the validation set is used for estimating
the generalization error.

Definition
The validation error is an estimator of the generalization error.

Model Validation
Deciding what percentage of the data should be allocated to the training set and the
validation set is a trade-off.
Having a large training set will produce models that are close to the one that can be
obtained with the full dataset, but, due to the limited size of the validation set, we will obtain
an unreliable (high variance) estimate of the generalization error.
Having a small training set allows us to obtain a lower variance estimate of the
generalization error. But this introduces a high bias, since the estimated models will differ
sensibly from the one obtained with the whole dataset.

Small validation set high variance

Small training set large bias

Leave-one-out
However, we can take inspiration from bootstrapping to make a better use of the samples:
we can divide the dataset in training and validation sets multiple times and perform many
estimations.
We can obtain the most accurate estimation of the generalization error by performing
estimates (where is the number of samples). For each estimate, we divide the dataset in
samples for the training set and leave out 1 sample for the validation. The total
validation error is the average of each individual validation error.

This way, the estimated generalization error has low bias. Each time we use almost all of the
training set, and, nevertheless, we use also the whole dataset for estimating the
generalization error (since each sample will form a validation set).

6/ 9
-Fold Cross Validation
The leave-one-out method, however, is computationally very expensive. In practice, most of
the time it is enough to perform only estimates, where each time
samples form the training set and form the validation set. This technique is called cross
validation.

Validation and Test Sets

As previously mentioned, techniques like cross validation aim to produce an estimate of
the generalization error that can be used to find the optimal set of hyper-parameters (e.g.,
regularization factor, neural network size, etc).
However, once we find the best hyper-parameters, we run the risk of over-fitting to the
validation set, since it has been used for the hyper-parameters optimization.

For this reason, is essential to always keep a portion of the given data for the test set. Such a
set can be used only once after developing the model to estimate its performance. To avoid
overfitting the test set, we can never reuse it.

Inductive Biases
An inductive bias is a set of assumptions that reduce the model complexity without
resulting in a projection error that is too high.
An example of an inductive bias is the use of convolutional neural networks (CNNs) for
image processing. CNNs assume that the input data has spatial structure and that local
features are more important than global ones. They also assume that the same features can
appear in different locations of the image, which leads to translation invariance.
Convolutional layers are less "powerful" than fully connected ones but if the assumptions
hold they do not increase the projection error.

Regularization
Regularization penalizes models that are considered "not probable".
Regularization can be seen as a soft form of model complexity reduction. It reduces the
estimator variance but it increases the bias.

We have seen L1 and L2 regularization in previous classes.

Data Augmentation
To compensate the lack of data and reduce overfitting, one can produce synthetic data. As
we saw, a higher number of samples reduces the estimator variance, preventing overfitting.
However, introducing synthetic data might increase the estimator bias, since the new
samples might not reflect the true target function.

7/ 9
Data augmentation is frequently used in computer vision. Some examples:

Bagging (Bootstrap Aggregating)

One idea to reduce overfitting without increasing (sensibly) the bias is to use bagging.
Bagging consists of generating datasets using bootstrapping, obtaining independent
estimates , and taking the average of these estimates. It is possible to see that

while

Of course, due to the resempling.

Boosting
Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. Boosting aims to reduce the bias.
Boosting works by iteratively adding new weak learners that try to correct the mistakes of
the previous ones. The final prediction is a weighted combination of the weak learners. This
is the main difference from bagging, which, instead trains the models separately.
One notorious boosting algorithm is AdaBoost

How does Boosting work? (Classification)

(1) Start with a random sample of data and train a weak learner on it.

(2) Assign higher weights to the misclassified examples and lower weights to the correctly
classified ones.
3 Draw another random sample of data according to the updated weights and train
another weak learner on it.

8/ 9
(4) Repeat steps 2 and 3 until a predefined number of weak learners are obtained or no
further improvement is possible.
(5) Combine the weak learners by giving more weight to those with lower error rates and
less weight to those with higher error rates.

(6) Use the combined model to make predictions on new data.

Summary
According to classical machine learning:

Bias Variance
Model Complexity
Regularization
# Samples
Bagging
Boosting
Early Stopping ( # Epochs)
Data Augmentation

Double Descent
Double descent is a phenomenon observed in machine learning where the test error
first decreases, then increases, and then decreases again as the model complexity
increases.
Double descent contradicts the classical bias-variance trade-off, which predicts that
the test error should monotonically increase after reaching a minimum at the
optimal model complexity.
Double descent has been empirically demonstrated for various types of models, but it
is not yet fully understood.

Double Descent
When # parameters than # samples, there are many possible models that fit the data.
Among those, there are some that have low variance since they are similar for different
training sets. It remains an open question why, among all possible models, many ML
techniques naturally tend towards the ones with low variance.

9/ 9

Group 4 Project Synopsis on Photo Editing Using Machine Learning
No ratings yet
Group 4 Project Synopsis on Photo Editing Using Machine Learning
53 pages
blackbook (1)
No ratings yet
blackbook (1)
46 pages
Diagnosing Bias vs Variance
No ratings yet
Diagnosing Bias vs Variance
11 pages
Inference For The Generalization Error
No ratings yet
Inference For The Generalization Error
43 pages
DL_Unit1 (1)
100% (1)
DL_Unit1 (1)
79 pages
unit-1.2-Perceptron-2024
No ratings yet
unit-1.2-Perceptron-2024
107 pages
L3 Model Selection Diagnostics
No ratings yet
L3 Model Selection Diagnostics
75 pages
lec24
No ratings yet
lec24
8 pages
U-Net and Its Variants for Medical Image Segmentat
No ratings yet
U-Net and Its Variants for Medical Image Segmentat
43 pages
PA DL Consolidated
No ratings yet
PA DL Consolidated
94 pages
Bias and Variance (v2)
No ratings yet
Bias and Variance (v2)
22 pages
3.Areal-time humanbonefracturedetection andclassification
100% (1)
3.Areal-time humanbonefracturedetection andclassification
17 pages
Machine Learning Math Essentials _12.02.2025
No ratings yet
Machine Learning Math Essentials _12.02.2025
88 pages
Full Stack Datasciece & Ai, Generative Ai, LLM Models
No ratings yet
Full Stack Datasciece & Ai, Generative Ai, LLM Models
26 pages
IEEE_ICDCECE_2025 (1)
No ratings yet
IEEE_ICDCECE_2025 (1)
6 pages
BiasVarianceTradeOff
No ratings yet
BiasVarianceTradeOff
4 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
Unit 2
No ratings yet
Unit 2
97 pages
Important Tems
No ratings yet
Important Tems
61 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Lecture 31-36
No ratings yet
Lecture 31-36
44 pages
Optimized Brain Tumor Detection A Dual-Module Approach For MRI Image Enhancement and Tumor Classification
No ratings yet
Optimized Brain Tumor Detection A Dual-Module Approach For MRI Image Enhancement and Tumor Classification
20 pages
Bias–variance_tradeoff
No ratings yet
Bias–variance_tradeoff
7 pages
ESGB Evaluation Methods
No ratings yet
ESGB Evaluation Methods
84 pages
Emailing PREDICTIVE ANALYSIS 2
No ratings yet
Emailing PREDICTIVE ANALYSIS 2
14 pages
ML Kuramoto Model
No ratings yet
ML Kuramoto Model
7 pages
Plant Detection 33
No ratings yet
Plant Detection 33
61 pages
All DL
No ratings yet
All DL
72 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
module 3 modified
No ratings yet
module 3 modified
48 pages
Long Short Term Memory Networks For Automated Waste Treatment Augmented With IoT and Bioelectric Sensors
No ratings yet
Long Short Term Memory Networks For Automated Waste Treatment Augmented With IoT and Bioelectric Sensors
12 pages
Insc D 22 00136
No ratings yet
Insc D 22 00136
16 pages
Edab Module - 2
No ratings yet
Edab Module - 2
20 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Bias Variance Trade Off
No ratings yet
Bias Variance Trade Off
14 pages
ML3 - Evaluation
100% (1)
ML3 - Evaluation
65 pages
Bias-variance
No ratings yet
Bias-variance
8 pages
An EEG-based Machine Learning Framework For Depression Detection Using Effective Connectivity Analysis
No ratings yet
An EEG-based Machine Learning Framework For Depression Detection Using Effective Connectivity Analysis
20 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
Capstone FINAL
No ratings yet
Capstone FINAL
34 pages
Crack Detection of Track Slab Based On RSG-YOLO
No ratings yet
Crack Detection of Track Slab Based On RSG-YOLO
10 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
Session On Bias Variance Tradeoff
No ratings yet
Session On Bias Variance Tradeoff
17 pages
Jurnal Litrev 23
No ratings yet
Jurnal Litrev 23
6 pages
REgnet
No ratings yet
REgnet
6 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Overfitting: Extracting Too Much
No ratings yet
Overfitting: Extracting Too Much
17 pages
Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications
No ratings yet
Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications
10 pages
Understanding Common Challenges To Model Accuracy (Slides)
No ratings yet
Understanding Common Challenges To Model Accuracy (Slides)
13 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Efficient Deep Learning Techniques For The Detection of Phishing
No ratings yet
Efficient Deep Learning Techniques For The Detection of Phishing
18 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
ICIRCA 2023 Schedule
No ratings yet
ICIRCA 2023 Schedule
31 pages
DOC-20241121-WA0194.
No ratings yet
DOC-20241121-WA0194.
7 pages
ML 01
No ratings yet
ML 01
24 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Week2-Day 1-Introduction To Data Mining
No ratings yet
Week2-Day 1-Introduction To Data Mining
30 pages
Spoken Language Identification Using CNN With Log Mel Spectrogram Features in Indian Context
No ratings yet
Spoken Language Identification Using CNN With Log Mel Spectrogram Features in Indian Context
7 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Receiver Operator Characteristic
No ratings yet
Receiver Operator Characteristic
25 pages
UT Austin Texas PGP AIML Brochure
No ratings yet
UT Austin Texas PGP AIML Brochure
19 pages
3 Bias Variance Tradeoff
No ratings yet
3 Bias Variance Tradeoff
9 pages
Semi-Supervised Medical Image Classification With Relation-Driven Self-Ensembling Model
No ratings yet
Semi-Supervised Medical Image Classification With Relation-Driven Self-Ensembling Model
12 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
Road Damage Detection Algorithm For Improved YOLOv5
No ratings yet
Road Damage Detection Algorithm For Improved YOLOv5
12 pages
Automatic Fabric Defect Detection With A Multi-Sca
No ratings yet
Automatic Fabric Defect Detection With A Multi-Sca
18 pages
Matconvnet: Convolutional Neural Networks For Matlab
No ratings yet
Matconvnet: Convolutional Neural Networks For Matlab
55 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
1 5 Bias Variance Trade Off
No ratings yet
1 5 Bias Variance Trade Off
34 pages
1.4 Intro To Need of Estimation and Validation PDF
No ratings yet
1.4 Intro To Need of Estimation and Validation PDF
18 pages
Bais and Variance
No ratings yet
Bais and Variance
4 pages
School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)
No ratings yet
School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)
4 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
5.4 MLBasics Estimators
No ratings yet
5.4 MLBasics Estimators
23 pages
The Fundamentals of Machine Learning
No ratings yet
The Fundamentals of Machine Learning
12 pages
Real Time Brazilian
No ratings yet
Real Time Brazilian
8 pages
Bias Variance Overfitting
No ratings yet
Bias Variance Overfitting
3 pages
Machine Learning Using Matlab: Lecture 8 Advice On ML Application
No ratings yet
Machine Learning Using Matlab: Lecture 8 Advice On ML Application
30 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
10: Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
10: Advice For Applying Machine Learning: Deciding What To Try Next
8 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
6 pages
Overfitting vs. Underfitting, Bias vs. Variance
No ratings yet
Overfitting vs. Underfitting, Bias vs. Variance
7 pages
Python Deep Learning Tutorial
0% (1)
Python Deep Learning Tutorial
17 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

Generalization Error

Uploaded by

Generalization Error

Uploaded by

Overview

1. The Generalization Error

2. Estimation of Model Variance and Generalization Error

Bird Eye View

and irreducible error,

Remember this slide? Slide 13 of "Statistical Estimator"

Variance due to finite samples,

and irreducible error,

Remember this slide? Slide 13 of "Statistical Estimator"

Samples and Model Complexity

Samples and Model Complexity

Overfitting and Underfitting

Good Generalization Good Bias-Variance Tradeoff

Estimation of Model Variance and Generalization Error

Statistical Bootstrapping: A Variance Estimator

Of course, is also affected by variance and bias, which decrease as .

Small validation set high variance

Validation and Test Sets

We have seen L1 and L2 regularization in previous classes.

Bagging (Bootstrap Aggregating)

Of course, due to the resempling.

How does Boosting work? (Classification)

(6) Use the combined model to make predictions on new data.

You might also like