2.SupervisedLearning Error
2.SupervisedLearning Error
Supervised Learning
Errors and other concepts
BITS Pilani Dr Arindam Roy
Pilani Campus
The Supervised Learning Problem
Starting point:
• Vector of p predictor measurements X (also called inputs, regressors, covariates, features, independent
variables).
• In the classification problem, Y takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of
tissue sample).
• We have training data (x1, y1), . . . ,(xN , yN ). These are observations (examples, instances) of these
measurements.
• objective is more fuzzy — find groups of samples that behave similarly, find features
that behave similarly, find linear combinations of features with the most variation.
• different from supervised learning, but can be useful as a pre-processing step for
supervised learning.
Shown are Sales vs TV, Radio and Newspaper, with a blue linear-regression line fit
separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model: Sales ≈ f (TV, Radio, Newspaper)
• In simplest term, the prediction task is all about how can I design f(x) in such a way
that I get very close to the value of Y, for the given values of X.
Y = f(X) + ε
• Our estimator function is 𝕗 and our predicted value as 𝕐
• Now that we have the actual (Y) and the predicted (Ŷ) value, the difference
between Y and Ŷ is the prediction error.
• Difference between f(X) and ƒˆ(x), which we term as the Reducible Error
• ε , which we call as the Irreducible Error
• Given infinite time, we know that we can figure out a good enough estimator and bring down the
Reducible Error closer to 0.
• So, Reducible errors are those errors which we have control of and can reduce them by choosing
more optimal models.
• E(Y - ƒˆ(x))2 = Reducible + irreducible Error
• It can be mathematically proven than average squared prediction error can be decomposed into two
parts – Reducible and Irreducible Error
• In machine learning, the irreducible error, often called noise, is the component of
the error that cannot be reduced no matter how good our model is. This error
comes from the inherent randomness and variability in the data that we cannot
model or predict.
• Since irreducible error is due to these inherent aspects of the data and the system
being modeled, it cannot be reduced through improvements in the model or
learning algorithm. However, understanding the domain and managing irreducible
error is crucial:
– Improve data quality: Reducing measurement noise and improving data collection processes can help minimize the
contribution of measurement errors.
– Data augmentation: Adding more relevant data can help ensure that the variability captured is more representative of
the true distribution.
– Feature engineering: Identifying and including relevant features might reduce the perceived irreducible error by
accounting for more variability in the data.
● ●
●
2
●
●
● ●
● ● ●
● ●
1
● ●●●
●● ●
●
●● ●
y
●
●● ●●●
● ●● ● ● ●● ●
0
● ● ● ● ●
● ● ● ●
●
● ● ●● ● ● ● ●
●
−1
● ● ●● ● ● ●
●●
−2
1 2 3 4 5 6
x
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The curse of dimensionality
7 / 30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The curse of dimensionality
10% Neighborhood
p= 10
1.0
● ● ●●●
● ● ●
● ● ● ● ● ●●
1.5
● ● ● ●
● ●
● ●
● ● ●
0.5 ● ● ● p= 5
● ●
● ● ●
●●●
● ●
p= 3
1.0
●
Radius
● ● ● ●
● ●
p= 2
0.0
● ●
x2
● ●● ●
● ●●
●
● ● ●
● ●
●●
●
● ● p= 1
● ● ●● ●
0.5
−0.5
●
● ●
● ●●
● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●
−1.0
●
●
0.0
●
−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x1 Fraction of Volume
●
●●
● ●
● ●●●
●
●●●
● ● ●
y
● ●●
● ●
●● ●●
● ● ● ● ●
gives a reasonable fit here
0
● ● ● ● ●
● ● ●
● ●
● ● ●● ● ● ● ●
●
−1
● ● ●
● ●
●● ●●
−2
3
● ●
●
A quadratic model
2
●
●
● ● ● ●
●● ●
●
● ● ●
y
●
●● ●●
●
● ● ● ● ● ● ● ●
0
● ● ● ● ●
● ● ●
● ●
● ● ● ●● ● ● ● ●
fits slightly better.
−1
● ● ●
● ●
●● ●●
−2
1 2 3 4 5 6
x
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simulated example. Red points are simulated values for income
from the model
income = ƒ (education, s e n i o r i t y ) + ϵ
ƒ ˆ L (education, s e n i o r i t y ) = βˆ0+βˆ1×education+βˆ2×seniority
12 / 30
More flexible regression model ƒˆS (education, s e n i o r i t y ) fit to
the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit.
13 / 30
Even more flexible spline regression model
ƒˆS (education, s e n i o r i t y ) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.
14 / 30
Some Trade offs
High
Lasso
Least Squares
Interpretability
Generalized Additive Models
Trees
Bagging, Boosting
Low High
Flexibility
Assessing Model Accuracy
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
2.0
10
1.5
8
Y
1.0
6
4
0.5
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is smoother, so the smoother fit and linear model do
really well.
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.
Understanding Bias – Variance Tradeoff
What is bias?
• Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict. Model with high bias pays very little attention
to the training data and oversimplifies the model. It always leads to high error on
training and test data.
What is variance?
• Variance is the variability of model prediction for a given data point or a value which
tells us spread of our data. Model with high variance pays a lot of attention to
training data and does not generalize on the data which it hasn’t seen before. As a
result, such models perform very well on training data but has high error rates on
test data.
• In supervised learning, underfitting happens when a model unable to capture the underlying pattern
of the data. These models usually have high bias and low variance. It happens when we have very
less amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data
like Linear and logistic regression.
• In supervised learning, overfitting happens when our model captures the noise along with the
underlying pattern in data. It happens when we train our model a lot over noisy dataset. These
models have low bias and high variance. These models are very complex like Decision trees which are
prone to overfitting.
Why tradeoff?
• If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand if our model has large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data.
• This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time.
• Typically as the flexibility of ƒˆ increases, its variance increases, and its bias
decreases. So choosing the flexibility based on average test error amounts to a
bias-variance trade-off.
• To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error. An optimal balance of bias and variance would
never overfit or underfit the model.
2.5
20
MSE
Bias
2.0 Var
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20