0% found this document useful (0 votes)
22 views32 pages

2.SupervisedLearning Error

Uploaded by

sandeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views32 pages

2.SupervisedLearning Error

Uploaded by

sandeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

ZG 512

Supervised Learning
Errors and other concepts
BITS Pilani Dr Arindam Roy
Pilani Campus
The Supervised Learning Problem
Starting point:

• Outcome measurement Y (also called dependent variable,response, target).

• Vector of p predictor measurements X (also called inputs, regressors, covariates, features, independent

variables).

• In the regression problem, Y is quantitative (e.g price,blood pressure).

• In the classification problem, Y takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of

tissue sample).

• We have training data (x1, y1), . . . ,(xN , yN ). These are observations (examples, instances) of these

measurements.

BITS Pilani, Pilani Campus


Objectives

On the basis of the training data we would like to:

• Accurately predict unseen test cases.

• Understand which inputs affect the outcome, and how.

• Assess the quality of our predictions and inferences.

BITS Pilani, Pilani Campus


Philosophy

• It is important to understand the ideas behind the various techniques, in order to


know how and when to use them.
• One has to understand the simpler methods first, in order to grasp the more
sophisticated ones.
• It is important to accurately assess the performance of a method, to know how well
or how badly it is working [simpler methods often perform as well as fancier ones!]
• This is an exciting research area, having important applications in science, industry
and finance.
• Statistical learning/Machine Learning is a fundamental ingredient in the training of a
modern data scientist.

BITS Pilani, Pilani Campus


Unsupervised Learning

• No outcome variable, just a set of predictors (features) measured on a set of


samples.

• objective is more fuzzy — find groups of samples that behave similarly, find features
that behave similarly, find linear combinations of features with the most variation.

• difficult to know how well your are doing.

• different from supervised learning, but can be useful as a pre-processing step for
supervised learning.

BITS Pilani, Pilani Campus


Statistical Learning Vs. Machine Learning

• Machine learning arose as a subfield of Artificial Intelligence.


• Statistical learning arose as a subfield of Statistics.
• There is much overlap — both fields focus on supervised and unsupervised
problems:
– Machine learning has a greater emphasis on large scale applications and prediction accuracy.
– Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
• But the distinction has become more and more blurred, and there is a great deal of
“cross-fertilization”.
• Machine learning has the upper hand in Marketing!

BITS Pilani, Pilani Campus


What is Machine Learning

Shown are Sales vs TV, Radio and Newspaper, with a blue linear-regression line fit
separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model: Sales ≈ f (TV, Radio, Newspaper)

BITS Pilani, Pilani Campus


Prediction

• In simplest term, the prediction task is all about how can I design f(x) in such a way
that I get very close to the value of Y, for the given values of X.
Y = f(X) + ε
• Our estimator function is 𝕗 and our predicted value as 𝕐

• f(X) vs ƒˆ(x)? and Y vs Ŷ? – True function vs. estimated function

• Now that we have the actual (Y) and the predicted (Ŷ) value, the difference
between Y and Ŷ is the prediction error.

BITS Pilani, Pilani Campus


Prediction

Prediction error is influenced by 2 factors:

• Difference between f(X) and ƒˆ(x), which we term as the Reducible Error
• ε , which we call as the Irreducible Error

• Given infinite time, we know that we can figure out a good enough estimator and bring down the
Reducible Error closer to 0.
• So, Reducible errors are those errors which we have control of and can reduce them by choosing
more optimal models.
• E(Y - ƒˆ(x))2 = Reducible + irreducible Error
• It can be mathematically proven than average squared prediction error can be decomposed into two
parts – Reducible and Irreducible Error

BITS Pilani, Pilani Campus


Can we reduce the Irreducible error?

• In machine learning, the irreducible error, often called noise, is the component of
the error that cannot be reduced no matter how good our model is. This error
comes from the inherent randomness and variability in the data that we cannot
model or predict.

• Irreducible error can arise from several sources, including:


– Intrinsic variability: Natural randomness in the data or phenomena being modeled.
– Measurement noise: Errors in data collection or measurement inaccuracies.
– Incomplete data: Important variables or features that are not included in the dataset.
– Latent variables: Hidden variables that affect the outcomes but are not directly observed or measured.

BITS Pilani, Pilani Campus


Can we reduce the Irreducible error?

• Since irreducible error is due to these inherent aspects of the data and the system
being modeled, it cannot be reduced through improvements in the model or
learning algorithm. However, understanding the domain and managing irreducible
error is crucial:
– Improve data quality: Reducing measurement noise and improving data collection processes can help minimize the
contribution of measurement errors.
– Data augmentation: Adding more relevant data can help ensure that the variability captured is more representative of
the true distribution.
– Feature engineering: Identifying and including relevant features might reduce the perceived irreducible error by
accounting for more variability in the data.

• Ultimately, the focus in machine learning is to minimize reducible errors—those


errors that can be mitigated through better models, algorithms, and data
preprocessing techniques.

BITS Pilani, Pilani Campus


How to estimate f

• Typically we have few if any data points with X = 4


exactly.
• So we cannot compute E ( Y |X = x)!
• Relax the definition and let
ƒˆ(x) = Ave(Y | X ∈N (x))
where N (x) is some neighborhood of x.
3

● ●

2



● ●
● ● ●
● ●
1

● ●●●
●● ●

●● ●
y


●● ●●●
● ●● ● ● ●● ●
0

● ● ● ● ●
● ● ● ●

● ● ●● ● ● ● ●

−1

● ● ●● ● ● ●
●●
−2

1 2 3 4 5 6

x
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The curse of dimensionality

• Nearest neighbor averaging can be pretty good for small p


— i.e. p ≤ 4 and large-ish N .

• Nearest neighbor methods can be lousy when p is large.


Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.

• We need to get a reasonable fraction of the N values of y i


to average to bring the variance down—e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be
local, so we lose the spirit of estimating by local averaging.

7 / 30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The curse of dimensionality

10% Neighborhood

p= 10

1.0
● ● ●●●
● ● ●
● ● ● ● ● ●●

1.5
● ● ● ●
● ●
● ●
● ● ●
0.5 ● ● ● p= 5
● ●
● ● ●

●●●
● ●
p= 3

1.0

Radius
● ● ● ●
● ●
p= 2
0.0

● ●
x2

● ●● ●
● ●●

● ● ●
● ●
●●

● ● p= 1
● ● ●● ●

0.5
−0.5


● ●
● ●●
● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●
−1.0


0.0

−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

x1 Fraction of Volume

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Parametric and structured models
The linear model is an important example of a parametric
model:
ƒ L ( X ) = β0 + β1X 1 + β2X 2 + . . . βpXp.

• A linear model is specified in terms of p + 1 parameters


β0, β1, . . . , βp.
• We estimate the parameters by fitting the model to
training data.
• Although it is almost never correct, a linear model often
serves as a good and interpretable approximation to the
unknown true function ƒ ( X ) .

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Parametric and structured models
3 ● ●

2


●●
● ●

A linear model ƒ ˆ L ( X ) = βˆ0 + βˆ1X



● ● ●
1

● ●●●

●●●
● ● ●
y

● ●●
● ●
●● ●●
● ● ● ● ●
gives a reasonable fit here
0

● ● ● ● ●
● ● ●
● ●
● ● ●● ● ● ● ●

−1

● ● ●
● ●
●● ●●
−2
3

● ●

A quadratic model
2



● ● ● ●

ƒ ˆ Q (X) = βˆ0 + βˆ1 X + βˆ2 X 2


● ● ●
● ●●●
1

●● ●

● ● ●
y


●● ●●

● ● ● ● ● ● ● ●
0

● ● ● ● ●
● ● ●
● ●
● ● ● ●● ● ● ● ●
fits slightly better.
−1

● ● ●
● ●
●● ●●
−2

1 2 3 4 5 6

x
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simulated example. Red points are simulated values for income
from the model
income = ƒ (education, s e n i o r i t y ) + ϵ

ƒ is the blue surface.


11 / 30
Linear regression model fit to the simulated data.

ƒ ˆ L (education, s e n i o r i t y ) = βˆ0+βˆ1×education+βˆ2×seniority

12 / 30
More flexible regression model ƒˆS (education, s e n i o r i t y ) fit to
the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit.

13 / 30
Even more flexible spline regression model
ƒˆS (education, s e n i o r i t y ) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.

14 / 30
Some Trade offs

• Prediction accuracy versus interpretability.


— Linear models are easy to interpret; thin-plate splines are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
• Parsimony versus black-box.
— We often prefer a simpler model involving fewer variables over a
black-box predictor involving them all.

BITS Pilani, Deemed to be University 15 / 30


under Section 3 of UGC Act, 1956
Subset Selection

High
Lasso

Least Squares
Interpretability
Generalized Additive Models
Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility
Assessing Model Accuracy

Suppose we fit a model ƒˆ(x) to some training data, and we


wish to see how well it performs
• We could compute the average squared prediction error
over Tr:

MSE Tr = Mean Squared Error for Training set


This may be biased toward more overfit models.
• Instead we should, if possible, compute it using fresh test
data
MSE Te = Mean Squared Error for Test set

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Black curve is truth. Red curve on right is MSE Te , grey curve is


MSE Tr . Orange, blue and green curves/squares correspond to fits of
different flexibility.
2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6
4

0.5
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Here the truth is smoother, so the smoother fit and linear model do
really well.
20
20

15
Mean Squared Error
10

10
Y

5
−10

0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.
Understanding Bias – Variance Tradeoff

What is bias?
• Bias is the difference between the average prediction of our model and the correct
value which we are trying to predict. Model with high bias pays very little attention
to the training data and oversimplifies the model. It always leads to high error on
training and test data.

What is variance?
• Variance is the variability of model prediction for a given data point or a value which
tells us spread of our data. Model with high variance pays a lot of attention to
training data and does not generalize on the data which it hasn’t seen before. As a
result, such models perform very well on training data but has high error rates on
test data.

BITS Pilani, Pilani Campus


Understanding Bias – Variance Tradeoff

• Suppose we are trying to predict Y, given X. We assume there is a relationship


between the two such that: Y = f(X) + ε
• We are estimating f(X) with ƒˆ(x) using linear regression or any other modelling
technique
• So the expected squared error for some x is

• The Err(x) can be further decomposed as

BITS Pilani, Pilani Campus


Understanding Bias – Variance Tradeoff

• In supervised learning, underfitting happens when a model unable to capture the underlying pattern
of the data. These models usually have high bias and low variance. It happens when we have very
less amount of data to build an accurate model or when we try to build a linear model with a
nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data
like Linear and logistic regression.
• In supervised learning, overfitting happens when our model captures the noise along with the
underlying pattern in data. It happens when we train our model a lot over noisy dataset. These
models have low bias and high variance. These models are very complex like Decision trees which are
prone to overfitting.

BITS Pilani, Pilani Campus


Understanding Bias – Variance Tradeoff

Why tradeoff?

• If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand if our model has large number of parameters
then it’s going to have high variance and low bias. So we need to find the right/good
balance without overfitting and underfitting the data.
• This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time.
• Typically as the flexibility of ƒˆ increases, its variance increases, and its bias
decreases. So choosing the flexibility based on average test error amounts to a
bias-variance trade-off.

BITS Pilani, Pilani Campus


Understanding Bias – Variance Tradeoff

• To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error. An optimal balance of bias and variance would
never overfit or underfit the model.

BITS Pilani, Pilani Campus


Bias-variance trade-off for the three
examples
2.5

2.5

20
MSE
Bias
2.0 Var

2.0

15
1.5

1.5

10
1.0

1.0

5
0.5

0.5
0.0

0.0

0
2 5 10 20 2 5 10 20 2 5 10 20

Flexibility Flexibility Flexibility

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like