0% found this document useful (0 votes)
68 views

Introduction Statistical Learning

This document introduces statistical learning through a guiding example. It describes: 1) Creating training data by simulating points from two Gaussian distributions, representing two classes ("blue" and "red"). 2) The optimal Bayes classifier that minimizes prediction error by assigning points to the class with the highest probability. 3) Alternative classifiers based on linear regression and nearest neighbors that are fitted to the training data and can classify new points, though with higher error than the Bayes classifier. Performance is improved by increasing the complexity of the linear classifier or number of nearest neighbors.

Uploaded by

Norbert Durand
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Introduction Statistical Learning

This document introduces statistical learning through a guiding example. It describes: 1) Creating training data by simulating points from two Gaussian distributions, representing two classes ("blue" and "red"). 2) The optimal Bayes classifier that minimizes prediction error by assigning points to the class with the highest probability. 3) Alternative classifiers based on linear regression and nearest neighbors that are fitted to the training data and can classify new points, though with higher error than the Bayes classifier. Performance is improved by increasing the complexity of the linear classifier or number of nearest neighbors.

Uploaded by

Norbert Durand
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Statistical Learning

Olivier Roustant
& Laurent Carraro for Part 2

Mines Saint-Étienne

2016/09

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 1 / 39
Part 1 : Famous traps !

Part 1 : Famous traps !

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 2 / 39
Part 1 : Famous traps !

Trap #1- Spurious relationship, correlation 6= causality

What do you think of the correlation of 0.99 between the two variables
illustrated below ?
 
        

   
 


   
          
   

 
  

 

   


   




   



          



   
   
! " # $

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 3 / 39
Part 1 : Famous traps !

Trap #1- Spurious relationship, correlation 6= causality

What do you think of the correlation of 0.52 between two daily returns
of French stocks in 2 different sectors (food and construction) ?

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 4 / 39
Part 1 : Famous traps !

Trap #1- Build your one spurious relationship !

Exercise 1 : Build a time series independently of the co2 curve, but


with an estimated correlation > 0.95 with it !
Exercise 2 : Same question with CAC40 !

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 5 / 39
Part 1 : Famous traps !

Trap #1- Spurious relationship !

There are at least two problems :


The ESTIMATOR of correlation is not consistent in presence of
trend or seasonality !
When it is (stationary time series for instance), then a THIRD
variable can explain the observed correlations.

Never forget HUMAN THINKING !

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 6 / 39
Part 1 : Famous traps !

Trap #2- Overfitting


Here are some data from a physical phenomenon. What is your
preferred model (2nd order polynomial or interpolation spline) ?

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 7 / 39
Part 1 : Famous traps !

Trap #2- Overfitting


The same models, estimated on a training set of 20 data, chosen at
random (empty points). Are the performances similar on the test set
(filled points) ?

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 8 / 39
Part 1 : Famous traps !

Trap #2- Overfitting

Always look at the model performances on other data than the


training set → external validation, cross-validation
A good model should behave similarly on training & test sets

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 9 / 39
Part 2 : A guiding example

Part 2 : A guiding example

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 10 / 39
Part 2 : A guiding example

What follows is freely adapted from the book


The elements of Statistical learning, of T. Hastie, R. Tibshirani, J.
Friedman (Springer, 2nd edition), available on internet.

We consider a simulated example for classification, where 2


populations "blue" and "red" are drawn from 2 mixtures of Gaussian
distributions.

The aim is to find a rule to decide in which group a new individual


should be classed.

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 11 / 39
Part 2 : A guiding example

Construction of the training sets


Step 1 : Simulate 10 points M11 , . . . , M10
1 for the "blue", drawn from
2 2
N(µ1 , Σ), and 10 points M1 , . . . , M10 for the "red", from N(µ2 , Σ)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 12 / 39
Part 2 : A guiding example

Step 2 : Simulate a sample of size 100 as a mixture of N(Mi1 , Σ0 ) for


the "blue", and N(Mi2 , Σ0 ) for the "red"

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 13 / 39
Part 2 : A guiding example

Bayes classifier

If we knew the simulation procedure, that is the distributions fX |G=i ,


then we could use the Bayes classifier. Let x be a new point to classify.
if P(G = 1|X = x) > P(G = 2|X = x), then decide that x is "blue"
if P(G = 1|X = x) < P(G = 2|X = x), then decide that x is "red"
if P(G = 1|X = x) = P(G = 2|X = x), then ?

Here :
0.5fX |G=i (x)
P(G = i|X = x) =
0.5fX |G=1 (x) + 0.5fX |G=2 (x)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 14 / 39
Part 2 : A guiding example

Remark. Define Ĝ(x) as a decision rule at point x, and consider the


0-1 loss function :

L(1, 1) = L(2, 2) = 0
L(1, 2) = L(2, 1) = α > 0

Then the Bayes classifier Ĝ minimizes the Expected Prediction Loss


E[L(G, Ĝ(X ))]. It is enough to show that it is true knowing X = x :

EPLx = E[L(G, Ĝ(X ))|X = x]


= L(1, Ĝ(x))P(G = 1|X = x) + L(2, Ĝ(x))P(G = 2|X = x)

The Bayes classifier cancels L(i, Ĝ(x)) where P(G = i|X = x) is the
highest.

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 15 / 39
Part 2 : A guiding example

The (optimal) frontier, obtained with Bayes classifier.

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 16 / 39
Part 2 : A guiding example

Classifiers from samples based on linear regression

For each sample point define a value Y which is equal to 1 if "blue"


and 0 otherwise, and let Ŷ (x) be the prediction at a new point x :

Ŷ (x) = βˆ0 + βˆ1 x1 + βˆ2 x2

A classifier is :
if Ŷ (x) > 0.5, then decide that x is "blue"
if Ŷ (x) < 0.5, then decide that x is "red"
if Ŷ (x) = 0.5, then ?

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 17 / 39
Part 2 : A guiding example

Linear frontier : classification rate 73.5 %

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 18 / 39
Part 2 : A guiding example

Quadratic frontier : classification rate 79.5 %

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 19 / 39
Part 2 : A guiding example

5th order polynomial frontier : classification rate 88 %

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 20 / 39
Part 2 : A guiding example

Nearest Neighbors Classifiers

Let Nk (x) the number of k -nearest neighbors of x, and Ŷ (x) the


proportion of these neighbors that belong to the "blue" :

1 X
Ŷ (x) = Yi
k
xi ∈Nk (x)

We can define a classifier by :


if Ŷ (x) > 0.5, then decide that x is "blue"
if Ŷ (x) < 0.5, then decide that x is "red"
if Ŷ (x) = 0.5, then ?

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 21 / 39
Part 2 : A guiding example

kNN with k = 30 : classification rate 84 %

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 22 / 39
Part 2 : A guiding example

kNN with k = 10 : classification rate 88 %

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 23 / 39
Part 2 : A guiding example

kNN with k = 1 : classification rate 100 %

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 24 / 39
Part 2 : A guiding example

Temporary conclusions

kNN is closer to the optimal method


Parameters to estimate : k and d (polynomial degree)
A classification rate of 100% is NOT the aim (see trap #2
’overfitting’...)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 25 / 39
Part 2 : A guiding example

Error decomposition & bias-variance tradeoff

Assume that Y (x) is deterministic, and let x be a new point. Denote


µ(x) = E[Ŷ (x)]. The quadratic error (risk) is decomposed as :
 2 
QE(x) = E Ŷ (x) − Y (x)
h i
= (Y (x) − µ(x))2 + var Ŷ (x) = Bias2 + Variance

Remarks
for kNN, the bias is ≈ 0
for the linear model, the bias is 0 if there is no model error (good
basis functions).

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 26 / 39
Part 2 : A guiding example

The curse of dimensionality

Exercise : Let X1 , . . . , Xn i.i.d. uniforms on [−1, 1]d , and consider the


norm khk∞ = max1≤j≤d |hj |.
What is the distribution of R = min1≤i≤n kXi k∞ , the distance of the
closest point to 0 ?
What’s happening when d → ∞ ?

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 27 / 39
Part 2 : A guiding example

Boxplots for the distribution of the closest point to 0.

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 28 / 39
Part 2 : A guiding example

In high dimensions, the sample points are close to the boundaries


In 15D, the distance to the closest point is around 0.6
There are no neighbors in high dimensions → kNN cannot be used.
More generally any local method cannot be used.

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 29 / 39
Part 2 : A guiding example

Validation

Internal validation (on the training set only)


External validation : Validate on a separate "test" set
Cross validation : Choose the training set and test set inside the
data (see later).

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 30 / 39
Part 2 : A guiding example

Validation results on the example

Linear frontier : classification rate 72.8 % (learning : 73.5 %)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 31 / 39
Part 2 : A guiding example

Quadratic frontier : classification rate 77.5 % (learning : 79.5 %)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 32 / 39
Part 2 : A guiding example

5th order poly. frontier : classification rate 84.5 % (learning : 88 %)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 33 / 39
Part 2 : A guiding example

kNN with k = 30 : classification rate 80.2 % (learning : 84 %)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 34 / 39
Part 2 : A guiding example

kNN with k = 10 : classification rate 84.9 % (learning : 88 %)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 35 / 39
Part 2 : A guiding example

kNN with k = 1 : classification rate 82 % (learning : 100 %)

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 36 / 39
Part 2 : A guiding example

Conclusions

The performance difference between training and test set is


increasing with model complexity
The performance on test sets does not always increase with
model complexity
Complex models sometimes take crazy decisions :
I 5th order polynomial : boundaries of the x-axis
I kNN for k = 1 : islands in the middle

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 37 / 39
Part 2 : A guiding example

Cross validation

k-fold cross validation (CV) consists in choosing training & test sets
among the data, and rotating them.
CV errors are computed by averaging.

(source : The elements of Statistical learning, T. Hastie, R. Tibshirani, J. Friedman)

Define K ’folds’ F1 , . . . , FK in your data. For k = 1, . . . , K , do :


Estimate the model without Fk and predict on Fk
Compute an error criterion (e.g. MSE) L−k on the predicted values
Compute the CV error by averaging : k1 Kk=1 L−k
P

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 38 / 39
Part 2 : A guiding example

Cross-validation results on the example


Parameter k of kNN can be chosen by cross-validation

Olivier Roustant & Laurent Carraro (EMSE) Introduction to Statistical Learning 2016/09 39 / 39

You might also like