0% found this document useful (0 votes)
3 views29 pages

00_Introduction

The document provides an introduction to statistical learning, covering key concepts such as supervised and unsupervised learning, model accuracy assessment, and the issue of overfitting. It outlines the data analytics process and emphasizes the importance of understanding the relationship between input and output variables for effective predictions. Additionally, it discusses methods for estimating models and evaluating their performance using training and test data.

Uploaded by

gedankenmanken
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views29 pages

00_Introduction

The document provides an introduction to statistical learning, covering key concepts such as supervised and unsupervised learning, model accuracy assessment, and the issue of overfitting. It outlines the data analytics process and emphasizes the importance of understanding the relationship between input and output variables for effective predictions. Additionally, it discusses methods for estimating models and evaluating their performance using training and test data.

Uploaded by

gedankenmanken
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Statistical learning

A brief introduction

Gertraud Malsiner-Walli

Readings: ISLR Chapters 1 & 2

Gertraud Malsiner-Walli Statistical learning 1 / 29


Outline

Framework

Modeling approaches: supervised and unsupervised learning

Statistical learning - general approach

Assessing model accuracy

Overfitting

Gertraud Malsiner-Walli Statistical learning 2 / 29


Framework

Gertraud Malsiner-Walli Statistical learning 3 / 29


AI, ML, DL,. . .

Gertraud Malsiner-Walli Statistical learning 4 / 29


Data analytics process cycle

Gertraud Malsiner-Walli Statistical learning 5 / 29


Data analytics process cycle

(1) Business application is the process driver.


(2) Classic data sources are internal and market data.
(3) Real time data, social media data, IoT (internet of things)
(4) Statistical analysis involves data description and data
exploration, as well as modeling and evaluation.
(5) Proper tools: R, Python, Excel/VBA, cloud services etc.
(6) Storytelling!
(7) Revise periodically

Gertraud Malsiner-Walli Statistical learning 6 / 29


Modeling approaches: supervised and
unsupervised learning

Gertraud Malsiner-Walli Statistical learning 7 / 29


Modeling approaches

▶ The (4) statistical analysis of the analytic process cycle relies


on statistical and machine learning techniques to extract
information from data.
▶ Statistical learning problems can be assigned to one of two
broad types:
▶ Supervised learning
▶ Unsupervised learning

Gertraud Malsiner-Walli Statistical learning 8 / 29


Supervised learning

▶ Supervised learning is used to model an observed output


(target, dependent variables etc) using a set of inputs (features,
independent variables, attributes etc) where we postulate that
there is a relationship between the input and the output.
▶ regression - (typically) used to predict numerical outcomes
▶ classification - used to predict categorical outcomes, most
often binary yes/no type of variables.
▶ Notation:
▶ Y . . . output variable
▶ X1 , X2 , . . . . . . input variables

Gertraud Malsiner-Walli Statistical learning 9 / 29


Examples of regression and classification

▶ Regression:
▶ Y . . . sales
▶ X1 , X2 , X3 . . . . advertising budgets in TV, radio and newspaper

▶ Classification:
▶ Y . . . clicking on an ad (no,yes)
▶ X1 , X2 . . . age, time stamp

Gertraud Malsiner-Walli Statistical learning 10 / 29


Supervised learning: Visualisation

1.0
25

0.8
20

0.6
Clicked.on.Ad
15
Sales

0.4
10

0.2
5

0.0

0 50 100 150 200 250 300 20 30 40 50 60

TV Age

Gertraud Malsiner-Walli Statistical learning 11 / 29


Unsupervised learning

▶ Unsupervised learning:
▶ There is no target variable Y and no feedback based on the
prediction results.
▶ It is used to describe real-world events and to discover latent
relationships responsible for them (relationship between
variables, relationship between observations.
▶ Examples: clustering, dimensionality reduction.

Gertraud Malsiner-Walli Statistical learning 12 / 29


Unsupervised learning: Clustering of Iris data
Measurements of Iris blossoms: How many subtypes of Iris flowers
are there?
Groups are easier to identify
7.5

7.5
Sepal.Length

Sepal.Length
6.5

6.5
5.5

5.5
4.5

4.5
1 2 3 4 5 6 7 1 2 3 4 5 6 7

Petal.Length Petal.Length

Groups are more difficult to identify


7.5

7.5
Sepal.Length

Sepal.Length
6.5

6.5
5.5

5.5
4.5

4.5

2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0

Sepal.Width Sepal.Width

Gertraud Malsiner-Walli Statistical learning 13 / 29


Supervised & Unsupervised learning

▶ They often go hand in hand.


▶ Example:
▶ unsupervised learning can help an organization to understand
its customers by finding customer segments;
▶ supervised learning is used to generate forecasts of the
outcomes of interest (e.g., predict the preferences of the
customers in the identified segments).
▶ Quiz: Example of supervised or unsupervised learning?

Gertraud Malsiner-Walli Statistical learning 14 / 29


Statistical learning - general approach

Gertraud Malsiner-Walli Statistical learning 15 / 29


More on supervised learning
▶ In supervised learning we are interested in a relationship of the
form:
Y = f (X ) + ϵ, where

▶ f is a fixed but unknown function which depends on


X1 , . . . , Xp , and
▶ ϵ is a random error term (“noise”) which is independent of X
and has mean zero.
▶ We can think about f as the systematic information that the
X ’s provide about Y .
▶ Note: there is always some error/noise ϵ present.
▶ Supervised learning refers to a set of approaches for
estimating f.
▶ Q: How does f look like in a linear regression model?

Gertraud Malsiner-Walli Statistical learning 16 / 29


Example: Y = f (X ) + ϵ
25

25
20

20
15

15
Sales

Sales
10

10
5

0 50 100 150 200 250 300 0 50 100 150 200 250 300

TV TV

Gertraud Malsiner-Walli Statistical learning 17 / 29


Why estimate f ?

▶ Prediction purposes:
▶ A good f can serve as a good basis for predictions.

▶ Inference purposes:
▶ We can use f to understand which inputs X1 , . . . , Xp are
important in explaining Y .
▶ We can use f to understand how the different inputs X1 , . . . , Xp
affect Y .
▶ Different methods serve the purposes differently
⇒ degree of flexibility is crucial.

Gertraud Malsiner-Walli Statistical learning 18 / 29


Trade-off between model flexibility / interpretability
▶ Typically, more rigid models may be a good choice for
inference since they allow to understand the relationship
between inputs and output quite easily.
▶ E.g., a linear regression model is very strict about the
relationship (linear):
⇒ they may not yield as accurate predictions as some other
approaches (such as deep learning).
⇒ But: they allows for relatively simple and interpretable
inference.
▶ In contrast, flexible approaches may deliver good predictions.
▶ E.g., deep learning methods are very flexible.
⇒ They make accurate predictions.
⇒ But: they can lead to such complicated estimates of f that
it is difficult to understand how any individual predictor is
associated with the response.

Gertraud Malsiner-Walli Statistical learning 19 / 29


How do we estimate f ?
▶ We assume that we have observed a set of n different data
points (x i , yi ).
▶ These observations are used to estimate f :

Y ≈ fˆ(X )

▶ Two main approaches:


▶ Parametric: we make an assumption about the functional
form of f , e.g., linear, and only estimate the parameters of the
functional form (e.g., β0 and β1 )
▶ Non-parametric: we do not make explicit assumptions
about the functional form of f :
⇒ we try to find an fˆ that gets as close to the data points
as possible without being too rough or wiggly (example:
k-Nearest Neighbors).

Gertraud Malsiner-Walli Statistical learning 20 / 29


Assessing model accuracy

Gertraud Malsiner-Walli Statistical learning 21 / 29


Assessing model fit and prediction accuracy
▶ In order to evaluate the performance of a statistical learning
method on a given data set, we need some way to measure
how well its predictions actually match the observed data.
▶ In regression problems, the most commonly-used measure is
the mean squared error (MSE):
n
1X
MSE = (yi − fˆ(xi ))2
n i=1

▶ In classification problems we look at the confusion matrix of


true vs predicted responses (see Unit 2).
▶ Idea: choose that learning method fˆ which minimizes the
MSE.
▶ However: there is a risk: the method may learn too well the
seen data but cannot predict well the unseen data
⇒ Problem of overfitting!
Gertraud Malsiner-Walli Statistical learning 22 / 29
Overfitting

Gertraud Malsiner-Walli Statistical learning 23 / 29


Problem of overfitting I

▶ Model:
Y = f (X ) + ϵ

▶ Based on data (Y , X ), we want to learn f in order to make


good predictions Ŷ :
Ŷ = fˆ(X )

▶ fˆ is a good estimate of f , if (Y − Ŷ ) small.


▶ BUT: it is difficult to entangle f (X ) from errors ϵ as we only
observe Y !
▶ If we choose a very flexible fˆ, fˆ starts learning the errors ϵ
which are typical for this data set, but not an overall pattern.
▶ Overfitting will give bad predictions for new data!

Gertraud Malsiner-Walli Statistical learning 24 / 29


’Unseen’ data?

▶ Why do we care about the predictions of new, unseen data?


▶ Suppose that we are interested in developing an algorithm to
predict a stock’s price based on previous stock returns:
▶ We can train the method using stock returns from the past 6
months.
▶ But we don’t really care how well our method predicts, e.g., last
week’s stock price.
▶ We instead care about how well it will predict tomorrow’s price
or next month’s price.
▶ Problem: how can we get ‘new’ data to evaluate the
prediction performance of a method?

Gertraud Malsiner-Walli Statistical learning 25 / 29


Training and test MSE

Remedy:
▶ We split the data set into two parts: the training data and the
test data.
▶ We distinguish between - MSE in training data vs.
- MSE in test data.
▶ We train f on the training data (training MSE) and use the
estimated fˆ to make predictions for the test set (test MSE) .
▶ We want to choose the method that gives the lowest test
MSE (as opposed to the lowest training MSE).

Gertraud Malsiner-Walli Statistical learning 26 / 29


Training and test MSE
▶ We did an experiment: we split a data set into two parts: the
training data and the test data.
▶ We trained f on the training data (for 3 different degrees of
flexibility) and used the estimated fˆ to make predictions for the
test set.

▶ Plots: left: points + fitted curves for the training data. Right:
MSE for fitted curves (test MSE in red, train MSE in gray).
Gertraud Malsiner-Walli Statistical learning 27 / 29
The bias-variance trade-off
▶ The U-shape curve observed in the test MSE turns out to be
the result of two competing properties of a learning method:

Expected test MSE = Var (fˆ) + Bias(fˆ)2 + Var (ϵ)

▶ Bias(fˆ): refers to the error that is introduced by approximating


a real-life problem by a simpler f .
▶ Variance(fˆ): refers to the amount by which fˆ would change, if
we estimated it by a different data set.
▶ Simpler models may contain bias, but have lower variance.
▶ Flexible models have usually low bias but high variance:
changing the data slightly might cause results to change
drastically.-
▶ No free lunch: no one method dominates all others over all
possible data sets.
Gertraud Malsiner-Walli Statistical learning 28 / 29
How to compute the test MSE?
▶ How can we measure the prediction accuracy of the method for
unseen data?
▶ Solution: split the data into two or more parts.
▶ A variety of approaches:
▶ Train - test split: split the data randomly into a train and test
sample and compute MSE on test sample (can lead to high bias
if we have limited data, as we could miss information which was
not included in the training data).
▶ K-fold cross-validation: split the entire data randomly into K
folds; fit the model using the K-1 folds and evaluate the model
using the remaining fold; repeat this process until every K-fold
serve as the test set.
▶ Validation set approach: split data in 3 parts: train,
validation and test set. Use the validation set to tune f . Use
test set only for testing.

Gertraud Malsiner-Walli Statistical learning 29 / 29

You might also like