00_Introduction
00_Introduction
A brief introduction
Gertraud Malsiner-Walli
Framework
Overfitting
▶ Regression:
▶ Y . . . sales
▶ X1 , X2 , X3 . . . . advertising budgets in TV, radio and newspaper
▶ Classification:
▶ Y . . . clicking on an ad (no,yes)
▶ X1 , X2 . . . age, time stamp
1.0
25
0.8
20
0.6
Clicked.on.Ad
15
Sales
0.4
10
0.2
5
0.0
TV Age
▶ Unsupervised learning:
▶ There is no target variable Y and no feedback based on the
prediction results.
▶ It is used to describe real-world events and to discover latent
relationships responsible for them (relationship between
variables, relationship between observations.
▶ Examples: clustering, dimensionality reduction.
7.5
Sepal.Length
Sepal.Length
6.5
6.5
5.5
5.5
4.5
4.5
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal.Length Petal.Length
7.5
Sepal.Length
Sepal.Length
6.5
6.5
5.5
5.5
4.5
4.5
2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0
Sepal.Width Sepal.Width
25
20
20
15
15
Sales
Sales
10
10
5
0 50 100 150 200 250 300 0 50 100 150 200 250 300
TV TV
▶ Prediction purposes:
▶ A good f can serve as a good basis for predictions.
▶ Inference purposes:
▶ We can use f to understand which inputs X1 , . . . , Xp are
important in explaining Y .
▶ We can use f to understand how the different inputs X1 , . . . , Xp
affect Y .
▶ Different methods serve the purposes differently
⇒ degree of flexibility is crucial.
Y ≈ fˆ(X )
▶ Model:
Y = f (X ) + ϵ
Remedy:
▶ We split the data set into two parts: the training data and the
test data.
▶ We distinguish between - MSE in training data vs.
- MSE in test data.
▶ We train f on the training data (training MSE) and use the
estimated fˆ to make predictions for the test set (test MSE) .
▶ We want to choose the method that gives the lowest test
MSE (as opposed to the lowest training MSE).
▶ Plots: left: points + fitted curves for the training data. Right:
MSE for fitted curves (test MSE in red, train MSE in gray).
Gertraud Malsiner-Walli Statistical learning 27 / 29
The bias-variance trade-off
▶ The U-shape curve observed in the test MSE turns out to be
the result of two competing properties of a learning method: