Regression and Generalization
Regression and Generalization
M. Soleymani
Fall 2016
Topics
Beyond linear regression models
Evaluation & model selection
Regularization
Probabilistic perspective for the regression problem
2
Recall: Linear regression (squared loss)
Linear regression functions
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥
𝑓 ∶ ℝd → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 ,𝑤1 ,...,𝑤𝑑 𝑇 are the
parameters we need to set.
We obtain 𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚
3
Beyond linear regression
How to extend the linear regression to non-linear
functions?
Transform the data using basis functions
Learn a linear regression on the new feature vectors (obtained
by basis functions)
4
Beyond linear regression
𝑚𝑡ℎ order polynomial regression (univariate 𝑓 ∶ ℝ ⟶ ℝ)
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + . . . +𝑤𝑚−1 𝑥 𝑚−1 +𝑤𝑚 𝑥 𝑚
−𝟏
Solution: 𝒘 = 𝑇
𝑿′ 𝑿′ 𝑿′𝑇 𝒚
1 1 1 2 1 𝑚 𝒘0
𝑦1 1 𝑥 𝑥 ⋯ 𝑥
2 1 2 2 2 𝑚 𝒘1
𝒚= ⋮ 𝑿′ = 1 𝑥 𝑥 ⋯ 𝑥 𝒘=
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥𝑛
1
𝑥 𝑛 2 ⋯ 𝑥𝑛
1 𝒘𝑚
5
Polynomial regression: example
𝑚=3
𝑚=1
𝑚=5 𝑚=7
6
Generalized linear
Linear combination of fixed non-linear function of the
input vector
7
Basis functions: examples
Linear
Polynomial (univariate)
8
Basis functions: examples
2
𝒙−𝒄𝑗
Gaussian: 𝜙𝑗 𝒙 = 𝑒𝑥𝑝 −
2𝜎𝑗2
𝒙−𝒄𝑗 1
Sigmoid: 𝜙𝑗 𝒙 = 𝜎 𝜎 𝑎 =
𝜎𝑗 1+exp(−𝑎)
9
Radial Basis Functions: prototypes
Predictions based on similarity to “prototypes”:
1 2
𝜙𝑗 𝒙 = 𝑒𝑥𝑝 − 2 𝒙 − 𝒄𝑗
2𝜎𝑗
10
Generalized linear: optimization
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 ;𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖
𝑖=1
(1) (1)
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 ) 𝑤0
𝑦 (1) (2) (2) 𝑤1
𝒚= ⋮ 𝚽=
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 ) 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑚
(𝑛)
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 )
𝑇 −𝟏
𝒘= 𝚽 𝚽 𝚽𝑇 𝒚
11
Model complexity and overfitting
With limited training data, models may achieve zero
training error but a large test error.
1 𝑛 2
𝑖
Training 𝑦 𝑖 − 𝑓 𝒙 ;𝜽 ≈0
(empirical) loss 𝑛 𝑖=1
Expected E𝐱,y 𝑦 − 𝑓 𝒙; 𝜽
2
≫0
(test) loss
12
Polynomial regression
𝑚=0 𝑚=1
𝑦 𝑦
𝑚=3 𝑚=9
𝑦 𝑦
13 [Bishop]
Polynomial regression: training and test error
2
𝑛 𝑖 𝑖
𝑖=1 𝑦 − 𝑓 𝒙 ;𝜽
𝑅𝑀𝑆𝐸 =
𝑛
[Bishop]
14
Over-fitting causes
Model complexity
E.g., Model with a large number of parameters (degrees of
freedom)
15
Model complexity
Example:
Polynomials with larger 𝑚 are becoming increasingly tuned to
the random noise on the target values.
𝑚=0 𝑚=1
𝑦 𝑦
𝑚=3 𝑚=9
𝑦 𝑦
16
16
[Bishop]
Number of training data & overfitting
Over-fitting problem becomes less severe as the size of
training data increases.
𝑚=9 𝑚=9
𝑛 = 15 𝑛 = 100
[Bishop]
17
How to evaluate the learner’s performance?
Generalization error: true (or expected) error that we
would like to optimize
18
Evaluation and model selection
Evaluation:
We need to measure how well the learned function can
predicts the target for unseen examples
Model selection:
Most of the time we need to select among a set of models
Example: polynomials with different degree 𝑚
and thus we need to evaluate these models first
19
Avoiding over-fitting
Determine a suitable value for model complexity
Simple hold-out method
Cross-validation
Bayesian approach
20
Simple hold-out: model selection
Steps:
Divide training data into training and validation set 𝑣_𝑠𝑒𝑡
Use only the training set to train a set of models
Evaluate each learned model on the validation set
2
1 (𝑖) (𝑖)
𝐽𝑣 𝒘 = 𝑖∈𝑣_𝑠𝑒𝑡 𝑦 − 𝑓 𝒙 ;𝒘
𝑣_𝑠𝑒𝑡
21
Simple hold out:
training, validation, and test sets
Simple hold-out chooses the model that minimizes error on
validation set.
Training
Validation
22
Test
Cross-Validation (CV): Evaluation
𝑘-fold cross-validation steps:
Shuffle the dataset and randomly partition training data into 𝑘 groups of
approximately equal size
for 𝑖 = 1 to 𝑘
Choose the 𝑖-th group as the held-out validation group
Train the model on all but the 𝑖-th group of data
Evaluate the model on the held-out group
Performance scores of the model from 𝑘 runs are averaged.
The average error rate can be considered as an estimation of the true
performance.
… First run
…
Second run
…
… (k-1)th run
23 … k-th run
Cross-Validation (CV): Model Selection
For each model we first find the average error find by CV.
24
Cross-validation: polynomial regression example
5-fold CV
100 runs
average
𝑚=3
𝑚=1
CV: 𝑀𝑆𝐸 = 1.45
CV: 𝑀𝑆𝐸 = 0.30
𝑚=5 𝑚=7
CV: 𝑀𝑆𝐸 = 45.44 CV: 𝑀𝑆𝐸 = 31759
25
Leave-One-Out Cross Validation (LOOCV)
When data is particularly scarce, cross-validation with 𝑘
=𝑁
Leave-one-out treats each training sample in turn as a test
example and all other samples as the training set.
26
Regularization
Adding a penalty term in the cost function to discourage
the coefficients from reaching large values.
𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝒘𝑇 𝝓 𝒙 𝑖
+ 𝜆𝒘𝑇 𝒘
𝑖=1
−𝟏
𝒘= 𝚽𝑇 𝚽 + 𝜆𝑰 𝚽𝑇 𝒚
27
Polynomial order
Polynomials with larger 𝑚 are becoming increasingly
tuned to the random noise on the target values.
magnitude of the coefficients typically gets larger by increasing
𝑚.
[Bishop]
28
Regularization parameter
𝑚=9
𝑤0
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑤6
𝑤7
𝑤8 [Bishop]
𝑤9
29
Regularization parameter
Generalization
𝜆 now controls the effective complexity of the model and
hence determines the degree of over-fitting
30
[Bishop]
Choosing the regularization parameter
A set of models with different values of 𝜆.
Select the model with the best 𝐽𝑣 (𝒘) (or 𝐽𝑐𝑣 (𝒘))
31
The approximation-generailization trade-off
32
Complexity of Hypothesis Space: Example
Price
Price
Price
Size Size Size
𝑤0 + 𝑤1 𝑥 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4
33 This example has been adapted from: Prof. Andrew Ng’s slides
Complexity of Hypothesis Space: Example
Price
Price
Price
Size Size Size
𝑤0 + 𝑤1 𝑥 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4
34 This example has been adapted from: Prof. Andrew Ng’s slides
Complexity of Hypothesis Space: Example
1 2
(𝑖) (𝑖)
𝐽𝑣 𝒘 = 𝑦 −𝑓 𝒙 ;𝒘
𝑛_𝑣 𝑖∈𝑣𝑎𝑙_𝑠𝑒𝑡
1 2
(𝑖) (𝑖)
𝐽𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑦 − 𝑓 𝒙 ;𝒘
𝑛_𝑡𝑟𝑎𝑖𝑛 𝑖∈𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡
𝐽𝑣
error
𝐽𝑡𝑟𝑎𝑖𝑛
degree of polynomial 𝑚
35
Complexity of Hypothesis Space
Less complex ℋ:
𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) ≈ 𝐽𝑣 (𝒘) and 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) is very high
More complex ℋ:
𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) ≪ 𝐽𝑣 (𝒘) and 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) is low
𝐽𝑣 (𝒘)
error
𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘)
degree of polynomial 𝑚
36
Size of training set
1 (𝑖) (𝑖)
2 𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2
𝐽𝑣 𝒘 = 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑣 𝑖∈𝑣𝑎𝑙_𝑠𝑒𝑡
1 2
(𝑖) (𝑖)
𝐽𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑡𝑟𝑎𝑖𝑛 𝑖∈𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡
error
𝐽𝑣
𝐽𝑡𝑟𝑎𝑖𝑛
37 This slide has been adapted from: Prof. Andrew Ng’s slides
Less complex ℋ
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥
error
price
𝐽𝑣
High
error
𝐽𝑡𝑟𝑎𝑖𝑛
size
(training set size)
𝑛
price
If model is very simple, getting more
training data will not (by itself) help
much.
38 This slide has been adapted from: Prof. Andrew Ng’s slides size
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + ⋯ 𝑤10 𝑥 10
More complex ℋ
error
price
𝐽𝑣
Gap
𝐽𝑡𝑟𝑎𝑖𝑛
size
price
For more complex models, getting more
training data is usually helps.
size
39 This slide has been adapted from: Prof. Andrew Ng’s slides
Regularization: Example
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 +𝑤3 𝑥 3 +𝑤4 𝑥 4
1 𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓 𝑥 ;𝒘 + 𝜆𝒘𝑇 𝒘
𝑛 𝑖=1
Price
Price
Price
Size Size Size
Large 𝜆x Intermediate 𝜆 Small 𝜆
(Prefer to more simple models) (Prefer to more complex models)
𝑤1 = 𝑤2 ≈ 0 𝜆=0
40 This example has been adapted from: Prof. Andrew Ng’s slides
Model complexity: Bias-variance trade-off
Least squares, can lead to severe over-fitting if complex models
are trained using data sets of limited size.
41
Formal discussion on bias, variance, and noise
Noise
42
The learning diagram: deterministic target
ℎ: 𝒳 → 𝒴
1 𝑁
𝑥 ,…,𝑥
1
𝑥 , 𝑦 (1) , … , 𝑥 𝑁
, 𝑦 (𝑁)
𝑓: 𝒳 → 𝒴
43
[Y.S. Abou Mostafa, et. al]
The learning diagram including noisy target
Type equation here.
ℎ: 𝒳 → 𝒴
1 𝑁
𝑥 ,…,𝑥
1
𝑥 , 𝑦 (1) , … , 𝑥 𝑁
, 𝑦 (𝑁)
𝑓 𝒙 = ℎ(𝒙)
𝑓: 𝒳 → 𝒴
𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃(𝑦|𝑥)
Distribution Target
on features distribution
44
[Y.S. Abou Mostafa, et. al]
Best unrestricted regression function
If we know the joint distribution 𝑃(𝒙, 𝑦) and no
constraints on the regression function?
cost function: mean squared error
∗ 2
ℎ = argmin 𝔼𝒙,𝑦 𝑦−ℎ 𝒙
ℎ:ℝ𝑑 →ℝ
ℎ∗ 𝒙 = 𝔼𝑦|𝒙 [𝑦]
45
Best unrestricted regression function: Proof
2 2
𝔼𝒙,𝑦 𝑦−ℎ 𝒙 = 𝑦−ℎ 𝒙 𝑝 𝒙, 𝑦 𝑑𝒙𝑑𝑦
2
𝛿 𝔼𝒙,𝑦 𝑦−ℎ 𝒙
= 2 𝑦 − ℎ 𝒙 𝑝 𝒙, 𝑦 𝑑𝑦 = 0
𝛿ℎ(𝒙)
𝑦𝑝 𝒙, 𝑦 𝑑𝑦 𝑦𝑝 𝒙, 𝑦 𝑑𝑦
⇒ℎ 𝒙 = = = 𝑦𝑝 𝑦|𝒙 𝑑𝑦 = 𝔼𝑦|𝒙 𝑦
𝑝 𝒙, 𝑦 𝑑𝑦 𝑝 𝒙
⟹ ℎ∗ 𝒙 = 𝔼𝑦|𝒙 [𝑦]
46
𝒙, 𝑦 ~𝑃
Error decomposition ℎ 𝒙 : minimizes the expected loss
𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦 2
Expected loss
= 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 + ℎ 𝒙 − 𝑦 2
2 2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝔼𝒙,𝑦 ℎ 𝒙 − 𝒚
+2𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 ℎ 𝒙 −𝑦
𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 𝔼𝑦|𝒙 ℎ 𝒙 − 𝑦
0
47
𝒙, 𝑦 ~𝑃
Error decomposition ℎ 𝒙 : minimizes the expected loss
𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦 2
2
= 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 + ℎ 𝒙 − 𝑦
2 2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝔼𝒙,𝑦 ℎ 𝒙 − 𝒚
+2𝐸
0 𝒙,𝑦 𝑓 𝒙; 𝒘 − ℎ 𝒙 ℎ 𝒙 −𝑦
noise
48
Expectation of true error
2
𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦
2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝑛𝑜𝑖𝑠𝑒
2
𝔼𝒟 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙
2
= 𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙
2
We now want to focus on 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 .
49
The average hypothesis
𝑓 𝒙 ≡ 𝐸𝒟 𝑓𝒟 𝒙
𝐾
1
𝑓 𝒙 ≈ 𝑓𝒟 𝑘 𝒙
𝐾
𝑘=1
50
Using the average hypothesis
2
𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙
2
= 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 − ℎ 𝒙
2 2
= 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 −ℎ 𝒙
51
Bias and variance
2 2 2
𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 = 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 −ℎ 𝒙
var(𝒙) bias(𝒙)
2
𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 = 𝔼𝒙 var 𝒙 + bias(𝒙)
= var + bias
52
Bias-variance trade-off
2
var = 𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙
bias = 𝔼𝒙 𝑓 𝒙 − ℎ 𝒙
ℎ
ℎ
53
[Y.S. Abou Mostafa, et. al]
Example: sin target
Only two training example 𝑁 = 2
Which is better ℋ0 or ℋ1 ?
54
Learning from a training set
ℋ0 ℋ1
55
[Y.S. Abou Mostafa, et. al]
Variance ℋ0
𝑓(𝑥)
56
[Y.S. Abou Mostafa, et. al]
Variance ℋ1
𝑓(𝑥)
𝑓(𝑥) 𝑓(𝑥)
58
[Y.S. Abou Mostafa, et. al]
Lesson
59
Expected training and true error curves
Errors vary with the number of training samples
𝐸true
𝐸true
𝐸train
𝐸train
61
[Y.S. Abou Mostafa, et. al]
Regularization: bias and variance
𝑓(𝑥)
𝑓(𝑥)
ℋ1
𝑓(𝑥)
𝑓(𝑥)
𝑓(𝑥)
63
𝜆 is
large
𝜆 is
small
64
[Bishop]
Learning curves of bias, variance, and noise
[Bishop]
65
Bias-variance decomposition: summary
The noise term is unavoidable.
The terms we are interested in are bias and variance.
The approximation-generalization trade-off is seen in the
bias-variance decomposition.
66
Resources
C. Bishop, “Pattern Recognition and Machine Learning”,
Chapter 1.1,1.3, 3.1, 3.2.
Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan
Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.
67