L2 Supervised Learning
L2 Supervised Learning
This implies the existence of a “teacher” who knows the right answers
h : X1 × X2 × … × Xn → Y
h is called a hypothesis
4
Evaluation = testing
Dr. M M Manjurul Islam
EX: What Model or Hypothesis Space H ?
• Training examples:
ei = <xi, yi>
6
Dr. M M Manjurul Islam
What Error Function ? Algorithm ?
Want to find the weight vector w = (w0, …, wn) such that hw(xi) ≈ yi
Should define the error function to measure the difference between
the predictions and the true answers
Thus pick w such that the error is minimized
Sum-of-squares error function:
1 m
J ( w) = [hw ( xi ) − yi ]2
2 i =1
Compute w such that J(w) is minimal, that is such that:
J ( w) = 0, j = 0,, n
w j
Learning Algorithm which find w: Least Mean Squares methods
7
Dr. M M Manjurul Islam
Some Linear Algebra
8
Dr. M M Manjurul Islam
Some Linear Algebra …
9
Dr. M M Manjurul Islam
Some Linear Algebra - The Solution!
10
Dr. M M Manjurul Islam
Example of Linear Regression - Data Matrices
11
Dr. M M Manjurul Islam
XTX
12
Dr. M M Manjurul Islam
XT Y
13
Dr. M M Manjurul Islam
Solving for w – Regression Curve
14
Dr. M M Manjurul Islam
Linear Regression - Summary
The optimal solution can be computed in polynomial time in the size of the
data set.
Too simple for most real-valued problems
The solution is w = (XTX)-1XTY, where
X is the data matrix, augmented with a column of 1’s
Y is the column vector of target outputs
A very rare case in which an analytical exact solution is possible
Nice math, closed-form formula, unique global optimum
Problems when (XTX) does not have an inverse
Possible solutions to this:
1. Include high-order terms in hw
2. Transform the input X to some other space X’, and apply linear regression on X’
3. Use a different but more powerful hypothesis representation
Is Linear Regression enough ?
15
Dr. M M Manjurul Islam
Generalization Ability vs Overfitting
Very important issue for any machine learning algorithms.
Can your algorithm predict the correct target y of any unseen x ?
Hypothesis may perfectly predict for all known x’s but not unseen x’s
This is called overfitting
Each hypothesis h has an unknown true error on the universe: JU(h)
But we only measured the empirical error on the training set: JD(h)
Let h1 and h2 be two hypotheses compared on training set D, such that
we obtained the result JD(h1) < JD(h2)
If h2 is “truly” better, that is JU(h2) < JU(h1)
Then your algorithm is overfitting, and won’t generalize to unseen data
We are not interested in memorizing the training set
In our examples, highest degree d hypotheses overfit (i.e. memorize) the data
Overfitting = low error on training data but high error on unseen data
Assume D is drawn from some unknown probability distribution
Given the universe U of data, we want to learn a hypothesis h from the
training set 𝐷 ⊂ 𝑈 minimizing the error on unseen data 𝑈 ∖ 𝐷.
Every h has a true error JU(h) on U, which is the expected error when the
data is drawn from the distribution
We can only measure the empirical error JD(h) on D; we do not have U
Then… How can we estimate the error JU(h) from D?
Apply a cross-validation method during D
Determining best hypothesis h which generalizes best is called model selection.
17
Dr. M M Manjurul Islam
Avoiding Overfitting
• Red curve = Test set
• Blue curve = Training set
19
Dr. M M Manjurul Islam
Cross-Validation and Model Selection
How to find the best degree d which fits the data D the best?
Randomly partition the available data D into three disjoint sets;
training set T, validation set V, and test set U, then:
1. Cross-validation: For each degree d, perform a cross-
validation method using T and V sets for evaluating the
goodness of d.
Some cross-validation techniques to be discussed later
2. Model Selection: Given the best d found in step 1, find hw,d
using T and V sets and report the prediction error of hw,d
using the test set U
Some model selection approaches to be discussed later.
The prediction error on U is an unbiased estimate of the true error
20
Dr. M M Manjurul Islam
Leave-One-Out Cross-Validation
For each degree d do:
1. for i ← 1 to m do:
1. Validation set Vi ← {ei = ( xi, yi )} ; leave the i-the sample out
2. Training set: Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; optimal wd,i using training set Ti
4. J(d, i) ← Test(Vi) ; validation error of wd,i on xi
; J(d, i) is an unbiased estimate of the true prediction error
2. Average validation error: 𝐽 𝑑 ← 1
𝑚
σ𝑚
𝑖=1 𝐽(𝑑, 𝑖)
d* ← arg mind J(d) ; select the degree d with lowest average error
; J(d*) is not an unbiased estimate since all data is used to find it.
21
Dr. M M Manjurul Islam
Example: Estimating True Error for d = 1
22
Dr. M M Manjurul Islam
Example: Estimation results for all d
Optimal choice is d = 2
Overfitting for d > 2
Very high validation error for d = 8 and 9
23
Dr. M M Manjurul Islam
Model Selection
J(d*) is not unbiased since it was obtained using all m sample data
We chose the hypothesis class d* based on 𝐽 𝑑 =
1
𝑚
σ𝑚𝑖=1 𝐽(𝑑, 𝑖)
We want both an hypothesis class and an unbiased true error
estimate
If we want to compare different learning algorithms (or different
hypotheses) an independent test data U is required in order to
decide for the best algorithm or the best hypothesis
In our case, we are trying to decide which regression model to
use, d=1, or d=2, or …, or d=11?
And, which has the best unbiased true error estimate
24
Dr. M M Manjurul Islam
k-Fold Cross-Validation
Partition D into k disjoint subsets of same size and same
distribution, P1, P2, …, Pk
For each degree d do:
for i ← 1 to k do:
1. Validation set Vi ← Pi ; leave Pi out for validation
2. Training set Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; train on Ti
4. J(d, i) ← Test(Vi) ; compute validation error on Ti
Average validation error: 𝐽 𝑑 ← 1
𝑚
σ𝑚
𝑖=1 𝐽(𝑑, 𝑖)
d* ← arg mind J(d) ; return optimal degree d
25
Dr. M M Manjurul Islam
kCV-Based Model Selection
Partition D into k disjoint subsets P1, P2, …, Pk
1. For j ← 1 to k do:
1. Test set Uj ← {ej = ( xj, yj )}
2. TrainingAndValidation set Dj ← D \ Uj
3. dj* ← kCV(Dj) ; find best degree d at iteration j
4. wj* ← Train(Dj, dj*) ; find associated w using full data Dj
5. J(hj*) ← Test(Uj) ; estimate unbiased predictive error of hj* on Uj
1 𝑚 ∗
2. Performance of method: 𝐸 ← 𝑚 σ𝑗=1 𝐽(ℎ𝑗 )
; return E as the performance of the learning algorithm
3. Best hypothesis: hbest ← arg minj J(hj*) ; final selected predictor
26
; several approaches can be used to
Dr.come up withIslam
M M Manjurul just one hypothesis
Variations of k-Fold Cross-Validation
LOOCV: k-CV with k = m; i.e. m-fold CV
Best but very slow on large D
k-CV is a good trade-off between true error estimate, speed, and data size
27
Each sample is used for validation exactly once. Usually k = 10.
Dr. M M Manjurul Islam
Learning a Class from Examples
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a
family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
1 if x is positive
r =
0 if x is negative
x1
x=
x2
Class C
(p1 price p2 ) AND (e1 engine power e2 )
Error of h on H
E (h| X ) = 1(h(xt ) r t )
N
t =1
h H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)
Dr. M M Manjurul
Lecture Notes for E Alpaydın 2010 Introduction Islam
to Machine Learning 2e © The MIT Press (V1.0) 35
Noise and Model Complexity
Use the simpler one because
Simpler to use
(lower computational
complexity)
Easier to train (lower
space complexity)
Easier to explain
(more interpretable)
Generalizes better (lower
variance - Occam’s razor)
0 if x t
C j , j i
Train hypotheses
hi(x), i =1,...,K:
t
Ci
hi (x ) =
t 1 if x
0 if x t
C j , j i
Regression
X = x , r
t
t N
t =1
g(x ) = w1x + w0
r t
g(x ) = w2 x 2 + w1x + w0
r t = f (x t ) +
1 N t
N t =1
E (g | X ) = r − g (x )
t 2
1 N t
N t =1
E (w1 ,w0 | X ) = r − (w1 x + w0 )
t 2
Model Selection & Generalization
Learning is an ill-posed problem; data is not sufficient to find
a unique solution
The need for inductive bias, assumptions about H
Generalization: How well a model performs on new data
Overfitting: H more complex than C or f
Underfitting: H less complex than C or f
Dr. M M Manjurul
Lecture Notes for E Alpaydın 2010 Introduction Islam
to Machine Learning 2e © The MIT Press (V1.0) 39
Triple Trade-Off
3. Optimization procedure:
* = arg min E ( | X )
Textbook/ Reference Materials
43