0% found this document useful (0 votes)
10 views

L2 Supervised Learning

Uploaded by

Fahim Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

L2 Supervised Learning

Uploaded by

Fahim Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Supervised Learning Setup

Course 4232: Machine Learning

Dept. of Computer Science


Faculty of Science and Technology

Lecturer No: Week No: 2 Semester:


Instructor: Dr. M M Manjurul Islam ([email protected])
Supervised Learning
 Training experience: a set of labeled examples of the form
x = ( x1, x2, …, xn, y )

 where xj are values for input variables and y is the output

 This implies the existence of a “teacher” who knows the right answers

 What to learn: A function f : X1 × X2 × … × Xn → Y , which maps


the input variables into the output domain

2 Goal: minimize the error (loss function) on the training examples


Dr. M M Manjurul Islam
Supervised Learning Problem
 Given a data set D  X1 × X2 × … × Xn × Y, find a function

h : X1 × X2 × … × Xn → Y

such that h(x) is a good predictor for the value of y

h is called a hypothesis

 If Y is the real set, this problem is a regression


 If Y is a finite discrete set, this problem is called classification
 Binary classification if Y has 2 discrete value
3  Multiple classification if more than 2
Dr. M M Manjurul Islam
Supervised Learning Steps
 Decide what the training examples are
 Data collection
 Feature extraction or selection:
 Discriminative features
 Relevant and insensitive to noise
 Input space X, output space Y, and feature vectors
 Choose a model, i.e. representation for h;
 or, the hypothesis class H = {h1, …, hr})
 Choose an error function to define the best hypothesis
 Choose a learning algorithm: regression or classification method
 Training

4
 Evaluation = testing
Dr. M M Manjurul Islam
EX: What Model or Hypothesis Space H ?

• Training examples:

ei = <xi, yi>

5 for i = 1, …, 10 Dr. M M Manjurul Islam


Linear Hypothesis

6
Dr. M M Manjurul Islam
What Error Function ? Algorithm ?
 Want to find the weight vector w = (w0, …, wn) such that hw(xi) ≈ yi
 Should define the error function to measure the difference between
the predictions and the true answers
 Thus pick w such that the error is minimized
 Sum-of-squares error function:
1 m
J ( w) =  [hw ( xi ) − yi ]2
2 i =1
 Compute w such that J(w) is minimal, that is such that:

J ( w) = 0, j = 0,, n
w j
 Learning Algorithm which find w: Least Mean Squares methods
7
Dr. M M Manjurul Islam
Some Linear Algebra

8
Dr. M M Manjurul Islam
Some Linear Algebra …

9
Dr. M M Manjurul Islam
Some Linear Algebra - The Solution!

10
Dr. M M Manjurul Islam
Example of Linear Regression - Data Matrices

11
Dr. M M Manjurul Islam
XTX

12
Dr. M M Manjurul Islam
XT Y

13
Dr. M M Manjurul Islam
Solving for w – Regression Curve

14
Dr. M M Manjurul Islam
Linear Regression - Summary
 The optimal solution can be computed in polynomial time in the size of the
data set.
 Too simple for most real-valued problems
 The solution is w = (XTX)-1XTY, where
 X is the data matrix, augmented with a column of 1’s
 Y is the column vector of target outputs
 A very rare case in which an analytical exact solution is possible
 Nice math, closed-form formula, unique global optimum
 Problems when (XTX) does not have an inverse
 Possible solutions to this:
1. Include high-order terms in hw
2. Transform the input X to some other space X’, and apply linear regression on X’
3. Use a different but more powerful hypothesis representation
 Is Linear Regression enough ?
15
Dr. M M Manjurul Islam
Generalization Ability vs Overfitting
 Very important issue for any machine learning algorithms.
 Can your algorithm predict the correct target y of any unseen x ?
 Hypothesis may perfectly predict for all known x’s but not unseen x’s
 This is called overfitting
 Each hypothesis h has an unknown true error on the universe: JU(h)
 But we only measured the empirical error on the training set: JD(h)
 Let h1 and h2 be two hypotheses compared on training set D, such that
we obtained the result JD(h1) < JD(h2)
 If h2 is “truly” better, that is JU(h2) < JU(h1)
 Then your algorithm is overfitting, and won’t generalize to unseen data
 We are not interested in memorizing the training set
 In our examples, highest degree d hypotheses overfit (i.e. memorize) the data

16 We need methods to overcome overfitting.


Dr. M M Manjurul Islam
Overfitting
 We have overfitting when hypothesis h is more complex than the data
 Complexity of h = number of parameters in h
 Number of weight parameters in our example increases with degree d

 Overfitting = low error on training data but high error on unseen data
 Assume D is drawn from some unknown probability distribution
 Given the universe U of data, we want to learn a hypothesis h from the
training set 𝐷 ⊂ 𝑈 minimizing the error on unseen data 𝑈 ∖ 𝐷.
 Every h has a true error JU(h) on U, which is the expected error when the
data is drawn from the distribution
 We can only measure the empirical error JD(h) on D; we do not have U
 Then… How can we estimate the error JU(h) from D?
 Apply a cross-validation method during D
 Determining best hypothesis h which generalizes best is called model selection.
17
Dr. M M Manjurul Islam
Avoiding Overfitting
• Red curve = Test set
• Blue curve = Training set

• What is the best h?


• Find the degree d
• Such that JT(h) minimal
• Training error decreases with complexity of h;
degree d in our example
• Testing error decreases initially then increases
• We need three disjoint sets of data T, V, U of D
• Learn a potential h using the training set T
• Estimate error of h using the validation set V
18 • Report unbiased h using the test set U
Dr. M M Manjurul Islam
Cross-Validation
 General procedure for estimating the true error of a learner.

 Randomly partition the data into three subsets:

1. Training Set T: used only to find the parameters of classifier, e.g. w.


2. Validation Set V: used to find the correct hypothesis class, e.g. d.
3. Test Set U: used to estimate the true error of your algorithm

 These three sets do not intersect, i.e. they are disjoint


 Repeat cross-validation many times
 Results are averaged to give true error estimate.

19
Dr. M M Manjurul Islam
Cross-Validation and Model Selection
 How to find the best degree d which fits the data D the best?
 Randomly partition the available data D into three disjoint sets;
training set T, validation set V, and test set U, then:
1. Cross-validation: For each degree d, perform a cross-
validation method using T and V sets for evaluating the
goodness of d.
 Some cross-validation techniques to be discussed later
2. Model Selection: Given the best d found in step 1, find hw,d
using T and V sets and report the prediction error of hw,d
using the test set U
 Some model selection approaches to be discussed later.
 The prediction error on U is an unbiased estimate of the true error
20
Dr. M M Manjurul Islam
Leave-One-Out Cross-Validation
 For each degree d do:
1. for i ← 1 to m do:
1. Validation set Vi ← {ei = ( xi, yi )} ; leave the i-the sample out
2. Training set: Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; optimal wd,i using training set Ti
4. J(d, i) ← Test(Vi) ; validation error of wd,i on xi
; J(d, i) is an unbiased estimate of the true prediction error
2. Average validation error: 𝐽 𝑑 ← 1
𝑚
σ𝑚
𝑖=1 𝐽(𝑑, 𝑖)
 d* ← arg mind J(d) ; select the degree d with lowest average error
; J(d*) is not an unbiased estimate since all data is used to find it.
21
Dr. M M Manjurul Islam
Example: Estimating True Error for d = 1

22
Dr. M M Manjurul Islam
Example: Estimation results for all d

 Optimal choice is d = 2
 Overfitting for d > 2
 Very high validation error for d = 8 and 9
23
Dr. M M Manjurul Islam
Model Selection
 J(d*) is not unbiased since it was obtained using all m sample data
 We chose the hypothesis class d* based on 𝐽 𝑑 =
1
𝑚
σ𝑚𝑖=1 𝐽(𝑑, 𝑖)
 We want both an hypothesis class and an unbiased true error
estimate
 If we want to compare different learning algorithms (or different
hypotheses) an independent test data U is required in order to
decide for the best algorithm or the best hypothesis
 In our case, we are trying to decide which regression model to
use, d=1, or d=2, or …, or d=11?
 And, which has the best unbiased true error estimate
24
Dr. M M Manjurul Islam
k-Fold Cross-Validation
 Partition D into k disjoint subsets of same size and same
distribution, P1, P2, …, Pk
 For each degree d do:
 for i ← 1 to k do:
1. Validation set Vi ← Pi ; leave Pi out for validation
2. Training set Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; train on Ti
4. J(d, i) ← Test(Vi) ; compute validation error on Ti
 Average validation error: 𝐽 𝑑 ← 1
𝑚
σ𝑚
𝑖=1 𝐽(𝑑, 𝑖)
 d* ← arg mind J(d) ; return optimal degree d
25
Dr. M M Manjurul Islam
kCV-Based Model Selection
Partition D into k disjoint subsets P1, P2, …, Pk
1. For j ← 1 to k do:
1. Test set Uj ← {ej = ( xj, yj )}
2. TrainingAndValidation set Dj ← D \ Uj
3. dj* ← kCV(Dj) ; find best degree d at iteration j
4. wj* ← Train(Dj, dj*) ; find associated w using full data Dj
5. J(hj*) ← Test(Uj) ; estimate unbiased predictive error of hj* on Uj
1 𝑚 ∗
2. Performance of method: 𝐸 ← 𝑚 σ𝑗=1 𝐽(ℎ𝑗 )
; return E as the performance of the learning algorithm
3. Best hypothesis: hbest ← arg minj J(hj*) ; final selected predictor
26
; several approaches can be used to
Dr.come up withIslam
M M Manjurul just one hypothesis
Variations of k-Fold Cross-Validation
 LOOCV: k-CV with k = m; i.e. m-fold CV
 Best but very slow on large D

 Holdout-CV: 2-fold CV with 50%T, 25%V, 25%U


 Advantage for large D but not good for small data set

 Repeated Random Sub-Sampling: k-CV with random V in each iteration


 for i ← 1 to k do:
1. Randomly select a fixed fraction αm, 0 < α < 1, of D as Vi
2. Train on D \V and measure Ji error on V
 Return 𝐸 = 𝑘1 σ𝑘𝑖=1 𝐽𝑖 as the true error estimate
 Usually k = 10 and α = 0.1
 Some samples may never be selected for validation and others may be selected many
times for validation

 k-CV is a good trade-off between true error estimate, speed, and data size
27
 Each sample is used for validation exactly once. Usually k = 10.
Dr. M M Manjurul Islam
Learning a Class from Examples
 Class C of a “family car”
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect from a
family car?
 Output:
Positive (+) and negative (–) examples
 Input representation:
x1: price, x2 : engine power

Dr. M M Manjurul Islam 28


Training set X
X = {xt ,r t }tN=1

 1 if x is positive
r =
0 if x is negative

 x1 
x= 
x2 
Class C
(p1  price  p2 ) AND (e1  engine power  e2 )

Dr. M M Manjurul Islam 30


Hypothesis class H
 1 if h says x is positive
h( x) = 
0 if h says x is negative

Error of h on H

E (h| X ) = 1(h(xt )  r t )
N

t =1

Dr. M M Manjurul Islam


S, G, and the Version Space

most specific hypothesis, S


most general hypothesis, G

h  H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)

Dr. M M Manjurul Islam 32


Margin
 Choose h with largest margin

Dr. M M Manjurul Islam


VC Dimension
 N points can be labeled in 2N ways as +/–
 H shatters N if there
exists h  H consistent
for any of these:
VC(H ) = N

An axis-aligned rectangle shatters 4 points only !


Dr. M M Manjurul Islam 34
Probably Approximately Correct (PAC)
Learning
 How many training examples N should we have, such that with probability at
least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)

 Each strip is at most ε/4


 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

Dr. M M Manjurul
Lecture Notes for E Alpaydın 2010 Introduction Islam
to Machine Learning 2e © The MIT Press (V1.0) 35
Noise and Model Complexity
Use the simpler one because
 Simpler to use
(lower computational
complexity)
 Easier to train (lower
space complexity)
 Easier to explain
(more interpretable)
 Generalizes better (lower
variance - Occam’s razor)

Dr. M M Manjurul Islam 36


Multiple Classes, Ci i=1,...,K
X = {xt ,r t }tN=1
1 if x t
Ci
ri = 
t

0 if x t
C j , j  i

Train hypotheses
hi(x), i =1,...,K:

 t
Ci
hi (x ) = 
t 1 if x
0 if x t
C j , j  i
Regression

X = x , r
t

t N
t =1
g(x ) = w1x + w0
r t 
g(x ) = w2 x 2 + w1x + w0
r t = f (x t ) + 

1 N t
N t =1

E (g | X ) =  r − g (x )
t 2

1 N t
N t =1

E (w1 ,w0 | X ) =  r − (w1 x + w0 )
t 2

Model Selection & Generalization
 Learning is an ill-posed problem; data is not sufficient to find
a unique solution
 The need for inductive bias, assumptions about H
 Generalization: How well a model performs on new data
 Overfitting: H more complex than C or f
 Underfitting: H less complex than C or f

Dr. M M Manjurul
Lecture Notes for E Alpaydın 2010 Introduction Islam
to Machine Learning 2e © The MIT Press (V1.0) 39
Triple Trade-Off

 There is a trade-off between three factors (Dietterich, 2003):


1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
 As N, E
 As c (H), first E and then E

Dr. M M Manjurul Islam 40


Cross-Validation
 To estimate generalization error, we need data unseen during
training. We split the data as
 Training set (50%)
 Validation set (25%)
 Test (publication) set (25%)
 Resampling when there is few data

Dr. M M Manjurul Islam


Dimensions of a Supervised Learner
1. Model: g(x| )

2. Loss function: E ( | X ) =  L(r t , g (xt | ))


t

3. Optimization procedure:

 * = arg min E ( | X )

Textbook/ Reference Materials

 Introduction to Machine Learning by Ethem Alpaydin


 Machine Learning: An Algorithmic Perspective by
Stephen Marsland
 Pattern Recognition and Machine Learning by
Christopher M. Bishop

43

You might also like