0% found this document useful (0 votes)

31 views43 pages

Lecture Slide 02 - Supervised Learning - Summer 2023

Uploaded by

sajid alam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views43 pages

Lecture Slide 02 - Supervised Learning - Summer 2023

Uploaded by

sajid alam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Supervised Learning Setup

Course 4232: Machine Learning

Dept. of Computer Science

Faculty of Science and Technology

Lecturer No: Week No: 2 Semester: Summer 2022-23

Instructor: Prof. Dr. Md. Asraf Ali ([email protected])
Supervised Learning
 Training experience: a set of labeled examples of the form
x = ( x1, x2, …, xn, y )

where xj are values for input variables and y is the output

 This implies the existence of a “teacher” who knows the right answers

 What to learn: A function f : X1 × X2 × … × Xn → Y , which maps the

input variables into the output domain

 Goal: minimize the error (loss function) on the training examples

Supervised Learning Problem
Given a data set D X1 × X2 × … × Xn × Y, find a function

h : X1 × X2 × … × Xn → Y

such that h(x) is a good predictor for the value of y

h is called a hypothesis

If Y is the real set, this problem is a regression

If Y is a finite discrete set, this problem is called classification
Binary classification if Y has 2 discrete value
Multiple classification if more than 2
Supervised Learning Steps
Decide what the training examples are
Data collection
Feature extraction or selection:
 Discriminative features
 Relevant and insensitive to noise
Input space X, output space Y, and feature vectors
Choose a model, i.e. representation for h;
or, the hypothesis class H = {h1, …, hr})
Choose an error function to define the best hypothesis
Choose a learning algorithm: regression or classification
method
Training
Evaluation = testing
EX: What Model or Hypothesis Space H ?

• Training examples:

ei = <xi, yi>

for i = 1, …, 10
Linear Hypothesis
What Error Function ? Algorithm ?
Want to find the weight vector w = (w0, …, wn) such that hw(xi)
≈ yi
Should define the error function to measure the difference
between the predictions and the true answers
Thus pick w such that the error is minimized
Sum-of-squares error function:

Compute w such that J(w) is minimal, that is such that:

Some Linear Algebra
Some Linear Algebra …
Some Linear Algebra - The Solution!
Example of Linear Regression - Data Matrices
X TX
X TY
Solving for w – Regression Curve
Linear Regression - Summary
The optimal solution can be computed in polynomial time in the size
of the data set.
Too simple for most real-valued problems
The solution is w = (XTX)-1XTY, where
X is the data matrix, augmented with a column of 1’s
 Y is the column vector of target outputs
A very rare case in which an analytical exact solution is possible
Nice math, closed-form formula, unique global optimum
Problems when (XTX) does not have an inverse
Possible solutions to this:
1. Include high-order terms in hw
2. Transform the input X to some other space X’, and apply linear regression on X’
3. Use a different but more powerful hypothesis representation
 Is Linear Regression enough ?
Generalization Ability vs Overfitting
Very important issue for any machine learning algorithms.
Can your algorithm predict the correct target y of any unseen x ?
Hypothesis may perfectly predict for all known x’s but not unseen x’s
This is called overfitting
Each hypothesis h has an unknown true error on the universe: JU(h)
But we only measured the empirical error on the training set: JD(h)
Let h1 and h2 be two hypotheses compared on training set D, such that
we obtained the result JD(h1) < JD(h2)
If h2 is “truly” better, that is JU(h2) < JU(h1)
Then your algorithm is overfitting, and won’t generalize to unseen data
We are not interested in memorizing the training set
 In our examples, highest degree d hypotheses overfit (i.e. memorize) the data

We need methods to overcome overfitting.

Overfitting
We have overfitting when hypothesis h is more complex than the data
Complexity of h = number of parameters in h
 Number of weight parameters in our example increases with degree d

Overfitting = low error on training data but high error on unseen data
Assume D is drawn from some unknown probability distribution
Given the universe U of data, we want to learn a hypothesis h from
the training set minimizing the error on unseen data .
 Every h has a true error JU(h) on U, which is the expected error when the
data is drawn from the distribution
We can only measure the empirical error JD(h) on D; we do not have U
Then… How can we estimate the error JU(h) from D?
 Apply a cross-validation method during D
 Determining best hypothesis h which generalizes best is called model selection.
Avoiding Overfitting
• Red curve = Test set
• Blue curve = Training set

• What is the best h?

• Find the degree d
• Such that JT(h) minimal
• Training error decreases with complexity of h;
degree d in our example
• Testing error decreases initially then increases
• We need three disjoint sets of data T, V, U of D
• Learn a potential h using the training set T
• Estimate error of h using the validation set V
• Report unbiased h using the test set U
Cross-Validation
General procedure for estimating the true error of a learner.
Methods of estimating expected prediction error
Helps selecting the best fit model
Helps ensuring model is not over fit
Types of CV: Holdout, k-fold, Stratified K-Fold, Leave one-out,
Leave-P-Out/time series

Randomly partition the data into three subsets:

1. Training Set T: used only to find the parameters of classifier, e.g. w.

2. Validation Set V: used to find the correct hypothesis class, e.g. d.
3. Test Set U: used to estimate the true error of your algorithm

These three sets do not intersect, i.e. they are disjoint

Repeat cross-validation many times
Results are averaged to give true error estimate.
Cross-Validation and Model Selection
How to find the best degree d which fits the data D the
best?
Randomly partition the available data D into three disjoint
sets; training set T, validation set V, and test set U, then:
1. Cross-validation: For each degree d, perform a
cross-validation method using T and V sets for
evaluating the goodness of d.
 Some cross-validation techniques to be discussed later
2. Model Selection: Given the best d found in step 1,
find hw,d using T and V sets and report the
prediction error of hw,d using the test set U
 Some model selection approaches to be discussed later.
The prediction error on U is an unbiased estimate of the
Leave-One-Out Cross-Validation
For each degree d do:
1. for i ← 1 to m do:
1. Validation set Vi ← {ei = ( xi, yi )} ; leave the i-the sample out
2. Training set: Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; optimal wd,i using training set Ti
4. J(d, i) ← Test(Vi) ; validation error of wd,i on xi
; J(d, i) is an unbiased estimate of the true prediction error
2. Average validation error:
d* ← arg mind J(d) ; select the degree d with lowest average error
; J(d*) is not an unbiased estimate since all data is used to find it.
Example: Estimating True Error for d = 1
Example: Estimation results for all d

Optimal choice is d = 2
Overfitting for d > 2
Very high validation error for d = 8 and 9
Model Selection
J(d*) is not unbiased since it was obtained using all m
sample data
We chose the hypothesis class d* based on
We want both an hypothesis class and an unbiased true
error estimate
If we want to compare different learning algorithms (or
different hypotheses) an independent test data U is
required in order to decide for the best algorithm or the
best hypothesis
In our case, we are trying to decide which regression
model to use, d=1, or d=2, or …, or d=11?
And, which has the best unbiased true error estimate
k-Fold Cross-Validation
Partition D into k disjoint subsets of same size and
same distribution, P1, P2, …, Pk
For each degree d do:
for i ← 1 to k do:
1. Validation set Vi ← Pi ; leave Pi out for validation
2. Training set Ti ← D \ Vi
3. wd,i ← Train(Ti, d) ; train on Ti
4. J(d, i) ← Test(Vi) ; compute validation error on Ti
Average validation error:
 d* ← arg mind J(d) ; return optimal degree d
kCV-Based Model Selection
Partition D into k disjoint subsets P1, P2, …, Pk
1. For j ← 1 to k do:
1. Test set Uj ← {ej = ( xj, yj )}
2. TrainingAndValidation set Dj ← D \ Uj
3. dj* ← kCV(Dj) ; find best degree d at iteration j
4. wj* ← Train(Dj, dj*) ; find associated w using full data Dj
5. J(hj*) ← Test(Uj) ; estimate unbiased predictive error of hj* on Uj
2. Performance of method:
; return E as the performance of the learning algorithm
3. Best hypothesis: hbest ← arg minj J(hj*) ; final selected predictor
; several approaches can be used to come up with just one hypothesis
Variations of k-Fold Cross-Validation
LOOCV: k-CV with k = m; i.e. m-fold CV
Best but very slow on large D

Holdout-CV: 2-fold CV with 50%T, 25%V, 25%U

Advantage for large D but not good for small data set

Repeated Random Sub-Sampling: k-CV with random V in each iteration

 for i ← 1 to k do:
1. Randomly select a fixed fraction αm, 0 < α < 1, of D as Vi
2. Train on D \ V and measure Ji error on V
 Return as the true error estimate
 Usually k = 10 and α = 0.1
 Some samples may never be selected for validation and others may be selected many times
for validation

k-CV is a good trade-off between true error estimate, speed, and data size
Each sample is used for validation exactly once. Usually k = 10.
Learning a Class from Examples
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a
family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
Training set X
Class C
Hypothesis class H

Error of h on H

Dr. M M Manjurul Islam

S, G, and the Version Space

most specific hypothesis, S

most general hypothesis, G

h Î H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)
Margin
Choose h with largest margin
VC Dimension
N points can be labeled in 2N ways as +/–
H shatters N if there
exists h Î H consistent
for any of these:
VC(H ) = N

An axis-aligned rectangle shatters 4 points only !

Probably Approximately Correct (PAC)
Learning
 How many training examples N should we have, such that with
probability at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)

 Each strip is at most ε/4

 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
Noise and Model Complexity
Use the simpler one because
 Simpler to use
(lower computational
complexity)
 Easier to train (lower
space complexity)
 Easier to explain
(more interpretable)
 Generalizes better (lower
variance - Occam’s razor)
Multiple Classes, Ci i=1,...,K

Train hypotheses
hi(x), i =1,...,K:
Regression
Model Selection & Generalization
Learning is an ill-posed problem; data is not sufficient
to find a unique solution
The need for inductive bias, assumptions about H
Generalization: How well a model performs on new
data
Overfitting: H more complex than C or f
Underfitting: H less complex than C or f
Triple Trade-Off

 There is a trade-off between three factors (Dietterich,

2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
 As N, E¯
 As c (H), first E¯ and then E
Cross-Validation
To estimate generalization error, we need data unseen
during training. We split the data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Resampling when there is few data
Dimensions of a Supervised Learner
1. Model:

2. Loss function:

3. Optimization procedure:
Textbook/ Reference Materials

Introduction to Machine Learning by Ethem Alpaydin

Machine Learning: An Algorithmic Perspective by
Stephen Marsland
Pattern Recognition and Machine Learning by
Christopher M. Bishop

L2 Supervised Learning
No ratings yet
L2 Supervised Learning
43 pages
Practical Issues
No ratings yet
Practical Issues
30 pages
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
No ratings yet
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
17 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
ML 4
No ratings yet
ML 4
21 pages
Module 3 - ML
No ratings yet
Module 3 - ML
101 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
ML 1 Lecture 2
No ratings yet
ML 1 Lecture 2
50 pages
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
DS Notes Unit - V
No ratings yet
DS Notes Unit - V
13 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
Week 3
No ratings yet
Week 3
56 pages
Model Selection and Evaluation
No ratings yet
Model Selection and Evaluation
23 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Unit 4 Regression
No ratings yet
Unit 4 Regression
26 pages
5 CV Boot-Handout PDF
No ratings yet
5 CV Boot-Handout PDF
44 pages
MI - Unit 5
No ratings yet
MI - Unit 5
72 pages
Lec 5
No ratings yet
Lec 5
28 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Regression and Generalization
No ratings yet
Regression and Generalization
67 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
IML Summary
No ratings yet
IML Summary
12 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
ML 5
No ratings yet
ML 5
14 pages
INSY662 - F23 - Week 3-1
No ratings yet
INSY662 - F23 - Week 3-1
22 pages
Unit IV
No ratings yet
Unit IV
51 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
08 Classification
No ratings yet
08 Classification
46 pages
Supervised Learning
No ratings yet
Supervised Learning
41 pages
Week 2
No ratings yet
Week 2
43 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
M1 - Evaluating Predictive Performance
No ratings yet
M1 - Evaluating Predictive Performance
58 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
4-ResamplingMethods 1
No ratings yet
4-ResamplingMethods 1
23 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
KNN Bias Variance Classification Metrics
No ratings yet
KNN Bias Variance Classification Metrics
81 pages
ML Nithish
No ratings yet
ML Nithish
16 pages
机器学习
No ratings yet
机器学习
41 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Xiiaiuniticapstone Projectpartii
No ratings yet
Xiiaiuniticapstone Projectpartii
11 pages

Lecture Slide 02 - Supervised Learning - Summer 2023

Uploaded by

Lecture Slide 02 - Supervised Learning - Summer 2023

Uploaded by

Supervised Learning Setup

Course 4232: Machine Learning

Dept. of Computer Science

Lecturer No: Week No: 2 Semester: Summer 2022-23

where xj are values for input variables and y is the output

 What to learn: A function f : X1 × X2 × … × Xn → Y , which maps the

 Goal: minimize the error (loss function) on the training examples

such that h(x) is a good predictor for the value of y

If Y is the real set, this problem is a regression

Compute w such that J(w) is minimal, that is such that:

We need methods to overcome overfitting.

• What is the best h?

Randomly partition the data into three subsets:

1. Training Set T: used only to find the parameters of classifier, e.g. w.

These three sets do not intersect, i.e. they are disjoint

Holdout-CV: 2-fold CV with 50%T, 25%V, 25%U

Repeated Random Sub-Sampling: k-CV with random V in each iteration

Dr. M M Manjurul Islam

most specific hypothesis, S

An axis-aligned rectangle shatters 4 points only !

 Each strip is at most ε/4

 There is a trade-off between three factors (Dietterich,

Introduction to Machine Learning by Ethem Alpaydin

You might also like