0% found this document useful (0 votes)

14 views38 pages

SML Lecture4

Lecture 4 of CS-E4715 discusses model selection in supervised machine learning, emphasizing the balance between generalization error and model complexity as guided by Occam's Razor. It covers the stochastic nature of data, sources of noise, and the importance of controlling complexity through regularization and hypothesis class selection. Additionally, it introduces concepts like Bayes error, error decomposition, and techniques for model selection such as grid search and validation sets.

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views38 pages

SML Lecture4

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

CS-E4715 Supervised Machine Learning

Lecture 4: Model selection

Model selection and Occam’s Razor

• Occam’s razor principle ”Entities

should not be multiplied
unnecessarily” captures the trade-off
between generalization error and
complexity
• Model selection in machine learning
can be seen to implement Occam’s
razor
William of Ockham
(1285–1347) ”Pluralitas
non est ponenda sine
neccesitate” 1
Stochastic scenario

• The analysis so far assumed that the labels are deterministic

functions of the input
• Stochastic scenario relaxes this assumption by assuming the output
is a probabilistic function of the input
• The input and output is generated by a joint probability distribution
D or X × Y.
• This setup covers different cases when the same input x can have
different labels y
• In the stochastic scenario, there may not always exist a target
concept f that has zero generalization error R(f ) = 0

2
Sources of stochasticity

The stochastic dependency between input and output can arise from
various sources

• Imprecision in recording the input data (e.g. measurement error),

shifting our examples in the input space
• Errors in the labeling of the training data (e.g. human annotation
errors), flipping the labels some examples
• There may be additional variables that affect the labels that are not
part of our input data

All of these sources could be characterized as adding noise (or hiding

signal)

3
Noise and complexity

• The effect of noise is typically to make the decision boundary more

complex
• To obtain a consistent hypothesis on noisy data, we can use a more
complex model e.g. a spline curve instead of a hyperplane
• But this may not give a better generalization error, if we end up
merely re-classifying points corrupted by noise

4
Noise and complexity

In practice, we need to balance the complexity of the hypothesis and the

empirical error carefully
• A too simple model does not allow optimal empirical error to
obtained, this is called underfitting
• A too complex model may obtain zero empirical error, but have
worse than optimal generalization error, this is called overfitting

5
Controlling complexity

Two general approaches to control the complexity

• Selecting a hypothesis class, e.g. the maximum degree of polynomial
to fit the regression model
• Regularization: penalizing the use of too many parameters, e.g. by
bounding the norm of the weights (used in SVMs and neural
networks)

6
Measuring complexity

What is a good measure of complexity of a hypothesis class?

We have already looked at some measures:

• Number of distinct hypotheses |H|: works for finite H (e.g. models

build form binary data), but not for infinite classes (e.g. geometric
hypotheses such as polygons, hyperplanes, ellipsoids)
• Vapnik-Chervonenkis dimension (VCdim): the maximum number of
examples that can be classified in all possible ways by choosing
different hypotheses h ∈ H
• Rademacher complexity: measures the capability to classify after
randomizing the labels

Lots of other complexity measures and model selection methods exist c.f.
https://en.wikipedia.org/wiki/Model_selection (these are not
in the scope of this course)

7
Bayes error
Bayes error

• In the stochastic scenario, there is a minimal non-zero error for any

hypothesis, called the Bayes error
• Bayes error is the minimum achievable error, given a distribution D
over X × Y, by measurable functions h : X 7→ Y

R∗ = inf R(h)
{h|h measurable }

• Note that we cannot actually compute R ∗ :

• We cannot compute the generalization error R(h) exactly (c.f. PAC
learning)
• We cannot evaluate all measurable functions (intuitively: hypothesis
class that contains all functions that are mathematically
well-behaved enough to allow us to define probabilities on them)
• Bayes error serves us a a theoretical measure of best possible
performance

8
Bayes error and noise

• A hypothesis with R(h) = R ∗ is called the Bayes classifier

• The Bayes classifier can be defined in terms of conditional
probabilities as

hBayes (x) = argmaxy ∈{0,1} Pr (y |x)

• The average error made by the Bayes classifer at x ∈ X is called the

noise
noise(x) = min(Pr (1|x), Pr (0|x))
• Its expectation E (noise(x)) = R ∗ is the Bayes error
• Similarly to the Bayer error, Bayes classifier is a theoretical tool, not
something we can compute in practice

9
Bayes error example

• We have a univariate input space:

x ∈ {1, 2, 3}, and two classes
y ∈ {0, 1}.
• We assume a uniform distribution of
data: Pr (x) = 1/3
• The Bayes classifier will predict the X 1 2 3
most probable class for each possible P(Y = 1|X ) 0.6 0.2 0.7
input value P(Y = 0|X ) 0.4 0.8 0.3
hBayes (x) 1 0 1
noise(x) 0.4 0.2 0.3
The Bayes error is the expectation of the the noise over the input domain
X
R∗ = P(x)noise(x) = 1/3 (0.4 + 0.2 + 0.3) = 0.3
x

10
Decomposing the error of a hypothesis

The excess error of a hypothesis compared to the Bayes error R ∗ can be

decomposed as:

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has

over the optimal hypothesis h∗ = argminh0 ∈H R(h0 ) in the
hypothesis class H
• approximation = R(h∗ ) − R ∗ is the approximation error due to
selecting the hypothesis class H instead of the best possible
hypothesis class (which is generally unknown to us)

Note: The approximation error is sometimes called the bias and the
estimation error the variance, and the decomposition bias-variance
decomposition

11
Decomposing the error of a hypothesis

Figure on the right depicts the concepts:

• hBayes is the Bayes classifier, with
R(hBayes ) = R ∗
• h∗ = inf h∈H R(h) is the hypothesis
with the lowest generalization error in
the hypothesis class H
• h has both non-zero estimation error
R(h) − R(h∗ ) and approximation error
R(h∗ ) − R(hBayes )

12
Example: Approximation error

• Assume the hypothesis class of univariate threshold functions:

H = {ha,θ : X 7→ {0, 1}|a ∈ {−1, +1}, θ ∈ R},

where ha,θ (x) = 1ax≥θ

• The classifier from H with the lowest generalization error separates
the data with x = 3 from data with x < 3 by, e.g. choosing
a = 1, θ = 2.5, giving h∗ (1) = h∗ (2) = 0, h∗ (3) = 1
X 1 2 3
• The generalization error is P(y = 1|x) 0.6 0.2 0.7
R(h∗ ) = P(y = 0|x) 0.4 0.8 0.3
1/3(0.6 + 0.2 + 0.3) ≈ 0.367 h∗ (x) 0 0 1

• Approximation error satisfies:

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

13
Example: estimation error
x 1 2 3
• Consider now a training set
y=1 3 1 3
on the right consisting of
y=0 1 3 2
m = 13 examples from the
same distribution as before Σ 4 4 5

• The classifier from h ∈ H with the lowest training error

R̂(h) = 0.385 separates the data with x = 1 from data with x > 1
by, e.g. choosing a = −1, θ = 1.5, giving h(1) = 1, h(2) = h(3) = 0
• However, the generalization error of h is
R(h) = 13 (0.4 + 0.2 + 0.7) ≈ 0.433
• The estimation error satisfies

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

• The decomposition of the error of h is given by:

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433

14
Error decomposition and model selection

• We can bound the estimation error

estimation = R(h) − R(h∗ ) by
generalization bounds arising from the
PAC theory
• However, we cannot do the same for
the approximation error since
approximation = R(h∗ ) − R ∗ remains
unknown to us
• In other words, we do not know how
good the hypothesis class is for
approximating the label distribution

15
Complexity and model selection

In model selection, we have a trade-off: increasing the complexity of the

hypothesis class
• decreases the approximation
error as the class is more
likely to contain a
hypothesis with error close to
the Bayes error
• increases the estimation error
as finding the good
hypothesis becomes more
hard and the generalization
bounds become looser(due to
increasing log |H| or the VC
dimension)
To minimize the generalization error over all hypothesis classes, we
should find a balance between the two terms 16
Learning with complex hypothesis classes

• One strategy for model selection to initially choose a very complex

hypothesis class with zero or very low empirical risk H
• Assume in addition the class can be decomposed into a union of
S
increasingly complex hypothesis classes H = γ∈Γ Hγ , parametrized
by γ, e.g.

• γ = number of variables in a
boolean monomial
• γ = degree of a polynomial
function
• γ = size of a neural network
• γ = regularization parameter
penalizing large weights
• We expect the approximation error to go down and the estimation
error to up when γ increases
• Model selection entails choosing γ ∗ that gives the best trade-off
17
Half-time poll: Inequalities on different errors

Let R∗ denote the Bayes error, R(h) the generalization error and R̂S (h)
the training error of hypothesis h on training set S.
Which of the inequalities always hold true?

1. R ∗ ≤ R̂S (h)
2. R̂S (h) ≤ R(h)
3. R ∗ ≤ R(h)

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

down to ”Lecture 4 poll”:
Answers are anonymous and do not affect grading of the course.
Regularization-based algorithms
Regularization-based algorithms

• Regularization is technique that penalizes large weights in a model

based on weighting input features:
• linear regression, support vector machines and neural networks are
fall under this category
• An important example is the class of linear functions x 7→ wT x
• The classes as parametrized by the norm kwk of the weight vector
bounded by γ: Hγ = {x 7→ wT x : kwk ≤ γ}
• The norm is typically either
• L2 norm q(Also called Euclidean norm or 2-norm):
Pn 2
kwk2 = j=1 wj : used e.g. in support vector machines and ridge
regression
• L1 norm (Also called Manhattan norm or 1-norm):
kwk1 = nj=1 |wj |: used e.g. in LASSO regression
P

18
Regularization-based algorithms

• For the L2 -norm case, we have an important computational shortcut:

the empirical Rademacher complexity of this class can be bounded
analytically!
• Let S ⊂ {x|kxk ≤ r } be a sample of size m and let
Hγ = {x 7→ wT x : kwk2 ≤ γ} .Then
r
r 2γ2 rγ
R̂S (Hγ ) ≤ =√
m m

• Thus the Rademacher complexity depends linearly on the upper

bound γ norm of the weight vector, as r and m are constant for any
fixed training set
• We can use kwk as a efficiently computable upper bound of R̂m (Hγ )

19
Regularization-based algorithms

• A regularized learning problem is to minimize

argminh∈H R̂S (h) + λΩ(h)

• R̂S (h) is the empirical error

• Ω(h) is the regularization term which increases when the complexity
of the hypothesis class increases
• λ is a regularization parameter, which is usually set by
cross-validation
• For the linear functions h : x 7→ wT x, usually Ω(h) = kwk22 or
Ω(h) = kwk1
• We will study regularization-based algorithms during the next part of
the course

20
Model selection using a
validation set
Model selection by using a validation set

We can use the given dataset for empirical model selection, if the
algorihm has input parameters (hyperparameters) that define/affect the
model complexity

• Split the data into training, validation and test sets

• For the hyperparameters, use grid search to find the parameter
combination that gives the best performance on the validation set
• Retrain a final model using the optimal parameter combination, use
both the training and validation data for training
• Evaluate the performance of the final model on the test set

21
Grid search

• Grid search is an technique frequently used to optimize

hyperparameters, including those that define the complexity of the
models
• In its basic form it goes through all combinations of parameter
values, given a set of candidate values for each parameter

• For two parameters, taking of value

combinations (v , u) ∈ V × U, where
V and U are the sets of values for the
two parameters, defines a
two-dimensional grid to be searched
• Even more parameters can be
optimized but the exhaustive search Figure by Alexander Elvers - Own work, CC
BY-SA 4.0,
becomes computationally hard due to https://commons.wikimedia.org/w/index.php?curid=842554

exponentially exploding search space

22
Model selection by using a validation set

• The need for the validation set comes from the need to avoid
overfitting
• If we only use a simple training/test split and selected the
hyperparameter values by repeated evaluation on the test set, the
performance estimate will be optimistic
• A reliable performance estimate can only be obtained form the test
set

23
How large should the training set be in comparison of the vali-
dation set?

• The larger the training set, the better the generalization error will be
(e.g. by PAC theory)
• The larger the validation set, the less variance there is in the test
error estimate.
• When the dataset is small generally the training set is taken to be as
large as possible, typically 90% or more of the total
• When the dataset is large, training set size is often taken as big as
the computational resources allow

24
Stratification

• Class distributions of the training and validation sets should be as

similar to each another as possible, otherwise there will be extra
unwanted variance
• when the data contains classes with very low number of examples,
random splitting might result in no examples in the class in the
validation set
• Stratification is a process that tries to ensure similar class
distributions across the different sets
• Simple stratification approach is to divide all classes separately into
the training and validation sets and the merge the class-specific
training sets into global training set and class-specific validation sets
into a global validation set.

25
Cross-validation
The need of multiple data splits

One split of data into training, validation and test sets may not be
enough, due to randomness:
• The training and validation sets might be small and contain noise or
outliers
• There might be some randomness in the training procedure (e.g.
initialization)
• We need to fight the randomness by averaging the evaluation
measure over multiple (training, validation) splits
• The best hyperparameter values are chosen as those that have the
best average performance over the n validation sets.

26
Generating multiple data splits

• Let us first consider generating a number of training and validation

set pairs, after first setting a side a separate test set
• Given a dataset S, we would like to generate n random splits into
training and validation set
• Two general approaches:
• Repeated random splitting
• n-fold cross-validation

27
n-Fold Cross-Validation

• The dataset S is split randomly into n equal-sized parts (or folds)

• We keep one of the n folds as the validation set (light blue in the
Figure) and combine the remaining n − 1 folds to form the training
set for the split

• n = 5 or n = 10 are typical numbers used in practice

28
Leave-one-out cross-validation (LOO)

• Extreme case of cross-validation is leave-one-out (LOO) : given a

dataset of m examples, only one example is left out as the validation
set and training uses the m − 1 examples.
• This gives an unbiased estimate of the average generalization error
over samples of size m − 1 (Mohri, et al. 2018, Theorem 5.4.)
• However, it is comptationally demanding to compute if m is large

29
Nested cross-validation

• n-fold cross-validation gives

us a well-founded way for
model selection
• However, only using a single
test set may result in
unwanted variation
• Nested cross-validation solves
this problem by using two
cross-validation loops

30
Nested cross-validation

The dataset is initially divided into n outer folds (n = 5 in the figure)

• Outer loop uses 1 fold at a

time as a test set, and the
rest of the data is used in the
inner fold
• Inner loop splits the
remaining exampls into k
folds, 1 fold for validation,
k − 1 for training (k = 2 in
the figure)
The average performance over the n test sets is computed as the final
performance estimate

31
Summary

• Model selection concerns the trade-off between model complexity

and empirical error on training data
• Regularization-based methods are based on continuous
parametrization the complexity of the hypothesis classes
• Empirical model selection can be achieved by grid search on a
validation dataset
• Various cross-validation schemes can be used to tackle the variance
of the performance estimates

Introduction To Wavelets and Wavelet Transforms - A Primer, Brrus C. S., 1998.
No ratings yet
Introduction To Wavelets and Wavelet Transforms - A Primer, Brrus C. S., 1998.
281 pages
Software Fundamentals
No ratings yet
Software Fundamentals
77 pages
CS3491-AI ML-Chapter 2
No ratings yet
CS3491-AI ML-Chapter 2
16 pages
DSP-2 (DFS & DFT) (S)
No ratings yet
DSP-2 (DFS & DFT) (S)
53 pages
CS3233 C II P I Competitive Programming: Dr. Steven Halim Week 04 - Problem Solving Paradigms
No ratings yet
CS3233 C II P I Competitive Programming: Dr. Steven Halim Week 04 - Problem Solving Paradigms
46 pages
All Cards
No ratings yet
All Cards
104 pages
ML Opt
No ratings yet
ML Opt
89 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Lecture 31-36
No ratings yet
Lecture 31-36
44 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
IML Summary
No ratings yet
IML Summary
12 pages
Week 3
No ratings yet
Week 3
56 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
Math375 Final Exam Sample4 PDF
No ratings yet
Math375 Final Exam Sample4 PDF
12 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Gauss Elimination Method - Learn and Solve Questions
No ratings yet
Gauss Elimination Method - Learn and Solve Questions
14 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Curs 3
No ratings yet
Curs 3
31 pages
Foundation Model Evaluation
No ratings yet
Foundation Model Evaluation
10 pages
Lecture 3 - ATM PDF
No ratings yet
Lecture 3 - ATM PDF
29 pages
Hidden Surface Removal - Computer Graphics
No ratings yet
Hidden Surface Removal - Computer Graphics
10 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
ML PPT Lect - 4
No ratings yet
ML PPT Lect - 4
16 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Eva Uation Methods 273 A Spring 09
No ratings yet
Eva Uation Methods 273 A Spring 09
17 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Ty It FF105 Sem1 22 23
No ratings yet
Ty It FF105 Sem1 22 23
36 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Machine Learning - AL3451 - Important Questions With Answer
No ratings yet
Machine Learning - AL3451 - Important Questions With Answer
25 pages
3 Bias Variance Tradeoff
No ratings yet
3 Bias Variance Tradeoff
9 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
Recursive Analysis
From Everand
Recursive Analysis
R. L. Goodstein
No ratings yet
Pente: David Kron, Matt Renzelmann, Eric Richmond, and Todd Ritland
No ratings yet
Pente: David Kron, Matt Renzelmann, Eric Richmond, and Todd Ritland
16 pages
Week 5
No ratings yet
Week 5
34 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
ML 01
No ratings yet
ML 01
24 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
ML 3
No ratings yet
ML 3
66 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Lecture+Notes+Model+ Selection PDF
No ratings yet
Lecture+Notes+Model+ Selection PDF
12 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Data Cube Computation and Data Generation
No ratings yet
Data Cube Computation and Data Generation
54 pages
Assessment Questions 1 Final Year
No ratings yet
Assessment Questions 1 Final Year
3 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Class 10 Holiday Homework
100% (1)
Class 10 Holiday Homework
3 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
EE338-Digital Signal Processing: Tutorial 2 Solutions
No ratings yet
EE338-Digital Signal Processing: Tutorial 2 Solutions
2 pages
Operating System
No ratings yet
Operating System
3 pages
Page Replacement Algorithm
No ratings yet
Page Replacement Algorithm
20 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008
No ratings yet
Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008
21 pages
FIR Filter - Window Method
No ratings yet
FIR Filter - Window Method
7 pages
Interpolation and Polynomial Approximation 3.1 Interpolation and The Lagrange Polynomial
No ratings yet
Interpolation and Polynomial Approximation 3.1 Interpolation and The Lagrange Polynomial
7 pages
Department of Electrical Engineering: Riphah International University, Islamabad, Pakistan
No ratings yet
Department of Electrical Engineering: Riphah International University, Islamabad, Pakistan
9 pages
CH 06
No ratings yet
CH 06
45 pages
Pulse Code Mod CH 3
No ratings yet
Pulse Code Mod CH 3
19 pages
9 - Linear Discriminant Analysis
No ratings yet
9 - Linear Discriminant Analysis
19 pages
12 Bias-Variance - Underfit - Overfit
No ratings yet
12 Bias-Variance - Underfit - Overfit
4 pages
4 D and C - Maximum Subarray
No ratings yet
4 D and C - Maximum Subarray
13 pages
A Practical Guide To Support Vector Classification: I I I N L
No ratings yet
A Practical Guide To Support Vector Classification: I I I N L
12 pages
Zco2020 Question Paper
No ratings yet
Zco2020 Question Paper
4 pages
3244-L0-Course Syllabus
No ratings yet
3244-L0-Course Syllabus
4 pages
Ma311 Numerical Techniques (End - SP22)
No ratings yet
Ma311 Numerical Techniques (End - SP22)
4 pages

SML Lecture4

Uploaded by

SML Lecture4

Uploaded by

CS-E4715 Supervised Machine Learning

Lecture 4: Model selection

• Occam’s razor principle ”Entities

• The analysis so far assumed that the labels are deterministic

• Imprecision in recording the input data (e.g. measurement error),

All of these sources could be characterized as adding noise (or hiding

• The effect of noise is typically to make the decision boundary more

In practice, we need to balance the complexity of the hypothesis and the

Two general approaches to control the complexity

What is a good measure of complexity of a hypothesis class?

• Number of distinct hypotheses |H|: works for finite H (e.g. models

• In the stochastic scenario, there is a minimal non-zero error for any

• Note that we cannot actually compute R ∗ :

• A hypothesis with R(h) = R ∗ is called the Bayes classifier

hBayes (x) = argmaxy ∈{0,1} Pr (y |x)

• The average error made by the Bayes classifer at x ∈ X is called the

• We have a univariate input space:

The excess error of a hypothesis compared to the Bayes error R ∗ can be

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has

Figure on the right depicts the concepts:

• Assume the hypothesis class of univariate threshold functions:

H = {ha,θ : X 7→ {0, 1}|a ∈ {−1, +1}, θ ∈ R},

where ha,θ (x) = 1ax≥θ

• Approximation error satisfies:

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

• The classifier from h ∈ H with the lowest training error

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

• The decomposition of the error of h is given by:

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433

• We can bound the estimation error

In model selection, we have a trade-off: increasing the complexity of the

• One strategy for model selection to initially choose a very complex

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

• Regularization is technique that penalizes large weights in a model

• For the L2 -norm case, we have an important computational shortcut:

• Thus the Rademacher complexity depends linearly on the upper

• A regularized learning problem is to minimize

argminh∈H R̂S (h) + λΩ(h)

• R̂S (h) is the empirical error

• Split the data into training, validation and test sets

• Grid search is an technique frequently used to optimize

• For two parameters, taking of value

exponentially exploding search space

• Class distributions of the training and validation sets should be as

• Let us first consider generating a number of training and validation

• The dataset S is split randomly into n equal-sized parts (or folds)

• n = 5 or n = 10 are typical numbers used in practice

• Extreme case of cross-validation is leave-one-out (LOO) : given a

• n-fold cross-validation gives

The dataset is initially divided into n outer folds (n = 5 in the figure)

• Outer loop uses 1 fold at a

• Model selection concerns the trade-off between model complexity

You might also like

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433