0% found this document useful (0 votes)
14 views38 pages

SML Lecture4

Lecture 4 of CS-E4715 discusses model selection in supervised machine learning, emphasizing the balance between generalization error and model complexity as guided by Occam's Razor. It covers the stochastic nature of data, sources of noise, and the importance of controlling complexity through regularization and hypothesis class selection. Additionally, it introduces concepts like Bayes error, error decomposition, and techniques for model selection such as grid search and validation sets.

Uploaded by

mohamnaf.b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views38 pages

SML Lecture4

Lecture 4 of CS-E4715 discusses model selection in supervised machine learning, emphasizing the balance between generalization error and model complexity as guided by Occam's Razor. It covers the stochastic nature of data, sources of noise, and the importance of controlling complexity through regularization and hypothesis class selection. Additionally, it introduces concepts like Bayes error, error decomposition, and techniques for model selection such as grid search and validation sets.

Uploaded by

mohamnaf.b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

CS-E4715 Supervised Machine Learning

Lecture 4: Model selection


Model selection and Occam’s Razor

• Occam’s razor principle ”Entities


should not be multiplied
unnecessarily” captures the trade-off
between generalization error and
complexity
• Model selection in machine learning
can be seen to implement Occam’s
razor
William of Ockham
(1285–1347) ”Pluralitas
non est ponenda sine
neccesitate” 1
Stochastic scenario

• The analysis so far assumed that the labels are deterministic


functions of the input
• Stochastic scenario relaxes this assumption by assuming the output
is a probabilistic function of the input
• The input and output is generated by a joint probability distribution
D or X × Y.
• This setup covers different cases when the same input x can have
different labels y
• In the stochastic scenario, there may not always exist a target
concept f that has zero generalization error R(f ) = 0

2
Sources of stochasticity

The stochastic dependency between input and output can arise from
various sources

• Imprecision in recording the input data (e.g. measurement error),


shifting our examples in the input space
• Errors in the labeling of the training data (e.g. human annotation
errors), flipping the labels some examples
• There may be additional variables that affect the labels that are not
part of our input data

All of these sources could be characterized as adding noise (or hiding


signal)

3
Noise and complexity

• The effect of noise is typically to make the decision boundary more


complex
• To obtain a consistent hypothesis on noisy data, we can use a more
complex model e.g. a spline curve instead of a hyperplane
• But this may not give a better generalization error, if we end up
merely re-classifying points corrupted by noise

4
Noise and complexity

In practice, we need to balance the complexity of the hypothesis and the


empirical error carefully
• A too simple model does not allow optimal empirical error to
obtained, this is called underfitting
• A too complex model may obtain zero empirical error, but have
worse than optimal generalization error, this is called overfitting

5
Controlling complexity

Two general approaches to control the complexity


• Selecting a hypothesis class, e.g. the maximum degree of polynomial
to fit the regression model
• Regularization: penalizing the use of too many parameters, e.g. by
bounding the norm of the weights (used in SVMs and neural
networks)

6
Measuring complexity

What is a good measure of complexity of a hypothesis class?


We have already looked at some measures:

• Number of distinct hypotheses |H|: works for finite H (e.g. models


build form binary data), but not for infinite classes (e.g. geometric
hypotheses such as polygons, hyperplanes, ellipsoids)
• Vapnik-Chervonenkis dimension (VCdim): the maximum number of
examples that can be classified in all possible ways by choosing
different hypotheses h ∈ H
• Rademacher complexity: measures the capability to classify after
randomizing the labels

Lots of other complexity measures and model selection methods exist c.f.
https://en.wikipedia.org/wiki/Model_selection (these are not
in the scope of this course)

7
Bayes error
Bayes error

• In the stochastic scenario, there is a minimal non-zero error for any


hypothesis, called the Bayes error
• Bayes error is the minimum achievable error, given a distribution D
over X × Y, by measurable functions h : X 7→ Y

R∗ = inf R(h)
{h|h measurable }

• Note that we cannot actually compute R ∗ :


• We cannot compute the generalization error R(h) exactly (c.f. PAC
learning)
• We cannot evaluate all measurable functions (intuitively: hypothesis
class that contains all functions that are mathematically
well-behaved enough to allow us to define probabilities on them)
• Bayes error serves us a a theoretical measure of best possible
performance

8
Bayes error and noise

• A hypothesis with R(h) = R ∗ is called the Bayes classifier


• The Bayes classifier can be defined in terms of conditional
probabilities as

hBayes (x) = argmaxy ∈{0,1} Pr (y |x)

• The average error made by the Bayes classifer at x ∈ X is called the


noise
noise(x) = min(Pr (1|x), Pr (0|x))
• Its expectation E (noise(x)) = R ∗ is the Bayes error
• Similarly to the Bayer error, Bayes classifier is a theoretical tool, not
something we can compute in practice

9
Bayes error example

• We have a univariate input space:


x ∈ {1, 2, 3}, and two classes
y ∈ {0, 1}.
• We assume a uniform distribution of
data: Pr (x) = 1/3
• The Bayes classifier will predict the X 1 2 3
most probable class for each possible P(Y = 1|X ) 0.6 0.2 0.7
input value P(Y = 0|X ) 0.4 0.8 0.3
hBayes (x) 1 0 1
noise(x) 0.4 0.2 0.3
The Bayes error is the expectation of the the noise over the input domain
X
R∗ = P(x)noise(x) = 1/3 (0.4 + 0.2 + 0.3) = 0.3
x

10
Decomposing the error of a hypothesis

The excess error of a hypothesis compared to the Bayes error R ∗ can be


decomposed as:

R(h) − R ∗ = estimation + approximation

• estimation = R(h) − R(h∗ ) is the excess generalization error h has


over the optimal hypothesis h∗ = argminh0 ∈H R(h0 ) in the
hypothesis class H
• approximation = R(h∗ ) − R ∗ is the approximation error due to
selecting the hypothesis class H instead of the best possible
hypothesis class (which is generally unknown to us)

Note: The approximation error is sometimes called the bias and the
estimation error the variance, and the decomposition bias-variance
decomposition

11
Decomposing the error of a hypothesis

Figure on the right depicts the concepts:


• hBayes is the Bayes classifier, with
R(hBayes ) = R ∗
• h∗ = inf h∈H R(h) is the hypothesis
with the lowest generalization error in
the hypothesis class H
• h has both non-zero estimation error
R(h) − R(h∗ ) and approximation error
R(h∗ ) − R(hBayes )

12
Example: Approximation error

• Assume the hypothesis class of univariate threshold functions:

H = {ha,θ : X 7→ {0, 1}|a ∈ {−1, +1}, θ ∈ R},

where ha,θ (x) = 1ax≥θ


• The classifier from H with the lowest generalization error separates
the data with x = 3 from data with x < 3 by, e.g. choosing
a = 1, θ = 2.5, giving h∗ (1) = h∗ (2) = 0, h∗ (3) = 1
X 1 2 3
• The generalization error is P(y = 1|x) 0.6 0.2 0.7
R(h∗ ) = P(y = 0|x) 0.4 0.8 0.3
1/3(0.6 + 0.2 + 0.3) ≈ 0.367 h∗ (x) 0 0 1

• Approximation error satisfies:

approximation = R(h∗ ) − R ∗ ≈ 0.367 − 0.3 ≈ 0.067

13
Example: estimation error
x 1 2 3
• Consider now a training set
y=1 3 1 3
on the right consisting of
y=0 1 3 2
m = 13 examples from the
same distribution as before Σ 4 4 5

• The classifier from h ∈ H with the lowest training error


R̂(h) = 0.385 separates the data with x = 1 from data with x > 1
by, e.g. choosing a = −1, θ = 1.5, giving h(1) = 1, h(2) = h(3) = 0
• However, the generalization error of h is
R(h) = 13 (0.4 + 0.2 + 0.7) ≈ 0.433
• The estimation error satisfies

estimation = R(h) − R(h∗ ) ≈ 0.433 − 0.367 ≈ 0.067

• The decomposition of the error of h is given by:

R(h) = R ∗ + approximation + estimation ≈ 0.3 + 0.067 + 0.067 ≈ 0.433

14
Error decomposition and model selection

• We can bound the estimation error


estimation = R(h) − R(h∗ ) by
generalization bounds arising from the
PAC theory
• However, we cannot do the same for
the approximation error since
approximation = R(h∗ ) − R ∗ remains
unknown to us
• In other words, we do not know how
good the hypothesis class is for
approximating the label distribution

15
Complexity and model selection

In model selection, we have a trade-off: increasing the complexity of the


hypothesis class
• decreases the approximation
error as the class is more
likely to contain a
hypothesis with error close to
the Bayes error
• increases the estimation error
as finding the good
hypothesis becomes more
hard and the generalization
bounds become looser(due to
increasing log |H| or the VC
dimension)
To minimize the generalization error over all hypothesis classes, we
should find a balance between the two terms 16
Learning with complex hypothesis classes

• One strategy for model selection to initially choose a very complex


hypothesis class with zero or very low empirical risk H
• Assume in addition the class can be decomposed into a union of
S
increasingly complex hypothesis classes H = γ∈Γ Hγ , parametrized
by γ, e.g.

• γ = number of variables in a
boolean monomial
• γ = degree of a polynomial
function
• γ = size of a neural network
• γ = regularization parameter
penalizing large weights
• We expect the approximation error to go down and the estimation
error to up when γ increases
• Model selection entails choosing γ ∗ that gives the best trade-off
17
Half-time poll: Inequalities on different errors

Let R∗ denote the Bayes error, R(h) the generalization error and R̂S (h)
the training error of hypothesis h on training set S.
Which of the inequalities always hold true?

1. R ∗ ≤ R̂S (h)
2. R̂S (h) ≤ R(h)
3. R ∗ ≤ R(h)

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll


down to ”Lecture 4 poll”:
Answers are anonymous and do not affect grading of the course.
Regularization-based algorithms
Regularization-based algorithms

• Regularization is technique that penalizes large weights in a model


based on weighting input features:
• linear regression, support vector machines and neural networks are
fall under this category
• An important example is the class of linear functions x 7→ wT x
• The classes as parametrized by the norm kwk of the weight vector
bounded by γ: Hγ = {x 7→ wT x : kwk ≤ γ}
• The norm is typically either
• L2 norm q(Also called Euclidean norm or 2-norm):
Pn 2
kwk2 = j=1 wj : used e.g. in support vector machines and ridge
regression
• L1 norm (Also called Manhattan norm or 1-norm):
kwk1 = nj=1 |wj |: used e.g. in LASSO regression
P

18
Regularization-based algorithms

• For the L2 -norm case, we have an important computational shortcut:


the empirical Rademacher complexity of this class can be bounded
analytically!
• Let S ⊂ {x|kxk ≤ r } be a sample of size m and let
Hγ = {x 7→ wT x : kwk2 ≤ γ} .Then
r
r 2γ2 rγ
R̂S (Hγ ) ≤ =√
m m

• Thus the Rademacher complexity depends linearly on the upper


bound γ norm of the weight vector, as r and m are constant for any
fixed training set
• We can use kwk as a efficiently computable upper bound of R̂m (Hγ )

19
Regularization-based algorithms

• A regularized learning problem is to minimize

argminh∈H R̂S (h) + λΩ(h)

• R̂S (h) is the empirical error


• Ω(h) is the regularization term which increases when the complexity
of the hypothesis class increases
• λ is a regularization parameter, which is usually set by
cross-validation
• For the linear functions h : x 7→ wT x, usually Ω(h) = kwk22 or
Ω(h) = kwk1
• We will study regularization-based algorithms during the next part of
the course

20
Model selection using a
validation set
Model selection by using a validation set

We can use the given dataset for empirical model selection, if the
algorihm has input parameters (hyperparameters) that define/affect the
model complexity

• Split the data into training, validation and test sets


• For the hyperparameters, use grid search to find the parameter
combination that gives the best performance on the validation set
• Retrain a final model using the optimal parameter combination, use
both the training and validation data for training
• Evaluate the performance of the final model on the test set

21
Grid search

• Grid search is an technique frequently used to optimize


hyperparameters, including those that define the complexity of the
models
• In its basic form it goes through all combinations of parameter
values, given a set of candidate values for each parameter

• For two parameters, taking of value


combinations (v , u) ∈ V × U, where
V and U are the sets of values for the
two parameters, defines a
two-dimensional grid to be searched
• Even more parameters can be
optimized but the exhaustive search Figure by Alexander Elvers - Own work, CC
BY-SA 4.0,
becomes computationally hard due to https://commons.wikimedia.org/w/index.php?curid=842554

exponentially exploding search space


22
Model selection by using a validation set

• The need for the validation set comes from the need to avoid
overfitting
• If we only use a simple training/test split and selected the
hyperparameter values by repeated evaluation on the test set, the
performance estimate will be optimistic
• A reliable performance estimate can only be obtained form the test
set

23
How large should the training set be in comparison of the vali-
dation set?

• The larger the training set, the better the generalization error will be
(e.g. by PAC theory)
• The larger the validation set, the less variance there is in the test
error estimate.
• When the dataset is small generally the training set is taken to be as
large as possible, typically 90% or more of the total
• When the dataset is large, training set size is often taken as big as
the computational resources allow

24
Stratification

• Class distributions of the training and validation sets should be as


similar to each another as possible, otherwise there will be extra
unwanted variance
• when the data contains classes with very low number of examples,
random splitting might result in no examples in the class in the
validation set
• Stratification is a process that tries to ensure similar class
distributions across the different sets
• Simple stratification approach is to divide all classes separately into
the training and validation sets and the merge the class-specific
training sets into global training set and class-specific validation sets
into a global validation set.

25
Cross-validation
The need of multiple data splits

One split of data into training, validation and test sets may not be
enough, due to randomness:
• The training and validation sets might be small and contain noise or
outliers
• There might be some randomness in the training procedure (e.g.
initialization)
• We need to fight the randomness by averaging the evaluation
measure over multiple (training, validation) splits
• The best hyperparameter values are chosen as those that have the
best average performance over the n validation sets.

26
Generating multiple data splits

• Let us first consider generating a number of training and validation


set pairs, after first setting a side a separate test set
• Given a dataset S, we would like to generate n random splits into
training and validation set
• Two general approaches:
• Repeated random splitting
• n-fold cross-validation

27
n-Fold Cross-Validation

• The dataset S is split randomly into n equal-sized parts (or folds)


• We keep one of the n folds as the validation set (light blue in the
Figure) and combine the remaining n − 1 folds to form the training
set for the split

• n = 5 or n = 10 are typical numbers used in practice

28
Leave-one-out cross-validation (LOO)

• Extreme case of cross-validation is leave-one-out (LOO) : given a


dataset of m examples, only one example is left out as the validation
set and training uses the m − 1 examples.
• This gives an unbiased estimate of the average generalization error
over samples of size m − 1 (Mohri, et al. 2018, Theorem 5.4.)
• However, it is comptationally demanding to compute if m is large

29
Nested cross-validation

• n-fold cross-validation gives


us a well-founded way for
model selection
• However, only using a single
test set may result in
unwanted variation
• Nested cross-validation solves
this problem by using two
cross-validation loops

30
Nested cross-validation

The dataset is initially divided into n outer folds (n = 5 in the figure)

• Outer loop uses 1 fold at a


time as a test set, and the
rest of the data is used in the
inner fold
• Inner loop splits the
remaining exampls into k
folds, 1 fold for validation,
k − 1 for training (k = 2 in
the figure)
The average performance over the n test sets is computed as the final
performance estimate

31
Summary

• Model selection concerns the trade-off between model complexity


and empirical error on training data
• Regularization-based methods are based on continuous
parametrization the complexity of the hypothesis classes
• Empirical model selection can be achieved by grid search on a
validation dataset
• Various cross-validation schemes can be used to tackle the variance
of the performance estimates

32

You might also like