SML Lecture4
SML Lecture4
2
Sources of stochasticity
The stochastic dependency between input and output can arise from
various sources
3
Noise and complexity
4
Noise and complexity
5
Controlling complexity
6
Measuring complexity
Lots of other complexity measures and model selection methods exist c.f.
https://en.wikipedia.org/wiki/Model_selection (these are not
in the scope of this course)
7
Bayes error
Bayes error
R∗ = inf R(h)
{h|h measurable }
8
Bayes error and noise
9
Bayes error example
10
Decomposing the error of a hypothesis
Note: The approximation error is sometimes called the bias and the
estimation error the variance, and the decomposition bias-variance
decomposition
11
Decomposing the error of a hypothesis
12
Example: Approximation error
13
Example: estimation error
x 1 2 3
• Consider now a training set
y=1 3 1 3
on the right consisting of
y=0 1 3 2
m = 13 examples from the
same distribution as before Σ 4 4 5
14
Error decomposition and model selection
15
Complexity and model selection
• γ = number of variables in a
boolean monomial
• γ = degree of a polynomial
function
• γ = size of a neural network
• γ = regularization parameter
penalizing large weights
• We expect the approximation error to go down and the estimation
error to up when γ increases
• Model selection entails choosing γ ∗ that gives the best trade-off
17
Half-time poll: Inequalities on different errors
Let R∗ denote the Bayes error, R(h) the generalization error and R̂S (h)
the training error of hypothesis h on training set S.
Which of the inequalities always hold true?
1. R ∗ ≤ R̂S (h)
2. R̂S (h) ≤ R(h)
3. R ∗ ≤ R(h)
18
Regularization-based algorithms
19
Regularization-based algorithms
20
Model selection using a
validation set
Model selection by using a validation set
We can use the given dataset for empirical model selection, if the
algorihm has input parameters (hyperparameters) that define/affect the
model complexity
21
Grid search
• The need for the validation set comes from the need to avoid
overfitting
• If we only use a simple training/test split and selected the
hyperparameter values by repeated evaluation on the test set, the
performance estimate will be optimistic
• A reliable performance estimate can only be obtained form the test
set
23
How large should the training set be in comparison of the vali-
dation set?
• The larger the training set, the better the generalization error will be
(e.g. by PAC theory)
• The larger the validation set, the less variance there is in the test
error estimate.
• When the dataset is small generally the training set is taken to be as
large as possible, typically 90% or more of the total
• When the dataset is large, training set size is often taken as big as
the computational resources allow
24
Stratification
25
Cross-validation
The need of multiple data splits
One split of data into training, validation and test sets may not be
enough, due to randomness:
• The training and validation sets might be small and contain noise or
outliers
• There might be some randomness in the training procedure (e.g.
initialization)
• We need to fight the randomness by averaging the evaluation
measure over multiple (training, validation) splits
• The best hyperparameter values are chosen as those that have the
best average performance over the n validation sets.
26
Generating multiple data splits
27
n-Fold Cross-Validation
28
Leave-one-out cross-validation (LOO)
29
Nested cross-validation
30
Nested cross-validation
31
Summary
32