W3 Ecs7020p
W3 Ecs7020p
12 Oct 2022
Ventris’ decisive check
2/51
Don’t fool yourself!
3/51
What’s different about machine learning?
4/51
Sampling a population
We need to ensure that datasets are representative and provide a
complete picture of the target population by:
Extracting samples randomly and independently.
Ensuring they come from the target population (identically
distributed).
Having a sufficiently large number of samples.
Independent and identically distributed datasets are known as IID.
Sampling
Population Dataset
5/51
Agenda
Training a model
Summary
6/51
Deployment quality
Piors
Learner
Data
Model
Take 2 definition: The best model is the one with the highest
deployment quality, i.e. the best on the target population.
7/51
Deployment quality
8/51
Assessing the deployment quality
If we could use all the data that models can be shown (population), we
would be able to quantify their true deployment quality.
Instead we use a subset of data, the test dataset, to compute the test
quality as an estimation of the true deployment quality.
Sampling Test
Population Test quality
dataset
True quality
9/51
Random nature of the test quality
Test datasets are extracted randomly. Hence, the test quality is itself
random, as different datasets will in general produce different values.
Test
Test quality 1
dataset 1
Test
Population Test quality 2
dataset 2
Test
Test quality 3
dataset 3
True quality
10/51
Comparing models
Models built by different teams can be compared based on their test
qualities.
Test
Population Test quality model 2
dataset
11/51
The Infinite Monkey Theorem
12/51
Agenda
Training a model
Summary
13/51
Optimisation theory
14/51
The error surface
The error surface (a.k.a. error, objective, loss or cost function) denoted
by E(w) maps each candidate model w to its error. We will assume
that we can obtain it using the ideal description of our target population.
15/51
The error surface
Error
16/51
The error surface and the optimal model
The optimal model can be identified as the one with the lowest error.
17/51
The error surface and the optimal model
The gradient (slope) of the error surface, ∇E(w), is zero at the optimal
model. Hence we can look for it by identifying where ∇E(w) = 0.
convex surface
18/51
The error surface and the optimal model
What if do not have enough computation power to obtain the error or the
gradient of every candidate model, how can we find the optimal model?
19/51
Gradient descent
The gradient provides the direction along which the error increases the
most. Using the gradient, we can create the following update rule:
20/51
Gradient descent
Error
21/51
The step size
The step size controls how much we change the parameters w of our
model in each iteration of gradient descent:
Small values of result in slow convergence to the optimal model.
Large values of risk overshooting the optimal model.
22/51
Small steps
Error
23/51
Large steps
Error
24/51
Starting and stopping
For gradient descent to start, we need an initial model. The choice of the
initial model can be crucial. The initial parameters w are usually chosen
randomly (but within a sensible range of values).
In general, gradient descent will not reach the optimal model, hence it is
necessary to design a stopping strategy. Common choices include:
Number of iterations.
Processing time.
Error value.
Relative change of the error value.
25/51
Local and global solutions
26/51
Local and global optimal solutions
Gradient descent can get stuck in local optima. To avoid them, we can
repeat the procedure from several initial models and select the best.
27/51
Agenda
Training a model
Summary
28/51
Where is my error surface?
29/51
The training dataset
Sampling Training
Population
dataset
30/51
The empirical error surface
The empirical and true error surfaces are in general different. Hence, their
optimal models might differ, i.e. the best model for the training dataset
might not be the best for the population.
31/51
The empirical error surface
Error 32/51
Least squares: minimising the error on a training dataset
Least squares defines an empirical error surface whose gradient can
be obtained exactly. A linear model applied to a training dataset can be
expressed as
ŷ = Xw
where X is the design matrix, ŷ is the predicted label vector and w is the
coefficients vector. The MSE on the training dataset can be written as
1 T
EM SE (w) = (y − ŷ) (y − ŷ)
N
1 T
= (y − Xw) (y − Xw)
N
where y is the true label. The resulting gradient of the MSE is
−2 T
∇EM SE (w) = X (y − Xw)
N
33/51
Exhaustive approaches
In general, we will not have analytical solutions. We can reconstruct the
empirical error surface by evaluating each model on training data. This is
called brute-force or exhaustive search. Simple, but often impractical.
34/51
Data-driven gradient descent
Batch
Gradient 1
1
Training Batch
Population Gradient 2
dataset 2
Batch
Gradient 3
3
The batch size determines how close the estimated gradient is to the true
gradient of the empirical error surface.
35/51
Batch size in data-driven gradient descent
Error
36/51
Overfitting and fooling ourselves
The empirical and true error surfaces are in general different. When small
datasets and complex models are used, the differences between the two
can be very large, resulting in trained models that work very well on the
training dataset but poorly when deployed.
Never use the same data for testing and training a model. The test
dataset needs to remain inaccessible to avoid using it (inadvertently or
not) during training.
37/51
Regularisation
1 N 2 K
EM SE+R = ∑ ei + λ ∑ wk
2
N i=1 i=1
w = (XT X + N λI)
−1
XT y
38/51
Cost vs quality
Our goal is always to produce a model that achieves the highest quality
during deployment. How we achieve it, is a different question.
39/51
Agenda
Training a model
Summary
40/51
Why do we need validation?
41/51
Why do we need validation?
Population Dataset
42/51
Validation set approach
The validation set approach is the simplest method. It randomly splits the
available dataset into a training and a validation (or hold-out) dataset.
Dataset
Train Validation
Models are then fitted with the training part and the validation part is
used to estimate its quality.
43/51
Leave-one-out cross-validation (LOOCV)
This method also splits the available dataset into training and validation
sets. However, the validation set contains only one sample.
...
Multiple splits are considered and the final quality is calculated as the
average of the individual qualities (for N samples, we produce N splits
and obtain N different qualities).
44/51
k-fold cross-validation
In this approach the available dataset is divided into k groups (also
known as folds) of approximately equal size:
We carry k rounds of training followed by validation, each one
using a different fold for validation and the remaining for training.
The final estimation of the quality is the average of the qualities
from each round.
45/51
Validation approaches: Comparison
46/51
Agenda
Training a model
Summary
47/51
Machine learning methodology: Basic tasks
48/51
The role of data
49/51
Using data correctly
50/51
Disappointed you didn’t decipher Linear B?
51/51