0% found this document useful (0 votes)
36 views79 pages

L09 - Regularisation

This document discusses regularisation techniques in machine learning. Regularisation helps avoid overfitting by adding "extra information" to machine learning models. It encourages simpler models that generalize better to new data. Specifically, the document covers reasons for regularisation such as overfitting, ill-posed problems, incorporating auxiliary data, improving human interpretability, and enabling easier optimization.

Uploaded by

skylar.skyau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views79 pages

L09 - Regularisation

This document discusses regularisation techniques in machine learning. Regularisation helps avoid overfitting by adding "extra information" to machine learning models. It encourages simpler models that generalize better to new data. Specifically, the document covers reasons for regularisation such as overfitting, ill-posed problems, incorporating auxiliary data, improving human interpretability, and enabling easier optimization.

Uploaded by

skylar.skyau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Machine Learning 1.

09:
Regularisation

Tom S. F. Haines
[email protected]

1 / 42
Underfitting & overfitting

• Underfitting • Balanced • Overfitting


• Logistic regression • Tuned random forest • Badly tuned random forest
• (scikit learn, • (scikit learn,
min impurity decrease=0.008, default parameters)
n estimators=512)
2 / 42
Regularisation

• This lecture is about regularisation. . .


. . . which avoids overfitting (among other things)

• Effectively “extra information”

3 / 42
Extra information
• 1D regression, 4 points

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

• Assume general model – anything goes

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

• Assume general model – anything goes


• e.g. a sine curve

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

• Assume general model – anything goes


• e.g. a sine curve

• Perfect match at known points. . .


• Identical cost to a straight line!

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

• Assume general model – anything goes


• e.g. a sine curve

• Perfect match at known points. . .


• Identical cost to a straight line!

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

• Assume general model – anything goes


• e.g. a sine curve

• Perfect match at known points. . .


• Identical cost to a straight line!

4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!

• Assume general model – anything goes


• e.g. a sine curve

• Perfect match at known points. . .


• Identical cost to a straight line!

• Regularisation emphasises simpler model


(straight line)
• Common sense for models!

4 / 42
Occam’s razor

The simplest explanation is usually the correct one

5 / 42
Occam’s razor

The simplest explanation is usually the correct one

• Traceable to Aristotle (384–322 BC)


• Ockham’s version: ”Plurality must never be posited without necessity”
(translated from Latin — William of Ockham = 13th century priest)

• Overfitting: Unjustifiably complex explanation

5 / 42
Why regularise?

• Overfitting
• Ill posed problem

• Auxiliary data
• Human understanding
• Easier optimisation

Often several at once

6 / 42
Reason: Overfitting

• Already seen. . .
• Overfitting = Fitting to noise

• Another example (SVM, rbf kernel – will be taught later):

γ = 0.01, C = 1 γ = 0.1, C = 1 γ = 1.0, C = 1 γ = 10.0, C = 1


(about right)

7 / 42
Reason: Ill posed
• Ill posed: Multiple equally good solutions (line from earlier)
• e.g. order irrelevant: bricks when making a wall, nodes in a neural network

• Regularisation: Force selection, even if arbitrary


• Without: Optimisation can “drift” between solutions:

→ → →

(colours represent different solutions, chasing each other around an image)

• Drifting = never converges, all solutions bad

8 / 42
Reason: Auxiliary data

• Regularisation may reflect extra information

• Measured noise from a sensor


(a seperate experiment)

Correct amount of regularisation to apply to signal

9 / 42
Reason: Human understanding
• Goal: Learn y = fθ (x)
• Alternatively: Learn y = fθ (z) and z = fη (x)
Subject to z being useful in some way, i.e. human interpretable

10 / 42
Reason: Human understanding
• Goal: Learn y = fθ (x)
• Alternatively: Learn y = fθ (z) and z = fη (x)
Subject to z being useful in some way, i.e. human interpretable

• Attribute learning:
• z = fη (x) encodes: Has tail, black & white, four legs etc.
• x = fθ (z) encodes: Is zebra, is horse, is penguin

10 / 42
Reason: Human understanding
• Goal: Learn y = fθ (x)
• Alternatively: Learn y = fθ (z) and z = fη (x)
Subject to z being useful in some way, i.e. human interpretable

• Attribute learning:
• z = fη (x) encodes: Has tail, black & white, four legs etc.
• x = fθ (z) encodes: Is zebra, is horse, is penguin

• Regularisation towards simpler model as judged by a human

• Notes:
• “Sharing statistical strength”:
Recognising black & white objects =⇒ images of penguins improve zebra recognition
• Window into black box (attribute learning can also be uninterpretable)
• Zero shot learning – recognise an unseen animal from a description

10 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)

11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)

• Find minima of red function


(starting at black vertical,
global optima is x = 1)

11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)

• Find minima of red function


(starting at black vertical,
global optima is x = 1)
• Stuck at x = 3
(using BFGS)

11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)

• Find minima of red function


(starting at black vertical,
global optima is x = 1)
• Stuck at x = 3
(using BFGS)

• Blue: L1 regularisation
(pushing answer towards x = 0)

11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)

• Find minima of red function


(starting at black vertical,
global optima is x = 1)
• Stuck at x = 3
(using BFGS)

• Blue: L1 regularisation
(pushing answer towards x = 0)
• Finds better optima
(happens to be global)

11 / 42
Aside: Model limits

• Model limits ≈ regularisation


e.g. Logistic regression only does straight lines

• But rarely the “right amount”,


and no hyperparameter to tune

• Good for invariants/equivariants


e.g. convolutional neural network is invariant to translation

12 / 42
Aside: Early stopping
• Model starts simple. . .
. . . gets more complicated as optimisation runs. . .
. . . until overfitting

• ∴ stop early!

13 / 42
Aside: Early stopping
• Model starts simple. . .
. . . gets more complicated as optimisation runs. . .
. . . until overfitting

• ∴ stop early!

• Bad hack — avoid =⇒ Regularisation too weak - strengthen!


• May still be least terrible solution (common for neural networks)

13 / 42
Aside: Quantity I

• More data =⇒ less regularisation required


• Infinite data =⇒ no regularisation!
(simple lookup – nearest neighbour)

14 / 42
Aside: Quantity I

• More data =⇒ less regularisation required


• Infinite data =⇒ no regularisation!
(simple lookup – nearest neighbour)

• Hyperparameters control regularisation strength


• More/less data → different hyperparameter values
(can still tune with subset of train then fine tune with all)

14 / 42
Aside: Quantity I

• More data =⇒ less regularisation required


• Infinite data =⇒ no regularisation!
(simple lookup – nearest neighbour)

• Hyperparameters control regularisation strength


• More/less data → different hyperparameter values
(can still tune with subset of train then fine tune with all)

• Models often have sweet spot:


• Not enough data → fail
• Too much data → stop improving (underfit)

14 / 42
Aside: Quantity II

exemplars = 16, γ = 0.5, exemplars = 32, γ = 0.5, exemplars = 64, γ = 0.5, exemplars = 128, γ = 0.25,
accuracy = 83.5% accuracy = 83.4% accuracy = 87.5% accuracy = 86.5%

exemplars = 256, γ = 0.25, exemplars = 512, γ = 0.1, exemplars = 1024, γ = 0.1, exemplars = 2048, γ = 0.25,
accuracy = 87.2% accuracy = 89.1% accuracy = 89.3% accuracy = 89.3%
15 / 42
Model kinds I

• Discussed why
• How depends on model kind. . .

16 / 42
Model kinds I

• Discussed why
• How depends on model kind. . .

• Non-probabilistic
• Arbitrary loss functions

• Probabilistic
• Maximum likelihood (ML) (no regularisation)
• Maximum a posteriori (MAP)
• Bayesian

16 / 42
Non-probabilistic

• Model fitting: Minimises loss function L(θ), e.g. L2:


n
X 2
L(θ) = (yi − fθ (xi ))
i=1

17 / 42
Non-probabilistic

• Model fitting: Minimises loss function L(θ), e.g. L2:


n
X 2
L(θ) = (yi − fθ (xi ))
i=1

• Regularise θ, e.g. encourage small parameters:

n k
1X 2
X
L(θ) = (yi − fθ (xi )) + λ θk2
n
i=1 j=1

• λ = regularisation strength — hyperparameter

17 / 42
Ridge regression I
Also called Tikhonov regression

Define fθ (xi ) as:


yi = axi + b, θ = [a, b]

(linear regression)
Loss function:
n
X 2
L(θ) = (yi − (axi + b)) + λ(a2 + b 2 )
i=1

18 / 42
Ridge regression II
Also called Tikhonov regression
• Estimate diamond price given 9 features (carat, cut, colour, multiple for size)
• Linear model: Train RMSE = 1420; Test RMSE = 1831 (overfit)

19 / 42
Ridge regression II
Also called Tikhonov regression
• Estimate diamond price given 9 features (carat, cut, colour, multiple for size)
• Linear model: Train RMSE = 1420; Test RMSE = 1831 (overfit)
• Sweep: (x-axis = λ, y-axis = RMSE, blue = validation, red = train)

• Best (black line): Train RMSE = 1443; Validation RMSE = 1442 (test is now validation)
19 / 42
Lasso, ridge and elastic net
• Ridge regression: (L2 norm, without square root)
n
X 2
L(θ) = (yi − (axi + b))) + λ(a2 + b 2 )
i=1

20 / 42
Lasso, ridge and elastic net
• Ridge regression: (L2 norm, without square root)
n
X 2
L(θ) = (yi − (axi + b))) + λ(a2 + b 2 )
i=1

• Lasso regression: (L1 norm)


n
X 2
L(θ) = (yi − (axi + b))) + λ(|a| + |b|)
i=1

• Elastic net regression: (blend of lasso and ridge)


n
X 2
(yi − (axi + b))) + λ γ(|a| + |b|) + (1 − γ)(a2 + b 2 )

L(θ) =
i=1

(two hyperparameters: λ and γ)


20 / 42
Lasso regression
• Sweep: (x-axis = λ, y-axis = RMSE, blue = validation, red = train)

• Best (black line): Train RMSE = 1440; Validation RMSE = 1433

21 / 42
Elastic net regression
• γ = 0.5 (good default if not sweeping)
• Sweep: (x-axis = λ, y-axis = RMSE, blue = validation, red = train)

• Best (black line): Train RMSE = 1443; Validation RMSE = 1442


22 / 42
Robustness

• Results (RMSE):
• Linear: 1831
• Ridge: 1442
• Lasso: 1433
• Elastic: 1442

• Little difference!

23 / 42
Robustness

• Results (RMSE):
• Linear: 1831
• Ridge: 1442
• Lasso: 1433
• Elastic: 1442

• Little difference!

• More complicated problems: L1 (lasso) often better than L2 (ridge)


(L1 better at ignoring irrelevant features)
• Elastic-net lets hyperparameter optimisation decide

• There are more complex regularisers, e.g. robust statistics

23 / 42
Probabilistic: Maximum likelihood

• Find model parameters that maximise data probability

argmax P(data|θ)
θ

(model must be probabilistic)

• No regularisation!
• Need lots of data

24 / 42
Linear regression: Maximum likelihood I
For each exemplar:
yi = axi + b + i , i ∼ N(0, σ 2 )
N(mean, standard deviation2 ) is the Normal distribution
(simplest modification of linear regression to be probabilistic)

25 / 42
Linear regression: Maximum likelihood I
For each exemplar:
yi = axi + b + i , i ∼ N(0, σ 2 )
N(mean, standard deviation2 ) is the Normal distribution
(simplest modification of linear regression to be probabilistic)
Exemplar probability:

−(axi + b − yi )2
 
1
P(yi |xi , a, b, σ) ∝ exp
σ 2σ 2

25 / 42
Linear regression: Maximum likelihood I
For each exemplar:
yi = axi + b + i , i ∼ N(0, σ 2 )
N(mean, standard deviation2 ) is the Normal distribution
(simplest modification of linear regression to be probabilistic)
Exemplar probability:

−(axi + b − yi )2
 
1
P(yi |xi , a, b, σ) ∝ exp
σ 2σ 2

Maximum likelihood solution:


[a, b]T = (X T X )−1 X T y
where
X = [[x1 , 1], [x2 , 1], . . . , [xn , 1]] y = [y1 , y2 , . . . , yn ]T
Given above know i ∴ σ 2 is variance of 

25 / 42
Linear regression: Maximum likelihood II

green = ground truth, orange = estimate

26 / 42
Probabilistic: Maximum a posteriori

• Introduce prior over model parameters (a, b, σ)


• prior: parameter ∼ probability distribution

• Model complete — can generate predictions without data!


• Data quantity irrelevant

• Find maximum likelihood solution (again), including prior

27 / 42
Linear regression: MAP I

For each exemplar:


yi = axi + b + i , i ∼ N(0, σ 2 )
but add priors (one choice among many):

a, b ∼ N(µ0 , Σ0 ), σ 2 ∼ Inv-Gamma(α0 , β0 )

where µ0 , Σ0 , α0 and β0 are hyperparameters

28 / 42
Linear regression: MAP I

For each exemplar:


yi = axi + b + i , i ∼ N(0, σ 2 )
but add priors (one choice among many):

a, b ∼ N(µ0 , Σ0 ), σ 2 ∼ Inv-Gamma(α0 , β0 )

where µ0 , Σ0 , α0 and β0 are hyperparameters

Answer:
[a, b]T = (X T X + Σ−1 −1 −1 T
0 ) (Σ0 µ0 + X y )

with same definitions of X and y as before


Ignoring σ as complicated

28 / 42
Linear regression: MAP II

green = ground truth, orange = estimate

29 / 42
Probabilistic: Bayesian

• Same model as MAP (priors)


• Instead of maximum likelihood solution find posterior distribution

P(data|model parameters)P(model parameters)


P(model parameters|data) =
P(data)

30 / 42
Probabilistic: Bayesian

• Same model as MAP (priors)


• Instead of maximum likelihood solution find posterior distribution

P(data|model parameters)P(model parameters)


P(model parameters|data) =
P(data)

• Benefits of MAP
• Plus a distribution over models — it knows how certain it is!
• Occam’s razor built in

30 / 42
Linear regression: Bayesian I

Same formulation as MAP


Answer:
[a, b]T ∼ N(µn , Σn )
µn = (X T X + Σ−1 −1 −1 T
0 ) (Σ0 µ0 + X y )

Σn = σ 2 (X T X + Σ−1
0 )
−1

Note: Dependent on σ, which has not been given

31 / 42
Linear regression: Bayesian II

green = ground truth, orange = draws from estimate

32 / 42
Comparison

ML MAP Bayesian

• Given infinite data → identical answers (assuming sane prior)


• Not enough:
• Maximum likelihood fails
• Maximum a posteriori gives a solution
• Bayesian gives a solution and tells you how confident it is
33 / 42
Should all models be Bayesian?

• In an ideal world, yes!

34 / 42
Should all models be Bayesian?

• In an ideal world, yes!

• But. . .
• Harder to code and optimise
• Slower
• Good prior problem. . .

34 / 42
Priors

• Regularisation — bias towards preferred (simple) solutions

35 / 42
Priors

• Regularisation — bias towards preferred (simple) solutions

• Indicates likely vs unlikely model parameters


• Assumption that the model is sensible — you can reason about its parameters
• e.g. a chaotic simulation would be almost impossible to set a prior for

35 / 42
Prior: Types

• Uninformative
• Improper
• Minimum description length

• Extra knowledge
• Data driven (dodgy)
• Human belief

36 / 42
Prior: Conjugate

• Prior with analytic solution


• Gaussian and inverse Gamma for linear regression → analytic

37 / 42
Prior: Conjugate

• Prior with analytic solution


• Gaussian and inverse Gamma for linear regression → analytic

• Problem: Conjugate priors are simple, bad match to data


• Bayesian methods often under perform due to using simple priors

37 / 42
Model kinds II
x – Data y – Label

38 / 42
Model kinds II
x – Data y – Label

Discriminative Generative

38 / 42
Model kinds II
x – Data y – Label

Discriminative Generative

• Learns P(y |x) • Learns P(y , x)

38 / 42
Model kinds II
x – Data y – Label

Discriminative Generative

• Learns P(y |x) • Learns P(y , x)


• Used directly • Apply Bayes rule: P(y |x) = P(y ,x)
P(x)
• Often actually P(x|y ) and P(y )

38 / 42
Model kinds II
x – Data y – Label

Discriminative Generative

• Learns P(y |x) • Learns P(y , x)


• Used directly • Apply Bayes rule: P(y |x) = P(y ,x)
P(x)
• Often actually P(x|y ) and P(y )

• Learns boundary between data • Learns distribution of data


(no requirement to be probabilistic) (must be probabilistic)

38 / 42
Model kinds II
x – Data y – Label

Discriminative Generative

• Learns P(y |x) • Learns P(y , x)


• Used directly • Apply Bayes rule: P(y |x) = P(y ,x)
P(x)
• Often actually P(x|y ) and P(y )

• Learns boundary between data • Learns distribution of data


(no requirement to be probabilistic) (must be probabilistic)
• Can only discriminate between classes • Can also generate data

38 / 42
Model kinds II
x – Data y – Label

Discriminative Generative

• Learns P(y |x) • Learns P(y , x)


• Used directly • Apply Bayes rule: P(y |x) = P(y ,x)
P(x)
• Often actually P(x|y ) and P(y )

• Learns boundary between data • Learns distribution of data


(no requirement to be probabilistic) (must be probabilistic)
• Can only discriminate between classes • Can also generate data
• Handle missing data
• Less vulnerable to overfitting
• Know when they are unreliable
38 / 42
Should all models be generative?

• Yes! In an ideal world. . .

39 / 42
Should all models be generative?

• Yes! In an ideal world. . .


• But often. . .
• Harder to code and optimise
• Slower
• Discriminative approaches “win”. . .

39 / 42
Discriminative vs generative

• If winning means highest accuracy: They keep switching places

40 / 42
Discriminative vs generative

• If winning means highest accuracy: They keep switching places

• Currently, discriminative is winning. . .


. . . but can already see generative successors
(GANs, Auto-encoders)

40 / 42
Summary

• Regularisation embodies common sense — use it!

• Models can be probabilistic or not


• Probabilistic models have three main approaches (others exist)
• Models can be discriminative or generative

• Generative Bayesian models are the (often unobtainable) gold standard

41 / 42
Further reading

• Chapter 28, of “Information Theory, Inference, and Learning Algorithms” by MacKay.


• Maths for linear regression variants:
https://en.wikipedia.org/wiki/Bayesian_linear_regression

42 / 42

You might also like