L09 - Regularisation
L09 - Regularisation
09:
Regularisation
Tom S. F. Haines
[email protected]
1 / 42
Underfitting & overfitting
3 / 42
Extra information
• 1D regression, 4 points
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Extra information
• 1D regression, 4 points
• Linear solution obvious – to us!
4 / 42
Occam’s razor
5 / 42
Occam’s razor
5 / 42
Why regularise?
• Overfitting
• Ill posed problem
• Auxiliary data
• Human understanding
• Easier optimisation
6 / 42
Reason: Overfitting
• Already seen. . .
• Overfitting = Fitting to noise
7 / 42
Reason: Ill posed
• Ill posed: Multiple equally good solutions (line from earlier)
• e.g. order irrelevant: bricks when making a wall, nodes in a neural network
→ → →
8 / 42
Reason: Auxiliary data
9 / 42
Reason: Human understanding
• Goal: Learn y = fθ (x)
• Alternatively: Learn y = fθ (z) and z = fη (x)
Subject to z being useful in some way, i.e. human interpretable
10 / 42
Reason: Human understanding
• Goal: Learn y = fθ (x)
• Alternatively: Learn y = fθ (z) and z = fη (x)
Subject to z being useful in some way, i.e. human interpretable
• Attribute learning:
• z = fη (x) encodes: Has tail, black & white, four legs etc.
• x = fθ (z) encodes: Is zebra, is horse, is penguin
10 / 42
Reason: Human understanding
• Goal: Learn y = fθ (x)
• Alternatively: Learn y = fθ (z) and z = fη (x)
Subject to z being useful in some way, i.e. human interpretable
• Attribute learning:
• z = fη (x) encodes: Has tail, black & white, four legs etc.
• x = fθ (z) encodes: Is zebra, is horse, is penguin
• Notes:
• “Sharing statistical strength”:
Recognising black & white objects =⇒ images of penguins improve zebra recognition
• Window into black box (attribute learning can also be uninterpretable)
• Zero shot learning – recognise an unseen animal from a description
10 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)
11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)
11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)
11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)
• Blue: L1 regularisation
(pushing answer towards x = 0)
11 / 42
Reason: Easier optimisation
• “Drifting” between solutions already an example
• Regularisation: Smooths cost function → fewer local minima
(also removes stationary points, accelerating convergence)
• Blue: L1 regularisation
(pushing answer towards x = 0)
• Finds better optima
(happens to be global)
11 / 42
Aside: Model limits
12 / 42
Aside: Early stopping
• Model starts simple. . .
. . . gets more complicated as optimisation runs. . .
. . . until overfitting
• ∴ stop early!
13 / 42
Aside: Early stopping
• Model starts simple. . .
. . . gets more complicated as optimisation runs. . .
. . . until overfitting
• ∴ stop early!
13 / 42
Aside: Quantity I
14 / 42
Aside: Quantity I
14 / 42
Aside: Quantity I
14 / 42
Aside: Quantity II
exemplars = 16, γ = 0.5, exemplars = 32, γ = 0.5, exemplars = 64, γ = 0.5, exemplars = 128, γ = 0.25,
accuracy = 83.5% accuracy = 83.4% accuracy = 87.5% accuracy = 86.5%
exemplars = 256, γ = 0.25, exemplars = 512, γ = 0.1, exemplars = 1024, γ = 0.1, exemplars = 2048, γ = 0.25,
accuracy = 87.2% accuracy = 89.1% accuracy = 89.3% accuracy = 89.3%
15 / 42
Model kinds I
• Discussed why
• How depends on model kind. . .
16 / 42
Model kinds I
• Discussed why
• How depends on model kind. . .
• Non-probabilistic
• Arbitrary loss functions
• Probabilistic
• Maximum likelihood (ML) (no regularisation)
• Maximum a posteriori (MAP)
• Bayesian
16 / 42
Non-probabilistic
17 / 42
Non-probabilistic
n k
1X 2
X
L(θ) = (yi − fθ (xi )) + λ θk2
n
i=1 j=1
17 / 42
Ridge regression I
Also called Tikhonov regression
(linear regression)
Loss function:
n
X 2
L(θ) = (yi − (axi + b)) + λ(a2 + b 2 )
i=1
18 / 42
Ridge regression II
Also called Tikhonov regression
• Estimate diamond price given 9 features (carat, cut, colour, multiple for size)
• Linear model: Train RMSE = 1420; Test RMSE = 1831 (overfit)
19 / 42
Ridge regression II
Also called Tikhonov regression
• Estimate diamond price given 9 features (carat, cut, colour, multiple for size)
• Linear model: Train RMSE = 1420; Test RMSE = 1831 (overfit)
• Sweep: (x-axis = λ, y-axis = RMSE, blue = validation, red = train)
• Best (black line): Train RMSE = 1443; Validation RMSE = 1442 (test is now validation)
19 / 42
Lasso, ridge and elastic net
• Ridge regression: (L2 norm, without square root)
n
X 2
L(θ) = (yi − (axi + b))) + λ(a2 + b 2 )
i=1
20 / 42
Lasso, ridge and elastic net
• Ridge regression: (L2 norm, without square root)
n
X 2
L(θ) = (yi − (axi + b))) + λ(a2 + b 2 )
i=1
21 / 42
Elastic net regression
• γ = 0.5 (good default if not sweeping)
• Sweep: (x-axis = λ, y-axis = RMSE, blue = validation, red = train)
• Results (RMSE):
• Linear: 1831
• Ridge: 1442
• Lasso: 1433
• Elastic: 1442
• Little difference!
23 / 42
Robustness
• Results (RMSE):
• Linear: 1831
• Ridge: 1442
• Lasso: 1433
• Elastic: 1442
• Little difference!
23 / 42
Probabilistic: Maximum likelihood
argmax P(data|θ)
θ
• No regularisation!
• Need lots of data
24 / 42
Linear regression: Maximum likelihood I
For each exemplar:
yi = axi + b + i , i ∼ N(0, σ 2 )
N(mean, standard deviation2 ) is the Normal distribution
(simplest modification of linear regression to be probabilistic)
25 / 42
Linear regression: Maximum likelihood I
For each exemplar:
yi = axi + b + i , i ∼ N(0, σ 2 )
N(mean, standard deviation2 ) is the Normal distribution
(simplest modification of linear regression to be probabilistic)
Exemplar probability:
−(axi + b − yi )2
1
P(yi |xi , a, b, σ) ∝ exp
σ 2σ 2
25 / 42
Linear regression: Maximum likelihood I
For each exemplar:
yi = axi + b + i , i ∼ N(0, σ 2 )
N(mean, standard deviation2 ) is the Normal distribution
(simplest modification of linear regression to be probabilistic)
Exemplar probability:
−(axi + b − yi )2
1
P(yi |xi , a, b, σ) ∝ exp
σ 2σ 2
25 / 42
Linear regression: Maximum likelihood II
26 / 42
Probabilistic: Maximum a posteriori
27 / 42
Linear regression: MAP I
a, b ∼ N(µ0 , Σ0 ), σ 2 ∼ Inv-Gamma(α0 , β0 )
28 / 42
Linear regression: MAP I
a, b ∼ N(µ0 , Σ0 ), σ 2 ∼ Inv-Gamma(α0 , β0 )
Answer:
[a, b]T = (X T X + Σ−1 −1 −1 T
0 ) (Σ0 µ0 + X y )
28 / 42
Linear regression: MAP II
29 / 42
Probabilistic: Bayesian
30 / 42
Probabilistic: Bayesian
• Benefits of MAP
• Plus a distribution over models — it knows how certain it is!
• Occam’s razor built in
30 / 42
Linear regression: Bayesian I
Σn = σ 2 (X T X + Σ−1
0 )
−1
31 / 42
Linear regression: Bayesian II
32 / 42
Comparison
ML MAP Bayesian
34 / 42
Should all models be Bayesian?
• But. . .
• Harder to code and optimise
• Slower
• Good prior problem. . .
34 / 42
Priors
35 / 42
Priors
35 / 42
Prior: Types
• Uninformative
• Improper
• Minimum description length
• Extra knowledge
• Data driven (dodgy)
• Human belief
36 / 42
Prior: Conjugate
37 / 42
Prior: Conjugate
37 / 42
Model kinds II
x – Data y – Label
38 / 42
Model kinds II
x – Data y – Label
Discriminative Generative
38 / 42
Model kinds II
x – Data y – Label
Discriminative Generative
38 / 42
Model kinds II
x – Data y – Label
Discriminative Generative
38 / 42
Model kinds II
x – Data y – Label
Discriminative Generative
38 / 42
Model kinds II
x – Data y – Label
Discriminative Generative
38 / 42
Model kinds II
x – Data y – Label
Discriminative Generative
39 / 42
Should all models be generative?
39 / 42
Discriminative vs generative
40 / 42
Discriminative vs generative
40 / 42
Summary
41 / 42
Further reading
42 / 42