0% found this document useful (0 votes)
12 views

Lecture 02

This document summarizes a lecture on linear models for machine learning. It discusses polynomial curve fitting to sample data points using linear functions of varying degrees. Higher-degree polynomials can overfit the data, while regularization can help reduce overfitting by penalizing large coefficients. The document also introduces linear basis function models using different basis functions like polynomials, Gaussians, and sigmoids. Maximum likelihood estimation is discussed as a way to fit the linear model parameters to data.

Uploaded by

carlo.768.ri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture 02

This document summarizes a lecture on linear models for machine learning. It discusses polynomial curve fitting to sample data points using linear functions of varying degrees. Higher-degree polynomials can overfit the data, while regularization can help reduce overfitting by penalizing large coefficients. The document also introduces linear basis function models using different basis functions like polynomials, Gaussians, and sigmoids. Maximum likelihood estimation is discussed as a way to fit the linear model parameters to data.

Uploaded by

carlo.768.ri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Advanced Machine Learning

Lecture 2: Linear models


Sandjai Bhulai
Vrije Universiteit Amsterdam

[email protected]
8 September 2023
Linear models

Advanced Machine Learning


Polynomial curve tting
▪ 10 points sampled from sin(2πx) + disturbance

x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
3 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting
▪ Polynomial curve
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j

j=0

▪ Performance is measured by

1 N
{y(xn, w) − tn}2
2∑
E(w) =
n=1

4 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


fi
Polynomial curve tting: order 0

M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j

j=0
5 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 1

M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j

j=0
6 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 3

M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j

j=0
7 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 9

M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j

j=0
8 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Over tting
▪ Root mean square (RMS) error: ERMS = 2E(w*)/N

9 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


fi
Over tting

10 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


fi
E ect of dataset size
▪ Polynomial of order 9 and N = 15

11 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


ff
E ect of dataset size
▪ Polynomial of order 9 and N = 100

12 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


ff
Regularization
▪ Penalize large coe cients values

1 N λ
{y(xn, w) − tn} + ∥w∥2
2
2∑
Ẽ(w) =
n=1
2

▪ λ becomes a model parameter

13 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


ffi
Regularization
▪ Regularization with ln λ = − 18

14 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Regularization
▪ Regularization with ln λ = 0

15 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Regularization
▪ ERMS versus ln λ

16 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Regularization

17 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


A deeper analysis

Advanced Machine Learning


What is the issue?

x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
19 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ φj are known are basis functions


▪ Typically, φ0(x) = 1 so that w0 acts as bias

20 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Linear basis function models
▪ General model is

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ Polynomial basis functions:


φj(x) = x j

▪ These are global functions

21 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Linear basis function models
▪ General model is

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ Gaussian basis functions:


(x − μj)2
{ 2s 2 }
φj(x) = exp −

▪ These are local functions

> μj controls location


> s controls scale
22 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is

M−1
wjφj(x) = w⊤φ(x)

y(x, w) =
j=0

▪ Sigmoidal basis functions:

( )
x − μj
φj(x) = σ
s
where
1
σ(a) =
1 + exp(−a)

23 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)

{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2

β = 1/σ 2

(x | μ, σ 2) > 0

∫−∞
(x | μ, σ 2) dx = 1
𝒩
𝒩
24 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒩
𝒩
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)

{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2

∫−∞
[x] = x (x | μ, σ 2) dx = μ

∫−∞
[x 2] = x 2 (x | μ, σ 2) dx = μ 2 + σ 2

var[x] = [x 2] − [x]2 = σ 2
𝒩
𝔼
𝔼
𝒩
𝒩
25 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝔼
𝔼
𝒩
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)

▪ This is the same as saying

p(t | x, w, β) = (t | y(x, w), β −1)

M−1
wjφj(x) = w⊤φ(x)
▪ Recall: y(x, w) =

j=0

26 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


𝒩
𝒩
Maximum likelihood
▪ This is the same as saying

p(t | x, w, β) = (t | y(x, w), β −1)

▪ Given observed inputs X = {x1, …, xN} and targets


t = [t1, …, tN ]⊤, we obtain the likelihood function:
N
(tn | w⊤φ(xn), β −1)

p(t | X, w, β) =
n=1
𝒩
27 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒩
Maximum likelihood
▪ Taking the logarithm, we get
N
(tn | w⊤φ(xn), β −1)

ln p(t | w, β) = ln
n=1
N N
= ln β − ln(2π) − βED(w)
2 2

1 N
{tn − w⊤φ(xn)}2
2∑
where ED(w) =
n=1

{ 2σ 2 }
2 1 1 2
▪ Recall: (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
𝒩
𝒩
28 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Maximum likelihood
▪ Computing the gradient and setting it to zero yields
N
{tn − w⊤φ(xn)}φ(xn)⊤ = 0

∇w ln p(t | w, β) = β
n=1
The Moore-Penrose
pseudo-inverse
▪ Solve for w, we get
wML = (Φ Φ) ⊤ −1
Φ⊤t

with φ0(x1) φ1(x1) ⋯ φM−1(x1)


φ0(x2) φ1(x2) ⋯ φM−1(x2)
Φ=
⋮ ⋮ ⋱ ⋮
φ0(xN ) φ1(xN ) ⋯ φM−1(xN )

29 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Interpretation
▪ Consider y = ΦwML = [φ1, …, φM]wML

y∈ ⊆

N-dimensional

M-dimensional

▪ is spanned by φ1, …, φM
▪ wML minimizes the distance between t and its orthogonal
projection on , i.e., y

30
𝒮
𝒮
Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒮
𝒯
Regularization
▪ Consider the error function

ED(w) + λEW (w)


data term + regularization term

▪ With the sum-of-squares error function and a quadratic


regularizer, we get
1 N ⊤ 2 λ ⊤

{tn − w φ(xn)} + w w
2 n=1 2

▪ This is minimized by w = (λI + Φ⊤Φ)


−1
Φ⊤t

31 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Regularization
▪ With a more general regularizer, we have

1 N λ M
{tn − w⊤φ(xn)}2 + | wj |q
2∑
n=1
2 ∑
j=1

lasso quadratic

32 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023


Regularization
▪ Lasso tends to generate sparser solutions than a quadratic
regularizer

33 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023

You might also like