Lecture 02
Lecture 02
[email protected]
8 September 2023
Linear models
x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
3 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting
▪ Polynomial curve
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
▪ Performance is measured by
1 N
{y(xn, w) − tn}2
2∑
E(w) =
n=1
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
5 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 1
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
6 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 3
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
7 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Polynomial curve tting: order 9
M
y(x, w) = w0 + w1x + w2 x 2 + ⋯ + wM x M = wj x j
∑
j=0
8 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
fi
Over tting
▪ Root mean square (RMS) error: ERMS = 2E(w*)/N
1 N λ
{y(xn, w) − tn} + ∥w∥2
2
2∑
Ẽ(w) =
n=1
2
x3 x5 x7 x9
sin x = x − + − + +⋯
3! 5! 7! 9!
19 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Linear basis function models
▪ General model is
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
M−1
wjφj(x) = w⊤φ(x)
∑
y(x, w) =
j=0
( )
x − μj
φj(x) = σ
s
where
1
σ(a) =
1 + exp(−a)
{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
β = 1/σ 2
(x | μ, σ 2) > 0
∞
∫−∞
(x | μ, σ 2) dx = 1
𝒩
𝒩
24 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒩
𝒩
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)
{ 2σ 2 }
2 1 1 2
▪ Note that (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
∞
∫−∞
[x] = x (x | μ, σ 2) dx = μ
∞
∫−∞
[x 2] = x 2 (x | μ, σ 2) dx = μ 2 + σ 2
var[x] = [x 2] − [x]2 = σ 2
𝒩
𝔼
𝔼
𝒩
𝒩
25 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝔼
𝔼
𝒩
Maximum likelihood
▪ Assume observations from a deterministic function with
added Gaussian noise:
t = y(x, w) + ϵ where p(ϵ | β) = (ϵ | 0, β −1)
M−1
wjφj(x) = w⊤φ(x)
▪ Recall: y(x, w) =
∑
j=0
1 N
{tn − w⊤φ(xn)}2
2∑
where ED(w) =
n=1
{ 2σ 2 }
2 1 1 2
▪ Recall: (x | μ, σ ) = exp − (x − μ)
(2πσ 2)1/2
𝒩
𝒩
28 Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
Maximum likelihood
▪ Computing the gradient and setting it to zero yields
N
{tn − w⊤φ(xn)}φ(xn)⊤ = 0
∑
∇w ln p(t | w, β) = β
n=1
The Moore-Penrose
pseudo-inverse
▪ Solve for w, we get
wML = (Φ Φ) ⊤ −1
Φ⊤t
y∈ ⊆
N-dimensional
M-dimensional
▪ is spanned by φ1, …, φM
▪ wML minimizes the distance between t and its orthogonal
projection on , i.e., y
30
𝒮
𝒮
Sandjai Bhulai / Advanced Machine Learning / 8 September 2023
𝒮
𝒯
Regularization
▪ Consider the error function
1 N λ M
{tn − w⊤φ(xn)}2 + | wj |q
2∑
n=1
2 ∑
j=1
lasso quadratic