0% found this document useful (0 votes)
11 views

Lecture 04

The document summarizes linear models for classification using machine learning. It discusses using linear discriminants to separate classes in input space and classify new data points. The perceptron algorithm is introduced for training linear classifiers by minimizing misclassifications. Probabilistic generative models are also covered, which model class-conditional densities to calculate posterior probabilities for classification. Maximum likelihood estimation is discussed for learning the parameters of these probabilistic linear classifiers from labeled training data.

Uploaded by

carlo.768.ri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 04

The document summarizes linear models for classification using machine learning. It discusses using linear discriminants to separate classes in input space and classify new data points. The perceptron algorithm is introduced for training linear classifiers by minimizing misclassifications. Probabilistic generative models are also covered, which model class-conditional densities to calculate posterior probabilities for classification. Maximum likelihood estimation is discussed for learning the parameters of these probabilistic linear classifiers from labeled training data.

Uploaded by

carlo.768.ri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Advanced Machine Learning

Lecture 4: Classification
Sandjai Bhulai
Vrije Universiteit Amsterdam

[email protected]
15 September 2023
Linear models for classi cation

Advanced Machine Learning

fi
Classi cation with linear models
▪ Goal: take input vector x and map it onto one of K discrete
classes

▪ Consider linear models: separable by (D − 1) dimensional


hyperplanes in the D-dimensional input space

▪ Simplest linear regression model: y(x) = w⊤x + w0

f( ⋅ ) to map function onto discrete


▪ Use activation function

classes y(x) = f(w x + w0)

▪ Due to f( ⋅ ), these models are no longer linear in the parameters

3 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Discriminant functions
▪ The simplest case is the 2-class case: y(x) = w⊤x + w0,
where w is a weight vector and w0 is the bias
▪ Decision boundary is 0
▪ Consider 2 points xa and xb lie on the decision surface.
Because y(xa) = y(xb) = 0, we have w⊤(xa − xb) = 0.
▪ Thus, vector w is orthogonal to every vector lying within the
decision surface
▪ If x is on the decision surface, then y(x) = 0, indicating that
w⊤x w0
=−
∥w∥ ∥w∥

4 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Geometry of linear discriminants

▪ Decision surface is perpendicular to w


▪ Displacement is controlled by w0

5 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Multiple classes
▪ Not generally good idea to use multiple 2-class classi ers to
do K-class classi cation
▪ Leads to ambiguous regions

6 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
fi
Single K-class classi er
▪ Single discriminant comprising K linear
functions of form yk(x) = w⊤k x + wk0

▪ Point x belongs to class Ck if yk(x) > yj(x)


for all j ≠ k

▪ Decision boundary between Ck and Cj is


given by yk(x) = yj(x) and corresponds to
(D − 1)-dimensional hyperplane
(wk − wj)⊤x + (wk0 − wj0) = 0
▪ Decision region singly connected and convex
(due to linearity of discriminant functions)

7 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Perceptron algorithm
▪ Rosenblatt (1962)
▪ Linear model with step activation function:

{−1, a < 0
⊤ +1, a ≥ 0,
y(x) = f(w φ(x)) f(a) =

▪ Train using perceptron criterion (here tn ∈ {−1,1})

w⊤φntn

EP = −
n∈ℳ

where ℳ is the set of misclassi ed patterns


▪ Note that direct misclassi cation using total number of misclassi ed
patterns will not work because of non-linear f( ⋅ )

8 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
fi
fi
Perceptron algorithm
▪ Total error function is piecewise linear
▪ Stochastic gradient descent:

w(τ+1) = w(τ) − η ∇EP(w) = w(τ) + ηφntn

▪ Update is not a function of w, thus η can be


equal to 1
▪ Perceptron convergence theorem: if there exists
and exact solution, then PA will nd a solution in
a nite number of steps
▪ Attacked by Minsky and Papert in Perceptrons
(1969). Attack valid only for single-layer
perceptrons. Consequence: research stopped in
neural computation for nearly a decade

9 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
fi
Perceptron algorithm

10 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Perceptron algorithm

11 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Perceptron algorithm

12 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Perceptron algorithm

13 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Probabilistic generative models
▪ Model class-conditional densities p(x | Ck)
▪ Posterior probability for class C1:
p(x | C1)p(C1) 1
p(C1 | x) = = = σ(a)
p(x | C1)p(C1) + p(x | C2)p(C2) 1 + exp(−a)

p(x | C1)p(C1)
where have de ned a = ln
p(x | C2)p(C2)

▪ σ is the logistic sigmoid function

(1 − σ)
σ

The inverse of σ is the logit function a = ln

14 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Probabilistic generative models
▪ Generalization to multiple classes:
p(x | Ck)p(Ck) exp(ak)
p(Ck | x) = =
∑j p(x | Cj)p(Cj) ∑j exp(aj)

where ak = ln(p(x | Ck)p(Ck))

▪ This is known as the softmax function, because it is a


smoothed version of the max

▪ Different representations for class-conditional densities yield


different consequences in how classi cation is done

15 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Continuous inputs
▪ First assume that all classes share the same covariance
matrix and that there are only 2 classes.
▪ We have

p(C1 | x) = σ(w⊤x + w0)

{ 2 }
1 1 1 ⊤ −1
p(x | Ck) = exp − (x − μk ) Σ (x − μk)
(2π) D/2
|Σ| 1/2

where
w = Σ−1(μ1 − μ2)
1 ⊤ −1 1 ⊤ −1 p(C1)
w0 = − μ1 Σ μ1 + μ2 Σ μ2 + ln
2 2 p(C2)
16 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Continuous inputs
▪ Quadratic term from Gaussian vanishes. The priors p(Ck)
only enter via the bias parameter
▪ For the general case of K classes, we have
ak(x) = w⊤k x + wk0

where
wk = Σ−1μk
1 ⊤ −1
wk0 = − μk Σ μk + ln p(Ck)
2

17 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Continuous inputs

18 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Linear versus quadratic
▪ When covariance is shared by classes, the decision
boundary is linear
▪ When covariances are unlinked, the decision boundary is
quadratic

19 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Maximum likelihood
▪ Since we have a parametric form for class-conditional
densities p(x | Ck), we can determine values of the
parameters and priors p(Ck)

p(xn, C1) = p(C1)p(xn | C1) = q (xn | μ1, Σ)

p(xn, C2) = p(C2)p(xn | C2) = (1 − q) (xn | μ2, Σ)

▪ Let tn ∈ {0,1}, then the likelihood is then given by


N

∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
20 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
𝒩
𝒩
Maximum likelihood
▪ The log-likelihood function with relevant terms for q is:
N

∑{ n
t ln q + (1 − tn)ln(1 − q)}
n=1

▪ Maximize with respect to q, yields

1 N N1 N1
N∑
q= tn = =
n=1
N N1 + N2

∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
21 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Maximum likelihood
▪ The log-likelihood function with relevant terms for μ1 is:
N
1 N
tn(xn − μ1)⊤Σ−1(xn − μ1) + const
∑ 2∑
tn ln (xn | μ1, Σ) = −
n=1 n=1

▪ Maximize with respect to μ1, yields

1 N
N1 ∑
μ1 = tnxn
n=1

∏[
q (xn | μ1, Σ)] [(1 − q) (xn | μ2, Σ)]
tn 1−tn
p(t, X | q, μ1, μ2, Σ) =
n=1
𝒩
𝒩
𝒩
22 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Logistic regression
▪ Posterior probability of class C1 written as a logistic sigmoid
acting on a linear function of feature vector φ
p(C1 | φ) = y(φ) = σ(w⊤φ) dσ(a)
= σ(a)(1 − σ(a))
da
▪ More compact than maximum likelihood tting of Gaussians.
For M parameters, Gaussian model uses 2M parameters for
the means, and M(M + 1)/2 parameters for the shared
covariance matrix
N
yntn(1 − yn)1−tn

▪ Maximum likelihood: p(t | w) =
n=1

23 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Logistic regression
▪ Negative log of likelihood yields cross entropy
N

∑{ n
E(w) = − ln p(t | w) = − t ln yn + (1 − tn)ln(1 − yn)}
n=1

▪ Gradient with respect to w


N


∇E(w) = (yn − tn)φn
n=1

▪ Therefore, we have the same form for the gradient for the
sum-of-squares error N
{n n }
⊤ ⊤

∇ln p(t | w, β) = t − w φ(x ) φ(xn )
n=1
24 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023
Iterative reweighted least squares
▪ Ef cient iterative optimization: Newton-Raphson

wnew = wold − H −1 ∇E(w)


where H is the Hessian matrix (with second derivatives)

▪ For sum-of-squares error this can be done in one step


because the error function is quadratic

▪ For cross entropy we get a similar set of normal equations for


weighted least squares, which depends on w

▪ This dependency forces us to apply the update iteratively

25 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


fi
Iterative reweighted least squares
▪ Apply this to linear regression
N
(w⊤φn − tn)φn = Φ⊤Φw − Φ⊤t

∇E(w) =
n=1
N
φnφn⊤ = Φ⊤Φ

H = ∇ ∇E(w) =
n=1

▪ The Newton-Raphson update then takes

wnew = wold − H −1 ∇E(w) = wold − (Φ⊤Φ)−1{Φ⊤Φwold − Φ⊤t}

= (Φ⊤Φ)−1Φ⊤t

26 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Iterative reweighted least squares
▪ Apply this to logistic regression
N
(yn − tn)φn = Φ⊤(y − t)

∇E(w) =
n=1
N
yn(1 − yn)φnφn⊤ = Φ⊤RΦ

H = ∇ ∇E(w) =
n=1

with R as diagonal matrix with Rnn = yn(1 − yn)

27 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023


Iterative reweighted least squares
▪ The Newton-Raphson update then takes

wnew = wold − H −1 ∇E(w) = wold − (Φ⊤RΦ)−1Φ⊤(y − t)

= (Φ⊤RΦ)−1{Φ⊤RΦwold − Φ⊤(y − t)}

= (Φ⊤RΦ)−1Φ⊤Rz

where
z = Φwold − R −1(y − t)

28 Sandjai Bhulai / Advanced Machine Learning / 15 September 2023

You might also like