0% found this document useful (0 votes)
62 views33 pages

2223hk1 Slide03 ML2022

This document discusses multi-class classification. It introduces multi-class logistic regression models that apply softmax functions to convert linear outputs to probabilities over multiple classes. The cross-entropy loss function is used to train these models by minimizing the negative log-likelihood. It also discusses estimating the error rate of a classifier using the empirical error rate on a training dataset.

Uploaded by

san cris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views33 pages

2223hk1 Slide03 ML2022

This document discusses multi-class classification. It introduces multi-class logistic regression models that apply softmax functions to convert linear outputs to probabilities over multiple classes. The cross-entropy loss function is used to train these models by minimizing the negative log-likelihood. It also discusses estimating the error rate of a classifier using the empirical error rate on a training dataset.

Uploaded by

san cris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

INT3405 Machine Learning

Lecture 3 - Classification

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long


TA: Le Bang Giang, Tran Truong Thuy

VNU-UET

2022

1 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

2 / 33
Recap: Logistic Regression - Binary Classification
Data
x ∈ Rd , y ∈ {0, 1}
Example: image S × S −→ d = S 2 = 784 (MNIST)

D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}

Model
f (x) = w T x + w0
Y |X = x ∼ Ber (y |σ(f (x)))
Sigmoid Function
1
σ(z) =
1 + e −z
Parameter
θ = (w , w0 )

3 / 33
Recap: Logistic Regression - Binary cross entropy loss

Training
With MLE principle, calculate likelihood
n n
µyi i (1 − µi )1−yi
Y Y
L(w , w0 ) = P(D) = P(yi |xi ) =
i=1 i=1

where µi = σ(f (xi ))


Negative-loglikelihood (NLL)
n
X
ℓ(w , w0 ) = − log L(w , w0 ) = −yi log µi − (1 − yi ) log(1 − µi )
i=1

 Binary cross entropy loss function - BCE

4 / 33
Recap: Training Algorithm - Gradient Descent

TrainLR-GD(D, λ) → w , w0 :
1. Initialize: w = 0 ∈ Rd , w0 = 0
2. Loop epoch = 1, 2, . . .
a. Calculate ℓ(w , w0 )
b. Calculate derivative ∇w ℓ, ∇w0 ℓ
c. Update params

w ← w − λ∇w ℓ(w , w0 )
w0 ← w0 − λ∇w0 ℓ(w , w0 )

3. Stop when:
a. Epoch is large enough
b. The Loss function decrease negligible
c. The derivative is small enough ∥∇w ℓ∥, ∇w0 ℓ

5 / 33
Multi-class logistics regression model

Given a label set Y = {1, 2, . . . , C }


Categorical distribution
A random variable Y ∼ Cat(y |θ1 , θ2 , . . . , θC ) means that

P(Y = c) = θc , c = 1, 2, . . . , C
P
with θc is the probability of the category c and c θc = 1

Example: A six-side dice with C = 6, θc = 1/6, ∀c


C
I(c=y )
Y
P(Y = y ) = θc
c=1

where I(c = y ) is an indicator random function denoting whether


c = y or not.

6 / 33
Example Dataset with Categorical labels
Iris dataset:
▶ Number of Instances: 150
▶ Number of Attributes: 4 (sepal length/width in centimeters,
petal length/width in centimeters)
▶ Number of classes: 3

7 / 33
Example Dataset with Categorical labels
MNIST dataset:
▶ Number of Instances: 60,000 training images and 10,000
testing images.
▶ Number of Attributes: 28 × 28
▶ Number of classes: 10

Figure: Sample images from MNIST test dataset (source: Wikipedia)

8 / 33
Multi-class logistics regression model
Model
Linear function with parameters wc ∈ Rd , wc0 ∈ R, c = 1, 2, . . . , C

fc (x) = wcT x + wc0

Apply softmax function to convert to probabilities


" #
e zc
S(z1 , z2 , . . . zC ) = PC
zc ′
c ′ =1 e c=1,2,...,C

Probability model

Y |X = x ∼ Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

where fc (x) is called as logit. With logistic regression, fc (x) are


(simple) linear functions, more sophisticated methods make use of
neural networks.
9 / 33
Multi-class logistics regression model

Inference
hLR (x): Choose c to maximize hLR (x),

Figure: Multi-class logistic regression with Softmax

10 / 33
Multi-class logistics regression model
Training
The likelihood of the parameters with respect to the dataset:
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

n
Y
L(W) = P(D) = P(Y = yi |xi )
i=1
n Y
C
µyicic
Y
=
i=1 c=1

where µic is computed as:


e fc (xi )
µic = PC
fc ′ (xi )
c ′ =1 e
and the one-hot encoding of the labels
yic = I(c = yi )
11 / 33
Multi-class logistics regression model

Loss function: Negative log likelihood (NLL)


n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1

which is also called the cross-entropy loss (CE loss) function, it


measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Recall that KL divergence between two distributions P and Q is
X P(x)
DKL (P∥Q) = P(x) log
Q(x)
x∈X

12 / 33
Multi-class logistics regression model

13 / 33
Multi-class logistics regression model

Loss function: Negative log likelihood (NLL)


n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1

which is also called the cross-entropy loss (CE loss) function, it


measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Question: What is the equivalent form of KL divergence of NLL, is
it E[DKL (y ∥µ)] or E[DKL (µ∥y )]? Can we reverse the order? Why?

14 / 33
Multi-class logistics regression model

Gradient descend
The gradient of the loss function w.r.t the loss function are
n
X
∇wc ℓ(W) = (µic − yic )xi
i=1
n
X
∇wc0 ℓ(W) = (µic − yic )
i=1

15 / 33
Multi-class classification

Problem statement
Given a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, xi ∈ X and
yi ∈ Y = {1, 2, . . . , C }, we want to find a classifier h(x) : X → Y

Statistical probability perspective


The dataset D is sampled from an unknown distribution P(x, y ).
A ’good’ classifier h has a low probability of making error under P.
Given a sample (X , Y ) ∼ P, the event X is misclassified by h:

{h(X ) ̸= Y }

and the error probability (error rate) of h

errP (h) = PX ,Y ∼P {h(X ) ̸= Y } (1)

16 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

17 / 33
Multi-class classification
Estimate the error rate
Since P is unknown, the true error (1) is not directly available.
However, we have the dataset D as a realization of P. The
empirical error rate on the training data D is then
n
1X
err
c D (h) = I(h(xi ) ̸= yi ) (2)
n
i=1

Expectation of empirical error rate


The empirical error rate (2) is an unbiased estimator of the true
error rate (1) under P since
n
1X
EP [err
c D (h)] = Exi ,yi ∼P [I(h(xi ) ̸= yi )]
n
i=1
= P{h(X ) ̸= Y } = errP (h)

18 / 33
Bounding Empirical Error Rate

Concentration bound
Hoeffding inequality with empirical mean reminder: Let
X1 , X2 , . . . Xn be (random) samples from random variables X ,
bounded a.s in [a, b], then
n
n 1 X o −2nϵ2
Xi − E [X ] ≤ ϵ ≥ 1 − 2 exp

P
n (b − a)2
i

Connection to confidence intervals: Bound the range of error ϵ


with the level of significance α (change to have a wrong
estimation) with the needed number of examples n

19 / 33
Bounding Empirical Error Rate

Concentration bound Apply the Hoeffding inequality to the


empirical error rate, we have a concentration bound on the
difference between empirical and true error rates,
2
c D (h) − errP (h)| ≤ ϵ} ≥ 1 − 2e −2nϵ
P {|err

with Xi = I(h(xi ) ̸= yi ) are indicator random variables taking


values in {0, 1}.
▶ Estimation how err c D (h) closed to errP (h), given ϵ and the
number of samples

20 / 33
Bounding Empirical Error Rate

21 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

22 / 33
The optimal classifier
Fixed X = x, the error event

P{h(X ) ̸= Y |X = x} = 1 − P(Y = h(x)|X = x)

Question: If X = x is fixed, what should be h(x) from the


numbers 1 to C to minimize the probability of error?
 
P(Y = 1|X = x)
 P(Y = 2|X = x) 
 
 .. 
 . 
P(Y = C |X = x)

Bayes optimal classifier: is a probabilistic model that makes the


most probable classification

h⋆ (x) = arg max P(Y = c|X = x)


c

23 / 33
The optimal classifier

Optimal error probability

E ⋆ = errP (h⋆ ) ≤ errP (h), ∀h

Example: A logistic regression model uses the formula of the


Bayes optimal classifier, with P(Y = c|X = x) is the probabilities
of the categorical distribution Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

Question: Is it possible to find the Bayesian classification function?

▶ Can we find the distribution P?


▶ Can we approximate P(Y = c|X = x) to approximate h⋆ ?

24 / 33
The optimal classifier

Two approaches in machine learning


▶ Discriminative models: learn a predictor given the
observations
▶ Approximate P(Y = c|X = x)
▶ Approximate the classifier h⋆
▶ Generative models: learn a joint distribution over all the
variables.
▶ Approximate P(Y = c, X = x)

h⋆ (x) = arg max P(Y = c|X = x)


c
= arg max P(Y = c|X = x)P(X = x)
c

▶ Can solve a variety of problems, not only classification

25 / 33
Discriminative vs generative models

26 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

27 / 33
K nearest neighbor - KNN

Estimate P(Y = c|X = x) in the neighborhood of x


▶ Find k samples from D that are closest to x
▶ Approximate P(Y = c|X = x) by the proportion of label c in
the k samples.
Pn
I(xi ∈ Vk (x))I(yi = c)
P(Y
b = c|X = x) = i=1
k
where Vk (x) is a neighborhood of x containing k of data
samples in D. The numerator is the number of data samples
in Vk (x) that are labeled c.

hKNN (x): select the label c that occurs most in k of the data
sample closest to x in D.

28 / 33
K nearest neighbor - KNN

pseudocode of KNN
1. Calculate d(x, xi ), i = 1, 2, . . . n, where d is a distance metric.
2. Take the first k data points whose distances are smallest
among the calculated distance list, along with their labels,
denoted (xj , yj )kj=1 .
3. Return the label which has the most votes.

Question: What data structure and algorithms should we use in


step 2 (sorting, heap, k-d tree)?

29 / 33
K nearest neighbor - KNN

Theorem of the upper bound of error rate of knn with k = 1


When the number of training samples n → ∞, we have
 
⋆ KNN ⋆ C ⋆
R ≤ errP (h )≤R 2− R .
C −1

where R ⋆ is the Bayesian optimal error probability.

30 / 33
K nearest neighbor - KNN
Proof: Let x(1) be the nearest neighbor of x in D and y(1) the
label of this data sample and y true the label of x.
Suppose that when number of samples n → ∞,

x(1) → x
P(y |x(1) ) → P(y |x), ∀y = 1, 2, . . . , C

The error of hKNN on x occurs when y(1) ̸= y true . This probability


is equal to

err(hKNN , x) = P(y true ̸= y(1) |x, x(1) )


C
X
=1− P(y true = y |x)P(y(1) = y |x(1) )
y =1
C
X
→1− P 2 (y |x) (as n → ∞)
y =1

31 / 33
K nearest neighbor - KNN
If y ⋆ = h⋆ (x) is the output of the Bayesian optimal classifier
function, then

P(y ⋆ |x) = max P(y |x) = 1 − err(h⋆ , x)


y
C
X X
P 2 (y |x) = P 2 (y ⋆ |x) + P 2 (y |x)
y =1 y ̸=y ⋆
( y ̸=y ⋆ P(y |x))2
P
2 ⋆
≥ P (y |x) +
C −1
(Bunyakovsky inequality)
err(h⋆ , x)2
= (1 − err(h⋆ , x))2 +
C −1
So
C
X C
1− P 2 (y |x) ≤ 2err(h⋆ , x) − err(h⋆ , x)2 (3)
C −1
y =1
32 / 33
K nearest neighbor - KNN
Taking the expectation both sides of (3), with R ⋆ = Ex err(h⋆ , x),
we have
Z
KNN KNN ⋆ C
Ex err(h , x) = err(h ) ≤ 2R − err(h⋆ , x)2 p(x)dx
C −1 x
(4)
Also,
Z
0≤ (err(h⋆ , x) − R ⋆ )2 p(x)dx
Zx
err(h⋆ , x)2 − 2R ⋆ err(h⋆ , x) + (R ⋆ )2 p(x)dx

=
Zx
= err(h⋆ , x)2 p(x)dx − (R ⋆ )2
x
Substitute to (4)
 
C C
err(hKNN ) ≤ 2R ⋆ − (R ⋆ )2 = R ⋆ 2 − R⋆ .
C −1 C −1
33 / 33

You might also like