0% found this document useful (0 votes)

62 views33 pages

2223hk1 Slide03 ML2022

This document discusses multi-class classification. It introduces multi-class logistic regression models that apply softmax functions to convert linear outputs to probabilities over multiple classes. The cross-entropy loss function is used to train these models by minimizing the negative log-likelihood. It also discusses estimating the error rate of a classifier using the empirical error rate on a training dataset.

Uploaded by

san cris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views33 pages

2223hk1 Slide03 ML2022

Uploaded by

san cris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

INT3405 Machine Learning

Lecture 3 - Classification

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long

TA: Le Bang Giang, Tran Truong Thuy

VNU-UET

2022

1 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

2 / 33
Recap: Logistic Regression - Binary Classification
Data
x ∈ Rd , y ∈ {0, 1}
Example: image S × S −→ d = S 2 = 784 (MNIST)

D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}

Model
f (x) = w T x + w0
Y |X = x ∼ Ber (y |σ(f (x)))
Sigmoid Function
1
σ(z) =
1 + e −z
Parameter
θ = (w , w0 )

3 / 33
Recap: Logistic Regression - Binary cross entropy loss

Training
With MLE principle, calculate likelihood
n n
µyi i (1 − µi )1−yi
Y Y
L(w , w0 ) = P(D) = P(yi |xi ) =
i=1 i=1

where µi = σ(f (xi ))

Negative-loglikelihood (NLL)
n
X
ℓ(w , w0 ) = − log L(w , w0 ) = −yi log µi − (1 − yi ) log(1 − µi )
i=1

Binary cross entropy loss function - BCE

4 / 33
Recap: Training Algorithm - Gradient Descent

TrainLR-GD(D, λ) → w , w0 :
1. Initialize: w = 0 ∈ Rd , w0 = 0
2. Loop epoch = 1, 2, . . .
a. Calculate ℓ(w , w0 )
b. Calculate derivative ∇w ℓ, ∇w0 ℓ
c. Update params

w ← w − λ∇w ℓ(w , w0 )
w0 ← w0 − λ∇w0 ℓ(w , w0 )

3. Stop when:
a. Epoch is large enough
b. The Loss function decrease negligible
c. The derivative is small enough ∥∇w ℓ∥, ∇w0 ℓ

5 / 33
Multi-class logistics regression model

Given a label set Y = {1, 2, . . . , C }

Categorical distribution
A random variable Y ∼ Cat(y |θ1 , θ2 , . . . , θC ) means that

P(Y = c) = θc , c = 1, 2, . . . , C
P
with θc is the probability of the category c and c θc = 1

Example: A six-side dice with C = 6, θc = 1/6, ∀c

C
I(c=y )
Y
P(Y = y ) = θc
c=1

where I(c = y ) is an indicator random function denoting whether

c = y or not.

6 / 33
Example Dataset with Categorical labels
Iris dataset:
▶ Number of Instances: 150
▶ Number of Attributes: 4 (sepal length/width in centimeters,
petal length/width in centimeters)
▶ Number of classes: 3

7 / 33
Example Dataset with Categorical labels
MNIST dataset:
▶ Number of Instances: 60,000 training images and 10,000
testing images.
▶ Number of Attributes: 28 × 28
▶ Number of classes: 10

Figure: Sample images from MNIST test dataset (source: Wikipedia)

8 / 33
Multi-class logistics regression model
Model
Linear function with parameters wc ∈ Rd , wc0 ∈ R, c = 1, 2, . . . , C

fc (x) = wcT x + wc0

Apply softmax function to convert to probabilities

" #
e zc
S(z1 , z2 , . . . zC ) = PC
zc ′
c ′ =1 e c=1,2,...,C

Probability model

Y |X = x ∼ Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

where fc (x) is called as logit. With logistic regression, fc (x) are

(simple) linear functions, more sophisticated methods make use of
neural networks.
9 / 33
Multi-class logistics regression model

Inference
hLR (x): Choose c to maximize hLR (x),

Figure: Multi-class logistic regression with Softmax

10 / 33
Multi-class logistics regression model
Training
The likelihood of the parameters with respect to the dataset:
D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

n
Y
L(W) = P(D) = P(Y = yi |xi )
i=1
n Y
C
µyicic
Y
=
i=1 c=1

where µic is computed as:

e fc (xi )
µic = PC
fc ′ (xi )
c ′ =1 e
and the one-hot encoding of the labels
yic = I(c = yi )
11 / 33
Multi-class logistics regression model

Loss function: Negative log likelihood (NLL)

n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1

which is also called the cross-entropy loss (CE loss) function, it

measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Recall that KL divergence between two distributions P and Q is
X P(x)
DKL (P∥Q) = P(x) log
Q(x)
x∈X

12 / 33
Multi-class logistics regression model

13 / 33
Multi-class logistics regression model

Loss function: Negative log likelihood (NLL)

n X
X C
ℓ(W) = − log L(W) = − yic log µic
i=1 c=1

which is also called the cross-entropy loss (CE loss) function, it

measures the Kullback-Leibler (KL) divergence between two
distributions µic and yic
Question: What is the equivalent form of KL divergence of NLL, is
it E[DKL (y ∥µ)] or E[DKL (µ∥y )]? Can we reverse the order? Why?

14 / 33
Multi-class logistics regression model

Gradient descend
The gradient of the loss function w.r.t the loss function are
n
X
∇wc ℓ(W) = (µic − yic )xi
i=1
n
X
∇wc0 ℓ(W) = (µic − yic )
i=1

15 / 33
Multi-class classification

Problem statement
Given a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, xi ∈ X and
yi ∈ Y = {1, 2, . . . , C }, we want to find a classifier h(x) : X → Y

Statistical probability perspective

The dataset D is sampled from an unknown distribution P(x, y ).
A ’good’ classifier h has a low probability of making error under P.
Given a sample (X , Y ) ∼ P, the event X is misclassified by h:

{h(X ) ̸= Y }

and the error probability (error rate) of h

errP (h) = PX ,Y ∼P {h(X ) ̸= Y } (1)

16 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

17 / 33
Multi-class classification
Estimate the error rate
Since P is unknown, the true error (1) is not directly available.
However, we have the dataset D as a realization of P. The
empirical error rate on the training data D is then
n
1X
err
c D (h) = I(h(xi ) ̸= yi ) (2)
n
i=1

Expectation of empirical error rate

The empirical error rate (2) is an unbiased estimator of the true
error rate (1) under P since
n
1X
EP [err
c D (h)] = Exi ,yi ∼P [I(h(xi ) ̸= yi )]
n
i=1
= P{h(X ) ̸= Y } = errP (h)

18 / 33
Bounding Empirical Error Rate

Concentration bound
Hoeffding inequality with empirical mean reminder: Let
X1 , X2 , . . . Xn be (random) samples from random variables X ,
bounded a.s in [a, b], then
n
n 1 X o −2nϵ2
Xi − E [X ] ≤ ϵ ≥ 1 − 2 exp

P
n (b − a)2
i

Connection to confidence intervals: Bound the range of error ϵ

with the level of significance α (change to have a wrong
estimation) with the needed number of examples n

19 / 33
Bounding Empirical Error Rate

Concentration bound Apply the Hoeffding inequality to the

empirical error rate, we have a concentration bound on the
difference between empirical and true error rates,
2
c D (h) − errP (h)| ≤ ϵ} ≥ 1 − 2e −2nϵ
P {|err

with Xi = I(h(xi ) ̸= yi ) are indicator random variables taking

values in {0, 1}.
▶ Estimation how err c D (h) closed to errP (h), given ϵ and the
number of samples

20 / 33
Bounding Empirical Error Rate

21 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

22 / 33
The optimal classifier
Fixed X = x, the error event

P{h(X ) ̸= Y |X = x} = 1 − P(Y = h(x)|X = x)

Question: If X = x is fixed, what should be h(x) from the

numbers 1 to C to minimize the probability of error?
 
P(Y = 1|X = x)
 P(Y = 2|X = x) 
 
 .. 
 . 
P(Y = C |X = x)

Bayes optimal classifier: is a probabilistic model that makes the

most probable classification

h⋆ (x) = arg max P(Y = c|X = x)

23 / 33
The optimal classifier

Optimal error probability

E ⋆ = errP (h⋆ ) ≤ errP (h), ∀h

Example: A logistic regression model uses the formula of the

Bayes optimal classifier, with P(Y = c|X = x) is the probabilities
of the categorical distribution Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

Question: Is it possible to find the Bayesian classification function?

▶ Can we find the distribution P?

▶ Can we approximate P(Y = c|X = x) to approximate h⋆ ?

24 / 33
The optimal classifier

Two approaches in machine learning

▶ Discriminative models: learn a predictor given the
observations
▶ Approximate P(Y = c|X = x)
▶ Approximate the classifier h⋆
▶ Generative models: learn a joint distribution over all the
variables.
▶ Approximate P(Y = c, X = x)

h⋆ (x) = arg max P(Y = c|X = x)

c
= arg max P(Y = c|X = x)P(X = x)
c

▶ Can solve a variety of problems, not only classification

25 / 33
Discriminative vs generative models

26 / 33
Table of content

Multi-class logistics regression model

Multi-class classification

The optimal classifier

K nearest neighbor - KNN

27 / 33
K nearest neighbor - KNN

Estimate P(Y = c|X = x) in the neighborhood of x

▶ Find k samples from D that are closest to x
▶ Approximate P(Y = c|X = x) by the proportion of label c in
the k samples.
Pn
I(xi ∈ Vk (x))I(yi = c)
P(Y
b = c|X = x) = i=1
k
where Vk (x) is a neighborhood of x containing k of data
samples in D. The numerator is the number of data samples
in Vk (x) that are labeled c.

hKNN (x): select the label c that occurs most in k of the data
sample closest to x in D.

28 / 33
K nearest neighbor - KNN

pseudocode of KNN
1. Calculate d(x, xi ), i = 1, 2, . . . n, where d is a distance metric.
2. Take the first k data points whose distances are smallest
among the calculated distance list, along with their labels,
denoted (xj , yj )kj=1 .
3. Return the label which has the most votes.

Question: What data structure and algorithms should we use in

step 2 (sorting, heap, k-d tree)?

29 / 33
K nearest neighbor - KNN

Theorem of the upper bound of error rate of knn with k = 1

When the number of training samples n → ∞, we have

⋆ KNN ⋆ C ⋆
R ≤ errP (h )≤R 2− R .
C −1

where R ⋆ is the Bayesian optimal error probability.

30 / 33
K nearest neighbor - KNN
Proof: Let x(1) be the nearest neighbor of x in D and y(1) the
label of this data sample and y true the label of x.
Suppose that when number of samples n → ∞,

x(1) → x
P(y |x(1) ) → P(y |x), ∀y = 1, 2, . . . , C

The error of hKNN on x occurs when y(1) ̸= y true . This probability

is equal to

err(hKNN , x) = P(y true ̸= y(1) |x, x(1) )

C
X
=1− P(y true = y |x)P(y(1) = y |x(1) )
y =1
C
X
→1− P 2 (y |x) (as n → ∞)
y =1

31 / 33
K nearest neighbor - KNN
If y ⋆ = h⋆ (x) is the output of the Bayesian optimal classifier
function, then

P(y ⋆ |x) = max P(y |x) = 1 − err(h⋆ , x)

y
C
X X
P 2 (y |x) = P 2 (y ⋆ |x) + P 2 (y |x)
y =1 y ̸=y ⋆
( y ̸=y ⋆ P(y |x))2
P
2 ⋆
≥ P (y |x) +
C −1
(Bunyakovsky inequality)
err(h⋆ , x)2
= (1 − err(h⋆ , x))2 +
C −1
So
C
X C
1− P 2 (y |x) ≤ 2err(h⋆ , x) − err(h⋆ , x)2 (3)
C −1
y =1
32 / 33
K nearest neighbor - KNN
Taking the expectation both sides of (3), with R ⋆ = Ex err(h⋆ , x),
we have
Z
KNN KNN ⋆ C
Ex err(h , x) = err(h ) ≤ 2R − err(h⋆ , x)2 p(x)dx
C −1 x
(4)
Also,
Z
0≤ (err(h⋆ , x) − R ⋆ )2 p(x)dx
Zx
err(h⋆ , x)2 − 2R ⋆ err(h⋆ , x) + (R ⋆ )2 p(x)dx

=
Zx
= err(h⋆ , x)2 p(x)dx − (R ⋆ )2
x
Substitute to (4)

C C
err(hKNN ) ≤ 2R ⋆ − (R ⋆ )2 = R ⋆ 2 − R⋆ .
C −1 C −1
33 / 33

Simple Regression: Multiple-Choice Questions
No ratings yet
Simple Regression: Multiple-Choice Questions
36 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Web PEER608 BAKER Cornell
No ratings yet
Web PEER608 BAKER Cornell
368 pages
Level II of CFA Program Mock Exam 1 - Questions (PM)
No ratings yet
Level II of CFA Program Mock Exam 1 - Questions (PM)
36 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
4th Attempts Huawei
No ratings yet
4th Attempts Huawei
6 pages
Unit Ii: Beyond Binary Classification: Handling More Than Two Classes, Regression, Unsupervised
No ratings yet
Unit Ii: Beyond Binary Classification: Handling More Than Two Classes, Regression, Unsupervised
22 pages
Unit 3-Discriminative Models
No ratings yet
Unit 3-Discriminative Models
29 pages
Q. 1) What Is Class Condition Density? (3 Marks) Ans
No ratings yet
Q. 1) What Is Class Condition Density? (3 Marks) Ans
12 pages
Chapter 5. Regression Models: 1 A Simple Model
No ratings yet
Chapter 5. Regression Models: 1 A Simple Model
49 pages
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
16 pages
Dimensions of Hard Power. Lemke D
No ratings yet
Dimensions of Hard Power. Lemke D
25 pages
Lecture3 Logistic Regression Regularization
No ratings yet
Lecture3 Logistic Regression Regularization
39 pages
Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1
No ratings yet
Linear Models For Classification: Logreg - PDF - May 4, 2010 - 1
7 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Chapter 1 Slides
No ratings yet
Chapter 1 Slides
36 pages
On Prime Time Viewership : PAF - Karachi Institute of Economics & Technology
No ratings yet
On Prime Time Viewership : PAF - Karachi Institute of Economics & Technology
16 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
53 Midas-Xr Analysis en
No ratings yet
53 Midas-Xr Analysis en
33 pages
Lecture 4 Classification P1
No ratings yet
Lecture 4 Classification P1
51 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
BECE352E Module 3
No ratings yet
BECE352E Module 3
64 pages
User Acceptance Study On Academic Information System in University XYZ
No ratings yet
User Acceptance Study On Academic Information System in University XYZ
7 pages
The Impact of Audit Quality On The Financial Performance of Listed Companies Nigeria
No ratings yet
The Impact of Audit Quality On The Financial Performance of Listed Companies Nigeria
6 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
12 pages
ML Lab Programs (1-13)
No ratings yet
ML Lab Programs (1-13)
44 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Lec 10
No ratings yet
Lec 10
8 pages
Product Costing 2.1 Capital Investment
No ratings yet
Product Costing 2.1 Capital Investment
11 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
Minitab Session Commands: Appendix
No ratings yet
Minitab Session Commands: Appendix
8 pages
VBS Purvanchal University: Jaunpur
No ratings yet
VBS Purvanchal University: Jaunpur
7 pages
Hasil Output SPSS 21
No ratings yet
Hasil Output SPSS 21
7 pages
Slide 2
No ratings yet
Slide 2
30 pages
PMF 2014 02 Biyase
No ratings yet
PMF 2014 02 Biyase
6 pages
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
No ratings yet
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
6 pages
The Research and Analysis of Factors Affecting Critical Flicker Frequency
No ratings yet
The Research and Analysis of Factors Affecting Critical Flicker Frequency
8 pages
Slides 3
No ratings yet
Slides 3
25 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
Pstat 126 Syllabus
No ratings yet
Pstat 126 Syllabus
1 page
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
15 pages
Mlfa Autumn 22 Lec 03
No ratings yet
Mlfa Autumn 22 Lec 03
61 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Working Capital Management and Banks' Performance: Evidence From India
No ratings yet
Working Capital Management and Banks' Performance: Evidence From India
13 pages
Classification: K N X X X y I y
No ratings yet
Classification: K N X X X y I y
6 pages
ML 2024 Part6 Classification Unsupervised
No ratings yet
ML 2024 Part6 Classification Unsupervised
43 pages
Crypto Mining and Market Quality
No ratings yet
Crypto Mining and Market Quality
21 pages
Module 3 Intro
No ratings yet
Module 3 Intro
46 pages
Digital Service Marketing
No ratings yet
Digital Service Marketing
18 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Lecture2 Classification PartI
No ratings yet
Lecture2 Classification PartI
100 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
LR, Decision Tree
No ratings yet
LR, Decision Tree
48 pages
Classification
No ratings yet
Classification
31 pages
Pattern Recognition and Deep Learning Linear Models For Classification
No ratings yet
Pattern Recognition and Deep Learning Linear Models For Classification
59 pages
Class Adv Classification I
No ratings yet
Class Adv Classification I
39 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
Lecture 3. Classification
No ratings yet
Lecture 3. Classification
60 pages
03 Bayes Nearest Neighbors
No ratings yet
03 Bayes Nearest Neighbors
34 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
Formula
No ratings yet
Formula
2 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Treasury Single Account and Banks' Liquidity of Deposit Money Banks in Nigeria.
No ratings yet
Treasury Single Account and Banks' Liquidity of Deposit Money Banks in Nigeria.
31 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Multivariate Classification
No ratings yet
Multivariate Classification
7 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
2 - Classification Models
No ratings yet
2 - Classification Models
52 pages
Lec 20
No ratings yet
Lec 20
16 pages
AI & ML Unit 4, 5 Notes
No ratings yet
AI & ML Unit 4, 5 Notes
137 pages
Stat 136 Chapter 11 Autocorrelation
No ratings yet
Stat 136 Chapter 11 Autocorrelation
15 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
MN2196 Commentary October 2023
No ratings yet
MN2196 Commentary October 2023
12 pages
Sy19 A22 Cours3
No ratings yet
Sy19 A22 Cours3
98 pages
Assignment MMPM 006 25
No ratings yet
Assignment MMPM 006 25
11 pages
ML - Logistic Regression&KNN
No ratings yet
ML - Logistic Regression&KNN
48 pages
Linear Models 2nd Edition Shayle R. Searle PDF Download
No ratings yet
Linear Models 2nd Edition Shayle R. Searle PDF Download
52 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

2223hk1 Slide03 ML2022

Uploaded by

2223hk1 Slide03 ML2022

Uploaded by

INT3405 Machine Learning

Ta Viet Cuong, Le Duc Trong, Tran Quoc Long

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}

where µi = σ(f (xi ))

 Binary cross entropy loss function - BCE

Given a label set Y = {1, 2, . . . , C }

Example: A six-side dice with C = 6, θc = 1/6, ∀c

where I(c = y ) is an indicator random function denoting whether

Figure: Sample images from MNIST test dataset (source: Wikipedia)

fc (x) = wcT x + wc0

Apply softmax function to convert to probabilities

Y |X = x ∼ Cat(y |S(f1 (x), f2 (x), . . . , fC (x)))

where fc (x) is called as logit. With logistic regression, fc (x) are

Figure: Multi-class logistic regression with Softmax

where µic is computed as:

Loss function: Negative log likelihood (NLL)

which is also called the cross-entropy loss (CE loss) function, it

Loss function: Negative log likelihood (NLL)

which is also called the cross-entropy loss (CE loss) function, it

Statistical probability perspective

and the error probability (error rate) of h

errP (h) = PX ,Y ∼P {h(X ) ̸= Y } (1)

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

Expectation of empirical error rate

Connection to confidence intervals: Bound the range of error ϵ

Concentration bound Apply the Hoeffding inequality to the

with Xi = I(h(xi ) ̸= yi ) are indicator random variables taking

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

P{h(X ) ̸= Y |X = x} = 1 − P(Y = h(x)|X = x)

Question: If X = x is fixed, what should be h(x) from the

Bayes optimal classifier: is a probabilistic model that makes the

h⋆ (x) = arg max P(Y = c|X = x)

Optimal error probability

E ⋆ = errP (h⋆ ) ≤ errP (h), ∀h

Example: A logistic regression model uses the formula of the

Question: Is it possible to find the Bayesian classification function?

▶ Can we find the distribution P?

Two approaches in machine learning

h⋆ (x) = arg max P(Y = c|X = x)

▶ Can solve a variety of problems, not only classification

Multi-class logistics regression model

The optimal classifier

K nearest neighbor - KNN

Estimate P(Y = c|X = x) in the neighborhood of x

Question: What data structure and algorithms should we use in

Theorem of the upper bound of error rate of knn with k = 1

where R ⋆ is the Bayesian optimal error probability.

The error of hKNN on x occurs when y(1) ̸= y true . This probability

err(hKNN , x) = P(y true ̸= y(1) |x, x(1) )

P(y ⋆ |x) = max P(y |x) = 1 − err(h⋆ , x)

You might also like

Binary cross entropy loss function - BCE