0% found this document useful (0 votes)
1 views

Lecture2 Classification PartI

The document is a lecture on supervised learning classification methods, focusing on techniques such as K-Nearest Neighbors and Logistic Regression. It discusses the classification goal, training and test error rates, and introduces the Bayes classifier. The lecture also covers the logistic function and its advantages over linear regression for binary classification problems.

Uploaded by

guoxiaofan0225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Lecture2 Classification PartI

The document is a lecture on supervised learning classification methods, focusing on techniques such as K-Nearest Neighbors and Logistic Regression. It discusses the classification goal, training and test error rates, and introduces the Bayes classifier. The lecture also covers the logistic function and its advantages over linear regression for binary classification problems.

Uploaded by

guoxiaofan0225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

IG.

3510-Machine Learning
Lecture 2: Supervised learning: Classification
Part I

Dr. Patricia CONDE-CESPEDES

[email protected]

September 16th, 2024

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 1 / 62


Plan

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 2 / 62


Introduction

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 3 / 62


Introduction

Introduction to classification

In classification, the response Y is a qualitative or categorical


variable. Some examples:
In the Spam detection problem the target can take only two values
{”Spam”, ”mail”}.
If the variable is ”Origin”, then, the target variable takes more than
two labels {”American”, ”Asian”, ”African”, ”European”} .
The goal is:
Classification goal
Given feature vectors X and a qualitative response Y , the goal is to build a
classifier function C (X) that takes as input X and predicts its value for Y .

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 4 / 62


Introduction

Example
Goal: predict whether an individual will default on his/her credit card
payment, on the basis of annual income and monthly credit card balance.

individuals who defaulted (orange) and those who did not (blue).
Y : default credit payment based on balance X1 and income X2 .
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 5 / 62
Introduction

Training error and test error in classification

Training error rate: the proportion of misclassified observations in


the training set:
n
1X
1(yi 6= ŷi )
n
i=1

where ŷi is the predicted class by our classifier for the ith observation
and yi is the real value.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 6 / 62


Introduction

Training error and test error in classification

Training error rate: the proportion of misclassified observations in


the training set:
n
1X
1(yi 6= ŷi )
n
i=1

where ŷi is the predicted class by our classifier for the ith observation
and yi is the real value.
Test error rate: . For a given observation (x0 , y0 ), a good classifier
will have minimum estimated error test:

average(1(y0 6= ŷ0 ))

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 6 / 62


Introduction

The Bayes Classifier

In practice, we estimate the conditional probability of Y given X:


Suppose Y has κ categories numbered {1, 2, ..., κ}. Then, we want to
estimate:
pk (x) = P(Y = k|X = x), k = 1, 2, ..., κ.
pk (x) is the conditional probability of class k at value x.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 7 / 62


Introduction

The Bayes Classifier

In practice, we estimate the conditional probability of Y given X:


Suppose Y has κ categories numbered {1, 2, ..., κ}. Then, we want to
estimate:
pk (x) = P(Y = k|X = x), k = 1, 2, ..., κ.
pk (x) is the conditional probability of class k at value x.
The test error rate is minimized, on average, if given an observation, it is
assigned to the most likely class.
Such a classifier is called the Bayes classifier:

C (x) = j if pj (x) = max{p1 (x), p2 (x), ..., pκ (x)}

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 7 / 62


Introduction

Bayes error rate in a two-class problem

If there are only 2 classes, the Bayes classifier will choose the class j for
which:

P(Y = j|X = x0 ) > 0.5


Then, the Bayes error rate will be:

1 − E (max P(Y = j|X = x0 ))


j

In practical applications, we do not know the conditional distribution of Y


given X . Then, we will have to estimate it!

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 8 / 62


K-Nearest Neighbors

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 9 / 62


K-Nearest Neighbors

The K-nearest neighbors (KNN) classifier

Given a positive integer K and a test observation x0 , the KNN classifier


proceeds as follows:
1 first identifies the K points in the training data that are closest to x0 ,
represented by N0 .

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 10 / 62


K-Nearest Neighbors

The K-nearest neighbors (KNN) classifier

Given a positive integer K and a test observation x0 , the KNN classifier


proceeds as follows:
1 first identifies the K points in the training data that are closest to x0 ,
represented by N0 .
2 It then estimates the conditional probability for class j as follows:
1 X
p̂j (x0 ) = 1(yi = j)
K
i∈N0

that is, the fraction of points in N0 whose response values are j.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 10 / 62


K-Nearest Neighbors

The K-nearest neighbors (KNN) classifier

Given a positive integer K and a test observation x0 , the KNN classifier


proceeds as follows:
1 first identifies the K points in the training data that are closest to x0 ,
represented by N0 .
2 It then estimates the conditional probability for class j as follows:
1 X
p̂j (x0 ) = 1(yi = j)
K
i∈N0

that is, the fraction of points in N0 whose response values are j.


Finally, KNN applies the Bayes rule to classify the test observation x0 .

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 10 / 62


K-Nearest Neighbors

KNN small example


Training data set consisting of six blue and six orange observations.
Goal: make a prediction for the point labeled by the black cross.
KNN for K = 3,

On the right, KNN for K = 3 applied at every point (the test set) and the corresponding
KNN decision boundary.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 11 / 62
K-Nearest Neighbors

KNN classifier: simple but good!


KNN can often produce classifiers that are surprisingly close to the optimal
Bayes classifier.The purple dashed line represents the Bayes decision boundary.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 12 / 62


K-Nearest Neighbors

KNN:The value of k and the flexibility of the model

KNN: The flexibility decreases with the value of K .


P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 13 / 62
Logistic Regression

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 14 / 62


Logistic Regression

Motivation with binary classification

Let us suppose we want to predict the Marital status. Then, we have


two levels and we can use the binary coding:
(
0 : No
Y =
1 : Yes

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 15 / 62


Logistic Regression

Motivation with binary classification

Let us suppose we want to predict the Marital status. Then, we have


two levels and we can use the binary coding:
(
0 : No
Y =
1 : Yes

We want to estimate a probability E (Y |X = x) = P(Y = 1|X = x) (


because Y is an indicator variable).
What if we perform linear regression?

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 15 / 62


Logistic Regression

Linear regression with a two-level output

Let us suppose we have only one predictor X , then we want to estimate


p(X ) = P(Y = 1|X ) using a linear regression model:

p(x) = β0 + β1 X

and classify as ”1:Yes” if p̂ > 0.5.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 16 / 62


Logistic Regression

Linear regression with a two-level output

Let us suppose we have only one predictor X , then we want to estimate


p(X ) = P(Y = 1|X ) using a linear regression model:

p(x) = β0 + β1 X

and classify as ”1:Yes” if p̂ > 0.5.


However, this model has some drawbacks:
Why not reverse coding {0:Yes, 1:No}? And the fit would be
different!

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 16 / 62


Logistic Regression

Linear regression with a two-level output

Let us suppose we have only one predictor X , then we want to estimate


p(X ) = P(Y = 1|X ) using a linear regression model:

p(x) = β0 + β1 X

and classify as ”1:Yes” if p̂ > 0.5.


However, this model has some drawbacks:
Why not reverse coding {0:Yes, 1:No}? And the fit would be
different!
Linear regression might produce probability estimates falling outside
the interval [0, 1].

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 16 / 62


Logistic Regression

Linear regression with a two-level output

Let us suppose we have only one predictor X , then we want to estimate


p(X ) = P(Y = 1|X ) using a linear regression model:

p(x) = β0 + β1 X

and classify as ”1:Yes” if p̂ > 0.5.


However, this model has some drawbacks:
Why not reverse coding {0:Yes, 1:No}? And the fit would be
different!
Linear regression might produce probability estimates falling outside
the interval [0, 1].
The predictions provide an ordering.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 16 / 62


Logistic Regression

Solution: the logistic function

The logistic function gives outputs between 0 and 1.

e β0 +β1 X
p(X ) = ,
1 + e β0 +β1 X
(e ≈ 2.71828 is the Euler’s number.)
No matter what values β0 , β1 or X take, p(X ) will always lie between 0
and 1.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 17 / 62


Logistic Regression

Linear versus Logistic Regression

In orange the observations, in blue, the fitted curve for each model.
For logistic regression, when y = 0, p(X ) takes low values, whereas for
y = 1, p(X ) takes high values.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 18 / 62
Logistic Regression

log odds or logit transformation of p(X )


Rewriting The logistic regression function is equivalent to:

p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 19 / 62


Logistic Regression

log odds or logit transformation of p(X )


Rewriting The logistic regression function is equivalent to:

p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
Interpretation:The odds is the ratio between P(Y = 1/X ) and P(Y = 0/X )

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 19 / 62


Logistic Regression

log odds or logit transformation of p(X )


Rewriting The logistic regression function is equivalent to:

p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
Interpretation:The odds is the ratio between P(Y = 1/X ) and P(Y = 0/X )
By taking the logarithm of both sides, we get:
 
p(X )
log = β0 + β1 X .
(1 − p(X ))

The left-hand side is called the log-odds or logit.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 19 / 62


Logistic Regression

log odds or logit transformation of p(X )


Rewriting The logistic regression function is equivalent to:

p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
Interpretation:The odds is the ratio between P(Y = 1/X ) and P(Y = 0/X )
By taking the logarithm of both sides, we get:
 
p(X )
log = β0 + β1 X .
(1 − p(X ))

The left-hand side is called the log-odds or logit.


Interpretation increasing X by one unit changes the log odds by β1 , or equivalently it
multiplies the odds by e β1 .
if β1 > 0 then increasing X will be associated with increasing p(X ), and
if β1 < 0 then increasing X will be associated with decreasing p(X ).

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 19 / 62


Logistic Regression

Estimating the coefficients in logistic regression

Y is a Bernoulli random variable as it can take only two values:


P(Y = 1|X ) = p(x) and P(Y = 0|X ) = (1 − p(x)).

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 20 / 62


Logistic Regression

Estimating the coefficients in logistic regression

Y is a Bernoulli random variable as it can take only two values:


P(Y = 1|X ) = p(x) and P(Y = 0|X ) = (1 − p(x)).
Suppose we have a random (train) sample of size n: (y1 , x1 ), . . . , (yn , xn ),
the joint probability of observing the n values of Y is given as:
n
Y
p(xi )yi (1 − p(xi ))1−yi (1)
i=1
We suppose the observations are independently distributed.
The joint probability distribution is known in statistics as likelihood
function and will be denoted `(.):
e β0 +β1 Xi Y e β0 +β1 Xi
Y Y  Y 
`(β0 , β1 ) = p(xi ) (1 − p(xi )) = β +β X
1−
yi =1 yi =0 yi =1
1 + e 0 1 i y =0 1 + e β0 +β1 Xi
i

We estimate β0 and β1 that maximize the likelihood function.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 20 / 62


Logistic Regression

Example with the Credit data

Considered the Credit dataset, we want to predict the default of a


customer (pay or not) according to the balance.
The parameter estimates are β̂0 and β̂1 .

their standard errors measure the accuracy of the coefficient.


The z-statistic plays the same role as the t-statistic in the linear
regression output.
A small p-value implies that there is an association between balance and
probability of default.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 21 / 62


Logistic Regression

Making predictions

What is our estimated probability of default for someone with a balance of


$1000?

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 22 / 62


Logistic Regression

Making predictions

What is our estimated probability of default for someone with a balance of


$1000?

e β̂0 +β̂1 X e −10.6513+0.0055×1,000


p̂(X ) = = = 0.006
1 + e β̂0 +β̂1 X 1 + e −10.6513+0.0055×1,000

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 22 / 62


Logistic Regression

Making predictions

What is our estimated probability of default for someone with a balance of


$1000?

e β̂0 +β̂1 X e −10.6513+0.0055×1,000


p̂(X ) = = = 0.006
1 + e β̂0 +β̂1 X 1 + e −10.6513+0.0055×1,000
With a balance of $2000?

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 22 / 62


Logistic Regression

Making predictions

What is our estimated probability of default for someone with a balance of


$1000?

e β̂0 +β̂1 X e −10.6513+0.0055×1,000


p̂(X ) = = = 0.006
1 + e β̂0 +β̂1 X 1 + e −10.6513+0.0055×1,000
With a balance of $2000?

e β̂0 +β̂1 X e −10.6513+0.0055×2,000


p̂(X ) = = = 0.586
1 + e β̂0 +β̂1 X 1 + e −10.6513+0.0055×2,000

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 22 / 62


Logistic Regression

Logistic regression with qualitative predictors


We can be interested in predicting default based on a categorical
variable, for instance the student status.

What is the estimated probability of defaulting for a student (1: Yes,


0:Not)?

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 23 / 62


Logistic Regression

Logistic regression with qualitative predictors


We can be interested in predicting default based on a categorical
variable, for instance the student status.

What is the estimated probability of defaulting for a student (1: Yes,


0:Not)?
e −3.5041+0.4049×1
p̂(X ) = p̂(default=Yes|x = student) = = 0.0431
1 + e −3.5041+0.4049×1

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 23 / 62


Logistic Regression

Logistic regression with qualitative predictors


We can be interested in predicting default based on a categorical
variable, for instance the student status.

What is the estimated probability of defaulting for a student (1: Yes,


0:Not)?
e −3.5041+0.4049×1
p̂(X ) = p̂(default=Yes|x = student) = = 0.0431
1 + e −3.5041+0.4049×1
What about for an individual who is not a student?
e −3.5041+0.4049×0
p̂(X ) = p̂(default=Yes|x = non student) = = 0.0292
1 + e −3.5041+0.4049×0

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 23 / 62


Logistic Regression

Multiple Logistic Regression

Suppose there is more than one regressor. Analogously to the extension


from simple to multiple linear regression, we have:
 
p(X )
log = β0 + β1 X1 + . . . + βp Xp
(1 − p(X ))

then,

e β0 +β1 X1 +...+βp Xp
p(X ) =
1 + e β0 +β1 X1 +...+βp Xp

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 24 / 62


Logistic Regression

Multiple Logistic Regression example


Consider the Credit data, we want to predict default based on 3
predictors: balance, income and student.

Then, we can make predictions:


For example, a student with a credit card balance of $1, 500 and an income of 40 K$
has an estimated probability of default of:

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 25 / 62


Logistic Regression

Multiple Logistic Regression example


Consider the Credit data, we want to predict default based on 3
predictors: balance, income and student.

Then, we can make predictions:


For example, a student with a credit card balance of $1, 500 and an income of 40 K$
has an estimated probability of default of:
e −10.869+0.0057×1,500+0.003×40−0.6468×1
p̂(X ) = = 0.058
1 + e −10.869+0.0057×1,500+0.003×40−0.6468×1

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 25 / 62


Logistic Regression

Multiple Logistic Regression example


Consider the Credit data, we want to predict default based on 3
predictors: balance, income and student.

Then, we can make predictions:


For example, a student with a credit card balance of $1, 500 and an income of 40 K$
has an estimated probability of default of:
e −10.869+0.0057×1,500+0.003×40−0.6468×1
p̂(X ) = = 0.058
1 + e −10.869+0.0057×1,500+0.003×40−0.6468×1
A non-student with the same balance and income has an estimated probability of default
of:
e −10.869+0.00574×1,500+0.003×40−0.6468×0
p̂(X ) = = 0.105
1 + e −10.869+0.00574×1,500+0.003×40−0.6468×0
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 25 / 62
Logistic Regression

Logistic regression with more than two classes

What happens if the target variable has more than 2 categories?


We can generalize logistic regression to a κ-level output variable as follows:

e β0k +β1k X1 +...+βpk Xp


P(Y = k|X ) = Pκ β0` +β1` X1 +...+βp` Xp
k ∈ 1, ..., κ
`=1 e
This is also called softmax function.
In this case there is a linear function for each class, except the last one
since all the probabilities sum up to 1.
So, only κ − 1 linear functions are fitted.
Logistic regression with more than two classes is also referred to as
multinomial logistic regression.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 26 / 62


Linear Discriminant Analysis

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 27 / 62


Linear Discriminant Analysis Introduction

Introduction to Discriminant Analysis

In Discriminant Analysis:
We treat the predictors X as random continuous variables and model
the distribution of X in each of the classes separately.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 28 / 62


Linear Discriminant Analysis Introduction

Introduction to Discriminant Analysis

In Discriminant Analysis:
We treat the predictors X as random continuous variables and model
the distribution of X in each of the classes separately.
Next, use Bayes theorem to obtain P(Y = k|X = x).

Bayes’ theorem
Let A1 , A2 , . . . , Aκ be a collection of κ mutually exclusive and exhaustive
events with prior probabilities P(Ak ) ∀k ∈ {1, . . . , κ}. Then, given an
event B for which P(B) > 0, the posterior probability of Ak given that B
has occurred is:

P(Ak ∩ B) P(B|Ak )P(Ak )


P(Ak |B) = = Pκ
P(B) `=1 P(B|A` )P(A` )

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 28 / 62


Linear Discriminant Analysis Introduction

Bayes’ theorem scheme

P(Ak ∩ B) P(B|Ak )P(Ak )


P(Ak |B) = = P5
P(B) `=1 P(B|A` )P(A` )

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 29 / 62


Linear Discriminant Analysis Introduction

Bayes theorem for LDA (Linear Discriminant Analysis)


Using the Bayes’ theorem for the classification problem, the probability of
class k given an observation x is:

P(X = x|Y = k)P(Y = k)


P(Y = k|X = x) =
P(X = x)

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 30 / 62


Linear Discriminant Analysis Introduction

Bayes theorem for LDA (Linear Discriminant Analysis)


Using the Bayes’ theorem for the classification problem, the probability of
class k given an observation x is:

P(X = x|Y = k)P(Y = k)


P(Y = k|X = x) =
P(X = x)

We will use the following notation:

πk fk (x)
P(Y = k|X = x) = pk (X = x) = Pκ
`=1 π` f` (x)

where:
πk = P(Y = k) represent the overall or prior probability that a
randomly chosen observation comes from the kth class;
fk (x) = P(X = x|Y = k) is the density of X for an observation that
belongs to class k.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 30 / 62
Linear Discriminant Analysis Introduction

Visual example: LDA with κ = 2 and p = 1


To simplify, we assume fk (X ) is a normal distribution.
Example: In the case of 2 classes, we classify a new point according to
which density is higher and one explanatory variable X .

On the left, π1 = π2 , then compare f1 (x) and f2 (x).


On the right, π1 6= π2 , then compare π1 f1 (x) and π2 f2 (x).
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 31 / 62
Linear Discriminant Analysis Introduction

Linear Discriminant Analysis when p = 1


We assume The density of X in class k follows a Gaussian density
N (µk , σk2 ) :
x−µk 2
 
1 − 21
fk (x) = √ e σ k
2πσk
where:
µk is the mean of X in class k and
σk2 is the variance of X in class k. For now, we assume
σ1 = . . . σκ = σ are the same among all the classes.
Plugging this into Bayes formula, we get for pk (x) = P(Y = k|X = x)
 2
1 x−µk
1 −
πk √2πσ e 2 σk
k
pk (x) = 
x−µ`
2 (2)
Pκ − 12
√ 1
` π` 2πσ` e
σ`

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 32 / 62


Linear Discriminant Analysis Introduction

Discriminant functions

The Bayes classifier involves assigning an observation X = x to the class


for which pk (x) is largest. Taking logs, and discarding terms that do not
depend on k, this is equivalent to assigning x to the class with the largest
discriminant score:

µk µ2k

δk (x) = x + log(πk )
σ2 2σ 2
Remark that δk is a linear function of x. That is where the name Linear
Discriminant Analysis (LDA) comes from.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 33 / 62


Linear Discriminant Analysis Introduction

Discriminant functions, example (1/2)

If there are κ = 2 classes and π1 = π2 = 0.5, then the decision boundary is


at
µ1 + µ2
x=
2
µ21 −µ22 µ1 +µ2
That is δ1 (x) = δ2 (x) ⇔ x = 2(µ1 −µ2 ) = 2 Then, a test observation

µ1 +µ2
x0 will be classified in class 1 if x0 > 2 (here we suppose µ1 > µ2 ) and
to class 2 otherwise.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 34 / 62


Linear Discriminant Analysis Introduction

Discriminant functions, example (2/2)


In this example an observation is equally likely to come from either class,
that is, π1 = π2 = 0.5.

The mean and variance parameters for the two density functions are µ1 = −1.25,
µ2 = 1.25, and σ12 = σ22 = 1
The Bayes classifier assigns the observation to class 1 if x < 0 and class 2 otherwise.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 35 / 62
Linear Discriminant Analysis Introduction

Estimation of the parameters

In practice, even if we know X is drawn from a Gaussian distribution, the


parameters are unknown, therefore we have to estimate them:
nk
π̂k =
n

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 36 / 62


Linear Discriminant Analysis Introduction

Estimation of the parameters

In practice, even if we know X is drawn from a Gaussian distribution, the


parameters are unknown, therefore we have to estimate them:
nk
π̂k =
n
1 X
µ̂k = xi
nk
i:yi =k

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 36 / 62


Linear Discriminant Analysis Introduction

Estimation of the parameters

In practice, even if we know X is drawn from a Gaussian distribution, the


parameters are unknown, therefore we have to estimate them:
nk
π̂k =
n
1 X
µ̂k = xi
nk
i:yi =k
κ κ
2 1 X X X nk − 1 2
2
σ̂ = (xi − µ̂k ) = σ̂
n−κ n−κ k
k=1 i:yi =k k=1
where: σ̂k2 = nk1−1 i:yi =k (xi − µ̂k )2 is the usual formula for the
P
estimated variance within the kth class.
where n is the total number of training observations, nk is the number of training
observations in the kth class. σ̂ 2 can be seen as a weighted average of the sample
variances for each of the κ classes.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 36 / 62


Linear Discriminant Analysis Introduction

Discriminant functions, example with estimated parameters


Simulated data and the corresponding histogram for 20 observations from
each class.

On the left the theoretical Bayes boundary (dashed line), on the right the decision
boundary was calculated with the estimates (black solid line)
Since π̂1 = π̂2 , the decision boundary corresponds to the midpoint between the sample
means for the two classes, (µ1 + µ2 )/2.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 37 / 62


Linear Discriminant Analysis Introduction

Linear Discriminant Analysis for p > 1

We assume that X = (X1 , X 2, ..., Xp) is drawn from a multivariate


Gaussian or multinormal distribution

Left: Equal variance and zero correlation, Right: different variances and existing correlation.
1 1 T Σ−1 (x−µ)
where the density can be written: f (x) = p 1 e − 2 (x−µ)
(2π) |Σ| 2 2

Where µ ∈ Rp is the mean vector and Σ is the covariance matrix.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 38 / 62


Linear Discriminant Analysis Introduction

LDA with p > 1 predictors

The LDA classifier assumes that the observations in the kth class are
drawn from a multivariate Gaussian distribution N (µk , Σ), where:
µk is mean vector of X specific to class k, and
Σ is a covariance matrix that is supposed common to all κ classes.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 39 / 62


Linear Discriminant Analysis Introduction

LDA with p > 1 predictors

The LDA classifier assumes that the observations in the kth class are
drawn from a multivariate Gaussian distribution N (µk , Σ), where:
µk is mean vector of X specific to class k, and
Σ is a covariance matrix that is supposed common to all κ classes.
Plugging the density function for the kth class, fk (X = x), into the Bayes
formula and a little of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which:
1
δk (x) = x T Σ−1 µk − µT Σ−1 µk + log(πk )
2 k
is largest.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 39 / 62


Linear Discriminant Analysis Introduction

LDA with p > 1 predictors

The LDA classifier assumes that the observations in the kth class are
drawn from a multivariate Gaussian distribution N (µk , Σ), where:
µk is mean vector of X specific to class k, and
Σ is a covariance matrix that is supposed common to all κ classes.
Plugging the density function for the kth class, fk (X = x), into the Bayes
formula and a little of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which:
1
δk (x) = x T Σ−1 µk − µT Σ−1 µk + log(πk )
2 k
is largest.
Notice that δk (x) = ck0 + ck1 x1 + ck2 x2 + . . . + ckp xp is a linear function.
That is the reason of the name LDA (Linear Discriminant Analysis).

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 39 / 62


Linear Discriminant Analysis Introduction

Example for p = 2 and κ = 3


Three equally-sized Gaussian classes are shown with class-specific mean
vectors and a common covariance matrix.
The dashed lines represent
the theoretical Bayes
decision boundaries. So,
they represent the set of
values x for which
δk (x) = δ` (x) for k 6= `.
There is one line per each
pair of classes.

These three Bayes decision boundaries divide the predictor space into three
regions. The Bayes classifier will classify an observation according to the
region in which it is located.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 40 / 62
Linear Discriminant Analysis Introduction

Example of estimation for p = 2 and κ = 3


Once more we estimate the unknown parameters µ1 , . . . , µκ , π1 , . . . , πκ ,
and Σ. Given a new observation X = x, LDA calculates δ̂(x) and classifies
to the class for which it is largest.

On the right, the estimated LDA decision boundaries are shown as solid black lines.
Here, n = 60 observations, 20 per class.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 41 / 62
Linear Discriminant Analysis Introduction

From δk (x) to probabilities

Once we have the estimates δ̂k (x), we can turn these into estimates for
class probabilities:

π̂k e δ̂k (x)


p̂(Y = k|X = x) = Pκ .
δ̂` (x)
`=1 π̂` e

So classifying to the largest δ̂k (x) amounts to classifying to the class for
which P(Y = k|X = x) is the largest.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 42 / 62


Linear Discriminant Analysis Introduction

From δk (x) to probabilities

Once we have the estimates δ̂k (x), we can turn these into estimates for
class probabilities:

π̂k e δ̂k (x)


p̂(Y = k|X = x) = Pκ .
δ̂` (x)
`=1 π̂` e

So classifying to the largest δ̂k (x) amounts to classifying to the class for
which P(Y = k|X = x) is the largest.
When κ = 2, classify to class 2 if P(Y = 2|X = x) > 0.5, else to class 1.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 42 / 62


Other forms of Discriminant Analysis

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 43 / 62


Other forms of Discriminant Analysis

Quadratic Discriminant Analysis (QDA)


QDA, like LDA assumes the X are drawn from a multivariate Gaussian
distribution. However, unlike LDA, QDA assumes that
each class has its own covariance matrix.

If X comes from the kth class, then X ∼ N (µk , Σk )

According to bayes classifier, an observation x will be assigned to class k if:

1 1
δ(x) = − (x − µk )T Σ−1
k (x − µk ) + log πk − log |σk | (3)
2 2
1 1 1
= − x T Σ−1
k x + x T Σ−1
k µ k − µT −1
k Σk µk + log πk − log |σk |
2 2 2
is largest.

Notice that this is a quadratic function of x. That is where the name


QDA comes from!
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 44 / 62
Other forms of Discriminant Analysis

Why to use QDA instead of LDA?

The answer lies in the bias-variance trade-off:


Estimating a covariance matrix implies estimating p(p + 1)/2
parameters for each class. Whereas LDA implies estimating only one
covariance matrix.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 45 / 62


Other forms of Discriminant Analysis

Why to use QDA instead of LDA?

The answer lies in the bias-variance trade-off:


Estimating a covariance matrix implies estimating p(p + 1)/2
parameters for each class. Whereas LDA implies estimating only one
covariance matrix.
LDA is a much less flexible classifier than QDA, and so has
substantially lower variance.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 45 / 62


Other forms of Discriminant Analysis

Why to use QDA instead of LDA?

The answer lies in the bias-variance trade-off:


Estimating a covariance matrix implies estimating p(p + 1)/2
parameters for each class. Whereas LDA implies estimating only one
covariance matrix.
LDA is a much less flexible classifier than QDA, and so has
substantially lower variance.
However, if LDA’s assumption that the κ classes share a common
covariance matrix is wrong, then LDA can suffer from high bias. So,
QDA would be a better choice.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 45 / 62


Other forms of Discriminant Analysis

Why to use QDA instead of LDA?

The answer lies in the bias-variance trade-off:


Estimating a covariance matrix implies estimating p(p + 1)/2
parameters for each class. Whereas LDA implies estimating only one
covariance matrix.
LDA is a much less flexible classifier than QDA, and so has
substantially lower variance.
However, if LDA’s assumption that the κ classes share a common
covariance matrix is wrong, then LDA can suffer from high bias. So,
QDA would be a better choice.
If n is small and so reducing variance is crucial LDA tends to be
better than QDA. Incontrast, QDA is recommended if the training set
is very large, so the variance of the classifier is not a major concern.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 45 / 62


Other forms of Discriminant Analysis

QDA, example
Observations drawn from two Gaussian variables.

Bayes’ classifier (purple dashed), LDA (black dotted), and QDA (green solid).
Left: common correlation of 0.7 among the two classes.
Right: Orange class has 0.7 correlation, whereas blue class has -0.7
correlation.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 46 / 62
Other forms of Discriminant Analysis

Naive Bayes

The Bayes theorem implies:

πk fk (x)
P(Y = k|X = x) = pk (X = x) = Pκ
`=1 π` f` (x)

Naive Bayes classifier assumes


Qp conditional independence between the
feature variables, fk (x) = j=1 fjk (xj ). For a Gaussian distribution, this
means that Σk are diagonal.
 
p p
Y 1 X (xj − µkj )2
δk (x) ∝ log πk
 fkj (xj ) = −

2
+ log πk
2 σkj
j=1 j=1

It is useful when p is large. Despite strong assumptions, naive Bayes often


produces good classification results.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 47 / 62


Evaluating the quality of the predictions

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 48 / 62


Evaluating the quality of the predictions

LDA on the Credit data, the confusion matrix (1/2)

We want to predict whether or not an individual will default on the basis


of credit card balance and income. The confusion matrix

(23 + 252)/10000 errors, so a 2.75% training error rate. In contrast the


quantity (81 + 9644)/10000 = 97.25% is called accuracy!

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 49 / 62


Evaluating the quality of the predictions

LDA on the Credit data, the confusion matrix (1/2)

We want to predict whether or not an individual will default on the basis


of credit card balance and income. The confusion matrix

(23 + 252)/10000 errors, so a 2.75% training error rate. In contrast the


quantity (81 + 9644)/10000 = 97.25% is called accuracy!
However:
Only 3.33% of the individuals defaulted. So, a trivial classifier that
always predicts not default, will result in an error rate of 3.33%.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 49 / 62


Evaluating the quality of the predictions

LDA on the Credit data, the confusion matrix (2/2)

Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s,
we make 252/333 = 75.7% errors!

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 50 / 62


Evaluating the quality of the predictions

Types of errors

A binary classifier can make two types of errors:


False positive rate: The fraction of negative examples that are
classified as positive, 0.2% in example.
False negative rate: The fraction of positive examples that are
classified as negative, 75.7% in example.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 51 / 62


Evaluating the quality of the predictions

Types of errors

A binary classifier can make two types of errors:


False positive rate: The fraction of negative examples that are
classified as positive, 0.2% in example.
False negative rate: The fraction of positive examples that are
classified as negative, 75.7% in example.

It is often of interest to evaluate class-specific performance.


The bank can be more interested in detecting people who default than
people that do not default.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 51 / 62


Evaluating the quality of the predictions

Confusion matrix

sensitivity or recall (True positive rate): percentage of true


defaulters that are identified, 24.3% in the example.
specificity (True negative rate): percentage of non-defaulters that
are correctly identified, this case, 99.8% in the example.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 52 / 62
Evaluating the quality of the predictions

Changing the threshold

The LDA produces a low sensitivity because it tries to approximate the


Bayes classifier. The Bayes classifier will yield the smallest possible total
number of misclassified observations, irrespective of which class the errors
come from.
LDA assigns an observation to class Yes if:

p̂(Y = Yes|Balance,Income) ≥ 0.5

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 53 / 62


Evaluating the quality of the predictions

Changing the threshold

The LDA produces a low sensitivity because it tries to approximate the


Bayes classifier. The Bayes classifier will yield the smallest possible total
number of misclassified observations, irrespective of which class the errors
come from.
LDA assigns an observation to class Yes if:

p̂(Y = Yes|Balance,Income) ≥ 0.5

In contrast, the bank might particularly wish to avoid incorrectly


classifying an individual who will default. Why not to change this
threshold and classify any customer with a posterior probability of default
above 20% to the default class?

p̂(Y = Yes|Balance,Income) ≥ 0.2

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 53 / 62


Evaluating the quality of the predictions

LDA for Credit data with threshold 0.2

Now the false negative rate decreased to 41.4%.


However, the false positive rate has increased. As a result the overall
error rate has increased slightly to 3.73%.
We can try different values of threshold.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 54 / 62


Evaluating the quality of the predictions

Varying the threshold


The trade-off that results from modifying the threshold value for the
posterior probability of default.

As the threshold is reduced, the error rate among individuals who default
decreases, but the error rate among the individuals who do not default
increases.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 55 / 62
Evaluating the quality of the predictions

ROC (Receiver Operating Characteristics) curve


The overall performance of a classifier, summarized over all possible
thresholds, is given by the area under the (ROC) curve (AUC).
The ROC curve plots the sensitivity versus (1-specificity )

Ideally AUC = 1.
A classifier that performs
no better than chance has
an AUC = 0.5.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 56 / 62


Evaluating the quality of the predictions

ROC (Receiver Operating Characteristics) curve


The overall performance of a classifier, summarized over all possible
thresholds, is given by the area under the (ROC) curve (AUC).
The ROC curve plots the sensitivity versus (1-specificity )

Ideally AUC = 1.
A classifier that performs
no better than chance has
an AUC = 0.5.
ROC curves are useful for
comparing different
classifiers, since they take
into account all possible
thresholds.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 56 / 62


Evaluating the quality of the predictions

Confusion matrix: F1 score

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 57 / 62


A Comparison of Classification Methods

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 58 / 62


A Comparison of Classification Methods

Logistic Regression versus LDA

For a two-class problem, one can show that for LDA:


   
p1 (x) p1 (x)
log = log = c0 + c1 x1 + . . . + cp xp
1 − p1 (x) p2 (x)

Hence, both, LDA and logistic regression have a linear boundary.


The difference is in how the parameters are estimated.
LDA assumes that the observations are drawn from a Gaussian
distribution with a common covariance matrix in each class, it is
preferable over logistic regression when this assumption approximately
holds. Conversely, logistic regression can outperform LDA if these
Gaussian assumptions are not met.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 59 / 62


A Comparison of Classification Methods

Logistic Regression versus LDA

For a two-class problem, one can show that for LDA:


   
p1 (x) p1 (x)
log = log = c0 + c1 x1 + . . . + cp xp
1 − p1 (x) p2 (x)

Hence, both, LDA and logistic regression have a linear boundary.


The difference is in how the parameters are estimated.
LDA assumes that the observations are drawn from a Gaussian
distribution with a common covariance matrix in each class, it is
preferable over logistic regression when this assumption approximately
holds. Conversely, logistic regression can outperform LDA if these
Gaussian assumptions are not met.
logistic regression can also fit quadratic boundaries like QDA, by
explicitly including quadratic terms in the model.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 59 / 62


A Comparison of Classification Methods

Summary

When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 60 / 62


A Comparison of Classification Methods

Summary

When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 60 / 62


A Comparison of Classification Methods

Summary

When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
For much more complicated decision boundaries, a non-parametric
approach such as KNN can be superior. But the level of smoothness
for a non-parametric approach must be chosen carefully.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 60 / 62


A Comparison of Classification Methods

Summary

When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
For much more complicated decision boundaries, a non-parametric
approach such as KNN can be superior. But the level of smoothness
for a non-parametric approach must be chosen carefully.
LDA is useful when n is small, or the classes are well separated, and
Gaussian assumptions are reasonable. Also when κ > 2.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 60 / 62


A Comparison of Classification Methods

Summary

When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
For much more complicated decision boundaries, a non-parametric
approach such as KNN can be superior. But the level of smoothness
for a non-parametric approach must be chosen carefully.
LDA is useful when n is small, or the classes are well separated, and
Gaussian assumptions are reasonable. Also when κ > 2.
Naive Bayes is useful when p is very large.

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 60 / 62


References

Outline

1 Introduction

2 K-Nearest Neighbors

3 Logistic Regression

4 Linear Discriminant Analysis

5 Other forms of Discriminant Analysis

6 Evaluating the quality of the predictions

7 A Comparison of Classification Methods

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 61 / 62


References

References

James, Gareth; Witten, Daniela; Hastie, Trevor and Tibshirani,


Robert. ”An Introduction to Statistical Learning with Applications in
R”, 2nd edition, New York : ”Springer texts in statistics”, 2021. Site
web: https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
Hastie, Trevor; Tibshirani, Robert and Friedman, Jerome (2009).
”The Elements of Statistical Learning (Data Mining, Inference, and
Prediction), 2nd edition”. New York: ”Springer texts in statistics”.
Site web :
http://statweb.stanford.edu/~tibs/ElemStatLearn/

P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 62 / 62

You might also like