Lecture2 Classification PartI
Lecture2 Classification PartI
3510-Machine Learning
Lecture 2: Supervised learning: Classification
Part I
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
Introduction to classification
Example
Goal: predict whether an individual will default on his/her credit card
payment, on the basis of annual income and monthly credit card balance.
individuals who defaulted (orange) and those who did not (blue).
Y : default credit payment based on balance X1 and income X2 .
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 5 / 62
Introduction
where ŷi is the predicted class by our classifier for the ith observation
and yi is the real value.
where ŷi is the predicted class by our classifier for the ith observation
and yi is the real value.
Test error rate: . For a given observation (x0 , y0 ), a good classifier
will have minimum estimated error test:
average(1(y0 6= ŷ0 ))
If there are only 2 classes, the Bayes classifier will choose the class j for
which:
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
On the right, KNN for K = 3 applied at every point (the test set) and the corresponding
KNN decision boundary.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 11 / 62
K-Nearest Neighbors
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
p(x) = β0 + β1 X
p(x) = β0 + β1 X
p(x) = β0 + β1 X
p(x) = β0 + β1 X
e β0 +β1 X
p(X ) = ,
1 + e β0 +β1 X
(e ≈ 2.71828 is the Euler’s number.)
No matter what values β0 , β1 or X take, p(X ) will always lie between 0
and 1.
In orange the observations, in blue, the fitted curve for each model.
For logistic regression, when y = 0, p(X ) takes low values, whereas for
y = 1, p(X ) takes high values.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 18 / 62
Logistic Regression
p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
Interpretation:The odds is the ratio between P(Y = 1/X ) and P(Y = 0/X )
p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
Interpretation:The odds is the ratio between P(Y = 1/X ) and P(Y = 0/X )
By taking the logarithm of both sides, we get:
p(X )
log = β0 + β1 X .
(1 − p(X ))
p(X )
= e β0 +β1 X .
(1 − p(X ))
The quantity p(X )/(1 − p(X )) is called the odds.
Interpretation:The odds is the ratio between P(Y = 1/X ) and P(Y = 0/X )
By taking the logarithm of both sides, we get:
p(X )
log = β0 + β1 X .
(1 − p(X ))
Making predictions
Making predictions
Making predictions
Making predictions
then,
e β0 +β1 X1 +...+βp Xp
p(X ) =
1 + e β0 +β1 X1 +...+βp Xp
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
In Discriminant Analysis:
We treat the predictors X as random continuous variables and model
the distribution of X in each of the classes separately.
In Discriminant Analysis:
We treat the predictors X as random continuous variables and model
the distribution of X in each of the classes separately.
Next, use Bayes theorem to obtain P(Y = k|X = x).
Bayes’ theorem
Let A1 , A2 , . . . , Aκ be a collection of κ mutually exclusive and exhaustive
events with prior probabilities P(Ak ) ∀k ∈ {1, . . . , κ}. Then, given an
event B for which P(B) > 0, the posterior probability of Ak given that B
has occurred is:
πk fk (x)
P(Y = k|X = x) = pk (X = x) = Pκ
`=1 π` f` (x)
where:
πk = P(Y = k) represent the overall or prior probability that a
randomly chosen observation comes from the kth class;
fk (x) = P(X = x|Y = k) is the density of X for an observation that
belongs to class k.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 30 / 62
Linear Discriminant Analysis Introduction
Discriminant functions
µk µ2k
−
δk (x) = x + log(πk )
σ2 2σ 2
Remark that δk is a linear function of x. That is where the name Linear
Discriminant Analysis (LDA) comes from.
µ1 +µ2
x0 will be classified in class 1 if x0 > 2 (here we suppose µ1 > µ2 ) and
to class 2 otherwise.
The mean and variance parameters for the two density functions are µ1 = −1.25,
µ2 = 1.25, and σ12 = σ22 = 1
The Bayes classifier assigns the observation to class 1 if x < 0 and class 2 otherwise.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 35 / 62
Linear Discriminant Analysis Introduction
On the left the theoretical Bayes boundary (dashed line), on the right the decision
boundary was calculated with the estimates (black solid line)
Since π̂1 = π̂2 , the decision boundary corresponds to the midpoint between the sample
means for the two classes, (µ1 + µ2 )/2.
Left: Equal variance and zero correlation, Right: different variances and existing correlation.
1 1 T Σ−1 (x−µ)
where the density can be written: f (x) = p 1 e − 2 (x−µ)
(2π) |Σ| 2 2
The LDA classifier assumes that the observations in the kth class are
drawn from a multivariate Gaussian distribution N (µk , Σ), where:
µk is mean vector of X specific to class k, and
Σ is a covariance matrix that is supposed common to all κ classes.
The LDA classifier assumes that the observations in the kth class are
drawn from a multivariate Gaussian distribution N (µk , Σ), where:
µk is mean vector of X specific to class k, and
Σ is a covariance matrix that is supposed common to all κ classes.
Plugging the density function for the kth class, fk (X = x), into the Bayes
formula and a little of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which:
1
δk (x) = x T Σ−1 µk − µT Σ−1 µk + log(πk )
2 k
is largest.
The LDA classifier assumes that the observations in the kth class are
drawn from a multivariate Gaussian distribution N (µk , Σ), where:
µk is mean vector of X specific to class k, and
Σ is a covariance matrix that is supposed common to all κ classes.
Plugging the density function for the kth class, fk (X = x), into the Bayes
formula and a little of algebra reveals that the Bayes classifier assigns an
observation X = x to the class for which:
1
δk (x) = x T Σ−1 µk − µT Σ−1 µk + log(πk )
2 k
is largest.
Notice that δk (x) = ck0 + ck1 x1 + ck2 x2 + . . . + ckp xp is a linear function.
That is the reason of the name LDA (Linear Discriminant Analysis).
These three Bayes decision boundaries divide the predictor space into three
regions. The Bayes classifier will classify an observation according to the
region in which it is located.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 40 / 62
Linear Discriminant Analysis Introduction
On the right, the estimated LDA decision boundaries are shown as solid black lines.
Here, n = 60 observations, 20 per class.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 41 / 62
Linear Discriminant Analysis Introduction
Once we have the estimates δ̂k (x), we can turn these into estimates for
class probabilities:
So classifying to the largest δ̂k (x) amounts to classifying to the class for
which P(Y = k|X = x) is the largest.
Once we have the estimates δ̂k (x), we can turn these into estimates for
class probabilities:
So classifying to the largest δ̂k (x) amounts to classifying to the class for
which P(Y = k|X = x) is the largest.
When κ = 2, classify to class 2 if P(Y = 2|X = x) > 0.5, else to class 1.
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
1 1
δ(x) = − (x − µk )T Σ−1
k (x − µk ) + log πk − log |σk | (3)
2 2
1 1 1
= − x T Σ−1
k x + x T Σ−1
k µ k − µT −1
k Σk µk + log πk − log |σk |
2 2 2
is largest.
QDA, example
Observations drawn from two Gaussian variables.
Bayes’ classifier (purple dashed), LDA (black dotted), and QDA (green solid).
Left: common correlation of 0.7 among the two classes.
Right: Orange class has 0.7 correlation, whereas blue class has -0.7
correlation.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 46 / 62
Other forms of Discriminant Analysis
Naive Bayes
πk fk (x)
P(Y = k|X = x) = pk (X = x) = Pκ
`=1 π` f` (x)
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
Of the true No’s, we make 23/9667 = 0.2% errors; of the true Yes’s,
we make 252/333 = 75.7% errors!
Types of errors
Types of errors
Confusion matrix
As the threshold is reduced, the error rate among individuals who default
decreases, but the error rate among the individuals who do not default
increases.
P. Conde-Céspedes Lecture 2: Classification (Part I) September 16th, 2024 55 / 62
Evaluating the quality of the predictions
Ideally AUC = 1.
A classifier that performs
no better than chance has
an AUC = 0.5.
Ideally AUC = 1.
A classifier that performs
no better than chance has
an AUC = 0.5.
ROC curves are useful for
comparing different
classifiers, since they take
into account all possible
thresholds.
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
Summary
When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
Summary
When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
Summary
When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
For much more complicated decision boundaries, a non-parametric
approach such as KNN can be superior. But the level of smoothness
for a non-parametric approach must be chosen carefully.
Summary
When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
For much more complicated decision boundaries, a non-parametric
approach such as KNN can be superior. But the level of smoothness
for a non-parametric approach must be chosen carefully.
LDA is useful when n is small, or the classes are well separated, and
Gaussian assumptions are reasonable. Also when κ > 2.
Summary
When the true decision boundaries are linear, then the LDA and
logistic regression approaches will tend to perform well.
When the boundaries are moderately non-linear, QDA may give better
results.
For much more complicated decision boundaries, a non-parametric
approach such as KNN can be superior. But the level of smoothness
for a non-parametric approach must be chosen carefully.
LDA is useful when n is small, or the classes are well separated, and
Gaussian assumptions are reasonable. Also when κ > 2.
Naive Bayes is useful when p is very large.
Outline
1 Introduction
2 K-Nearest Neighbors
3 Logistic Regression
References