Classification 2024
Classification 2024
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
What is classification?
• The linear regression models assumes that the
response variable Y is quantitative. But in many
situations, the response variable is instead
qualitative.
• For example, eye color is qualitative, taking
qualitative on values blue, brown, or green. Often
qualitative variables are referred to as
categorical.
• Approaches for predicting qualitative responses,
a process that is known as classification
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Default data set. Left: The annual incomes and monthly credit card balances of a
number of individuals. The individuals who defaulted on their credit card payments are
shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a
function of default status. Right: Boxplots of income as a function of default status.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• A linear regression model to represent these
probabilities:
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
For binary response variable Y
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• The quantity p(X)/[1−p(X)] is called the odds, and can
take on any value odds between 0 and ∞.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• We can define log-odds or logit as
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Estimating the Regression Coefficients:
• Likelihood function
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Estimating the Regression Coefficients:
• Likelihood function
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
Making Predictions
• We predict that the default probability for an
individual with a balance of $1, 000 is
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Predictors with more than two levels: One can
use qualitative predictors (with levels) with
the logistic regression model using the dummy
variable approach.
• Multiple Logistic Regression
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Classification of response variable
with more than two classes
• Why Not Linear Regression?
• We could consider encoding these values as a
quantitative response variable, Y , as follows:
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Classification of response variable
with more than two classes
• The Logistic Model
– The model is specified in terms of K − 1 log-odds or logit transformations (reflecting the constraint
that the probabilities sum to one).
– Although the model uses the last class as the denominator in the odds-ratios, the choice of
denominator is arbitrary in that the estimates are equivariant under this choice.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Bayes’ Theorem
You are planning a picnic today, but the morning
is cloudy
• Oh no! 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40%
of days start cloudy)
• And this is usually a dry month (only 3 of 30
days tend to be rainy, or 10%)
• What is the chance of rain during the day?
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Bayes’ Theorem
Bayes’ Theorem is a way of finding a probability when we know certain other
probabilities.
The formula is:
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Bayes’ Theorem
We will use Rain to mean rain during the day, and Cloud
to mean cloudy morning.
The chance of Rain given Cloud is written P(Rain|Cloud)
So let's put that in the formula:
P(Rain|Cloud) = P(Rain) P(Cloud|Rain)/P(Cloud)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
• The naive Bayes classifier is a probabilistic classifier based on
Bayes’s theorem. Let A and B denote two random events,
then
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
• The relevant events considered in
classification are “object belongs to class i”, or
more briefly, just “i”, and “object has the
feature vector x”, or “x”. Replacing Ai, Aj and B
by these events yields
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
If the p features in x are stochastically
independent, then
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
• Naive Bayes classifier: data of the student
example
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
So, a student who went to class regularly and studied the material will pass
with a probability of 93 %, and a deterministic naive Bayes classifier will yield
“passed” for this student, and all other students with the same history
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
Naive Bayes classifier: classifier function for the
student example
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• Suppose that we wish to classify an observation into
one of K classes, where K ≥ 2. In other words, the
qualitative response variable Y can take on K possible
distinct and unordered values.
• Let πk represent the overall or prior probability that a
randomly chosen observation comes from the kth
class; this is the probability that a given observation is
associated with the kth category of the response
variable Y .
• Let fk(X) ≡ Pr(X = x|Y = k) denote the density function of
X for an observation that comes from the kth class.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Let πk represent the overall or prior probability that a randomly chosen observation comes
from the kth class; this is the probability that a given observation is associated with the kth
category of the response variable Y .
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Let fk(X) ≡ Pr(X = x|Y = k) denote the density function of X for an observation that
comes from the kth class.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• Then Bayes’ theorem states that
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• Linear Discriminant Analysis for p = 1
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
If all variance (s ) are equal, then:
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• this is equivalent to assigning the observation to the
class for which
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• An example with three classes. The observations from each class are drawn from a
multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a
common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the
three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20
observations were generated from each class, and the corresponding LDA decision
boundaries are indicated using solid black lines. The Bayes decision boundaries are once
again shown as dashed lines.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Linear Discriminant Analysis for p >1
To do this, we will assume that X = (X1,X2, . .
.,Xp) is drawn from a multivariate Gaussian (or
multivariate normal) distribution, with a class-
specific multivariate mean vector and a common
covariance matrix.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Linear Discriminant Analysis for p >1
To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write X ∼ N(μ,Σ). Here E(X) = μ is the mean of X (a
vector with p components), and Cov(X) = Σ is the p × p covariance matrix of X.
Formally, the multivariate Gaussian density is defined as
is largest.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
The Bayes classifier works by assigning an
observation to the class for which the posterior
probability pk(X) is greatest. In the two-class
case, this amounts to assigning an observation
to the default class if
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Classification Confusion Matrix
• Here,
• Class 1 : Positive
• Class 2 : Negative
• Definition of the Terms:
• Positive (P) : Observation is positive (for example: You have Covid).
• Negative (N) : Observation is not positive (for example: You do not have Covid).
• True Positive (TP) : Observation is positive, and is predicted to be positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Class 1 Predicted Class 2 Predicted
Class 1 Actual True Positive (TP) False Negative
(FN)
Classification Rate/Accuracy
Class 2 Actual False Positive (FP) True Negative
(TN)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Class 1 Predicted Class 2 Predicted
Class 1 Actual True Positive (TP) False Negative
(FN)
Class 2 Actual False Positive (FP) True Negative
(TN)
Recall
• Total number of correctly classified positive examples divide to the
total number of positive examples.
• High Recall indicates the class is correctly recognized (small number of
FN).
Precision
• Total number of correctly classified positive examples by the total number
of predicted positive examples.
• High Precision indicates an example labeled as positive is indeed positive
(small number of FP).
High recall, low precision: This means that most of the Low recall, high precision: This shows that we miss a
positive examples are correctly recognized (low FN) but lot of positive examples (high FN) but those we
there are a lot of false positives. predict as positive are indeed positive (low FP)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Estimating the Regression Coefficients:
• Likelihood function
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Class 1 Predicted Class 2 Predicted
Class 1 Actual True Positive (TP) False Negative
(FN)
F-measure Class 2 Actual False Positive (FP) True Negative
(TN)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining