0% found this document useful (0 votes)
6 views43 pages

Classification 2024

The document discusses classification in statistical machine learning, focusing on logistic regression and the naive Bayes classifier for predicting qualitative responses. It explains the logistic model, including the estimation of regression coefficients and making predictions, as well as the use of Bayes' theorem and linear discriminant analysis. Additionally, it covers confusion matrices to evaluate classification performance and the implications of misclassification in predictive modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views43 pages

Classification 2024

The document discusses classification in statistical machine learning, focusing on logistic regression and the naive Bayes classifier for predicting qualitative responses. It explains the logistic model, including the estimation of regression coefficients and making predictions, as well as the use of Bayes' theorem and linear discriminant analysis. Additionally, it covers confusion matrices to evaluate classification performance and the implications of misclassification in predictive modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Classification

Prof. Asim Tewari


IIT Bombay

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
What is classification?
• The linear regression models assumes that the
response variable Y is quantitative. But in many
situations, the response variable is instead
qualitative.
• For example, eye color is qualitative, taking
qualitative on values blue, brown, or green. Often
qualitative variables are referred to as
categorical.
• Approaches for predicting qualitative responses,
a process that is known as classification

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Default data set. Left: The annual incomes and monthly credit card balances of a
number of individuals. The individuals who defaulted on their credit card payments are
shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a
function of default status. Right: Boxplots of income as a function of default status.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• A linear regression model to represent these
probabilities:

• In logistic regression, we use the logistic


function:

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
For binary response variable Y

• Classification using the Default data. Left: Estimated probability of


default using linear regression. Some estimated probabilities are
negative! The orange ticks indicate the 0/1 values coded for
default(No or Yes). Right: Predicted probabilities of default using
logistic regression. All probabilities lie between 0 and 1.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• The quantity p(X)/[1−p(X)] is called the odds, and can
take on any value odds between 0 and ∞.

• Values of the odds close to 0 and ∞ indicate very low


and very high probabilities of default, respectively. For
example, on average 1 in 5 people with an odds of 1/4
will default, since p(X) = 0.2 implies an odds of 0.2/
(1−0.2) = 1/4. Likewise on average nine out of every
ten people with an odds of 9 will default, since p(X) =
0.9 implies an odds of 0.9/(1−0.9) = 9.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• We can define log-odds or logit as

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Estimating the Regression Coefficients:
• Likelihood function

• The basic intuition behind using maximum likelihood to fit a logistic


regression model is as follows: we seek estimates for β0 and β1 such that
the predicted probability ˆp(xi) of default for each individual, corresponds
as closely as possible to the individual’s observed default status.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model

• For the Default data, estimated coefficients of the


logistic regression model that predicts the
probability of default using balance. A one-unit
increase in balance is associated with an increase
in the log odds of default by 0.0055 units.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Estimating the Regression Coefficients:
• Likelihood function

• The basic intuition behind using maximum likelihood to fit a logistic


regression model is as follows: we seek estimates for β0 and β1 such that
the predicted probability ˆp(xi) of default for each individual, corresponds
as closely as possible to the individual’s observed default status.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
Making Predictions
• We predict that the default probability for an
individual with a balance of $1, 000 is

• Which is below 1%. In contrast, the predicted


probability of default for an individual with a
balance of $2, 000 is much higher, and equals
0.586 or 58.6%.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Predictors with more than two levels: One can
use qualitative predictors (with levels) with
the logistic regression model using the dummy
variable approach.
• Multiple Logistic Regression

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Classification of response variable
with more than two classes
• Why Not Linear Regression?
• We could consider encoding these values as a
quantitative response variable, Y , as follows:

• Unfortunately, this coding implies an ordering on the outcomes, putting


drug overdose in between stroke and epileptic seizure, and insisting that
the difference between stroke and drug overdose is the same as the
difference between drug overdose and epileptic seizure.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Classification of response variable
with more than two classes
• The Logistic Model

– The model is specified in terms of K − 1 log-odds or logit transformations (reflecting the constraint
that the probabilities sum to one).
– Although the model uses the last class as the denominator in the odds-ratios, the choice of
denominator is arbitrary in that the estimates are equivariant under this choice.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Bayes’ Theorem
You are planning a picnic today, but the morning
is cloudy
• Oh no! 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40%
of days start cloudy)
• And this is usually a dry month (only 3 of 30
days tend to be rainy, or 10%)
• What is the chance of rain during the day?

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Bayes’ Theorem
Bayes’ Theorem is a way of finding a probability when we know certain other
probabilities.
The formula is:

P(A|B) = P(A) P(B|A)/P(B)

Which tells us:


how often A happens given that B happens, written P(A|B),
When we know:
how often B happens given that A happens, written P(B|A)
and how likely A is on its own, written P(A) and
how likely B is on its own, written P(B)

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Bayes’ Theorem
We will use Rain to mean rain during the day, and Cloud
to mean cloudy morning.
The chance of Rain given Cloud is written P(Rain|Cloud)
So let's put that in the formula:
P(Rain|Cloud) = P(Rain) P(Cloud|Rain)/P(Cloud)

• P(Rain) is Probability of Rain = 10%


• P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
• P(Cloud) is Probability of Cloud = 40%
• P(Rain|Cloud) = 0.1 x 0.50.4 = .125

Or a 12.5% chance of rain. Not too bad, let's have a


picnic!

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
• The naive Bayes classifier is a probabilistic classifier based on
Bayes’s theorem. Let A and B denote two random events,
then

• If the event A can be decomposed into the disjoint events A1,


. . . , Ac, p(Ai) > 0, i=1, 2, …c, then

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
• The relevant events considered in
classification are “object belongs to class i”, or
more briefly, just “i”, and “object has the
feature vector x”, or “x”. Replacing Ai, Aj and B
by these events yields

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
If the p features in x are stochastically
independent, then

Inserting this into yields the classification


probabilities of the naive Bayes classifier.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
• Naive Bayes classifier: data of the student
example

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier

So, a student who went to class regularly and studied the material will pass
with a probability of 93 %, and a deterministic naive Bayes classifier will yield
“passed” for this student, and all other students with the same history
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Naive Bayes Classifier
Naive Bayes classifier: classifier function for the
student example

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• Suppose that we wish to classify an observation into
one of K classes, where K ≥ 2. In other words, the
qualitative response variable Y can take on K possible
distinct and unordered values.
• Let πk represent the overall or prior probability that a
randomly chosen observation comes from the kth
class; this is the probability that a given observation is
associated with the kth category of the response
variable Y .
• Let fk(X) ≡ Pr(X = x|Y = k) denote the density function of
X for an observation that comes from the kth class.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Let πk represent the overall or prior probability that a randomly chosen observation comes
from the kth class; this is the probability that a given observation is associated with the kth
category of the response variable Y .

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Let fk(X) ≡ Pr(X = x|Y = k) denote the density function of X for an observation that
comes from the kth class.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• Then Bayes’ theorem states that

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• Linear Discriminant Analysis for p = 1

• let us further assume that σ21 = . . . = σ2K

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
If all variance (s ) are equal, then:

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
• this is equivalent to assigning the observation to the
class for which

is largest. For instance, if K = 2 and π1 = π2, then the


Bayes classifier assigns an observation to class 1 if 2x (μ1
− μ2) > μ21 − μ22 , and to class 2 otherwise. In this case,
the Bayes decision boundary corresponds to the point
where

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis

• An example with three classes. The observations from each class are drawn from a
multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a
common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the
three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20
observations were generated from each class, and the corresponding LDA decision
boundaries are indicated using solid black lines. The Bayes decision boundaries are once
again shown as dashed lines.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Linear Discriminant Analysis for p >1
To do this, we will assume that X = (X1,X2, . .
.,Xp) is drawn from a multivariate Gaussian (or
multivariate normal) distribution, with a class-
specific multivariate mean vector and a common
covariance matrix.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
Linear Discriminant Analysis for p >1
To indicate that a p-dimensional random variable X has a multivariate
Gaussian distribution, we write X ∼ N(μ,Σ). Here E(X) = μ is the mean of X (a
vector with p components), and Cov(X) = Σ is the p × p covariance matrix of X.
Formally, the multivariate Gaussian density is defined as

The Bayes classifier assigns an observation X = x to the class for which

is largest.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis

• A confusion matrix compares the LDA predictions to the true default


statuses for the 10, 000 training observations in the Default data
set. Elements on the diagonal of the matrix represent individuals
whose default statuses were correctly predicted, while off-diagonal
elements represent individuals that were misclassified. LDA made
incorrect predictions for 23 individuals who did not default and for
252 individuals who did default.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis

Since only 3.33% of the individuals in the training sample


defaulted, a simple but useless classifier that always
predicts that each individual will not default, regardless of
his or her credit card balance and student status, will
result in an error rate of 3.33%.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis
The Bayes classifier works by assigning an
observation to the class for which the posterior
probability pk(X) is greatest. In the two-class
case, this amounts to assigning an observation
to the default class if

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Linear Discriminant Analysis

A confusion matrix compares the LDA predictions to the


true default statuses for the 10, 000 training observations
in the Default data set, using a modified threshold value
that predicts default for any individuals whose posterior
default probability exceeds 20 %.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Classification Confusion Matrix

Class 1 Predicted Class 2 Predicted


Class 1 Actual True Positive (TP) False Negative (FN)
Class 2 Actual False Positive (FP) True Negative (TN)

• Here,
• Class 1 : Positive
• Class 2 : Negative
• Definition of the Terms:
• Positive (P) : Observation is positive (for example: You have Covid).
• Negative (N) : Observation is not positive (for example: You do not have Covid).
• True Positive (TP) : Observation is positive, and is predicted to be positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive.

Prof. Asim Tewari, IIT Bombay

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Class 1 Predicted Class 2 Predicted
Class 1 Actual True Positive (TP) False Negative
(FN)
Classification Rate/Accuracy
Class 2 Actual False Positive (FP) True Negative
(TN)

• Classification Rate or Accuracy is given by the


relation:

• Howehere are problems with accuracy. It


assumes equal costs for both kinds of errors. A
99.9% accuracy can be excellent, good, mediocre,
poor
Prof. Asim Tewari, IITor terrible depending upon the problem.
Bombay

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Class 1 Predicted Class 2 Predicted
Class 1 Actual True Positive (TP) False Negative
(FN)
Class 2 Actual False Positive (FP) True Negative
(TN)
Recall
• Total number of correctly classified positive examples divide to the
total number of positive examples.
• High Recall indicates the class is correctly recognized (small number of
FN).
Precision
• Total number of correctly classified positive examples by the total number
of predicted positive examples.
• High Precision indicates an example labeled as positive is indeed positive
(small number of FP).
High recall, low precision: This means that most of the Low recall, high precision: This shows that we miss a
positive examples are correctly recognized (low FN) but lot of positive examples (high FN) but those we
there are a lot of false positives. predict as positive are indeed positive (low FP)

Prof. Asim Tewari, IIT Bombay

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
The Logistic Model
• Estimating the Regression Coefficients:
• Likelihood function

• The basic intuition behind using maximum likelihood to fit a logistic


regression model is as follows: we seek estimates for β0 and β1 such that
the predicted probability ˆp(xi) of default for each individual, corresponds
as closely as possible to the individual’s observed default status.

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Class 1 Predicted Class 2 Predicted
Class 1 Actual True Positive (TP) False Negative
(FN)
F-measure Class 2 Actual False Positive (FP) True Negative
(TN)

• Since we have two measures (Precision and Recall) it helps to have a


measurement that represents both of them. We calculate an F-measure
which uses Harmonic Mean in place of Arithmetic Mean as it punishes
the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision
or Recall.

Positive predictive value ( )

Prof. Asim Tewari, IIT Bombay

Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining

You might also like