0% found this document useful (0 votes)
3 views42 pages

Ch4 Classifications24

The document discusses generative and discriminative classifiers, focusing on logistic regression and its interpretation through odds ratios. It explains how coefficients in logistic regression can be interpreted in terms of the impact on the probability of an event, using examples related to employment types based on age and gender. Additionally, it covers the linear probability model, generalized linear models, and provides an example of linear discriminant analysis (LDA) for classifying gender based on height.

Uploaded by

Sohaib Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views42 pages

Ch4 Classifications24

The document discusses generative and discriminative classifiers, focusing on logistic regression and its interpretation through odds ratios. It explains how coefficients in logistic regression can be interpreted in terms of the impact on the probability of an event, using examples related to employment types based on age and gender. Additionally, it covers the linear probability model, generalized linear models, and provides an example of linear discriminant analysis (LDA) for classifying gender based on height.

Uploaded by

Sohaib Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Mathematically, generative classifiers assume a functional form for P(Y) and P(X|Y), then generate estimated parameters from

the data and use the Bayes’ theorem


to calculate P(Y|X) (posterior probability). Meanwhile, discriminative classifiers assume a functional form of P(Y|X) and estimate the parameters directly from the
provided data.
Odds ratio interpretation of coefficients of the logistic regression
Consider the logistic regression model estimated to explain why a worker has permanent job
(Y=1) or temporary job (Y=0).

1 𝑒 𝐿̂
𝑝 = 𝑃(𝑌 = 1|𝑋) = =
1 + 𝑒 −𝐿̂ 1 + 𝑒 𝐿̂
𝐿̂ = −2.608 + 0.0549 𝐴𝑔𝑒 + 0.1425 𝐹𝑒𝑚 + 0.208 𝐸𝑑𝑢𝑃𝑟𝑖 + 1.124 𝐸𝑑𝑢𝑆𝑒𝑐
+ 3.562 𝐸𝑑𝑢𝑃𝑜𝑠𝑡
Let p = probability of an event. Odds in favor of the event are defined as probability of the event
occurring to probability of event not occurring i.e. p/q e.g., if probability of rain today is 0.75
then odds in favor of rain (vs not rain) is 0.75/0.25 = 3 i.e. 3 to 1.

It can be shown that for a predictor x with coefficient 𝛽, the exponentiated coefficient 𝑒 𝛽 is the
ratio of two odds:
𝑝
| 𝑥 = 𝑥0 +1
𝑞
𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜 = 𝑝 = 𝑒𝛽
| 𝑥 = 𝑥0
𝑞

This provides a way to interpret the impact of one unit increase in x on odds of event Y =1.
For example, for age variable: 𝑒 0.0549 = 1.056
Interpretation: A one-year increase in age of worker is associated with an increase in odds of
having a permanent job (versus temporary job) by a factor of 1.056 (more clearly a 5.6%
increase in odds), keeping gender and education level fixed.
[The 95% confidence interval for the odds is 1.032 to 1.0832. This is entirely greater than 1 so
odds are statistically significant. If confidence interval for the odds includes 1, the odds are not
statistically significant from 1].
Odds ratio for gender (Fem): 𝑒 0.1425 = 1.153
Interpretation: For a female worker odds of employment in a permanent job (versus temporary
job) are 1.153 times higher (more specifically 15.3% higher) than a male worker (keeping age
and education fixed).
Marginal effect of a quantitative predictor X (e.g., for age variable):
𝜕𝑝
= 𝛽𝐴𝑔𝑒 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑝𝑑𝑓 𝑣𝑎𝑙𝑢𝑒 . Which can be shown to equal to
𝜕𝐴𝑔𝑒

𝜕𝑝
= 𝛽𝐴𝑔𝑒 𝑝(1 − 𝑝)
𝜕𝐴𝑔𝑒

Thus, marginal effect (unlike odds ratio) is not constant but depends on specific values of
predictors. For example, for a 25 year female worker who has secondary education, find the
marginal effect of one year increase in age on probability of permanent employment.
𝐿̂ = −2.608 + 0.0549 (25) + 0.1425 (1) + 0.208 (0) + 1.124 (1) + 3.562 (0) = 0.031
1 1
𝑝= = = 0.50775
1+ 𝑒 −𝐿̂ 1 + 𝑒 −0.031
𝜕𝑝
= 𝛽𝐴𝑔𝑒 𝑝(1 − 𝑝) = 0.0549(0.50775)(1 − 0.50775) = 0.01372
𝜕𝐴𝑔𝑒

Thus, a one year increase in age of such a person is associated with increase in probability of
having permanent job by 0.01375.
Marginal effect of a dummy predictor D (e.g. for Fem variable):
ME = 𝑃(𝑌 = 1|𝑋, 𝐹𝑒𝑚 = 1) − 𝑃(𝑌 = 1|𝑋, 𝐹𝑒𝑚 = 0)
Example: For a 25 year age old person who has secondary education, find the marginal effect of
person’s gender on probability of permanent employment.
𝑃(𝑌 = 1|𝑋, 𝐹𝑒𝑚 = 1) = 0.50775
For a male person:
𝐿̂ = −2.608 + 0.0549 (25) + 0.1425 (0) + 0.208 (0) + 1.124 (1) + 3.562 (0) = −0.1115
1 1
𝑃(𝑌 = 1|𝑋, 𝐹𝑒𝑚 = 0) = 𝑝 = = = 0.47215
1+ 𝑒 −𝐿̂ 1 + 𝑒 0.1115
Hence ME of Fem = 𝑃(𝑌 = 1|𝑋, 𝐹𝑒𝑚 = 1) − 𝑃(𝑌 = 1|𝑋, 𝐹𝑒𝑚 = 0) = 0.50775 − 0.47215 =
0.03559
Interpretation: For a 25 year old female person who has secondary education, has probability of
permanent employment 0.0355 (3.55%) higher than a 25 years male person who ha secondary
education.
LPM, Logistic Regression and Generalized Linear Model
Consider a random variable Y having values 0 and1 with probability 𝑃(𝑌 = 1) = 𝑝, 𝑃(𝑌 = 0) =
1 − 𝑝. If probabilities are fixed under repeated independent trials, Y has the Bernoulli
Distribution:
𝑓(𝑦) = 𝑝 𝑦 (1 − 𝑝)1−𝑦 , 𝑦 = 0,1
It can be easily shown that 𝐸(𝑌) = 𝑝, 𝑉(𝑌) = 𝑝(1 − 𝑝)
Let’s try to model 𝐸(𝑌) = 𝑝 via a linear model:
𝐸(𝑌) = 𝑝 = 𝛽0 + 𝛽1 𝑥
𝑌 = 𝐸(𝑌) + 𝜀 = 𝛽0 + 𝛽1 𝑥 + 𝜀 (LPM: Linear probability model)
LPM has some undesirable properties. Notably the predicted probability P(Y =1) can be negative
or greater than 1. We need a model that is a non-linear function of x such that the predicted
probability is in (0,1) bound and that also takes care of the special character of the scatter plot. The
CDF of any continuous probability distribution has such character. One of the most popular pdfs
used is the Logistic pdf given by:
𝑒 −𝑦
𝑓(𝑦) = , −∞ < 𝑦 < ∞
(1 + 𝑒 −𝑦 )2
𝑒𝑦 1
The CDF of logistic is 𝐹(𝑦) = = 1+𝑒 −𝑦
1+𝑒 𝑦
1
Using the logistic CDF we have 𝐸(𝑌) = 𝜇 = 𝑝 = 𝑒 −(𝛽0 + 𝛽1𝑥)

This can be explained from the following scatter plot:

After transformation
𝜇 𝑝
𝑙𝑛 ( ) = 𝑙𝑛 ( ) = 𝛽0 + 𝛽1 𝑥
1−𝜇 1−𝑝
i.e.
𝑔(𝜇) = 𝛽0 + 𝛽1 𝑥
Compare this with the linear model:
𝜇 = 𝛽0 + 𝛽1 𝑥
We can see that with logistic model we have a non-linear function of 𝜇 i.e. 𝑔(𝜇) on the left-
hand side. This model is called the generalized linear model (GLM). The logistic regression
𝜇
model is a special form of the GLM with link function 𝑔(𝜇) = 𝑙𝑛 (1−𝜇). Another GLM is the
Poisson regression with the link function 𝑔(𝜇) = ln 𝜇, where 𝜇 = 𝐸(𝑌) = 𝑒 𝛽0+ 𝛽1𝑥
For the GLM, parameters are estimated using the MLE. Goodness of fit can be measured by the
Pseudo R2 e.g.,
𝐷𝑚
𝑅2 = 1 − , where 𝐷0 = Null deviance and 𝐷𝑚 =Residual deviance
𝐷0

Deviance is the GLM equivalent to the residual sum of square that we use in linear models.
LDA Example: Consider a simple example of classifying set of students based only on the
heights. Classify the gender of a person whose height is 152 cm using LDA.

This example is similar to conceptual Ex 7 Ch 4 p-191 of ISLR2. In the notation of the textbook:
Here p= number of predictor =1 (height). n = number of examples (cases or sample size) = 11
K= number of classes = 2 (F and M).
𝜋̂1 (𝐹) = 6⁄11 = 0.545, 𝜋̂2 (𝑀) = 5⁄11 = 0.454,
2
∑𝑥 𝑛 ∑ 𝑥 2 −(∑ 𝑥)
𝜇1 (𝐹) = = (140+145+135+165+141+159)/6 = 147.5, 𝜎12 (𝐹) = = 139.9
𝑛 𝑛(𝑛−1)
2
∑𝑥 𝑛 ∑ 𝑥 2 −(∑ 𝑥)
𝜇2 (𝑀) = = (169+142+168+160+172)/5 = 162.2, 𝜎22 (𝑀) = = 147.2
𝑛 𝑛(𝑛−1)

(𝑛1 − 1)𝜎12 + (𝑛2 − 1)𝜎22 (6 − 1)(139.9) + (5 − 1)(147.2)


𝜎̂ 2 = = = 143.144
𝑛1 + 𝑛2 − 2 6+5−2
1
1 − (𝑥−𝜇)2 2
𝑓1 (𝑥) (F) = 𝑒 2𝜎2 = 0.115336 𝑒 −0.003493 (152−147.5) = 0.002597 at x = 152
√2𝜋 𝜎
1
1 − (𝑥−𝜇)2 2
𝑓2 (𝑥) (M) = 𝑒 2𝜎2 = 0.115336 𝑒 −0.003493 (152−162.2) = 0.001938 at x = 152
√2𝜋 𝜎

Using Bays theorem, posterior probabilities of female and male given height of 152 are:
𝜋1 𝑓1 (152)
𝑝1 (𝑥) = 𝑃(𝑌 = 𝐹 | 𝑥 = 152) = = 0.617
𝜋1 𝑓1 (152) + 𝜋2 𝑓2 (152)
𝜋2 𝑓2 (152)
𝑝2 (𝑥) = 𝑃(𝑌 = 𝑀 | 𝑥 = 152) = = 0.383
𝜋1 𝑓1 (152) + 𝜋2 𝑓2 (152)
Hence LDA predicts that with height of 152 the prediction is a female as it has higher posterior
probability.
Note: Alternatively, probabilities can be found using simplified version based on linear
discriminant function as
𝜇𝑘 𝜇𝑘2
𝛿𝑘 (𝑥) = 𝑥 2 − 2 + ln (𝜋𝑘 )
𝜎 2𝜎
𝑒 𝛿𝑘 (𝑥)
𝑝𝑘 (𝑥) = 𝑃(𝑌 = 𝑘 | 𝑋 = 𝑥) = , 𝑒. 𝑔.,
∑𝑘 𝑒 𝛿𝑘(𝑥)
147.5 147.52
Female: 𝛿1 (152) = 152 143.144 − 286.288 + ln(0.545) = 80.023998

162.2 162.22
Male : 𝛿2 (152) = 152 143.144 − 286.288 + ln(0.454) = 79.54863

𝑒 𝛿1 (152)
Female: 𝑝1 (𝑥) = 𝑃(𝑌 = 𝐹 | 𝑥 = 152) = 𝑒 𝛿1(152) + 𝑒 𝛿2(152) = 0.617

𝑒 𝛿2 (152)
Male : 𝑝2 (𝑥) = 𝑃(𝑌 = 𝑀 | 𝑥 = 152) = 𝑒 𝛿1(152) + 𝑒 𝛿2(152) = 0.383

[Check that for Ex 7 𝛿1 (4) = −0.50092, 𝛿2 (4) = −1.6094379, P (Y= Gives Div) = 0.7518].
Confusion matrix, precision, sensitivity (recall), specificity and F score (F1 score)
In some cases, (especially with imbalanced classes) percent of correct cases classification or test
error rate is not very effective measure of performance of a classifier.

Precision = TP/(TP + FP) = TP/(total number of success predictions)


is the probability of correctly identifying a randomly selected record as one belonging to the success
class (e.g. the probability of correctly identifying a random patient as having cancer).
Recall (or Sensitivity) = Sensitivity or True Positive Rate (TPR) = TP/(TP + FN)=TP/(total number
of actual successes)
measures the percentage of actual positives that are correctly identified as positive (i.e., the proportion of
people with cancer who are correctly identified as having cancer).
Specificity (also called the True Negative Rate (TNR)) = TN / (FP + TN)=TN/(total number of actual
failures): measures the percentage of failures correctly identified as failures (i.e., the proportion of
people with no cancer being categorized as not having cancer.)
F1 = 2 × (precision × recall) /(precision + recall)
The F-1 score, which fluctuates between 0 (a useless classification with total failure) to 1 (a perfect
classification) is a measure that balances precision and recall.
Example: A classifier results in the following confusion matrix for a test sample. Verify these
performance measure:

ROC and Area under Curve (AUC): The area under a receiver operating characteristic (ROC)
curve, abbreviated as AUC, is a single scalar value that measures the overall performance of a
binary classifier (Hanley and McNeil 1982). The curve is obtained by varying the threshold.
Some Exercises:
Ex1: Consider the logistic regression model with two quantitative predictors:

𝑒 𝛽0+𝛽1𝑥1+𝛽2𝑥2
𝑃(𝑌 = 1|𝑋) =
1 + 𝑒𝛽0+𝛽1𝑥1+𝛽2𝑥2
Show that with predictive threshold for success class 𝑃(𝑌 = 1|𝑋) > 0.5, decision boundary is
linear and is given by:
𝛽0 𝛽1
𝑥2 > − − 𝑥
𝛽2 𝛽2 1
Ex2: Consider the logistic regression model with one quantitative predictors x:
𝑒 𝛽0+𝛽1𝑥
𝑝 = 𝑃(𝑌 = 1|𝑋) =
1 + 𝑒 𝛽0+𝛽1𝑥
𝑑𝑝
Show that the marginal effect is given by 𝑑𝑥
= 𝛽1 𝑝(1 − 𝑝)

Ex3: For the 2 class LDA with one quantitative predictor x, show that the log odds
𝑃(𝑌 = 1|𝑥)
𝑙𝑜𝑔 = 𝛼0 + 𝛼1 𝑥
𝑃(𝑌 = 0|𝑥)
i.e. it is linear in x.
Ex4: Consider the logistic regression model with one quantitative x and dummy predictor D:

𝑒 𝛽0+𝛽1𝑥+𝛽2𝐷
𝑝 = 𝑃(𝑌 = 1|𝑋) =
1 + 𝑒𝛽0+𝛽1𝑥+𝛽2𝐷
𝑝
|𝐷 =1
𝑞
Show that the Odds ratio of the dummy variable i.e. 𝑝 is constant and independent of x.
|𝐷 =0
𝑞
Thus odds ratio provides a neat way of summarizing the effect of a dummy x variable on y.

You might also like