0% found this document useful (0 votes)
5 views

Week 04 Logistic Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 04 Logistic Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

31-08-2024

TOD 533
Logistic Regression
Amit Das
TODS / AMSOM / AU
[email protected]

Supervised Learning: Regression and Classification


• SUPERVISED LEARNING
• We use a set of training data to set the parameters of a model, e.g.
a, b1, b2, …, bn of a regression model y = a + b1x1 + b2x2 + … + bnxn
• The training data contains known values of the dependent variable DV (the
variable to be predicted)
• The trained model (with optimum values of the parameters) is used to predict
the outcome for new (unseen) cases (whose DV values are not known)
• When the DV is interval-scaled (“continuous”) -> regression
• We built a model to predict the fuel economy of automobiles
• When the DV is nominal (“categorical”) -> classification
• We will build a model for a DV that has two classes, diabetic or healthy

1
31-08-2024

Why not regression?


• Consider the case of whether a person has diabetes* or not …
Dependent variable = has(diabetes), possible values (yes, no)

• Nominal dependent variable can be transformed to interval


variable p[has(diabetes)], and build a model such as
p[has(diabetes)] = a + b1X1 + b2X2 + b3X3 + … + bnXn

* WHO: 2 hour post-load plasma glucose at least 200 mg/dl

The bounds of probability


We need an equation of the type
p[has(diabetes)] = a + b1X1 + b2X2 + b3X3 + … + bnXn

• But how do we constrain it to stay between 0 and 1?


p = exp(a + b1x1 + …) = ea + b1x1 + … > 0, and

exp(a + b1x1 + …) ea + b1x1 + …


p= = <1
exp(a + b1x1 + …) ea + b1x1 + …

2
31-08-2024

Shape of the logistic regression function

Odds and log(odds)


With a little algebra
ln = a + b1x1 + b2x2 … + bnxn

Even though the probability is not a linear function of xi, the transform
ln is a linear function of xi

What is ln ? Log of the odds ratio, also called the logit.

= ea + b1x1 + … = ea . eb1x1 . eb2x2 … ebnxn (multiplicative model)

3
31-08-2024

The diabetes dataset


• Title: Pima Indians Diabetes Database
• Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

• Attributes
1. preg = Number of times pregnant
2. plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. pres = Diastolic blood pressure (mm Hg)
4. skin = Triceps skin fold thickness (mm)
5. insu = 2-Hour serum insulin (mu U/ml)
6. mass = Body mass index (weight in kg/(height in m)^2)
7. pedi = Diabetes pedigree function
8. age = Age (years)
9. Class variable = tested positive for diabetes (1 (268 instances) or 0 (500 instances))

Model Coefficients - diabetes

Predictor Estimate SE Z p Odds ratio

Intercept -5.71139 0.54034 -10.570 < .001 0.00331 a


age 0.03057 0.00821 3.723 < .001 1.03104 b1
pedi 1.00899 0.25928 3.891 < .001 2.74283 b2
mass 0.09974 0.01364 7.314 < .001 1.10489 b3
skin -0.00520 0.00555 -0.936 0.349 0.99482 b4
preg 0.09660 0.02857 3.381 < .001 1.10142 b5

Note. Estimates represent the log odds of "diabetes = 1" vs. "diabetes = 0"

ln = -5.711 + 0.031*age + 1.009*pedi + 0.100*mass


-0.005*skin + 0.097*preg

= ea + b x1 + … = e-5.711 * e0.031*age * e1.009*pedi * e0.100*mass


1

* e-0.005*skin* e0.097*preg
= 0.00331*(1.031^age)*(2.743^pedi)*(1.105^mass)*(0.995^skin)*(1.101^preg)

4
31-08-2024

Prediction for a new patient


Age = 50
Pedigree Score = 0.5 (some history of diabetes in the family)
BMI = 30
SkinFold Thickness = 22
Pregnancies = 3

What the likelihood of her being diabetic?


= 0.00331*(1.031^age)*(2.743^pedi)*(1.105^mass)*(0.995^skin)*(1.101^preg)
= 0.00331*(1.031^50)*(2.743^0.5)*(1.105^30)*(0.995^22)*(1.101^3)
= 0.00331*4.602*1.656*19.993*0.896*1.335 = 0.603
p = 0.603 / (1 + 0.603) = 37.6% < 50%

Dimension reduction / feature selection


• LASSO (forces some regression coefficients to be zero)
• Stepwise regression (retains only those variables that improve model
fit)
• Principal components PCA (creates “combination” variables that
capture the information in the original variables more
parsimoniously)
• Demo in STATA

You might also like