0% found this document useful (0 votes)
5 views

Week 04 Logistic Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 04 Logistic Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

31-08-2024

TOD 533
Logistic Regression
Amit Das
TODS / AMSOM / AU
[email protected]

Supervised Learning: Regression and Classification


• SUPERVISED LEARNING
• We use a set of training data to set the parameters of a model, e.g.
a, b1, b2, …, bn of a regression model y = a + b1x1 + b2x2 + … + bnxn
• The training data contains known values of the dependent variable DV (the
variable to be predicted)
• The trained model (with optimum values of the parameters) is used to predict
the outcome for new (unseen) cases (whose DV values are not known)
• When the DV is interval-scaled (“continuous”) -> regression
• We built a model to predict the fuel economy of automobiles
• When the DV is nominal (“categorical”) -> classification
• We will build a model for a DV that has two classes, diabetic or healthy

1
31-08-2024

Why not regression?


• Consider the case of whether a person has diabetes* or not …
Dependent variable = has(diabetes), possible values (yes, no)

• Nominal dependent variable can be transformed to interval


variable p[has(diabetes)], and build a model such as
p[has(diabetes)] = a + b1X1 + b2X2 + b3X3 + … + bnXn

* WHO: 2 hour post-load plasma glucose at least 200 mg/dl

The bounds of probability


We need an equation of the type
p[has(diabetes)] = a + b1X1 + b2X2 + b3X3 + … + bnXn

• But how do we constrain it to stay between 0 and 1?


p = exp(a + b1x1 + …) = ea + b1x1 + … > 0, and

exp(a + b1x1 + …) ea + b1x1 + …


p= = <1
exp(a + b1x1 + …) ea + b1x1 + …

2
31-08-2024

Shape of the logistic regression function

Odds and log(odds)


With a little algebra
ln = a + b1x1 + b2x2 … + bnxn

Even though the probability is not a linear function of xi, the transform
ln is a linear function of xi

What is ln ? Log of the odds ratio, also called the logit.

= ea + b1x1 + … = ea . eb1x1 . eb2x2 … ebnxn (multiplicative model)

3
31-08-2024

The diabetes dataset


• Title: Pima Indians Diabetes Database
• Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

• Attributes
1. preg = Number of times pregnant
2. plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. pres = Diastolic blood pressure (mm Hg)
4. skin = Triceps skin fold thickness (mm)
5. insu = 2-Hour serum insulin (mu U/ml)
6. mass = Body mass index (weight in kg/(height in m)^2)
7. pedi = Diabetes pedigree function
8. age = Age (years)
9. Class variable = tested positive for diabetes (1 (268 instances) or 0 (500 instances))

Model Coefficients - diabetes

Predictor Estimate SE Z p Odds ratio

Intercept -5.71139 0.54034 -10.570 < .001 0.00331 a


age 0.03057 0.00821 3.723 < .001 1.03104 b1
pedi 1.00899 0.25928 3.891 < .001 2.74283 b2
mass 0.09974 0.01364 7.314 < .001 1.10489 b3
skin -0.00520 0.00555 -0.936 0.349 0.99482 b4
preg 0.09660 0.02857 3.381 < .001 1.10142 b5

Note. Estimates represent the log odds of "diabetes = 1" vs. "diabetes = 0"

ln = -5.711 + 0.031*age + 1.009*pedi + 0.100*mass


-0.005*skin + 0.097*preg

= ea + b x1 + … = e-5.711 * e0.031*age * e1.009*pedi * e0.100*mass


1

* e-0.005*skin* e0.097*preg
= 0.00331*(1.031^age)*(2.743^pedi)*(1.105^mass)*(0.995^skin)*(1.101^preg)

4
31-08-2024

Prediction for a new patient


Age = 50
Pedigree Score = 0.5 (some history of diabetes in the family)
BMI = 30
SkinFold Thickness = 22
Pregnancies = 3

What the likelihood of her being diabetic?


= 0.00331*(1.031^age)*(2.743^pedi)*(1.105^mass)*(0.995^skin)*(1.101^preg)
= 0.00331*(1.031^50)*(2.743^0.5)*(1.105^30)*(0.995^22)*(1.101^3)
= 0.00331*4.602*1.656*19.993*0.896*1.335 = 0.603
p = 0.603 / (1 + 0.603) = 37.6% < 50%

Dimension reduction / feature selection


• LASSO (forces some regression coefficients to be zero)
• Stepwise regression (retains only those variables that improve model
fit)
• Principal components PCA (creates “combination” variables that
capture the information in the original variables more
parsimoniously)
• Demo in STATA

You might also like