0% found this document useful (0 votes)
92 views5 pages

Logistic Regression With R

- The document describes building a logistic regression model to predict student admissions using variables like GRE, GPA and rank. - The logistic regression model is trained on 80% of the data and finds GPA and rank are statistically significant predictors while GRE is not. - The model is used to predict probabilities of admission for the remaining 20% test data, achieving a misclassification error rate of 29.3%. - A goodness-of-fit test confirms the model is a statistically significant improvement over the null model.

Uploaded by

Selin Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views5 pages

Logistic Regression With R

- The document describes building a logistic regression model to predict student admissions using variables like GRE, GPA and rank. - The logistic regression model is trained on 80% of the data and finds GPA and rank are statistically significant predictors while GRE is not. - The model is used to predict probabilities of admission for the remaining 20% test data, achieving a misclassification error rate of 29.3%. - A goodness-of-fit test confirms the model is a statistically significant improvement over the null model.

Uploaded by

Selin Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Logistic Regression

Youtube video : https://www.youtube.com/watch?v=AVx7Wc1CQ7Y

R code

#Logistic Regression

# Read data file (choose the data from your folder)


mydata <- read.csv(file.choose(), header = T)
str(mydata)

'data.frame': 400 obs. of 4 variables:


$ admit: int 0 1 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...

Notice that admit and rank were classified as int (actually they are factors)

mydata$admit <- as.factor(mydata$admit)


mydata$rank <- as.factor(mydata$rank)
str(mydata)

'data.frame': 400 obs. of 4 variables:


$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...

# Two-way table as factor variables


xtabs(~admit + rank, data = mydata)

rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
#Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.8, 0.2))
train <- mydata[ind==1,]
test <- mydata[ind==2,]

#Logistic regression model


mymodel <- glm(admit ~ gre + gpa + rank, data = train, family = 'binomial')
summary(mymodel)

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5873 -0.8679 -0.6181 1.1301 2.1178

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.009514 1.316514 -3.805 0.000142 ***
gre 0.001631 0.001217 1.340 0.180180
gpa 1.166408 0.388899 2.999 0.002706 **
rank2 -0.570976 0.358273 -1.594 0.111005
rank3 -1.125341 0.383372 -2.935 0.003331 **
rank4 -1.532942 0.477377 -3.211 0.001322 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 404.39 on 324 degrees of freedom


Residual deviance: 369.99 on 319 degrees of freedom
AIC: 381.99

Number of Fisher Scoring iterations: 4

From the result, GRE is not statistically significant, so we drop that


variable

#without gre
mymodel <- glm(admit ~ gpa + rank, data = train, family = 'binomial')
summary(mymodel)

Call:
glm(formula = admit ~ gpa + rank, family = "binomial", data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5156 -0.8880 -0.6318 1.1091 2.1688

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.7270 1.2918 -3.659 0.000253 ***
gpa 1.3735 0.3590 3.826 0.000130 ***
rank2 -0.5712 0.3564 -1.603 0.108976
rank3 -1.1645 0.3804 -3.061 0.002203 **
rank4 -1.5642 0.4756 -3.289 0.001005 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 404.39 on 324 degrees of freedom


Residual deviance: 371.81 on 320 degrees of freedom
AIC: 381.81

Number of Fisher Scoring iterations: 4

#Prediction
p1 <- predict(mymodel, train, type = 'response')
head(p1)

1 2 3 4 6 7
0.2822956 0.2992879 0.6828897 0.1290134 0.2354735 0.3466234

From the result, number 5 is not there because it should have been taken for test data

The result shows for applicant number 1 the chance for being admitted is only 28.22 %

head(train)

admit gre gpa rank


1 0 380 3.61 3
2 1 660 3.67 3
3 1 800 4.00 1
4 1 640 3.19 4
6 1 760 3.00 2
7 1 560 2.98 1

Probability calculation
For applicant #1
> y <- -4.7270 + (1.3735*3.61) + (1*-1.1645)
> y
[1] -0.933165
> exp(y)/(1+exp(y))
[1] 0.282283

For applicant #3
> y <- -4.7270 + (1.3735*4)
> exp(y)/(1+exp(y))
[1] 0.6828716

#Misclassification error - train data


pred1 <- ifelse(p1>0.5, 1, 0)
tab1 <- table(Predicted = pred1, Actual = train$admit)
tab1

Actual
Predicted 0 1
0 208 73
1 15 29
The table above is called Confusion Matrix. 208 means that there are 208 data which are predicted not
to be admitted and it’s true (predicted=actual) while 29 are predicted to be admitted and it’s also true.
73 : predicted not admitted but actually they were admitted and 15 : predicted admitted but actually
they were not admitted

So the misclassification error is, 27.08%

1 - sum(diag(tab1))/sum(tab1)

> 1 - sum(diag(tab1))/sum(tab1)
[1] 0.2707692
#Misclassification error - test data
p2 <- predict(mymodel, test, type = 'response')
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Predicted = pred2, Actual = test$admit)
tab2

Actual
Predicted 0 1
0 48 20
1 2 5

1 - sum(diag(tab2))/sum(tab2)

> 1 - sum(diag(tab2))/sum(tab2)
[1] 0.2933333

# Goodness-of-fit test
with(mymodel, pchisq(null.deviance - deviance, df.null-df.residual,
lower.tail = F))

> with(mymodel, pchisq(null.deviance - deviance, df.null-df.residual, lower.t


ail = F))
[1] 1.450537e-06

Because of the value is very low, we are confident that the model is statistically significant.

You might also like