Logistic Regression With R
Logistic Regression With R
R code
#Logistic Regression
Notice that admit and rank were classified as int (actually they are factors)
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
#Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.8, 0.2))
train <- mydata[ind==1,]
test <- mydata[ind==2,]
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5873 -0.8679 -0.6181 1.1301 2.1178
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.009514 1.316514 -3.805 0.000142 ***
gre 0.001631 0.001217 1.340 0.180180
gpa 1.166408 0.388899 2.999 0.002706 **
rank2 -0.570976 0.358273 -1.594 0.111005
rank3 -1.125341 0.383372 -2.935 0.003331 **
rank4 -1.532942 0.477377 -3.211 0.001322 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#without gre
mymodel <- glm(admit ~ gpa + rank, data = train, family = 'binomial')
summary(mymodel)
Call:
glm(formula = admit ~ gpa + rank, family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5156 -0.8880 -0.6318 1.1091 2.1688
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.7270 1.2918 -3.659 0.000253 ***
gpa 1.3735 0.3590 3.826 0.000130 ***
rank2 -0.5712 0.3564 -1.603 0.108976
rank3 -1.1645 0.3804 -3.061 0.002203 **
rank4 -1.5642 0.4756 -3.289 0.001005 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Prediction
p1 <- predict(mymodel, train, type = 'response')
head(p1)
1 2 3 4 6 7
0.2822956 0.2992879 0.6828897 0.1290134 0.2354735 0.3466234
From the result, number 5 is not there because it should have been taken for test data
The result shows for applicant number 1 the chance for being admitted is only 28.22 %
head(train)
Probability calculation
For applicant #1
> y <- -4.7270 + (1.3735*3.61) + (1*-1.1645)
> y
[1] -0.933165
> exp(y)/(1+exp(y))
[1] 0.282283
For applicant #3
> y <- -4.7270 + (1.3735*4)
> exp(y)/(1+exp(y))
[1] 0.6828716
Actual
Predicted 0 1
0 208 73
1 15 29
The table above is called Confusion Matrix. 208 means that there are 208 data which are predicted not
to be admitted and it’s true (predicted=actual) while 29 are predicted to be admitted and it’s also true.
73 : predicted not admitted but actually they were admitted and 15 : predicted admitted but actually
they were not admitted
1 - sum(diag(tab1))/sum(tab1)
> 1 - sum(diag(tab1))/sum(tab1)
[1] 0.2707692
#Misclassification error - test data
p2 <- predict(mymodel, test, type = 'response')
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Predicted = pred2, Actual = test$admit)
tab2
Actual
Predicted 0 1
0 48 20
1 2 5
1 - sum(diag(tab2))/sum(tab2)
> 1 - sum(diag(tab2))/sum(tab2)
[1] 0.2933333
# Goodness-of-fit test
with(mymodel, pchisq(null.deviance - deviance, df.null-df.residual,
lower.tail = F))
Because of the value is very low, we are confident that the model is statistically significant.