0% found this document useful (0 votes)
63 views16 pages

Homework 3: Jiawei Li Sahil Bhagat Shahrzad Baraeinezhad

The document describes analyzing a dataset called Wells using logistic regression models in R. Several models are fit with different predictors. The best fitting model includes arsenic levels (logged), distance, and education as significant predictors of switching well water source. This model has a lower AIC than alternative models tested. Overall, the document demonstrates fitting and comparing different logistic regression models to identify the best predictors of well water switching behavior.

Uploaded by

Sahil Bhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views16 pages

Homework 3: Jiawei Li Sahil Bhagat Shahrzad Baraeinezhad

The document describes analyzing a dataset called Wells using logistic regression models in R. Several models are fit with different predictors. The best fitting model includes arsenic levels (logged), distance, and education as significant predictors of switching well water source. This model has a lower AIC than alternative models tested. Overall, the document demonstrates fitting and comparing different logistic regression models to identify the best predictors of well water switching behavior.

Uploaded by

Sahil Bhagat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Homework 3

Jiawei Li
Sahil Bhagat
Shahrzad Baraeinezhad

Install and load package


library(rmarkdown)
library(knitr)
library(effects)
library(arf3DS4)
## Loading required package: tcltk
## Loading required package: corpcor
##
## Attaching package: 'arf3DS4'
## The following object is masked from 'package:stats':
##
##
BIC
library(MASS)
library(Rcmdr)
## Loading required package: splines
## Loading required package: RcmdrMisc
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:effects':
##
##
Prestige
## Loading required package: sandwich
## Commander GUI
##
## Attaching package: 'Rcmdr'

## The following objects are masked from 'package:tcltk':


##
##
tclvalue, tkfocus
library(stats)

insert data file


data(Wells, package="effects")

Q1
glm1<-glm(switch~1,family=binomial(logit),data=Wells)

a . AIC=4120
b . BIC=NULL
summary(glm1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
glm(formula = switch ~ 1, family = binomial(logit), data = Wells)
Deviance Residuals:
Min
1Q Median
-1.308 -1.308
1.052

3Q
1.052

Max
1.052

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.30296
0.03681
8.23
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1
Residual deviance: 4118.1
AIC: 4120.1

on 3019
on 3019

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

glm1$BIC
## NULL

c . since there is no factor included, the log odds equals to that intercept adds a small
deviation In this case ,generally log odds = intercept=0.303
Q2
glm2<-glm(switch~distance+1,family=binomial(logit),data=Wells)
glm2

##
## Call:
##
##
##
##
##
##
##
##
##

a.

glm(formula = switch ~ distance + 1, family = binomial(logit),

data = Wells)
Coefficients:
(Intercept)
0.605959

distance
-0.006219

Degrees of Freedom: 3019 Total (i.e. Null);


Null Deviance:
4118
Residual Deviance: 4076 AIC: 4080

3018 Residual

the AIC has changed by (4120-4080=)40.

summary(glm2)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
glm(formula = switch ~ distance + 1, family = binomial(logit),
data = Wells)
Deviance Residuals:
Min
1Q
Median
-1.4406 -1.3058
0.9669

3Q
1.0308

Max
1.6603

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6059594 0.0603102 10.047 < 2e-16 ***
distance
-0.0062188 0.0009743 -6.383 1.74e-10 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1
Residual deviance: 4076.2
AIC: 4080.2

on 3019
on 3018

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

according to the summary of the model, distance is a significant predictor in this


model.
glm2$coefficients
##
##

(Intercept)
distance
0.605959365 -0.006218819

from the result above we found out that the sign of the coefficient for predictor
distance is negative,so with growing distance to a safe well families are less likely to
switch.
d.

distance=0 then logodds=intercept=0.303,then odds=exp(0.303),then the


probability is:

p<-exp(0.303)/(exp(0.303)+1)
p
## [1] 0.5751757

the probability is 57.5%


Q3 a. the half-way-point is:
glm2$coefficients[2]
##
distance
## -0.006218819
x1<-(-1)*((glm2$coefficients[1])/(glm2$coefficients[2]))
x1
## (Intercept)
##
97.43961

x1=97.44
b.

slope half-way-point

slope=1/(exp(glm2$coefficients[1]+glm2$coefficients[2]*x1)+1)^2
slope
## (Intercept)
##
0.25

slope=0.25
Q4
glm3<-glm(switch~.+1,family=binomial(logit),data=Wells)
summary(glm3)
##
##
##
##
##
##
##
##
##
##

Call:
glm(formula = switch ~ . + 1, family = binomial(logit), data = Wells)
Deviance Residuals:
Min
1Q
Median
-2.5942 -1.1976
0.7541

3Q
1.0632

Max
1.6739

Coefficients:
Estimate Std. Error z value Pr(>|z|)

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

a.

(Intercept)
-0.156712
0.099601 -1.573
0.116
arsenic
0.467022
0.041602 11.226 < 2e-16 ***
distance
-0.008961
0.001046 -8.569 < 2e-16 ***
education
0.042447
0.009588
4.427 9.55e-06 ***
associationyes -0.124300
0.076966 -1.615
0.106
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1
Residual deviance: 3907.8
AIC: 3917.8

on 3019
on 3015

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

according to the summary, arsenic,distance and education are significant.For


me, If the predictor is significant in this case, then the coefficient of that
predictor makes sense to me. Otherwise I don't think it make that much sense

switch.best <- stepwise(glm3, direction='backward/forward', criterion='


AIC')
##
##
##
##
##
##
##
##
##
##
##
##
##

Direction:
Criterion:

backward/forward
AIC

Start: AIC=3917.83
switch ~ arsenic + distance + education + association + 1
Df Deviance
<none>
3907.8
- association 1
3910.4
- education
1
3927.7
- distance
1
3985.2
- arsenic
1
4056.1

AIC
3917.8
3918.4
3935.7
3993.2
4064.1

summary(switch.best)
##
##
##
+
##
##
##
##
##
##
##
##

Call:
glm(formula = switch ~ arsenic + distance + education + association
1, family = binomial(logit), data = Wells)
Deviance Residuals:
Min
1Q
Median
-2.5942 -1.1976
0.7541

3Q
1.0632

Max
1.6739

Coefficients:
Estimate Std. Error z value Pr(>|z|)

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

(Intercept)
-0.156712
0.099601 -1.573
0.116
arsenic
0.467022
0.041602 11.226 < 2e-16 ***
distance
-0.008961
0.001046 -8.569 < 2e-16 ***
education
0.042447
0.009588
4.427 9.55e-06 ***
associationyes -0.124300
0.076966 -1.615
0.106
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1
Residual deviance: 3907.8
AIC: 3917.8

on 3019
on 3015

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

The significant predictors are arsenic, distance and education and the AIC score of
the model is 3917.8
par(mfrow=c(4,3))
crPlots(switch.best,span=0.1)
## Warning in smoother(.x, partial.res[, var], col = col.lines[2], log.
x =
## FALSE, : could not fit smooth

crPlots(switch.best,span=0.25)

## Warning in smoother(.x, partial.res[, var], col = col.lines[2], log.


x =
## FALSE, : could not fit smooth

crPlots(switch.best,span=0.75)

crPlots(switch.best,span=0.90)

After using
different smoothing parameter like 0.1,0.25,0.75,0.90, there is no any quadratic
effect should be added.

Q5
Wells$arsenic2<-(Wells$arsenic)^2
Wells$arseniclog<-log(Wells$arsenic)

quadratic model
glm4<-glm(switch~.-arseniclog-arsenic,family=binomial(logit),data=Wells)
summary(glm4)
##
## Call:
## glm(formula = switch ~ . - arseniclog - arsenic, family = binomial(l
ogit),
##
data = Wells)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -3.1699 -1.2250
0.8184
1.0596
1.5998
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept)
0.251526
0.086568
2.906 0.00367 **
## distance
-0.007872
0.001021 -7.711 1.25e-14 ***
## education
0.041400
0.009504
4.356 1.32e-05 ***
## associationyes -0.128019
0.076361 -1.676 0.09364 .
## arsenic2
0.080732
0.009229
8.747 < 2e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3954.0 on 3015 degrees of freedom
## AIC: 3964
##
## Number of Fisher Scoring iterations: 4

logarithm model
glm5<-glm(switch~.-arsenic2-arsenic,family=binomial(logit),data=Wells)
summary(glm5)
##
## Call:
## glm(formula = switch ~ . - arsenic2 - arsenic, family = binomial(log
it),
##
data = Wells)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

-2.0742

-1.1760

0.7369

1.0382

1.8044

Coefficients:
Estimate Std. Error z value
(Intercept)
0.372386
0.084997
4.381
distance
-0.009791
0.001061 -9.227
education
0.042740
0.009656
4.426
associationyes -0.123718
0.077431 -1.598
arseniclog
0.886959
0.068901 12.873
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*'

Pr(>|z|)
1.18e-05
< 2e-16
9.59e-06
0.11
< 2e-16

***
***
***
***

0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 4118.1
Residual deviance: 3875.6
AIC: 3885.6

on 3019
on 3015

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

compare AIC
glm4$aic
## [1] 3963.983
glm5$aic
## [1] 3885.602

the logarithm model has a lower AIC value


chi-square test Wald-test
library(aod)
wald.test(b = coef(glm4), Sigma = vcov(glm4), Terms = 1:5)
##
##
##
##
##

Wald test:
---------Chi-squared test:
X2 = 184.2, df = 5, P(> X2) = 0.0

wald.test(b = coef(glm5), Sigma = vcov(glm5), Terms = 1:5)


##
##
##
##
##

Wald test:
---------Chi-squared test:
X2 = 265.5, df = 5, P(> X2) = 0.0

the x2 value for glm4 is 184.2 while for glm5 is 265.5 with df=5,so the P-value for
glm5 is smaller than that of glm4, which means glm5 model has more significance.
Q6 a.
summary(glm5)
##
## Call:
## glm(formula = switch ~ . - arsenic2 - arsenic, family = binomial(log
it),
##
data = Wells)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -2.0742 -1.1760
0.7369
1.0382
1.8044
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept)
0.372386
0.084997
4.381 1.18e-05 ***
## distance
-0.009791
0.001061 -9.227 < 2e-16 ***
## education
0.042740
0.009656
4.426 9.59e-06 ***
## associationyes -0.123718
0.077431 -1.598
0.11
## arseniclog
0.886959
0.068901 12.873 < 2e-16 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3875.6 on 3015 degrees of freedom
## AIC: 3885.6
##
## Number of Fisher Scoring iterations: 4
glm5$coefficients
##
(Intercept)
niclog
##
0.372385812
959380

distance
-0.009791329

education associationyes
0.042740432

-0.123718388

arse
0.886

switch.logodds=(glm5$coefficients[1]+glm5$coefficients[2]*mean(Wells$di
stance)
+glm5$coefficients[3]*mean(Wells$education)+glm5$coefficients[4]*mean(a
s.numeric(Wells$association))
+glm5$coefficients[5]*mean(Wells$arseniclog))
switch.prob=exp(switch.logodds)/(exp(switch.logodds)+1)
switch.prob

## (Intercept)
##
0.551782

the probability is 0.552


switch.logodds=(glm5$coefficients[1]+glm5$coefficients[2]*100
+glm5$coefficients[3]*mean(Wells$education)+glm5$coefficie
nts[4]*mean(as.numeric(Wells$association))
+glm5$coefficients[5]*mean(Wells$arseniclog))
switch.prob=exp(switch.logodds)/(exp(switch.logodds)+1)
switch.prob
## (Intercept)
##
0.42604
0.552-0.426
## [1] 0.126

the prob now is 0.426, reduced by 0.126 2.


sd(Wells$education)
## [1] 4.017317
switch.logodds=(glm5$coefficients[1]+glm5$coefficients[2]*mean(Wells$di
stance)
+glm5$coefficients[3]*(mean(Wells$education)+sd(Wells$e
ducation))+glm5$coefficients[4]*mean(as.numeric(Wells$association))
+glm5$coefficients[5]*mean(Wells$arseniclog))
switch.prob=exp(switch.logodds)/(exp(switch.logodds)+1)
switch.prob
## (Intercept)
##
0.5937706
0.594-0.552
## [1] 0.042

the prob now is 0.594,increased by 0.042


Q7
glm6<-glm(switch~distance*arsenic+distance+arsenic,family=binomial(logi
t),data=Wells)
summary(glm6)
##
## Call:
## glm(formula = switch ~ distance * arsenic + distance + arsenic,
##
family = binomial(logit), data = Wells)
##

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Deviance Residuals:
Min
1Q
Median
-2.7823 -1.2004
0.7696

3Q
1.0816

Max
1.8476

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-0.147868
0.117538 -1.258 0.20838
distance
-0.005772
0.002092 -2.759 0.00579 **
arsenic
0.555977
0.069319
8.021 1.05e-15 ***
distance:arsenic -0.001789
0.001023 -1.748 0.08040 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4118.1
Residual deviance: 3927.6
AIC: 3935.6

on 3019
on 3016

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

the coefficients for distance and arsenic are the same with the coefficient above. The
coefficient for the union of the two is interpreted as the combined effect of the union
of the two predictors. I don't think it is meanful enough because from the summary
we can find out that the statistical significance for predictor:distance*arsenic is so
weak that the p-value is bigger than 0.05.
Q8 centralizing distance and arsenic
Wells$centereddistance<-Wells$distance-mean(Wells$distance)
Wells$centeredarsenic<-Wells$arsenic-mean(Wells$arsenic)
glm7<-glm(switch~centereddistance*centeredarsenic+centereddistance+cent
eredarsenic,family=binomial(logit),data=Wells)
summary(glm7)
##
## Call:
## glm(formula = switch ~ centereddistance * centeredarsenic + centered
distance +
##
centeredarsenic, family = binomial(logit), data = Wells)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -2.7823 -1.2004
0.7696
1.0816
1.8476
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z
|)
## (Intercept)
0.351094
0.039852
8.810
<2e16 ***

##
16
##
16
##
04
##
##
##
##
##
##
##
##
##
##

centereddistance
-0.008737
0.001048 -8.337
***
centeredarsenic
0.469508
0.042074 11.159
***
centereddistance:centeredarsenic -0.001789
0.001023 -1.748
.
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

<2e<2e0.08

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 4118.1
Residual deviance: 3927.6
AIC: 3935.6

on 3019
on 3016

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

Coefficient of variable centereddistance shows that there is negative relation


between the deviation of it and the variable switch, so do the coefficient of
centeredarsenic . the coefficient of the interaction remains the same , but the
coeffcients of distance and arsenic are reduced. They are still meanful since their pvalue are all under 0.001, which means they have high statistical significance.
Effect plot
plot(glm7)

You might also like