GLM Sol
GLM Sol
Solution 1
> #(i)Open file in R
> indices<-read.csv(file.choose(),header=TRUE);indices
Month Sensex BM CD EN FM FI HC
1 Feb-06 0.0444 0.0364 0.0291 -0.0179 0.1233 0.0103 0.0769
2 Mar-06 0.0841 0.1632 0.0783 0.1032 0.1158 0.0185 0.0822
3 Apr-06 0.0654 0.1513 0.0415 0.1518 0.0445 -0.0093 0.0095
4 May-06 -0.1468 -0.1939 -0.0884 -0.1015 -0.2018 -0.0885 -0.1394
5 Jun-06 0.0201 -0.0252 -0.0691 0.0347 0.0314 -0.0806 -0.0784
6 Jul-06 0.0126 -0.0177 -0.0280 -0.0478 -0.0361 0.0536 0.0298
(….continue)
Call:
glm(formula = indices$Sensex_direction ~ indices$BM + indices$CD +
indices$EN + indices$FM + indices$FI + indices$HC + indices$IN +
indices$IT + indices$TE + indices$UT, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.27544 -0.00117 0.00000 0.01354 1.75651
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0086 0.7315 -1.379 0.16796
indices$BM 7.7977 16.5255 0.472 0.63703
indices$CD -87.5335 42.6785 -2.051 0.04027 *
indices$EN 93.9675 38.3193 2.452 0.01420 *
indices$FM 41.1745 20.1436 2.044 0.04095 *
indices$FI 172.8807 60.8192 2.843 0.00448 **
indices$HC -6.4294 13.9394 -0.461 0.64463
indices$IN 4.1735 18.2152 0.229 0.81877
indices$IT 78.3494 30.9307 2.533 0.01131 *
indices$TE 29.9111 13.4184 2.229 0.02581 *
indices$UT -14.4767 23.0602 -0.628 0.53015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(iv)
Sectors which have significantly impacted the direction of Sensex returns are CD, EN,
FI, FM, IT and TE at 95% Confidence level. But only FI has impacted the Sensex
direction at 99% Confidence level
CHAPTER 12 SOLUTION PAPER B
outlier
> outlier_position<-which(glmmodel$residuals==min(glmmodel$residuals));outlier_position
41
41
> indices$Month[outlier_position]
[1] Jun-09
164 Levels: Apr-06 Apr-07 Apr-08 Apr-09 Apr-10 Apr-11 ... Sep-19
(vi)
Interpretation [2]
As the residual deviance came down significantly from Null Deviance of 223.21 to
32.90, the variables are able to classify the direction appropriately
One huge outlier (Jun-09) can impact the accuracy of the result (Removing this may
reduce the residual deviance further)
The independent variables are not independent and they are interdependent
(Correlations are very high among the sectors). Hence the standard errors may not be
appropriate
(vii)
> # Refit the model
> model2<-glm(indices$Sensex_direction~indices$CD+indices$EN+indices$FI+indices$IT+ind
ices$TE+indices$FM,family = binomial(link="logit"))
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
(viii)
> summary(glmmodel)
Call:
glm(formula = indices$Sensex_direction ~ indices$BM + indices$CD +
indices$EN + indices$FM + indices$FI + indices$HC + indices$IN +
indices$IT + indices$TE + indices$UT, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.27544 -0.00117 0.00000 0.01354 1.75651
Coefficients:
Estimate Std. Error z value Pr(>|z|)
CHAPTER 12 SOLUTION PAPER B
Interpretation [1]
p-value of the comparison is 0.94 > 0.05 thus not rejecting the null hypothesis of
no significant difference between the two models. So the model did not improve
significantly based on the friend's suggestion
CHAPTER 12 SOLUTION PAPER B
Solution 2
(i) All values are positive integer with some values more than 1, so use Poisson distribution as error
structure
(ii)
model <- glm(formula = Claim.number ~ Age + factor(Car.Group) + Area + factor(NCD) + Gender, data = datatrain, family = poisson())
summary(model)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3203 -0.5624 -0.4445 -0.3185 3.4254
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.895413 0.245584 -7.718 1.18e-14 ***
Age -0.008371 0.001231 -6.802 1.03e-11 ***
factor(Car.Group)2 0.283162 0.285122 0.993 0.320648
factor(Car.Group)3 0.311122 0.282852 1.100 0.271356
factor(Car.Group)4 0.116037 0.292479 0.397 0.691563
factor(Car.Group)5 0.578672 0.265600 2.179 0.029352 *
factor(Car.Group)6 0.912683 0.251178 3.634 0.000279 ***
factor(Car.Group)7 0.260364 0.287397 0.906 0.364968
factor(Car.Group)8 0.914236 0.253831 3.602 0.000316 ***
factor(Car.Group)9 0.877120 0.252408 3.475 0.000511 ***
factor(Car.Group)10 0.914426 0.250011 3.658 0.000255 ***
factor(Car.Group)11 0.799044 0.250434 3.191 0.001420 **
factor(Car.Group)12 1.025303 0.245168 4.182 2.89e-05 ***
factor(Car.Group)13 1.011678 0.248305 4.074 4.61e-05 ***
factor(Car.Group)14 1.118560 0.242695 4.609 4.05e-06 ***
factor(Car.Group)15 1.103179 0.244711 4.508 6.54e-06 ***
factor(Car.Group)16 0.996932 0.247455 4.029 5.61e-05 ***
factor(Car.Group)17 1.128584 0.242389 4.656 3.22e-06 ***
factor(Car.Group)18 1.198728 0.239721 5.001 5.72e-07 ***
factor(Car.Group)19 1.422781 0.238307 5.970 2.37e-09 ***
factor(Car.Group)20 1.317913 0.238579 5.524 3.31e-08 ***
AreaEast Midlands 0.146664 0.132611 1.106 0.268739
AreaLondon 0.318305 0.126773 2.511 0.012045 *
AreaNI 0.393303 0.125065 3.145 0.001662 **
AreaNorth East -0.060812 0.138426 -0.439 0.660439
AreaNorth West -0.193799 0.143745 -1.348 0.177590
AreaSouth East -0.323157 0.151830 -2.128 0.033303 *
AreaSouth West -0.097663 0.142546 -0.685 0.493256
AreaWales -0.309704 0.148283 -2.089 0.036744 *
AreaWest Midlands -0.068206 0.141474 -0.482 0.629730
AreaYorkshire and the Humber 0.100272 0.132276 0.758 0.448421
factor(NCD)1 -0.456078 0.086523 -5.271 1.36e-07 ***
factor(NCD)2 -0.679426 0.091303 -7.441 9.96e-14 ***
factor(NCD)3 -0.885118 0.096849 -9.139 < 2e-16 ***
factor(NCD)4 -0.963244 0.101502 -9.490 < 2e-16 ***
factor(NCD)5 -1.097532 0.104299 -10.523 < 2e-16 ***
GenderMale 0.259982 0.064879 4.007 6.14e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4569.9 on 7999 degrees of freedom
Residual deviance: 4085.6 on 7963 degrees of freedom
AIC: 6455.7 [10]
CHAPTER 12 SOLUTION PAPER B
(ii) Male policyholders have higher mean of reported claims (by exp(0.259982) − 1 = 29.7%) than female policyholders.
The difference is significant (p-value = 6.14e-5).[10]
(iii) Compare to Null model; the deviance is reduced by 484.3 while the degrees of freedom reduce by 36.
The observed difference in deviance (484.3) is very high compared to the values of the 𝜒𝜒36 2
distribution, so the fitted model is significant/good (alternatively, compare the deviance of the fitted
model (4085.6) to the 𝜒𝜒7963 2 distribution.)
(iv)
a) datatrain$Age2= datatrain$Age^2 {2}
(b)
model1 <- glm(formula = Claim.number ~ Age + Age2 +
factor(Car.Group) + Area + factor(NCD) + Gender, data = datatrain,
family = poisson())
summary(model1)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2760 -0.5532 -0.4261 -0.3063 3.2856
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.118e-01 2.816e-01 -1.462 0.143613
Age -7.268e-02 6.329e-03 -11.484 < 2e-16 ***
Age2 5.603e-04 5.405e-05 10.366 < 2e-16 ***
factor(Car.Group)2 2.913e-01 2.852e-01 1.022 0.307011
factor(Car.Group)3 2.854e-01 2.829e-01 1.009 0.313049
factor(Car.Group)4 1.361e-01 2.925e-01 0.465 0.641617
factor(Car.Group)5 5.608e-01 2.656e-01 2.111 0.034750 *
factor(Car.Group)6 8.935e-01 2.512e-01 3.557 0.000376 ***
factor(Car.Group)7 2.729e-01 2.875e-01 0.949 0.342393
factor(Car.Group)8 9.135e-01 2.539e-01 3.598 0.000320 ***
factor(Car.Group)9 8.766e-01 2.524e-01 3.473 0.000515 ***
factor(Car.Group)10 9.127e-01 2.501e-01 3.650 0.000263 ***
factor(Car.Group)11 8.012e-01 2.505e-01 3.198 0.001383 **
factor(Car.Group)12 1.039e+00 2.452e-01 4.239 2.25e-05 ***
factor(Car.Group)13 9.643e-01 2.483e-01 3.883 0.000103 ***
factor(Car.Group)14 1.109e+00 2.427e-01 4.568 4.93e-06 ***
factor(Car.Group)15 1.111e+00 2.445e-01 4.543 5.54e-06 ***
factor(Car.Group)16 9.760e-01 2.474e-01 3.945 7.99e-05 ***
factor(Car.Group)17 1.131e+00 2.424e-01 4.667 3.06e-06 ***
factor(Car.Group)18 1.188e+00 2.397e-01 4.957 7.14e-07 ***
factor(Car.Group)19 1.420e+00 2.383e-01 5.962 2.49e-09 ***
factor(Car.Group)20 1.322e+00 2.386e-01 5.540 3.03e-08 ***
AreaEast Midlands 9.654e-02 1.327e-01 0.727 0.466981
AreaLondon 3.190e-01 1.268e-01 2.516 0.011863 *
AreaNI 3.837e-01 1.251e-01 3.067 0.002161 **
AreaNorth East -6.318e-02 1.385e-01 -0.456 0.648255
AreaNorth West -2.015e-01 1.438e-01 -1.401 0.161092
AreaSouth East -3.268e-01 1.519e-01 -2.151 0.031438 *
AreaSouth West -1.148e-01 1.426e-01 -0.805 0.420806
AreaWales -3.166e-01 1.484e-01 -2.134 0.032820 *
AreaWest Midlands -8.999e-02 1.415e-01 -0.636 0.524886
AreaYorkshire and the Humber 1.060e-01 1.323e-01 0.801 0.423004
factor(NCD)1 -4.530e-01 8.657e-02 -5.233 1.67e-07 ***
factor(NCD)2 -6.705e-01 9.130e-02 -7.344 2.07e-13 ***
factor(NCD)3 -8.753e-01 9.692e-02 -9.031 < 2e-16 ***
CHAPTER 12 SOLUTION PAPER B
(c ) The p-value of the age squared coefficient shows that it is significant. Also, the deviance is reduced
more than twice the change in degrees of freedom. So the variable is significantly associated with the
number of reported claims.
CHAPTER 12 SOLUTION PAPER B
Solution 3
Linear predictor for modelling:
(𝑎)𝛼𝑖 + 𝛽 × 𝑡𝑒𝑚𝑝: 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝛼𝑖, 𝑖 = 1,2 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑚𝑒𝑠𝑡𝑒𝑟 [2]
(𝑏)𝛼𝑖 + 𝛽𝑖 × 𝑡𝑒𝑚𝑝: 𝑤ℎ𝑒𝑟𝑒 𝛼𝑖 𝑎𝑠 𝑎𝑏𝑜𝑣𝑒, 𝛽𝑖, 𝑖 = 1,2 𝑎𝑙𝑠𝑜 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑠𝑒𝑚𝑒𝑠𝑡𝑒𝑟[2]
(𝑐) 𝛼𝑖 + 𝛽𝑖 × 𝑡𝑒𝑚𝑝 + 𝛾𝑗 𝑤𝑖𝑡ℎ 𝛼𝑖, 𝛽𝑖 𝑎𝑠 𝑎𝑏𝑜𝑣𝑒, 𝛾𝑗, 𝑗 = 1,2 𝑑𝑒𝑝𝑒𝑛𝑑𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑟𝑜𝑢𝑡𝑒. [2]
Call:
glm(formula = Passengers ~ temp * semester + route, family = poisson(link = "log"))
Deviance Residuals:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.40210 0.31155 1.291 0.1968
temp -0.07878 0.03576 -2.203 0.0276 *
semestersemester 0.53514 0.46691 1.146 0.2517 [1]
route9am 0.17370 0.44520 0.390 0.6964
temp:semestersemester 0.10779 0.05741 1.878 0.0604 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 30.406 on 19 degrees of freedom
Residual deviance: 13.833 on 15 degrees of freedom
AIC: 62.187
(b)
Temperature (temp) is significant
Semester is not significant
Route is not significant
The interaction between temperature (temp) and semester is not significant at 5% significance level
but it is close to being significant
(c)
)
>Model2<- update(Model1,~.-route) [2]
Or,
Model2 <- glm(Passengers~temp*semester,family="poisson" (link = "log"))
>summary(Model2)
Call:
glm(formula = Passengers ~ temp + semester + temp:semester, family = poisson(link
= "log"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.84542 -0.66323 -0.06209 0.43732 1.34790
CHAPTER 12 SOLUTION PAPER B
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.44284 0.29121 1.521 0.1283
temp -0.07452 0.03387 -2.200 0.0278 *
semestersemester 0.54602 0.46390 1.177 0.2392
temp:semestersemester 0.10012 0.05316 1.883 0.0597 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 30.406 on 19 degrees of freedom
Residual deviance: 13.982 on 16 degrees of freedom
AIC: 60.33
(b) The AIC has fallen from 62.187 to 60.336 - so new model has improved the initial model
(iv)(a)
>Model3<- glm(Passengers~temp+temp:semester,family=poisson(link="log"))
>Modela<- glm(Passengers~temp+ semester,family=poisson(link="log")) [2]
>Modelb<- glm(Passengers~temp*semester,family=poisson(link="log")) [2]
>Model3$aic
59.65976 [1]
>Modela$aic
62.03591
>Modelb$aic
60.33588 [1]
Model3 has the lowest AIC compared with the other models. We conclude that Model3
outperforms the other models considered here [1]
(b)
Model3 doesn’t include both of the main effects. Despite this, the model still suits the data
well [1]
(v)(a)
> plot(Model3,1) [2]
CHAPTER 12 SOLUTION PAPER B
(b)
The residuals plot shows no patterns - exhibiting a fairly random scatter around zero with
constant variance and no outliers [2]
The plot suggests that the model is appropriate [1]
(vi)
>predict(Model3, data.frame(temp=0,semester="semester",route="8am"),type =
"response") [3]
Predicted number is: 1.866568 [1]
Solution 4
# open a R file - file->open->select the file ->check data will be added in global env
iornment or
> policydata<-load(file.choose());policydata
[1] "n.policies" "sex.code" "class.code"
It now seems that the number of claims also depends on the gender of policyholders. [1]
The numbers are generally higher for males. [1]
(iii)
CHAPTER 12 SOLUTION PAPER B
Call:
glm(formula = n.policies ~ class.code, family = "poisson")
Deviance Residuals:
1 2 3 4 5 6
0.6865 1.1476 2.1526 -0.1806 0.1879 -0.7118
7 8 9 10
-1.2165 -2.3899 0.1787 -0.1901
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.7257 0.1098 33.943 <2e-16 ***
class.code2 0.1029 0.1514 0.680 0.4965
class.code3 0.2540 0.1463 1.736 0.0825 .
class.code4 -0.2917 0.1679 -1.738 0.0822 .
class.code5 -0.3935 0.1729 -2.275 0.0229 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(iv)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.8611 0.1180 32.732 < 2e-16 ***
class.code2 0.1029 0.1514 0.680 0.49648
class.code3 0.2540 0.1463 1.736 0.08248 .
class.code4 -0.2917 0.1679 -1.738 0.08225 .
class.code5 -0.3935 0.1729 -2.275 0.02288 *
sex.code2 -0.2921 0.1011 -2.890 0.00386 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(v)
The null hypothesis is that the second model (including both factors) is not an improvement over the first
model.
anova(glm1,glm2,test = "Chisq")
Analysis of Deviance Table
(vi)
> predict(glm2, data.frame(class.code="2", sex.code="1"), type="r
esponse")
1
52.67