Linear Regression Model
Linear Regression Model
Final Project
Fall 2011
Abstract
The objective of this project is to develop a model to help Countrys B Government to identify the
factors that can predict the outbreaks of virus A infection for further policy on virus prevention.
The data set taken into account to develop the model is a sample of outbreaks of Virus A infection in
100 geographical areas of the Country. The response variable is going to be the percentage of the area
population infected during the winter.
First of all, it is important to make a preliminary analysis of the data to select the candidate explanatory
variables. Then the use of econometrics methods of model selection will give the model that best fit the
sample data.
Secondly, some diagnostics to the residuals of the model were perform in order to be sure that there
are not missing explanatory variables, presence of outliers that can influence in future prediction,
multicollinearity and other issues.
Finally, the final model is present together with the conclusions.
Preliminary Analisys
To start the preliminary analysis of the data, first the list of candidate explanatory variables are
presented and an analysis of the correlation between each other and with the response variable % of
the area population infected during the winter of this study are going to be perform.
List of candidates variables:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
In the following table it shows the correlation between all the explanatory variables and the response
variable (V1). Is good to notice that the percentage of males, college and insured, the employment rate
and the average amount of rain during the winter show a correlation with V1 near to zero, so they are
candidates to get out of the model.
Correlation between explanatory and response variables
perc.male
-0.0538
population
0.4305
camp.virus
-0.3683
perc.college
-0.0162
household.size
0.1520
temp
-0.2149
perc.child
0.1099
household.income
0.1507
public.per.cap
-0.4556
num.bed
0.5042
perc.insured
-0.0267
employ
-0.0072
num.phys
0.5087
geo.size
-0.6037
rain
0.0118
perc.vac
-0.1311
urban
0.4378
region
-0.4230
On the other hand, the number of beds and physicians, the population, size of the area, the indicator
variable of urban, the public space and the region show a correlation against V1 greater than 40%. They
are natural candidates for the model.
To obtain some information about possible future problems of multicollinearity, is important check the
correlation matrix of the explanatory variables. This matrix shows high correlation between the percent
of male, people with college degree and children in school age. The same behavior shows the number of
2
bed and physicians between each other and against the total population and the indicator variable of
urban. Finally, the household size and the household income are perfect correlated1.
Some other candidate variables are considered for the model,
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Again, is important to check the relation between these added variables against the response variable
with a scatter plot, if they show some relation we were include them in the set of candidate variables2.
The decision of add habitants per km2, percentage of public area, number of unemployed, vaccine per
km2, physicians per bed, physicians per km2 and beds per km2 has been taken.
All these variables seem to have some correlation to the response variable, so they are going to be
included. However, some of them show high correlation between each other so is important to take
care about multicolinearity, for example, physicians and beds per km2.
1
2
The table with all the correlations you can see the appendix A
The plot with the pairs relations of the variables and the added variables are in the appendix A
Model selection
The objective of this section is to find the model that best fit the sample data, using the selected
candidate variables.
The method that is going to be used for the selection is the Stepwise Regression Method, starting with
the model V1~13 and using the AIC criteria to add and delete variables.
Step: AIC=-1344.3 (V1 ~ hab.km2 + region2 + region1 + camp.virus + perc.vac + perc.child + employ +
public.per.cap + phys.bed + households + temp). And the coefficients are the followings,
Coefficients of Model 1
Estimate
Std. Error
t value
Pr(>|t|)
(Intercept)
0.035
0.004
7.925
0.0000
perc.child
0.036
0.017
2.144
0.0348
perc.vac
-0.004
0.001
-3.718
0.0004
camp.virus
-0.002
0.000
-6.381
0.0000
temp
0.000
0.000
1.610
0.1109
public.per.cap
-0.469
0.161
-2.921
0.0044
employ
-0.007
0.004
-1.697
0.0932
region1TRUE
0.002
0.000
6.095
0.0000
region2TRUE
0.002
0.000
6.775
0.0000
phys.bed
0.009
0.005
1.905
0.0600
hab.km2
0.000
0.000
10.923
0.0000
households
0.000
0.000
-1.692
0.0942
The individual t-test was performed to test the significance of the coefficients. As it can be seen, the pvalue of the coefficients related to the temperature, employment rate, physicians per bed and number
of households are greater than 5%. However, we use test of Lack of Fit to decide if we continue with this
model or another model without these variables.
Analysis of Variance Table
V1 ~ perc.child + perc.vac + camp.virus + public.per.cap + region1 + region2 +
hab.km2
V1 ~ perc.child + perc.vac + camp.virus + temp + public.per.cap + employ +
Model1
region1 + region2 + phys.bed + hab.km2 + households
Res.Df
RSS
Df
Sum of Sq F
Pr(>F)
1
92 0.00012897
2
88 0.00011417
4
1.48E-05
2.8531
0.02829
Model2
The reduced in the RSS is pretty small, so we could prefer to have 4 more degree of freedom. However,
the p-value suggests rejecting H0: b3=b5=b6=b8=0, so we stay with model 1.
The test was running starting with the model V1~1 because we arrive to a lower AIC value than starting from the
full model.
Residuals analysis
Once we have chosen the best model, we are going to check its residuals. The following graph shows
that the residuals seem to follow a quadratic function, which means that probably we need to perform a
transformation in the response variable.
The Box-Cox transformation was used to identify the power of the necessary transformation. On the
other hand, the residuals against the fitted values looks like a white noise.
Residuals
0.040
0.035
model1$fit
0.000
-0.001
-0.003
0.030
-0.002
model1$res
0.001
0.002
0.045
20
40
60
Index
80
100
-0.003
-0.002
-0.001
0.000
0.001
0.002
model1$res
To check the normality of the residuals we use the Normal Q-Q plot and histograms. Based on these two
graphs and the previous one, it is conclude that the residuals follow a normal distribution, maybe with
some skew to the right.
Histogram of model1$res
-0.003
-0.002
10
Frequency
15
0.001
0.000
-0.001
Sample Quantiles
0.002
20
-2
-1
-0.003
-0.002
-0.001
0.000
0.001
0.002
model1$res
Theoretical Quantiles
The quadratic behavior showed by the residual, suggest a transformation of the response variable V1.
The variable was transformed to Y=V1-1 and Y=(V1)-0.5, and now the residuals seems to follow a random
behavior. However, they seem to have less variability in the center than in the extremes.
Residuals transformed model Y=V1^(-1)
-0.1
0.0
model6$res
-0.2
-2
-1
model7$res
0.1
0.2
20
40
60
Index
80
100
20
40
60
80
100
Index
The transformation Y1=V1-1 was selected because both arrive the same results and this one is more
simple. With this transformation we go back to the model selection and we obtain the new model with
the same variables but different coefficients.
Coefficients of Model 2
Estimate
Std. Error
t value
Pr(>|t|)
(Intercept)
29.12
3.90
7.46
0.00000
perc.child
-26.39
14.76
-1.79
0.07724
perc.vac
3.10
0.98
3.17
0.00210
camp.virus
1.30
0.22
5.87
0.00000
temp
-0.03
0.01
-2.07
0.04169
public.per.cap
559.30
142.80
3.92
0.00018
employ
5.38
3.64
1.48
0.14297
region1TRUE
-1.69
0.28
-5.94
0.00000
region2TRUE
-1.75
0.25
-6.89
0.00000
phys.bed
-9.64
4.16
-2.32
0.02282
hab.km2
-0.01
0.00
-8.59
0.00000
households
0.00
0.00
1.38
0.16982
Now a check on the residuals of the new model is needed. The histogram of the residuals shows a
distribution really similar to a normal one, however the residuals against the fitted values graph shows
some increase in the volatility while the fitted values increases, this problem could be related with some
multicollinearity between the explanatory variables. On the other hand, the graphs of the residuals
against the explanatory variables show a white noise behavior4.
Residuals vs Fitted values
15
10
Frequency
-2
-1
model8$res
20
25
Histogram of model8$res
22
24
26
28
30
32
model8$fit
-2
-1
model8$res
Finally, the square of the residuals and its absolute value against the fitted value was checked to be sure
that they follow a white noise distribution. It is possible to see in the following graphs that both, the
square and the absolute values of the residuals against the fitted values show some positive trend and
increase in the variability.
0.0
0.5
model8$res^2
1.5
1.0
abs(model8$res)
2.0
2.5
22
24
26
28
30
32
22
24
26
model8$fit
28
30
32
model8$fit
This behavior of the residuals could be related with the fact that we are missing some explanatory
variable. Because of that the square of the explanatory variables and some interaction terms are going
to be included. The final model with an AIC value of 6.21, better than the 13.9 of the previous model,
was selected. In this last model, the variables were centered to avoid some multicollinearity issue
between the variables and its squares.
(Intercept)
c.perchild
c.perc.vac
c.households
c.employ
camp.virus
region1TRUE
region2TRUE
c.hab.km2
c.publicpercap
c.publicpercap2
c.hab.km22
region2TRUE:c.hab.km2
Coefficients of Model 3
Estimate
Std. Error
28.07
0.36
-23.77
14.31
3.16
0.95
0.00
0.00
10.18
3.46
1.17
0.21
-1.70
0.27
-1.44
0.25
-0.03
0.00
-75.41
162.80
271300.00 92780.00
0.00
0.00
0.01
0.00
t value
78.09
-1.66
3.33
2.60
2.94
5.62
-6.18
-5.84
-6.43
-0.46
2.93
3.04
2.56
Pr(>|t|)
0.00
0.10
0.00
0.01
0.00
0.00
0.00
0.00
0.00
0.64
0.00
0.00
0.01
The residuals of this new model still show some trend in the variance, in order to correct this problem
we are going to estimate the model with weights least squares, using the hat values as the weights. The
residuals against the fitted values seems to are5 ok and the new coefficients are the following,
(Intercept)
c.perchild
c.perc.vac
c.households
c.employ
camp.virus
region1TRUE
region2TRUE
c.hab.km2
c.publicpercap
c.publicpercap2
c.hab.km22
region2TRUE:c.hab.km2
Coefficients of Model 4
Estimate
Std. Error
28.21
0.32
-23.27
13.22
2.62
0.91
0.00
0.00
13.26
3.26
1.12
0.21
-1.70
0.26
-1.44
0.25
-0.03
0.00
-28.51
152.30
284300.00 81390.00
0.00
0.00
0.01
0.00
t value
88.19
-1.76
2.89
2.51
4.07
5.42
-6.48
-5.68
-6.84
-0.19
3.49
3.31
3.22
Pr(>|t|)
0.00
0.10
0.00
0.01
0.00
0.00
0.00
0.00
0.00
0.64
0.00
0.00
0.01
It is important to check for outliers, so in the first place we check for outliers on the Y observations using
the Studentized deleted errors. There is not presence of outliers in Y observations. Then we check for
outliers in the X observations using the Hat matrix leverage values. There is not presence of outliers in X
observations6.
And the last diagnostic is the Variance Inflation Factor, in order to check if the explanatory variables
show multicollinearity. The model shows serious multicollinearity issues because the maximum VIF
value is greater than 10 and the average VIF value is greater than 1.
c.perchild
c.perc.vac
c.households
c.employ
1.122
1.249
1.990
1.131
camp.virus
region1
region2
c.hab.km2
1.196
1.511
1.481
37.252
c.publicpercap c.publicpercap2 c.hab.km22 region2:c.hab.km2
3.013
2.194
29.082
1.822
The tables with the result of both tests are in the appendix A.
Conclusions
Along this project a model to predict the percentage of the population infected by the Virus A in the
Country B was developed. To do that we use the variables included in the sample data, we create new
variables and at the end we use the square of this variables, the interaction between each other and we
also transformed the response variable in order to get the best results.
The best model to predict the transformed response variable (Y1=V1-1) was,
Y1 = 28.21-23.27*c.perchild+2.62*c.perc.var+0.41E-4*c.households+13.26*c.emply+1.12*camp.virus1.7*region1-1.44*region2-0.03*c.hab.km2-28.51*c.publicpercap-28.43E+4*c.publicpercap2+0.23E-4
*c.hab.km22 +0.01*region2*c.hab.km2
The model seems to fit really well, with an adjusted R2 of 89.24%, which means that the 89.24% of the
variation of the response variables is explain by the model.
One limitation of the model is that we find multicollinearity between the explanatory variables, which
means that these variables have high correlation between each other. This issue could increase the
mean square error of these variables. However, the multicollinearity issue doesnt affect the prediction
power of the model, so we conclude that this model is a good tool to make the best prediction of the
percentage of the population infected with the Virus A.
10
Apendix A
Correlation matrix of the sample variables
V1
V1
perc.male
perc.college
perc.child
num.bed
num.phys
perc.vac
population
household.size
household.income
perc.insured
geo.size
urban
camp.virus
temp
public.per.cap
employ
rain
region
1.0000
-0.0538
-0.0162
0.1099
0.5042
0.5087
-0.1311
0.4305
0.1520
0.1507
-0.0267
-0.6037
0.4378
-0.3683
-0.2149
-0.4556
-0.0072
0.0118
-0.4230
perc.male
perc.college
perc.child
num.bed
num.phys
perc.vac
population household.size household.income perc.insured geo.size
urban
camp.virus temp
public.per.cap employ
rain
region
-0.05379185
-0.016239036
0.109909523
0.504218733
0.508748195 -0.131081034 0.43048432
0.15201332
0.150738919 -0.026732281 -0.603651932 0.437805016 -0.36833661 -0.214920555 -0.455620944 -0.007221472 0.01176554 -0.422998072
1
0.962909313
-0.914793855
0.14996299
0.131697934 0.072201059 0.09659195
0.03245287
0.076018307 0.125211244 0.040532916 0.089086629 0.10062439 0.049526291 0.030377274 -0.231542409 -0.044748119 0.119368858
0.96290931
1
-0.87415802
0.151829575
0.129064518 0.059845935 0.11501937
0.08234228
0.127280056 0.101606782 0.03104878 0.075338931 0.07441531 0.021969768 -0.003727268 -0.250959554 -0.031603178 0.060832004
-0.91479385
-0.87415802
1
-0.090994145
-0.074138337 -0.007453921 -0.05871094
-0.02338423
-0.062674736 -0.129227114 -0.074004725 -0.021437323 -0.08682309 0.003911223 -0.029381878 0.190215232 -0.052637884 -0.126157206
0.14996299
0.151829575
-0.090994145
1
0.945603843 0.195119591 0.79944142
0.0355717
0.041892933 0.006545029 -0.336654044 0.842372441 -0.08353608 -0.028253979 -0.088867222 -0.044900966 0.187778643 -0.09640023
0.13169793
0.129064518
-0.074138337
0.945603843
1 0.133602216 0.67182316
0.03654678
0.042060555 0.022692612 -0.386901068 0.900935808 -0.04472267 -0.043452632 -0.003097691 -0.029899445 0.191047857 -0.070914348
0.07220106
0.059845935
-0.007453921
0.195119591
0.133602216
1 0.20443806
0.01547453
0.017222982 0.130679036 -0.075647759 0.103700763 0.10046444 0.08291165 -0.050041915 -0.014499702 0.091875188 0.100483597
0.09659195
0.115019371
-0.05871094
0.799441421
0.671823158 0.204438061
1
0.06885906
0.073964098 -0.023112172 -0.162222753 0.476397949 -0.12470366 -0.085647889 -0.38589852 0.058129231 0.208401526 -0.109526469
0.03245287
0.082342283
-0.023384231
0.035571704
0.03654678 0.015474526 0.06885906
1
0.998948416 -0.026522967 -0.027913535 -0.020886456 0.01964967 -0.135394383 -0.077483079 0.016666948 -0.133816336 -0.070302971
0.07601831
0.127280056
-0.062674736
0.041892933
0.042060555 0.017222982 0.0739641
0.99894842
1 -0.020799924 -0.026474332 -0.017860898 0.02248706 -0.133947121 -0.078446773 0.004915188 -0.135392072 -0.06714767
0.12521124
0.101606782
-0.129227114
0.006545029
0.022692612 0.130679036 -0.02311217
-0.02652297
-0.020799924
1 -0.03506215 0.02540019 0.13967916 0.025298329 0.028860324 0.065061261 0.10514793 0.086176471
0.04053292
0.03104878
-0.074004725
-0.336654044
-0.386901068 -0.075647759 -0.16222275
-0.02791354
-0.026474332 -0.03506215
1 -0.431631853 -0.04174145 0.098784403 0.493078933 -0.067108329 -0.005145648 0.085411298
0.08908663
0.075338931
-0.021437323
0.842372441
0.900935808 0.103700763 0.47639795
-0.02088646
-0.017860898 0.02540019 -0.431631853
1 -0.04467011 0.013929722 0.184959123 -0.037122816 0.102242036 -0.004807952
0.10062439
0.074415313
-0.086823089
-0.083536082
-0.044722665 0.10046444 -0.12470366
0.01964967
0.022487057 0.139679159 -0.041741451 -0.044670111
1 0.239928841 0.082487002 -0.017534417 0.028369109 0.022655664
0.04952629
0.021969768
0.003911223
-0.028253979
-0.043452632 0.08291165 -0.08564789
-0.13539438
-0.133947121 0.025298329 0.098784403 0.013929722 0.23992884
1 0.215311068 -0.232143626 -0.072323202 0.164453208
0.03037727
-0.003727268
-0.029381878
-0.088867222
-0.003097691 -0.050041915 -0.38589852
-0.07748308
-0.078446773 0.028860324 0.493078933 0.184959123
0.082487 0.215311068
1 -0.183543762 -0.035585086 0.126658913
-0.23154241
-0.250959554
0.190215232
-0.044900966
-0.029899445 -0.014499702 0.05812923
0.01666695
0.004915188 0.065061261 -0.067108329 -0.037122816 -0.01753442 -0.232143626 -0.183543762
1 -0.019797261 0.062310507
-0.04474812
-0.031603178
-0.052637884
0.187778643
0.191047857 0.091875188 0.20840153
-0.13381634
-0.135392072 0.10514793 -0.005145648 0.102242036 0.02836911 -0.072323202 -0.035585086 -0.019797261
1 0.071927922
0.11936886
0.060832004
-0.126157206
-0.09640023
-0.070914348 0.100483597 -0.10952647
-0.07030297
-0.06714767 0.086176471 0.085411298 -0.004807952 0.02265566 0.164453208 0.126658913 0.062310507 0.071927922
1
11
4.0e+08
0.01
0.02
0
0.030
0.08
phys.bed
100
0.08
V1
perc.public.area
10000
0.0
hab.km2
4.0e+08
households
-80000
total.income
num.unemploy
16800
vac.km2
0.01
income.per.cap
0.10
phys.per.hab
0.02
beds.per.hab
phys.per.house
phys.per.km2
phys.per.child
0.030
100
10000
-80000
16800
0.10
0.05
beds.per.km2
0.05
12
0.86
0.01
0.0
0.0
-2 2
0.140
0.140
model9$res
perc.child
0.000
0.5
perc.vac
public.per.cap
0.86
30
temp
0.08
employ
0.01
phys.bed
0.02
phys.per.hab
phys.per.house
0.0
camp.virus
0.0
region1
hab.km2
0.5
30
0.08
0.02
0.0
100
-2
-1
model13$res
-2 2
100
0.0
region2
22
24
26
28
30
32
34
model13$fit
13
2
FALSE
15
FALSE
FALSE
16
FALSE
29
FALSE
57
58
71
FALSE
72
FALSE
85
FALSE
73
86
99
60
74
FALSE
87
FALSE
61
75
FALSE
88
FALSE
62
76
FALSE
89
FALSE
63
77
FALSE
90
FALSE
64
78
FALSE
91
FALSE
65
79
FALSE
92
FALSE
66
80
FALSE
93
FALSE
94
FALSE
69
82
95
70
FALSE
83
FALSE
96
FALSE
56
FALSE
FALSE
FALSE
FALSE
55
68
81
42
FALSE
FALSE
FALSE
FALSE
41
54
67
28
FALSE
FALSE
FALSE
FALSE
27
40
53
14
FALSE
FALSE
FALSE
FALSE
FALSE
26
39
52
13
FALSE
FALSE
FALSE
FALSE
FALSE
25
38
51
12
FALSE
FALSE
FALSE
FALSE
FALSE
24
37
50
11
FALSE
FALSE
FALSE
FALSE
FALSE
23
36
49
10
FALSE
FALSE
FALSE
FALSE
FALSE
22
35
48
9
FALSE
FALSE
FALSE
FALSE
FALSE
21
34
47
8
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
59
20
33
46
7
FALSE
FALSE
FALSE
FALSE
FALSE
19
32
45
6
FALSE
FALSE
FALSE
FALSE
FALSE
18
31
44
5
FALSE
FALSE
FALSE
FALSE
FALSE
17
30
43
4
FALSE
FALSE
FALSE
FALSE
84
FALSE
97
FALSE
98
FALSE
100
FALSE
2
FALSE
15
FALSE
16
29
57
58
71
FALSE
72
FALSE
85
FALSE
59
73
FALSE
86
FALSE
99
60
74
FALSE
87
FALSE
61
75
FALSE
88
FALSE
62
76
FALSE
89
FALSE
63
77
FALSE
90
FALSE
64
78
FALSE
91
FALSE
65
79
FALSE
92
FALSE
66
80
FALSE
93
FALSE
94
FALSE
69
82
70
FALSE
83
FALSE
96
FALSE
56
FALSE
FALSE
FALSE
95
FALSE
55
68
81
42
FALSE
FALSE
FALSE
FALSE
41
54
67
28
FALSE
FALSE
FALSE
FALSE
27
40
53
14
FALSE
FALSE
FALSE
FALSE
FALSE
26
39
52
13
FALSE
FALSE
FALSE
FALSE
FALSE
25
38
51
12
FALSE
FALSE
TRUE
FALSE
FALSE
24
37
50
11
FALSE
FALSE
FALSE
FALSE
FALSE
23
36
49
10
FALSE
FALSE
FALSE
FALSE
FALSE
22
35
48
9
FALSE
FALSE
FALSE
FALSE
FALSE
21
34
47
8
FALSE
FALSE
FALSE
FALSE
FALSE
20
33
46
7
FALSE
FALSE
FALSE
FALSE
FALSE
19
32
45
6
FALSE
FALSE
FALSE
FALSE
FALSE
18
31
44
5
FALSE
FALSE
FALSE
FALSE
FALSE
17
30
43
4
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
84
FALSE
97
FALSE
98
FALSE
100
FALSE
14