Ex Day4
Ex Day4
Ex Day4
We consider data taken from The Current Population Survey (CPS) made
in the US in 1985. The dataset contains observations of the following 6
variables for 532 persons:
> summary(lm(wage~edu+exper+age,data=subset(wage,(sex==1)&(occup==5))))
Residuals:
Min 1Q Median 3Q Max
-6.5109 -2.9453 -0.6629 2.0672 14.0105
1
Model validation would reveal that it is better to model the logarithm of the wage.
But in order not to give the impression that responses should always be log-transformed,
which indeed isn’t the case, and also to keep the interpretations of the parameters estimates
as simple as possible, we will not transform the response variable. This is ok since the
emphasis of this exercise is multicollinearity and not model validity.
1
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.4931 6.6457 -2.331 0.024 *
edu 0.7059 0.8524 0.828 0.412
exper -0.6247 0.8723 -0.716 0.477
age 0.6775 0.7964 0.851 0.399
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the model summary we see that neither of the 3 explanatory vari-
ables are close to significance. However, the multilinear regression still ex-
plains 30.97 pct (ie. the R2 ) of the variation in the wages, and taken together
the 3 explanatory variables are highly significant (p=0.0004461).
The fit of the multilinear regression after removal of age is given on the
next page. Please consider the following questions:
What has happened to the sign of the slope of exper? Do you think
that the positive sign makes more sense? Why/why not?
> summary(lm(wage~edu+exper,data=subset(wage,(sex==1)&(occup==5))))
Call:
lm(formula = wage ~ edu + exper, data = subset(wage, (sex ==
1) & (occup == 5)))
2
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
Residuals:
Min 1Q Median 3Q Max
-5.9828 -3.0854 -0.6495 1.7550 14.1748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.85350 5.07118 -2.337 0.0235 *
edu 1.38007 0.31307 4.408 5.68e-05 ***
exper 0.11552 0.06237 1.852 0.0700 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The following output from R shows that edu and exper may be considered
uncorrelated in the subpopulation of craftswomen. Does this have any impli-
cation for the interpretation of the slope estimates on edu and exper given
above? Why/why not? And what if edu and exper actually are negatively
correlated, i.e. if working experience in general is shorter for craftswomen
with a longer education?
> with(subset(wage,(sex==1)&(occup==5)),cor.test(edu,exper))
3
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
which two of these variables provides the “correct” explanation can not be
done based on statistics, but relies on the interpretation of the variables.
When there is multicollinearity among the explanatory variables, the p-values
may change from non-significant to highly significant and the estimates may
change sign after model reduction. That there indeed is multicollinearity in
the present dataset may be seen from the following analysis2
> summary(lm(age~edu+exper,data=wage))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.09160 0.19182 31.76 <2e-16 ***
edu 0.98494 0.01281 76.91 <2e-16 ***
exper 1.05558 0.00271 389.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What is the interpretation of the null hypothesis that the slopes on edu
and exper both equal 1?
2
A few more suggestions for the identification of multicollinearity may be found in
solution4 1.R.
4
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
In this exercise we investigate the world records for outdoor running dis-
tances. The records were taken from the website http://www.iaaf.org of
the International Association of Athletics Federation on May 7, 2011. We
want to examine the dependence of the record (time) on the distance, and to
examine the difference between men and women. The purpose of this exercise
is to give a non-trivial example of the choices needed in making simple sta-
tistical models with good interpretations. Reference: Based on exercise 8.2
from Anders Tolver & Helle Sørensen: Lecture notes for Applied Statistics.
Read the dataset available in the text file WR2011.txt into R (in a data
frame called wr ), and have a look at the variables:
– Please note, that the distances are more or less doubled between
consecutive running disciplines. Thus, the running distances are
almost equidistant on a logarithmic scale.
– The variable DOB contains the data-of-birth of the record holder.
The variables Place and Date contain the place and date of the
record. These variables will not be used in this exercise.
– The variable bend I made myself, and it will be used later. This
variable quantifies how many times longer than 1500 meters the
running distance in question is, and it is set to 1 if the distance is
shorter than 1500 meters.
– Make sure that the variables time, distance and bend are numeri-
cal, and that sex is a categorical factor.
library(ggplot2)
ggplot(wr) + geom_point(aes(x=distance,y=time,col=sex))
time = α + β ∗ distance
5
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
log(time) = α + β ∗ log(distance)
It is not obvious that this is a good model. But what do you think
looking at the plot?
m1 <- lm(log(time)~sex+log(distance)+sex:log(distance),data=wr)
In this model both the alpha and the beta parameter depend on the
gender:
par(mfrow=c(1,1))
plot(predict(m1),residuals(m1))
identify(predict(m1),residuals(m1))
and use the mouse to click on the points where you think the bend is
positioned. After you are done finish the identifier as signified in the
graphics window (in Windows you should press the Esc-key).
Remark: If identity() does not work inside RStudio, then a solution
might be to open a separate graphical device using the function x11()
before making the plot.
3
The line par(mfrow=c(1,1)) is only necessary if you did par(mfrow=c(2,2)) before.
6
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
7
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
8
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
the raw means listed in the above table? Do you prefer the raw means or the
em-means? And why?
Remark: The emmeans-package may be used to compute and compare the
em-means. The statistical computations done in the emmeans-package are
based on standard errors extracted from the model objects. Suppose e.g.
that your model is available in an lm-object called m2, and try the following
R code (and think about what the code does):
# load library
library(emmeans)
9
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
The dataset is shown below, and it is also available in the text file phosphor.txt:
inorganic organic available
0.4 53 64
0.4 23 60
3.1 19 71
0.6 34 61
4.7 24 54
1.7 65 77
9.4 44 81
10.1 31 93
11.6 29 93
12.6 58 51
10.9 37 76
23.1 46 96
23.1 50 77
21.6 44 93
23.1 56 95
1.9 36 54
26.8 58 168
29.9 51 99
10
Applied Statistics Bo Markussen
Statistical methods for the Biosciences December, 2021
Is there an association?
(Reference: Exercise 8.4 from Anders Tolver & Helle Sørensen: Lecture notes
for Applied Statistics.)
End of exercises.
11