05 Inference Lab
05 Inference Lab
DEEPAYAN SARKAR
> extra.hyoscyamine <- c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0)
> extra.laevorotatory <- c(1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4)
> extra.racemic <- c(1.5, 1.4, 0.0, -0.7, 0.5, 5.1, 5.7, 1.5, 4.7, 3.5)
First question. Consider the extra.hyoscyamine observations, which represents the extra hours of sleep
on average when given the hyoscyamine drug. The mean increase is
> mean(extra.hyoscyamine)
[1] 0.75
However, we are not really interested in these 10 particular patients, but in the effect of the drug in the
“general population”. To make any statements about that, we need to first link this particular sample to
the population.
This is done by hypothesizing a statistical model. In this simple case, our model needs just one univariate
distribution; we will pretend that we have planned the experiment but not yet observed the data, and
denote the n = 10 observations we will see by the symbols X1 , X2 , . . . , Xn . We will assume that these
quantities are independent random variables coming from a common but unknown distribution F . The
only unknown component of our model is F , which we refer to as the parameter of our model. The link
between the model and the actual observed data is completed by assuming that the observed data is one
realization of these random variables. If we repeated the experiment with n other patients, the observed
numbers would be different, but they would be realizations from the same distribution F .
It is difficult to infer much about the unknown distribution F in this model. However, some simple things
are possible; for example, we may be interested in the expected value of the Xi -s, given by
Z ∞
µ = E(Xi ) = xf (x)dx
−∞
where f is the density function of the distribution F . We call µ the population mean. A common-sense
estimator of µ is the sample mean 0.75 seen above. It also happens to be the least squares estimator, in the
sense that it is the value of a that minimizes the squared-error loss
n
X
(Xi − a)2 .
i=1
This estimator also has nice optimality properties: it is unbiased and has minimum variance among all
linear unbiased estimators. What does that mean? Notice that the sample mean
n
1X
X̄ = Xi
n i=1
is itself a random variable: it would have taken a different value if a different set of 10 patients had been
selected for the experiment. The probablity distribution of X̄ (which depends on F ) has its own mean and
variance. X̄ is unbiased because E(X̄) = µ, and it also has lower variance than any other (linear) unbiased
estimator of µ.
There may be other features of F we may be interested in such as the median and variance; these also have
estimators with various optimality properties.
Second question. In our observed sample, X̄ has the value 0.75. Does this mean µ is exactly equal to
0.75? Of course not. Can we at least say that µ is positive (that is, giving hyoscyamine is better than
giving no drug at all)? Even that is not necessarily true.
To see why, let us do some simulation. It is expensive to perform experiments in real life, but it is cheap to
do them in a computer. Consider our model. We do not know the parameter F , but let us suppose for a
moment that F was the standard Normal distribution N (0, 1). Here is the mean of one sample of size 10
from the N (0, 1) distribution.
> mean(rnorm(10))
[1] -0.4310653
As we can see, X̄ is sometimes positive even when the true F has µ = 0. Thus, the fact that X̄ is positive
in our experiment does not imply that µ > 0. There is not much more we can say about this problem
unless we make further model assumptions.
A more specific model. Our previous model made very few assumptions: only that all observations are
independent and come from the same distribution. Not much inference can be done in such a general
setup. We will now make a much more specific model: we will assume that F is a Normal distribution,
with mean µ and variance σ 2 ; that is,
Xi ∼ N (µ, σ 2 ), i = 1, 2, . . . , n
The question we are interested in is still whether µ is positive. The intuitive idea is that we would be more
inclined to believe this conjecture if in addition to X̄ being positive, the variance is also small. The familiar
AN INTRODUCTION TO R 3
Let us try our simulation approach again, assuming that µ = 0. But looking at a very specific F as we did
before (the standard Normal) will not allow us to make any general conclusions, so we would like to allow
σ 2 to be any positive quantity. However, this seems pointless, as we cannot simulate from N (0, σ 2 ) if we do
not know σ 2 . What can we do then?
The trick is to notice that if we scale X̄ to obtain a new quantity
X̄
U=√
s2
Xi
Exercise 1. Let X1 , . . . , Xn ∼ N (0, σ 2 ) independently. Let Zi = σ ,i = 1, . . . , n. Then by definition,
Zi ∼ N (0, 1). Let
X̄
UX = q Pn , and
1 2
n−1 i=1 (Xi − X̄)
Z̄
UZ = q Pn
1
n−1 i=1 (Zi − Z̄)2
Prove that UX = UZ . As the distribution of UZ clearly does not depend on σ 2 , the distribution of UX must
also not depend on σ 2 .
This means that whatever value of σ 2 we use to simulate, the computed value of U = X̄s will have the same
distribution; so we may as well use σ 2 = 1. Let us compute U from 5000 simulations of our model.
What is the distribution of this U ? Is it Normal? We can check using a Normal Q-Q plot.
> library(lattice)
> qqmath(~u5000, distribution = qnorm, type = c("p", "g"),
aspect = "xy", pch = ".", cex = 2)
4 DEEPAYAN SARKAR
u5000
0
−1
−2
−4 −2 0 2 4
qnorm
Exercise 2. Does this look like a straight line? Compare with the Normal Q-Q plot of values directly
simulated from the Normal distribution.
The systematic curvature is indicative of tails wider than the Normal distribution.
Simulation p-value. We now have 5000 simulated values of U under the assumption that µ = 0 (in the
model that F is a Normal distribution). What is the value of U in our actual experiment?
Let us compute the proportion of cases in which our simulated U is larger than our observed U .
Thus under our model, there is a roughly 10% chance that even when µ = 0, we see a value of U at least as
large as what we have actually seen. Of course, our observed proportion is itself an estimate of this chance,
but we can compute it more accurately by simulating more samples.
We have just performed a hypothesis test, where we tested the null hypothesis
H0 : µ = 0
against the alternative hypothesis
H1 : µ > 0.
AN INTRODUCTION TO R 5
The proportion computed is the p-value P (U > uobserved ), which represents the “surprise factor”: how
unlikely is our observed sample given our null hypothesis?
The true p-value. Our Q-Q plots suggest that the distibution of U when µ = 0 is not Normal, but does
not tell us what it is. It so happens that√ the distribution of U can be derived analytically, and belongs to
the family of t-distributions. In fact, n U ∼ tn−1 (the t-distribution with√ n − 1 degrees of freedom), as
can be graphically verified by a Q-Q plot against the tn−1 distribution. n U is referred to as the test
statistic, as it is the quantity (statistic) on which the test is based.
5
(sqrt(10) * u5000)
−5
−6 −4 −2 0 2 4 6
function(p) qt(p, df = 9)
We do not need to depend on simulation to compute the actual p-value; it can be computed using
A p-value of this magnitude is not considered strong enough evidence against our conjecture (null
hypothesis) that µ = 0; in other words, we would not conclude in this case that that hyoscyamine is an
effective sleep-inducing drug. The usual practice is to consider a p-value less that 0.05 to be “significant”, or
strong enough evidence (if we stick by such a rule whenever we test, we would erroneously reject the null
hypothesis 5% of the time).
Of course, it is possible that our lack of evidence is simply due to the fact that we do not have enough data.
Not having enough evidence “against the null” does not mean that we have evidence against the alternative.
Exercise 3. Repeat the procedure above to compute p-values for testing whether the two other drugs
(laevorotatory and racemic) significantly increase the average duration of sleep.
6 DEEPAYAN SARKAR
The R function t.test() will also perform the test described above, which is also known as a one-sample
t-test.
data: extra.hyoscyamine
t = 1.3257, df = 9, p-value = 0.1088
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
-0.2870553 Inf
sample estimates:
mean of x
0.75
Third question. Let us now ask a slightly different question: is the drug laevorotatory any better than
hyoscyamine? Since the observations are paired (measured on the same patients), this is the same as asking
if the mean of their difference is positive. R can perform a paired t-test using
Exercise 4. Confirm that the paired t-test above gives the same results as
A non-parametric test. Let us now go back to a more general model for our data, with F only assumed
to be symmetric around the mean µ. As before, we would like to test
H0 : µ = 0 vs H1 : µ > 0
The Wilcoxon signed-rank test works as follows: rank the observations according to their numeric value
(ignoring their sign), and then compute the sum of ranks for only the positive observations. The idea is
that with only the symmetry assumption, the test statistic corresponds to selecting each number from 1 to
n with probablity 0.5 and taking their sum.
It is clear that as before, the distribution of the test statistic will not depend on F (as long as µ = 0 and F
is symmetric). We can use simulation from any symmetric distribution with mean 0 to estimate the
p-value, or use a theoretical calculation. The distribution of the test statistic is not a standard distribution,
but is available in R (see ?pwilcox). The wilcox.text() function will perform the directly given the data.
data: extra.hyoscyamine
V = 31, p-value = 0.1716
alternative hypothesis: true location is greater than 0
AN INTRODUCTION TO R 7
Tests such as these are called non-parametric tests because they do not need strict parametric model
assumptions such as Normality. The t-test actually works quite well even under mild departures from
Normality, and has better optimality properties (it is more likely to correctly detect situations where H0 is
not true) when the data is actually close to Normal. For this reason, the t-test is preferred when the
assumption of Normality is justifiable, even though the Wilcoxon signed-rank test works under more
general assumptions.
The two-sample t-test. Consider a more complicated situation where we have independent samples from
two different populations: X1 , X2 , . . . , Xn ∼ F1 , and Y1 , Y2 , . . . , Ym ∼ F2 . Let the population means of F1
and F2 be µ1 and µ2 . We want to test H0 : µ1 = µ2 .
As before, the typical parametric approach is to assume that each Xi ∼ N (µ1 , σ12 ), and each
Yj ∼ N (µ2 , σ22 ). There are two variations of the next step. If we further assume that σ12 = σ22 and use a
pooled estimate of the variance, we get the so-called two-sample t-test with equal variance, also called the
classical t-test. If we allow σ12 6= σ22 , we can use an alternative procedure due to Welch that produces a
test-statistic whose null distribution does not exactly follow a t-distribution, but is well-approximated by it.
Both tests can be performed using the t.test() function.
Using the energy expenditure data seen before, we can do
Both these tests have alternative hypothesis H1 : µ1 6= µ2 . The second example uses a formula interface,
similar to the one we have seen in lattice graphics. The formula interface is used extensively for modeling in
R.
8 DEEPAYAN SARKAR
Two-sample Wilcoxon test. As with the one-sample problem, we can compute a two-sample Wilcoxon
test statistic, which computes the rank of all observations taken together (without regard to grouping) and
then sums the ranks of observations in the first group. The test for H1 : µ1 < µ2 can be performed using
wilcox.test() as follows.
Exercise 5. Using the ideas described above, compute the p-value for this test using simulation. Use 50000
replications. Is your computed p-value close to the p-value reported by wilcox.test()?
Linear models
A wide range of useful statistical models can be viewed as special cases of a class of models called linear
models. The general model may be written as
yi = xi1 β1 + xi2 β2 + · · · + xip βp + εi , i = 1, 2, . . . , n
where each i represents one observation in the sample, yi -s are some response variable that we want to
model, and xij -s are various known predictor or “design” terms. The statistical component in the model
comes through the random variables εi , which are at a minimum assumed to be independent with mean 0
and constant variance σ 2 . Often the more specific model εi ∼ N (0, σ 2 ) is assumed.
Estimation. The parameters in a linear model are the coefficients βj , j = 1, 2, . . . , p, and the variance
parameter σ 2 . The best estimators of the βj -s turn out to be those that minimise the squared error loss.
These estimators have nice analytical solutions, and they can be computed by the R function lm().
Testing. Hypothesis tests under the linear model are conjectures that put linear constraints on the βj
coefficients; typically these are of the form H0 : β1 = 0, H0 : β1 = β2 = β3 = β4 , etc. Such hypotheses can
be tested using a generalization of the t-test known as the F -test, provided we assume the more specific
Normal error model.
Specific models. We will not discuss the general linear model theory, but instead just mention some
common special cases.
The most familiar example of a linear model is simple linear regression. We will re-use the following
artificial example to illustrate regression.
Coefficients:
(Intercept) x
-5.573 5.705
We of course know this model to be incorrect, as E(yi ) cannot be expressed as a linear function of xi . We
can fit the correct model by adding a quadratic term.
yi = β1 + xi β2 + x2i β3 + εi , i = 1, 2, . . . , n = 100
which is fit by
Coefficients:
(Intercept) x I(x^2)
1.22003 -0.06778 0.99664
In practice we would not know the correct model, so we may need to perform a hypothesis test to decide
which model is correct. This is done using
Model 1: y ~ 1 + x
Model 2: y ~ 1 + x + I(x^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 171.978
2 97 34.123 1 137.86 391.88 < 2.2e-16
The very small p-value (the column with heading Pr(>F)) indicates that fm2 is a much better fit for the
data. Compare this with a test for a model with a (redundant) cubic term:
Model 1: y ~ 1 + x + I(x^2)
Model 2: y ~ 1 + x + I(x^2) + I(x^3)
Res.Df RSS Df Sum of Sq F Pr(>F)
10 DEEPAYAN SARKAR
1 97 34.123
2 96 33.795 1 0.32815 0.9322 0.3367
There is not always obvious additional terms to try out to test the goodness of a model fit this way. As an
alternative, it is always a good idea to look at graphical diagnostics for a fit. For example, R shows the
following plots for fm1, the model with only the linear term. The first and last plots are clear indicators of
systematic lack-of-fit.
1.5
14 14 14
3
36 2
2
Standardized residuals
2
Standardized residuals
Standardized residuals
36 161
2
1.0
1
Residuals
1
1
0
0
0
0.5
−1
−1
−1
−2
94
2
−2
Cook's distance
0.0
2
−3
−2
0 5 10 15 20 −2 0 1 2 0 5 10 15 20 0.00 0.02
Exercise 6. Look at the corresponding diagnostic plots for fm2. Do they still indicate lack-of-fit?
so that the errors have a Normal distribution, rather than the uniform distribution as before. Re-fit fm2 and
look at its diagnostic plots. Do you see any qualitative difference?
Exercise 8. For the airquality dataset, fit a suitable regression model for Ozone as response and Temp as
predictor. Comment on goodness-of-fit.
Another important class of linear models is factorial models. The most simple factorial models are one-way
classification models, where the predictor is a categorical variable, and the response is allowed to have a
different mean (but the same variance) for each group. One such model with a two-group categorical
variable is
AN INTRODUCTION TO R 11
Response: expend
Df Sum Sq Mean Sq F value Pr(>F)
stature 1 26.485 26.4853 15.568 0.000799
Residuals 20 34.026 1.7013
The anova() call with a single argument tests submodels obtained by sequentially removing terms from
the model, starting with the last one. In this case, the only submodel is equivalent to the hypothesis that
both groups have the same mean.
Exercise 9. In a one-way classification model with two categories, testing equality of means across groups
is equivalent to a two-sample t-test with equal variance. Confirm this for the above example by comparing
p-values for the appropriate tests.
Coefficients:
(Intercept) factor(Month)6 factor(Month)7 factor(Month)8 factor(Month)9
23.615 5.829 35.500 36.346 7.833
Exercise 10. Look at the diagnostic plots produced by plot(fm5). Is there any reason to suspect
lack-of-fit?
Exercise 11. Create a box-and-whisker plot of Ozone by Month. One-way classification models assume
equality of response variance for each group. Is the true for fm5? Fit and assess a new model with
log(Ozone) as the response.
Any number of continuous (regression-type) and categorical terms can be used in a linear model. Here are
a few more complex models for Ozone:
fm8 allows a different regression line for each month, but with the same slope. fm9 allows a different
regression line for each month, with possibly different slopes.
Exercise 12. Use anova() to compare all pairs of models for which the comparison makes sense. Which
model would you suggest as the most appropriate for the data?
Further reading
In this tutorial, we have only outlined some very basic ideas of modeling in R. Fitting linear models in R is
by itself an extensive topic. ?lm is a good place to start learning more, especially about the formula
language, and many resources on the web have more details. There are also many other kinds of models
one can fit in R, which you should explore once you are more comfortable with the basic models.