Data Analysis and Graphics Using R-An Example Based Approach
Data Analysis and Graphics Using R-An Example Based Approach
www.cambridge.org
www.cambridge.org
and
John Braun
Department of Statistical and Actuarial Science, University of Western Ontario
www.cambridge.org
www.cambridge.org
It is easy to lie with statistics. It is hard to tell the truth without statistics.
[Andrejs Dunkels]
. . . technology tends to overwhelm common sense.
[D. A. Freedman]
www.cambridge.org
Contents
Preface
A Chapter by Chapter Summary
1 A Brief Introduction to R
1.1 A Short R Session
1.1.1 R must be installed!
1.1.2 Using the console (or command line) window
1.1.3 Reading data from a le
1.1.4 Entry of data at the command line
1.1.5 Online help
1.1.6 Quitting R
1.2 The Uses of R
1.3 The R Language
1.3.1 R objects
1.3.2 Retaining objects between sessions
1.4 Vectors in R
1.4.1 Concatenation joining vector objects
1.4.2 Subsets of vectors
1.4.3 Patterned data
1.4.4 Missing values
1.4.5 Factors
1.5 Data Frames
1.5.1 Variable names
1.5.2 Applying a function to the columns of a data frame
1.5.3 Data frames and matrices
1.5.4 Identication of rows that include missing values
1.6 R Packages
1.6.1 Data sets that accompany R packages
1.7 Looping
1.8 R Graphics
1.8.1 The function plot() and allied functions
1.8.2 Identication and location on the gure region
1.8.3 Plotting mathematical symbols
page xv
xix
1
1
1
1
2
3
4
5
5
6
7
7
8
8
8
9
9
10
11
12
13
13
13
14
14
14
15
16
19
20
www.cambridge.org
viii
Contents
20
22
23
25
26
29
29
30
33
34
36
37
41
42
43
44
46
47
48
48
49
49
50
50
3 Statistical Models
3.1 Regularities
3.1.1 Mathematical models
3.1.2 Models that include a random component
3.1.3 Smooth and rough
3.1.4 The construction and use of models
3.1.5 Model formulae
3.2 Distributions: Models for the Random Component
3.2.1 Discrete distributions
3.2.2 Continuous distributions
3.3 The Uses of Random Numbers
3.3.1 Simulation
3.3.2 Sampling from populations
3.4 Model Assumptions
3.4.1 Random sampling assumptions independence
3.4.2 Checks for normality
3.4.3 Checking other model assumptions
3.4.4 Are non-parametric methods the answer?
3.4.5 Why models matter adding across contingency
tables
52
53
53
54
55
56
56
57
57
58
60
60
61
62
62
63
66
66
67
www.cambridge.org
Contents
3.5
3.6
3.7
Recap
Further Reading
Exercises
ix
68
68
69
71
71
71
71
72
73
73
74
77
78
82
83
84
86
87
88
90
91
92
92
93
94
95
96
96
96
97
98
100
100
101
102
102
103
107
107
108
109
110
113
114
www.cambridge.org
Contents
5.3
116
116
117
118
119
121
121
121
123
126
127
128
129
130
131
132
132
134
134
137
138
139
140
142
143
143
145
145
145
145
146
147
152
152
152
153
155
155
161
162
163
164
164
167
168
www.cambridge.org
Contents
6.8
xi
168
168
169
169
170
171
172
175
175
175
178
179
181
183
187
188
191
191
192
194
194
197
197
197
198
198
199
202
208
209
210
211
211
213
215
216
217
217
220
220
220
221
222
223
www.cambridge.org
xii
Contents
224
224
225
225
228
229
230
232
232
234
236
237
239
239
239
242
242
243
244
245
246
252
253
253
254
255
255
256
256
257
259
259
259
260
261
261
264
264
265
266
267
268
www.cambridge.org
Contents
xiii
270
270
271
271
271
271
272
273
273
275
275
276
278
279
279
281
282
282
282
285
286
287
289
290
291
295
297
298
300
300
303
303
307
310
310
311
314
315
317
320
320
320
322
322
www.cambridge.org
xiv
Contents
12.7
12.8
12.9
12.10
12.11
12.12
322
323
324
324
326
327
328
328
329
330
331
331
333
334
334
Epilogue Models
338
341
References
Index of R Symbols and Functions
Index of Terms
Index of Names
346
352
356
361
www.cambridge.org
Preface
This book is an exposition of statistical methodology that focuses on ideas and concepts,
and makes extensive use of graphical presentation. It avoids, as much as possible, the use
of mathematical symbolism. It is particularly aimed at scientists who wish to do statistical analyses on their own data, preferably with reference as necessary to professional
statistical advice. It is intended to complement more mathematically oriented accounts of
statistical methodology. It may be used to give students with a more specialist statistical
interest exposure to practical data analysis.
The authors can claim, between them, 40 years of experience in working with researchers
from many different backgrounds. Initial drafts of the monograph were constructed from
notes that the rst author prepared for courses for researchers, rst of all at the University of
Newcastle (Australia) over 19961997, and greatly developed and extended in the course
of work in the Statistical Consulting Unit at The Australian National University over 1998
2001. We are grateful to those who have discussed their research with us, brought us
their data for analysis, and allowed us to use it in the examples that appear in the present
monograph. At least these data will not, as often happens once data have become the basis
for a published paper, gather dust in a long-forgotten folder!
We have covered a range of topics that we consider important for many different areas
of statistical application. This diversity of sources of examples has benets, even for those
whose interests are in one specic application area. Ideas and applications that are useful in
one area often nd use elsewhere, even to the extent of stimulating new lines of investigation.
We hope that our book will stimulate such cross-fertilization. As is inevitable in a book that
has this broad focus, there will be specic areas perhaps epidemiology, or psychology, or
sociology, or ecology that will regret the omission of some methodologies that they nd
important.
We use the R system for the computations. The R system implements a dialect of the
inuential S language that is the basis for the commercial S-PLUS system. It follows
S in its close linkage between data analysis and graphics. Its development is the result
of a co-operative international effort, bringing together an impressive array of statistical
computing expertise. It has quickly gained a wide following, among professionals and nonprofessionals alike. At the time of writing, R users are restricted, for the most part, to a
command line interface. Various forms of graphical user interface will become available in
due course.
The R system has an extensive library of packages that offer state-of-the-art-abilities.
Many of the analyses that they offer were not, 10 years ago, available in any of the standard
www.cambridge.org
xvi
Preface
packages. What did data analysts do before we had such packages? Basically, they adapted
more simplistic (but not necessarily simpler) analyses as best they could. Those whose
skills were unequal to the task did unsatisfactory analyses. Those with more adequate skills
carried out analyses that, even if not elegant and insightful by current standards, were often
adequate. Tools such as are available in R have reduced the need for the adaptations that
were formerly necessary. We can often do analyses that better reect the underlying science.
There have been challenging and exciting changes from the methodology that was typically
encountered in statistics courses 10 or 15 years ago.
The best any analysis can do is to highlight the information in the data. No amount of
statistical or computing technology can be a substitute for good design of data collection,
for understanding the context in which data are to be interpreted, or for skill in the use of
statistical analysis methodology. Statistical software systems are one of several components
of effective data analysis.
The questions that statistical analysis is designed to answer can often be stated simply. This may encourage the layperson to believe that the answers are similarly simple.
Often, they are not. Be prepared for unexpected subtleties. Effective statistical analysis
requires appropriate skills, beyond those gained from taking one or two undergraduate
courses in statistics. There is no good substitute for professional training in modern tools
for data analysis, and experience in using those tools with a wide range of data sets. Noone should be embarrassed that they have difculty with analyses that involve ideas that
professional statisticians may take 7 or 8 years of professional training and experience to
master.
Inuences on the Modern Practice of Statistics
The development of statistics has been motivated by the demands of scientists for a methodology that will extract patterns from their data. The methodology has developed in a synergy
with the relevant supporting mathematical theory and, more recently, with computing. This
has led to methodologies and supporting theory that are a radical departure from the methodologies of the pre-computer era.
Statistics is a young discipline. Only in the 1920s and 1930s did the modern framework of
statistical theory, including ideas of hypothesis testing and estimation, begin to take shape.
Different areas of statistical application have taken these ideas up in different ways, some of
them starting their own separate streams of statistical tradition. Gigerenzer et al. (1989) examine the history, commenting on the different streams of development that have inuenced
practice in different research areas.
Separation from the statistical mainstream, and an emphasis on black box approaches,
have contributed to a widespread exaggerated emphasis on tests of hypotheses, to a neglect of pattern, to the policy of some journal editors of publishing only those studies that
show a statistically signicant effect, and to an undue focus on the individual study. Anyone who joins the R community can expect to witness, and/or engage in, lively debate
that addresses these and related issues. Such debate can help ensure that the demands of
scientic rationality do in due course win out over inuences from accidents of historical
development.
www.cambridge.org
Preface
xvii
www.cambridge.org
xviii
Preface
Germany) helped with technical aspects of working with LATEX, with setting up a cvs server
to manage the LATEX les, and with helpful comments. Lynne Billard (University of Georgia,
USA), Murray Jorgensen (University of Waikato, NZ) and Berwin Turlach (University of
Western Australia) gave valuable help in the identication of errors and text that required
clarication. Susan Wilson (Australian National University) gave welcome encouragement.
Duncan Murdoch (University of Western Ontario) helped set up the DAAG package. Thanks
also to Cath Lawrence (Australian National University) for her Python program that allowed
us to extract the R code, as and when required, from our LATEX les. The failings that remain
are, naturally, our responsibility.
There are a large number of people who have helped with the providing of data sets.
We give a list, following the list of references for the data near the end of the book. We
apologize if there is anyone that we have inadvertently failed to acknowledge. Finally,
thanks to David Tranah of Cambridge University Press, for his encouragement and help in
bringing the writing of this monograph to fruition.
References
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kruger, L. 1989. The Empire
of Chance. Cambridge University Press.
SAS Institute Inc. 1996. JMP Start Statistics. Duxbury Press, Belmont, CA.
These (and all other) references also appear in the consolidated list of references near the
end of the book.
Conventions
Text that is R code, or output from R, is printed in a verbatim text style. For example,
in Chapter 1 we will enter data into an R object that we call austpop. We will use the
plot() function to plot these data. The names of R packages, including our own DAAG
package, are printed in italics.
Starred exercises and sections identify more technical items that can be skipped at a rst
reading.
Web sites for supplementary information
The DAAG package, the R scripts that we present, and other supplementary information,
are available from
http://cbis.anu.edu/DAAG
http://www.stats.uwo.ca/DAAG
Solutions to exercises
Solutions to selected exercises are available from the website
http://www.maths.anu.edu.au/johnm/r-book.html
See also www.cambridge.org/0521813360
www.cambridge.org
www.cambridge.org
xx
The tted model determines tted or predicted values of the signal. The residuals (which
estimate the noise component) are what remain after subtracting the tted values from the
observed values of the signal.
The normal distribution is widely used as a model for the noise component.
Haphazardly chosen samples should be distinguished from random samples. Inference
from haphazardly chosen samples is inevitably hazardous. Self-selected samples are particularly unsatisfactory.
Chapter 4: An Introduction to Formal Inference
Formal analysis of data leads to inferences about the population(s) from which the data were
sampled. Statistics that can be computed from given data are used to convey information
about otherwise unknown population parameters.
The inferences that are described in this chapter require randomly selected samples from
the relevant populations.
A sampling distribution describes the theoretical distribution of sample values of a statistic, based on multiple independent random samples from the population.
The standard deviation of a sampling distribution has the name standard error.
For sufciently large samples, the normal distribution provides a good approximation to
the true sampling distribution of the mean or a difference of means.
A condence interval for a parameter, such as the mean or a difference of means, has the
form
statistic t-critical-value standard error.
Such intervals give an assessment of the level of uncertainty when using a sample statistic
to estimate a population parameter.
Another viewpoint is that of hypothesis testing. Is there sufcient evidence to believe that
there is a difference between the means of two different populations?
Checks are essential to determine whether it is plausible that condence intervals and
hypothesis tests are valid. Note however that plausibility is not proof!
Standard chi-squared tests for two-way tables assume that items enter independently into
the cells of the table. Even where such a test is not valid, the standardized residuals from
the no association model can give useful insights.
In the one-way layout, in which there are several independent sets of sample values,
one for each of several groups, data structure (e.g. compare treatments with control, or
focus on a small number of interesting contrasts) helps determine the inferences that are
appropriate. In general, it is inappropriate to examine all possible comparisons.
In the one-way layout with quantitative levels, a regression approach is usually
appropriate.
Chapter 5: Regression with a Single Predictor
Correlation can be a crude and unduly simplistic summary measure of dependence between
two variables. Wherever possible, one should use the richer regression framework to gain
deeper insights into relationships between variables.
www.cambridge.org
xxi
The line or curve for the regression of a response variable y on a predictor x is different
from the line or curve for the regression of x on y. Be aware that the inferred relationship
is conditional on the values of the predictor variable.
The model matrix, together with estimated coefcients, allows for calculation of predicted
or tted values and residuals.
Following the calculations, it is good practice to assess the tted model using standard
forms of graphical diagnostics.
Simple alternatives to straight line regression using the data in their raw form are
r transforming x and/or y,
r using polynomial regression,
r tting a smooth curve.
For size and shape data the allometric model is a good starting point. This model assumes
that regression relationships among the logarithms of the size variables are linear.
Chapter 6: Multiple Linear Regression
Scatterplot matrices may provide useful insight, prior to tting a regression model.
Following the tting of a regression, one should examine relevant diagnostic plots.
Each regression coefcient estimates the effect of changes in the corresponding explanatory variable when other explanatory variables are held constant.
The use of a different set of explanatory variables may lead to large changes in the
coefcients for those variables that are in both models.
Selective inuences in the data collection can have a large effect on the tted regression
relationship.
For comparing alternative models, the AIC or equivalent statistic (including Mallows
C p ) can be useful. The R 2 statistic has limited usefulness.
If the effect of variable selection is ignored, the estimate of predictive power can be
grossly inated.
When regression models are tted to observational data, and especially if there are a
number of explanatory variables, estimated regression coefcients can give misleading
indications of the effects of those individual variables.
The most useful test of predictive power comes from determining the predictive accuracy
that can be expected from a new data set.
Cross-validation is a powerful and widely applicable method that can be used for assessing
the expected predictive accuracy in a new sample.
Chapter 7: Exploiting the Linear Model Framework
In the study of regression relationships, there are many more possibilities than regression
lines! If a line is adequate, use that. But one is not limited to lines!
A common way to handle qualitative factors in linear models is to make the initial level
the baseline, with estimates for other levels estimated as offsets from this baseline.
Polynomials of degree n can be handled by introducing into the model matrix, in addition
to a column of values of x, columns corresponding to x 2 , x 3 , . . . , x n . Typically, n = 2, 3 or 4.
www.cambridge.org
xxii
Multiple lines are tted as an interaction between the variable and a factor with as many
levels as there are different lines.
Scatterplot smoothing, and smoothing terms in multiple linear models, can also be handled
within the linear model framework.
Chapter 8: Logistic Regression and Other Generalized Linear Models
Generalized linear models (GLMs) are an extension of linear models, in which a function
of the expectation of the response variable y is expressed as a linear model. A further
generalization is that y may have a binomial or Poisson or other non-normal distribution.
Common important GLMs are the logistic model and the Poisson regression model.
Survival analysis may be seen as a further specic extension of the GLM framework.
www.cambridge.org
xxiii
Both principal components analysis, and discriminant analysis, allow the calculation of
scores, which are values of the principal components or discriminant functions, calculated
observation by observation. The scores may themselves be used as variables in, e.g., a
regression analysis.
Chapter 12: The R System Additional Topics
This nal chapter gives pointers to some of the further capabilities of R. It hints at the
marvellous power and exibility that are available to those who extend their skills in the
use of R beyond the basic topics that we have treated. The information in this chapter is
intended, also, for use as a reference in connection with the computations of earlier chapters.
www.cambridge.org