0% found this document useful (0 votes)
286 views22 pages

Data Analysis and Graphics Using R-An Example Based Approach

Data Analysis and Graphics Using R

Uploaded by

Minh Phuc Huynh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
286 views22 pages

Data Analysis and Graphics Using R-An Example Based Approach

Data Analysis and Graphics Using R

Uploaded by

Minh Phuc Huynh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Cambridge University Press

0521813360 - Data Analysis and Graphics Using R: An Example-based Approach


John Maindonald and John Braun
Frontmatter
More information

Data Analysis and Graphics


Using R an Example-based Approach

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC


MATHEMATICS
Editorial Board:
R. Gill, Department of Mathematics, Utrecht University
B.D. Ripley, Department of Statistics, University of Oxford
S. Ross, Department of Industrial Engineering, University of California, Berkeley
M. Stein, Department of Statistics, University of Chicago
D. Williams, School of Mathematical Sciences, University of Bath
This series of high-quality upper-division textbooks and expository monographs covers
all aspects of stochastic applicable mathematics. The topics range from pure and applied
statistics to probability theory, operations research, optimization, and mathematical programming. The books contain clear presentations of new developments in the eld and
also of the state of the art in classical methods. While emphasizing rigorous treatment of
theoretical methods, the books also contain applications and discussions of new techniques
made possible by advances in computational practice.
Already published
1. Bootstrap Methods and Their Application, A.C. Davison and D.V. Hinkley
2. Markov Chains, J. Norris
3. Asymptotic Statistics, A.W. van der Vaart
4. Wavelet Methods for Time Series Analysis, D.B. Percival and A.T. Walden
5. Bayesian Methods, T. Leonard and J.S.J. Mu
6. Empirical Processes in M-Estimation, S. van de Geer
7. Numerical Methods of Statistics, J. Monahan
8. A Users Guide to Measure-Theoretic Probability, D. Pollard
9. The Estimation and Tracking of Frequency, B.G. Quinn and E.J. Hannan

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Data Analysis and Graphics


Using R an Example-based Approach
John Maindonald
Centre for Bioinformation Science, John Curtin School of Medical Research
and Mathematical Sciences Institute, Australian National University

and
John Braun
Department of Statistical and Actuarial Science, University of Western Ontario

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

cambridge university press


Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo
Cambridge University Press
40 West 20th Street, New York, NY 10011-4211, USA
www.cambridge.org
Information on this title: www.cambridge.org/9780521813365

C

Cambridge University Press 2003

This book is in copyright. Subject to statutory exception


and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without
the written permission of Cambridge University Press.
First published 2003
Reprinted 2004, 2005
Printed in the United States of America
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloguing in Publication data
Maindonald, J. H. (John Hilary), 1937
Data analysis and graphics using R : an example-based approach / John Maindonald and John Braun.
p. cm. (Cambridge series in statistical and probabilistic mathematics)
Includes bibliographical references and index.
ISBN 0 521 81336 0
1. Statistical Data processing. 2. Statistics Graphic methods Data processing. 3. R (Computer program
language) I. Braun, John, 1963 II. Title. III. Cambridge series on statistical and probabilistic mathematics.
QA276.4.M245 2003
519.5 0285dc21 2002031560
ISBN-13 978-0-521-81336-5 hardback
ISBN-10 0-521-81336-0 hardback
Cambridge University Press has no responsibility for
the persistence or accuracy of URLs for external or
third-party Internet Web sites referred to in this book
and does not guarantee that any content on such
Web sites is, or will remain, accurate or appropriate.

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

It is easy to lie with statistics. It is hard to tell the truth without statistics.
[Andrejs Dunkels]
. . . technology tends to overwhelm common sense.
[D. A. Freedman]

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Contents

Preface
A Chapter by Chapter Summary
1 A Brief Introduction to R
1.1 A Short R Session
1.1.1 R must be installed!
1.1.2 Using the console (or command line) window
1.1.3 Reading data from a le
1.1.4 Entry of data at the command line
1.1.5 Online help
1.1.6 Quitting R
1.2 The Uses of R
1.3 The R Language
1.3.1 R objects
1.3.2 Retaining objects between sessions
1.4 Vectors in R
1.4.1 Concatenation joining vector objects
1.4.2 Subsets of vectors
1.4.3 Patterned data
1.4.4 Missing values
1.4.5 Factors
1.5 Data Frames
1.5.1 Variable names
1.5.2 Applying a function to the columns of a data frame
1.5.3 Data frames and matrices
1.5.4 Identication of rows that include missing values
1.6 R Packages
1.6.1 Data sets that accompany R packages
1.7 Looping
1.8 R Graphics
1.8.1 The function plot() and allied functions
1.8.2 Identication and location on the gure region
1.8.3 Plotting mathematical symbols

Cambridge University Press

page xv
xix
1
1
1
1
2
3
4
5
5
6
7
7
8
8
8
9
9
10
11
12
13
13
13
14
14
14
15
16
19
20

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

viii

Contents

1.8.4 Row by column layouts of plots


1.8.5 Graphs additional notes
1.9 Additional Points on the Use of R in This Book
1.10 Further Reading
1.11 Exercises

20
22
23
25
26

2 Styles of Data Analysis


2.1 Revealing Views of the Data
2.1.1 Views of a single sample
2.1.2 Patterns in grouped data
2.1.3 Patterns in bivariate data the scatterplot
2.1.4 Multiple variables and times
2.1.5 Lattice (trellis style) graphics
2.1.6 What to look for in plots
2.2 Data Summary
2.2.1 Mean and median
2.2.2 Standard deviation and inter-quartile range
2.2.3 Correlation
2.3 Statistical Analysis Strategies
2.3.1 Helpful and unhelpful questions
2.3.2 Planning the formal analysis
2.3.3 Changes to the intended plan of analysis
2.4 Recap
2.5 Further Reading
2.6 Exercises

29
29
30
33
34
36
37
41
42
43
44
46
47
48
48
49
49
50
50

3 Statistical Models
3.1 Regularities
3.1.1 Mathematical models
3.1.2 Models that include a random component
3.1.3 Smooth and rough
3.1.4 The construction and use of models
3.1.5 Model formulae
3.2 Distributions: Models for the Random Component
3.2.1 Discrete distributions
3.2.2 Continuous distributions
3.3 The Uses of Random Numbers
3.3.1 Simulation
3.3.2 Sampling from populations
3.4 Model Assumptions
3.4.1 Random sampling assumptions independence
3.4.2 Checks for normality
3.4.3 Checking other model assumptions
3.4.4 Are non-parametric methods the answer?
3.4.5 Why models matter adding across contingency
tables

52
53
53
54
55
56
56
57
57
58
60
60
61
62
62
63
66
66

Cambridge University Press

67

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Contents

3.5
3.6
3.7

Recap
Further Reading
Exercises

ix

68
68
69

4 An Introduction to Formal Inference


4.1 Standard Errors
4.1.1 Population parameters and sample statistics
4.1.2 Assessing accuracy the standard error
4.1.3 Standard errors for differences of means
4.1.4 The standard error of the median
4.1.5 Resampling to estimate standard errors: bootstrapping
4.2 Calculations Involving Standard Errors: the t-Distribution
4.3 Condence Intervals and Hypothesis Tests
4.3.1 One- and two-sample intervals and tests for means
4.3.2 Condence intervals and tests for proportions
4.3.3 Condence intervals for the correlation
4.4 Contingency Tables
4.4.1 Rare and endangered plant species
4.4.2 Additional notes
4.5 One-Way Unstructured Comparisons
4.5.1 Displaying means for the one-way layout
4.5.2 Multiple comparisons
4.5.3 Data with a two-way structure
4.5.4 Presentation issues
4.6 Response Curves
4.7 Data with a Nested Variation Structure
4.7.1 Degrees of freedom considerations
4.7.2 General multi-way analysis of variance designs
4.8 Resampling Methods for Tests and Condence Intervals
4.8.1 The one-sample permutation test
4.8.2 The two-sample permutation test
4.8.3 Bootstrap estimates of condence intervals
4.9 Further Comments on Formal Inference
4.9.1 Condence intervals versus hypothesis tests
4.9.2 If there is strong prior information, use it!
4.10 Recap
4.11 Further Reading
4.12 Exercises

71
71
71
71
72
73
73
74
77
78
82
83
84
86
87
88
90
91
92
92
93
94
95
96
96
96
97
98
100
100
101
102
102
103

5 Regression with a Single Predictor


5.1 Fitting a Line to Data
5.1.1 Lawn roller example
5.1.2 Calculating tted values and residuals
5.1.3 Residual plots
5.1.4 The analysis of variance table
5.2 Outliers, Inuence and Robust Regression

107
107
108
109
110
113
114

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Contents

5.3

Standard Errors and Condence Intervals


5.3.1 Condence intervals and tests for the slope
5.3.2 SEs and condence intervals for predicted values
5.3.3 Implications for design
5.4 Regression versus Qualitative ANOVA Comparisons
5.5 Assessing Predictive Accuracy
5.5.1 Training/test sets, and cross-validation
5.5.2 Cross-validation an example
5.5.3 Bootstrapping

5.6 A Note on Power Transformations


5.7 Size and Shape Data
5.7.1 Allometric growth
5.7.2 There are two regression lines!
5.8 The Model Matrix in Regression
5.9 Recap
5.10 Methodological References
5.11 Exercises

6 Multiple Linear Regression


6.1 Basic Ideas: Book Weight and Brain Weight Examples
6.1.1 Omission of the intercept term
6.1.2 Diagnostic plots
6.1.3 Further investigation of inuential points
6.1.4 Example: brain weight
6.2 Multiple Regression Assumptions and Diagnostics
6.2.1 Inuential outliers and Cooks distance
6.2.2 Component plus residual plots
6.2.3 Further types of diagnostic plot
6.2.4 Robust and resistant methods
6.3 A Strategy for Fitting Multiple Regression Models
6.3.1 Preliminaries
6.3.2 Model tting
6.3.3 An example the Scottish hill race data
6.4 Measures for the Comparison of Regression Models
6.4.1 R 2 and adjusted R 2
6.4.2 AIC and related statistics
6.4.3 How accurately does the equation predict?
6.4.4 An external assessment of predictive accuracy
6.5 Interpreting Regression Coefcients the Labor Training Data
6.6 Problems with Many Explanatory Variables
6.6.1 Variable selection issues
6.6.2 Principal components summaries
6.7 Multicollinearity
6.7.1 A contrived example
6.7.2 The variance ination factor (VIF)
6.7.3 Remedying multicollinearity

Cambridge University Press

116
116
117
118
119
121
121
121
123
126
127
128
129
130
131
132
132
134
134
137
138
139
140
142
143
143
145
145
145
145
146
147
152
152
152
153
155
155
161
162
163
164
164
167
168

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Contents

6.8

Multiple Regression Models Additional Points


6.8.1 Confusion between explanatory and dependent variables
6.8.2 Missing explanatory variables
6.8.3 The use of transformations
6.8.4 Non-linear methods an alternative to transformation?
6.9 Further Reading
6.10 Exercises

xi

168
168
169
169
170
171
172
175
175
175

7 Exploiting the Linear Model Framework


7.1 Levels of a Factor Using Indicator Variables
7.1.1 Example sugar weight
7.1.2 Different choices for the model matrix when there are
factors
7.2 Polynomial Regression
7.2.1 Issues in the choice of model
7.3 Fitting Multiple Lines
7.4 Methods for Passing Smooth Curves through Data
7.4.1 Scatterplot smoothing regression splines
7.4.2 Other smoothing methods
7.4.3 Generalized additive models
7.5 Smoothing Terms in Multiple Linear Models
7.6 Further Reading
7.7 Exercises

178
179
181
183
187
188
191
191
192
194
194

8 Logistic Regression and Other Generalized Linear Models


8.1 Generalized Linear Models
8.1.1 Transformation of the expected value on the left
8.1.2 Noise terms need not be normal
8.1.3 Log odds in contingency tables
8.1.4 Logistic regression with a continuous explanatory variable
8.2 Logistic Multiple Regression
8.2.1 A plot of contributions of explanatory variables
8.2.2 Cross-validation estimates of predictive accuracy
8.3 Logistic Models for Categorical Data an Example
8.4 Poisson and Quasi-Poisson Regression
8.4.1 Data on aberrant crypt foci
8.4.2 Moth habitat example
8.4.3 Residuals, and estimating the dispersion
8.5 Ordinal Regression Models
8.5.1 Exploratory analysis
8.5.2 Proportional odds logistic regression
8.6 Other Related Models
8.6.1 Loglinear models
8.6.2 Survival analysis
8.7 Transformations for Count Data
8.8 Further Reading
8.9 Exercises

197
197
197
198
198
199
202
208
209
210
211
211
213
215
216
217
217
220
220
220
221
222
223

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

xii

Contents

9 Multi-level Models, Time Series and Repeated Measures


9.1 Introduction
9.2 Example Survey Data, with Clustering
9.2.1 Alternative models
9.2.2 Instructive, though faulty, analyses
9.2.3 Predictive accuracy
9.3 A Multi-level Experimental Design
9.3.1 The ANOVA table
9.3.2 Expected values of mean squares
9.3.3 The sums of squares breakdown
9.3.4 The variance components
9.3.5 The mixed model analysis
9.3.6 Predictive accuracy
9.3.7 Different sources of variance complication or focus
of interest?
9.4 Within and between Subject Effects an Example
9.5 Time Series Some Basic Ideas
9.5.1 Preliminary graphical explorations
9.5.2 The autocorrelation function
9.5.3 Autoregressive (AR) models
9.5.4 Autoregressive moving average (ARMA) models theory

9.6 Regression Modeling with Moving Average Errors an Example


9.7 Repeated Measures in Time Notes on the Methodology
9.7.1 The theory of repeated measures modeling
9.7.2 Correlation structure
9.7.3 Different approaches to repeated measures analysis
9.8 Further Notes on Multi-level Modeling
9.8.1 An historical perspective on multi-level models
9.8.2 Meta-analysis
9.9 Further Reading
9.10 Exercises
10 Tree-based Classication and Regression
10.1 The Uses of Tree-based Methods
10.1.1 Problems for which tree-based regression may be used
10.1.2 Tree-based regression versus parametric approaches
10.1.3 Summary of pluses and minuses
10.2 Detecting Email Spam an Example
10.2.1 Choosing the number of splits
10.3 Terminology and Methodology
10.3.1 Choosing the split regression trees
10.3.2 Within and between sums of squares
10.3.3 Choosing the split classication trees
10.3.4 The mechanics of tree-based regression a trivial
example

Cambridge University Press

224
224
225
225
228
229
230
232
232
234
236
237
239
239
239
242
242
243
244
245
246
252
253
253
254
255
255
256
256
257
259
259
259
260
261
261
264
264
265
266
267
268

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Contents

10.4 Assessments of Predictive Accuracy


10.4.1 Cross-validation
10.4.2 The training/test set methodology
10.4.3 Predicting the future
10.5 A Strategy for Choosing the Optimal Tree
10.5.1 Costcomplexity pruning
10.5.2 Prediction error versus tree size
10.6 Detecting Email Spam the Optimal Tree
10.6.1 The one-standard-deviation rule
10.7 Interpretation and Presentation of the rpart Output
10.7.1 Data for female heart attack patients
10.7.2 Printed Information on Each Split
10.8 Additional Notes
10.9 Further Reading
10.10 Exercises

xiii

270
270
271
271
271
271
272
273
273
275
275
276
278
279
279

11 Multivariate Data Exploration and Discrimination


11.1 Multivariate Exploratory Data Analysis
11.1.1 Scatterplot matrices
11.1.2 Principal components analysis
11.2 Discriminant Analysis
11.2.1 Example plant architecture
11.2.2 Classical Fisherian discriminant analysis
11.2.3 Logistic discriminant analysis
11.2.4 An example with more than two groups
11.3 Principal Component Scores in Regression
11.4 Propensity Scores in Regression Comparisons Labor Training Data
11.5 Further Reading
11.6 Exercises

281
282
282
282
285
286
287
289
290
291
295
297
298

12 The R System Additional Topics


12.1 Graphs in R
12.2 Functions Some Further Details
12.2.1 Common useful functions
12.2.2 User-written R functions
12.2.3 Functions for working with dates
12.3 Data input and output
12.3.1 Input
12.3.2 Data output
12.4 Factors Additional Comments
12.5 Missing Values
12.6 Lists and Data Frames
12.6.1 Data frames as lists
12.6.2 Reshaping data frames; reshape()
12.6.3 Joining data frames and vectors cbind()
12.6.4 Conversion of tables and arrays into data frames

300
300
303
303
307
310
310
311
314
315
317
320
320
320
322
322

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

xiv

Contents

12.7

12.8

12.9

12.10
12.11
12.12

12.6.5 Merging data frames merge()


12.6.6 The function sapply() and related functions
12.6.7 Splitting vectors and data frames into lists split()
Matrices and Arrays
12.7.1 Outer products
12.7.2 Arrays
Classes and Methods
12.8.1 Printing and summarizing model objects
12.8.2 Extracting information from model objects
Databases and Environments
12.9.1 Workspace management
12.9.2 Function environments, and lazy evaluation
Manipulation of Language Constructs
Further Reading
Exercises

322
323
324
324
326
327
328
328
329
330
331
331
333
334
334

Epilogue Models

338

Appendix S-PLUS Differences

341

References
Index of R Symbols and Functions
Index of Terms
Index of Names

346
352
356
361

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Preface

This book is an exposition of statistical methodology that focuses on ideas and concepts,
and makes extensive use of graphical presentation. It avoids, as much as possible, the use
of mathematical symbolism. It is particularly aimed at scientists who wish to do statistical analyses on their own data, preferably with reference as necessary to professional
statistical advice. It is intended to complement more mathematically oriented accounts of
statistical methodology. It may be used to give students with a more specialist statistical
interest exposure to practical data analysis.
The authors can claim, between them, 40 years of experience in working with researchers
from many different backgrounds. Initial drafts of the monograph were constructed from
notes that the rst author prepared for courses for researchers, rst of all at the University of
Newcastle (Australia) over 19961997, and greatly developed and extended in the course
of work in the Statistical Consulting Unit at The Australian National University over 1998
2001. We are grateful to those who have discussed their research with us, brought us
their data for analysis, and allowed us to use it in the examples that appear in the present
monograph. At least these data will not, as often happens once data have become the basis
for a published paper, gather dust in a long-forgotten folder!
We have covered a range of topics that we consider important for many different areas
of statistical application. This diversity of sources of examples has benets, even for those
whose interests are in one specic application area. Ideas and applications that are useful in
one area often nd use elsewhere, even to the extent of stimulating new lines of investigation.
We hope that our book will stimulate such cross-fertilization. As is inevitable in a book that
has this broad focus, there will be specic areas perhaps epidemiology, or psychology, or
sociology, or ecology that will regret the omission of some methodologies that they nd
important.
We use the R system for the computations. The R system implements a dialect of the
inuential S language that is the basis for the commercial S-PLUS system. It follows
S in its close linkage between data analysis and graphics. Its development is the result
of a co-operative international effort, bringing together an impressive array of statistical
computing expertise. It has quickly gained a wide following, among professionals and nonprofessionals alike. At the time of writing, R users are restricted, for the most part, to a
command line interface. Various forms of graphical user interface will become available in
due course.
The R system has an extensive library of packages that offer state-of-the-art-abilities.
Many of the analyses that they offer were not, 10 years ago, available in any of the standard

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

xvi

Preface

packages. What did data analysts do before we had such packages? Basically, they adapted
more simplistic (but not necessarily simpler) analyses as best they could. Those whose
skills were unequal to the task did unsatisfactory analyses. Those with more adequate skills
carried out analyses that, even if not elegant and insightful by current standards, were often
adequate. Tools such as are available in R have reduced the need for the adaptations that
were formerly necessary. We can often do analyses that better reect the underlying science.
There have been challenging and exciting changes from the methodology that was typically
encountered in statistics courses 10 or 15 years ago.
The best any analysis can do is to highlight the information in the data. No amount of
statistical or computing technology can be a substitute for good design of data collection,
for understanding the context in which data are to be interpreted, or for skill in the use of
statistical analysis methodology. Statistical software systems are one of several components
of effective data analysis.
The questions that statistical analysis is designed to answer can often be stated simply. This may encourage the layperson to believe that the answers are similarly simple.
Often, they are not. Be prepared for unexpected subtleties. Effective statistical analysis
requires appropriate skills, beyond those gained from taking one or two undergraduate
courses in statistics. There is no good substitute for professional training in modern tools
for data analysis, and experience in using those tools with a wide range of data sets. Noone should be embarrassed that they have difculty with analyses that involve ideas that
professional statisticians may take 7 or 8 years of professional training and experience to
master.
Inuences on the Modern Practice of Statistics
The development of statistics has been motivated by the demands of scientists for a methodology that will extract patterns from their data. The methodology has developed in a synergy
with the relevant supporting mathematical theory and, more recently, with computing. This
has led to methodologies and supporting theory that are a radical departure from the methodologies of the pre-computer era.
Statistics is a young discipline. Only in the 1920s and 1930s did the modern framework of
statistical theory, including ideas of hypothesis testing and estimation, begin to take shape.
Different areas of statistical application have taken these ideas up in different ways, some of
them starting their own separate streams of statistical tradition. Gigerenzer et al. (1989) examine the history, commenting on the different streams of development that have inuenced
practice in different research areas.
Separation from the statistical mainstream, and an emphasis on black box approaches,
have contributed to a widespread exaggerated emphasis on tests of hypotheses, to a neglect of pattern, to the policy of some journal editors of publishing only those studies that
show a statistically signicant effect, and to an undue focus on the individual study. Anyone who joins the R community can expect to witness, and/or engage in, lively debate
that addresses these and related issues. Such debate can help ensure that the demands of
scientic rationality do in due course win out over inuences from accidents of historical
development.

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Preface

xvii

New Tools for Statistical Computing


We have drawn attention to advances in statistical computing methodology. These have led
to new powerful tools for exploratory analysis of regression data, for choosing between
alternative models, for diagnostic checks, for handling non-linearity, for assessing the predictive power of models, and for graphical presentation. In addition, we have new computing
tools that make it straightforward to move data between different systems, to keep a record
of calculations, to retrace or adapt earlier calculations, and to edit output and graphics into
a form that can be incorporated into published documents.
One can think of an effective statistical analysis package as a workshop (this analogy
appears in a simpler form in the JMP Start Statistics Manual (SAS Institute Inc. 1996,
p. xiii).). The tools are the statistical and computing abilities that the package provides.
The layout of the workshop, the arrangement both of the tools and of the working area,
is important. It should be easy to nd each tool as it is needed. Tools should oat back of
their own accord into the right place after use! In other words, we want a workshop where
mending the rocking chair is a pleasure!
The workshop analogy is worth pursuing further. Different users have different requirements. A hobbyist workshop will differ from a professional workshop. The hobbyist may
have less sophisticated tools, and tools that are easy to use without extensive training or experience. That limits what the hobbyist can do. The professional needs powerful and highly
exible tools, and must be willing to invest time in learning the skills needed to use them.
Good graphical abilities, and good data manipulation abilities, should be a high priority for
the hobbyist statistical workshop. Other operations should be reasonably easy to implement
when carried out under the instructions of a professional. Professionals also require top rate
graphical abilities. The focus is more on exibility and power, both for graphics and for
computation. Ease of use is important, but not at the expense of power and exibility.
A Note on the R System
The R system implements a dialect of the S language that was developed at AT&T Bell
Laboratories by Rick Becker, John Chambers and Allan Wilks. Versions of R are available,
at no cost, for 32-bit versions of Microsoft Windows, for Linux and other Unix systems, and
for the Macintosh. It is available through the Comprehensive R Archive Network (CRAN).
Go to http://cran.r-project.org/, and nd the nearest mirror site.
The citation for John Chambers 1998 Association for Computing Machinery Software
award stated that S has forever altered how people analyze, visualize and manipulate data.
The R project enlarges on the ideas and insights that generated the S language. We are
grateful to the R Core Development Team, and to the creators of the various R packages,
for bringing into being the R system this marvellous tool for scientic and statistical
computing, and for graphical presentation.
Acknowledgements
Many different people have helped us with this project. Winfried Theis (University of
Dortmund, Germany) and Detlef Steuer (University of the Federal Armed Forces, Hamburg,

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

xviii

Preface

Germany) helped with technical aspects of working with LATEX, with setting up a cvs server
to manage the LATEX les, and with helpful comments. Lynne Billard (University of Georgia,
USA), Murray Jorgensen (University of Waikato, NZ) and Berwin Turlach (University of
Western Australia) gave valuable help in the identication of errors and text that required
clarication. Susan Wilson (Australian National University) gave welcome encouragement.
Duncan Murdoch (University of Western Ontario) helped set up the DAAG package. Thanks
also to Cath Lawrence (Australian National University) for her Python program that allowed
us to extract the R code, as and when required, from our LATEX les. The failings that remain
are, naturally, our responsibility.
There are a large number of people who have helped with the providing of data sets.
We give a list, following the list of references for the data near the end of the book. We
apologize if there is anyone that we have inadvertently failed to acknowledge. Finally,
thanks to David Tranah of Cambridge University Press, for his encouragement and help in
bringing the writing of this monograph to fruition.
References
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J. & Kruger, L. 1989. The Empire
of Chance. Cambridge University Press.
SAS Institute Inc. 1996. JMP Start Statistics. Duxbury Press, Belmont, CA.
These (and all other) references also appear in the consolidated list of references near the
end of the book.
Conventions
Text that is R code, or output from R, is printed in a verbatim text style. For example,
in Chapter 1 we will enter data into an R object that we call austpop. We will use the
plot() function to plot these data. The names of R packages, including our own DAAG
package, are printed in italics.
Starred exercises and sections identify more technical items that can be skipped at a rst
reading.
Web sites for supplementary information
The DAAG package, the R scripts that we present, and other supplementary information,
are available from
http://cbis.anu.edu/DAAG
http://www.stats.uwo.ca/DAAG
Solutions to exercises
Solutions to selected exercises are available from the website
http://www.maths.anu.edu.au/johnm/r-book.html
See also www.cambridge.org/0521813360

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

A Chapter by Chapter Summary

Chapter 1: A Brief Introduction to R


This chapter aims to give enough information on the use of R to get readers started.
Note Rs extensive online help facilities. Users who have a basic minimum knowledge
of R can often get needed additional information from the help pages as the demand arises.
A facility in using the help pages is an important basic skill for R users.

Chapter 2: Style of Data Analysis


Knowing how to explore a set of data upon encountering it for the rst time is an important
skill. What graphs should one draw?
Different types of graph give different views of the data. Which views are likely to be
helpful?
Transformations, especially the logarithmic transformation, may be a necessary preliminary to data analysis.
There is a contrast between exploratory data analysis, where the aim is to allow the data
to speak for themselves, and conrmatory analysis (which includes formal estimation and
testing), where the form of the analysis should have been largely decided before the data
were collected.
Statistical analysis is a form of data summary. It is important to check, as far as this is
possible that summarization has captured crucial features of the data. Summary statistics, such as the mean or correlation, should always be accompanied by examination
of a relevant graph. For example, the correlation is a useful summary, if at all, only if
the relationship between two variables is linear. A scatterplot allows a visual check on
linearity.

Chapter 3: Statistical Models


Formal data analyses assume an underlying statistical model, whether or not it is explicitly
written down.
Many statistical models have two components: a signal (or deterministic) component;
and a noise (or error) component.
Data from a sample (commonly assumed to be randomly selected) are used to t the
model by estimating the signal component.

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

xx

Chapter by Chapter Summary

The tted model determines tted or predicted values of the signal. The residuals (which
estimate the noise component) are what remain after subtracting the tted values from the
observed values of the signal.
The normal distribution is widely used as a model for the noise component.
Haphazardly chosen samples should be distinguished from random samples. Inference
from haphazardly chosen samples is inevitably hazardous. Self-selected samples are particularly unsatisfactory.
Chapter 4: An Introduction to Formal Inference
Formal analysis of data leads to inferences about the population(s) from which the data were
sampled. Statistics that can be computed from given data are used to convey information
about otherwise unknown population parameters.
The inferences that are described in this chapter require randomly selected samples from
the relevant populations.
A sampling distribution describes the theoretical distribution of sample values of a statistic, based on multiple independent random samples from the population.
The standard deviation of a sampling distribution has the name standard error.
For sufciently large samples, the normal distribution provides a good approximation to
the true sampling distribution of the mean or a difference of means.
A condence interval for a parameter, such as the mean or a difference of means, has the
form
statistic t-critical-value standard error.
Such intervals give an assessment of the level of uncertainty when using a sample statistic
to estimate a population parameter.
Another viewpoint is that of hypothesis testing. Is there sufcient evidence to believe that
there is a difference between the means of two different populations?
Checks are essential to determine whether it is plausible that condence intervals and
hypothesis tests are valid. Note however that plausibility is not proof!
Standard chi-squared tests for two-way tables assume that items enter independently into
the cells of the table. Even where such a test is not valid, the standardized residuals from
the no association model can give useful insights.
In the one-way layout, in which there are several independent sets of sample values,
one for each of several groups, data structure (e.g. compare treatments with control, or
focus on a small number of interesting contrasts) helps determine the inferences that are
appropriate. In general, it is inappropriate to examine all possible comparisons.
In the one-way layout with quantitative levels, a regression approach is usually
appropriate.
Chapter 5: Regression with a Single Predictor
Correlation can be a crude and unduly simplistic summary measure of dependence between
two variables. Wherever possible, one should use the richer regression framework to gain
deeper insights into relationships between variables.

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Chapter by Chapter Summary

xxi

The line or curve for the regression of a response variable y on a predictor x is different
from the line or curve for the regression of x on y. Be aware that the inferred relationship
is conditional on the values of the predictor variable.
The model matrix, together with estimated coefcients, allows for calculation of predicted
or tted values and residuals.
Following the calculations, it is good practice to assess the tted model using standard
forms of graphical diagnostics.
Simple alternatives to straight line regression using the data in their raw form are
r transforming x and/or y,
r using polynomial regression,
r tting a smooth curve.
For size and shape data the allometric model is a good starting point. This model assumes
that regression relationships among the logarithms of the size variables are linear.
Chapter 6: Multiple Linear Regression
Scatterplot matrices may provide useful insight, prior to tting a regression model.
Following the tting of a regression, one should examine relevant diagnostic plots.
Each regression coefcient estimates the effect of changes in the corresponding explanatory variable when other explanatory variables are held constant.
The use of a different set of explanatory variables may lead to large changes in the
coefcients for those variables that are in both models.
Selective inuences in the data collection can have a large effect on the tted regression
relationship.
For comparing alternative models, the AIC or equivalent statistic (including Mallows
C p ) can be useful. The R 2 statistic has limited usefulness.
If the effect of variable selection is ignored, the estimate of predictive power can be
grossly inated.
When regression models are tted to observational data, and especially if there are a
number of explanatory variables, estimated regression coefcients can give misleading
indications of the effects of those individual variables.
The most useful test of predictive power comes from determining the predictive accuracy
that can be expected from a new data set.
Cross-validation is a powerful and widely applicable method that can be used for assessing
the expected predictive accuracy in a new sample.
Chapter 7: Exploiting the Linear Model Framework
In the study of regression relationships, there are many more possibilities than regression
lines! If a line is adequate, use that. But one is not limited to lines!
A common way to handle qualitative factors in linear models is to make the initial level
the baseline, with estimates for other levels estimated as offsets from this baseline.
Polynomials of degree n can be handled by introducing into the model matrix, in addition
to a column of values of x, columns corresponding to x 2 , x 3 , . . . , x n . Typically, n = 2, 3 or 4.

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

xxii

Chapter by Chapter Summary

Multiple lines are tted as an interaction between the variable and a factor with as many
levels as there are different lines.
Scatterplot smoothing, and smoothing terms in multiple linear models, can also be handled
within the linear model framework.
Chapter 8: Logistic Regression and Other Generalized Linear Models
Generalized linear models (GLMs) are an extension of linear models, in which a function
of the expectation of the response variable y is expressed as a linear model. A further
generalization is that y may have a binomial or Poisson or other non-normal distribution.
Common important GLMs are the logistic model and the Poisson regression model.
Survival analysis may be seen as a further specic extension of the GLM framework.

Chapter 9: Multi-level Models, Time Series and Repeated Measures


In a multi-level model, the random component possesses structure; it is a sum of distinct
error terms.
Multi-level models that exhibit suitable balance have traditionally been analyzed within
an analysis of variance framework. Unbalanced multi-level designs require the more general
multi-level modeling methodology.
Observations taken over time often exhibit time-based dependence. Observations that are
close together in time may be more highly correlated than those that are widely separated.
The autocorrelation function can be used to assess levels of serial correlation in time series.
Repeated measures models have measurements on the same individuals at multiple points
in time and/or space. They typically require the modeling of a correlation structure similar
to that employed in analyzing time series.

Chapter 10: Tree-based Classication and Regression


Tree-based models make very weak assumptions about the form of the classication or
regression model. They make limited use of the ordering properties of continuous or ordinal
explanatory variables. They are unsuitable for use with small data sets.
Tree-based models can be an effective tool for analyzing data that are non-linear and/or
involve complex interactions.
The decision trees that tree-based analyses generate may be complex, giving limited
insight into model predictions.
Cross-validation, and the use of training and test sets, are essential tools both for choosing
the size of the tree and for assessing expected accuracy on a new data set.

Chapter 11: Multivariate Data Exploration and Discrimination


Principal components analysis is an important multivariate exploratory data analysis tool.
Examples are presented of the use of two alternative discrimination methods logistic
regression including multivariate logistic regression, and linear discriminant analysis.

Cambridge University Press

www.cambridge.org

Cambridge University Press


0521813360 - Data Analysis and Graphics Using R: An Example-based Approach
John Maindonald and John Braun
Frontmatter
More information

Chapter by Chapter Summary

xxiii

Both principal components analysis, and discriminant analysis, allow the calculation of
scores, which are values of the principal components or discriminant functions, calculated
observation by observation. The scores may themselves be used as variables in, e.g., a
regression analysis.
Chapter 12: The R System Additional Topics
This nal chapter gives pointers to some of the further capabilities of R. It hints at the
marvellous power and exibility that are available to those who extend their skills in the
use of R beyond the basic topics that we have treated. The information in this chapter is
intended, also, for use as a reference in connection with the computations of earlier chapters.

Cambridge University Press

www.cambridge.org

You might also like