Statistics and Computing: J. Chambers D. Hand W. H Ardle
Statistics and Computing: J. Chambers D. Hand W. H Ardle
Series Editors:
J. Chambers
D. Hand
W. Härdle
Statistics and Computing
Brusco/Stahl: Branch and Bound Applications in Combinatorial
Data Analysis
Chambers: Software for Data Analysis: Programming with R
Dalgaard: Introductory Statistics with R, 2nd ed.
Gentle: Elements of Computational Statistics
Gentle: Numerical Linear Algebra for Applications in Statistics
Gentle: Random Number Generation and Monte
Carlo Methods, 2nd ed.
Härdle/Klinke/Turlach: XploRe: An Interactive Statistical
Computing Environment
Hörmann/Leydold/Derflinger: Automatic Nonuniform Random
Variate Generation
Krause/Olson: The Basics of S-PLUS, 4th ed.
Lange: Numerical Analysis for Statisticians
Lemmon/Schafer: Developing Statistical Software in Fortran 95
Loader: Local Regression and Likelihood
Marasinghe/Kennedy: SAS for Data Analysis: Intermediate
Statistical Methods
Ó Ruanaidh/Fitzgerald: Numerical Bayesian Methods Applied to
Signal Processing
Pannatier: VARIOWIN: Software for Spatial Data Analysis in 2D
Pinheiro/Bates: Mixed-Effects Models in S and S-PLUS
Unwin/Theus/Hofmann: Graphics of Large Datasets:
Visualizing a Million
Venables/Ripley: Modern Applied Statistics with S, 4th ed.
Venables/Ripley: S Programming
Wilkinson: The Grammar of Graphics, 2nd ed.
Peter Dalgaard
123
Peter Dalgaard
Department of Biostatistics
University of Copenhagen
Denmark
[email protected]
ISSN: 1431-8784
ISBN: 978-0-387-79053-4 e-ISBN: 978-0-387-79054-1
DOI: 10.1007/978-0-387-79054-1
Library of Congress Control Number: 2008932040
springer.com
To Grete, for putting up with me for so long
Preface
R owes its name to typical Internet humour. You may be familiar with
the programming language C (whose name is a story in itself). Inspired
by this, Becker and Chambers chose in the early 1980s to call their newly
developed statistical programming language S. This language was further
developed into the commercial product S-PLUS, which by the end of the
decade was in widespread use among statisticians of all kinds. Ross Ihaka
and Robert Gentleman from the University of Auckland, New Zealand,
chose to write a reduced version of S for teaching purposes, and what was
more natural than choosing the immediately preceding letter? Ross’ and
Robert’s initials may also have played a role.
In 1995, Martin Maechler persuaded Ross and Robert to release the source
code for R under the GPL. This coincided with the upsurge in Open Source
software spurred by the Linux system. R soon turned out to fill a gap for
people like me who intended to use Linux for statistical computing but
had no statistical package available at the time. A mailing list was set up
for the communication of bug reports and discussions of the development
of R.
In August 1997, I was invited to join an extended international core team
whose members collaborate via the Internet and that has controlled the
development of R since then. The core team was subsequently expanded
several times and currently includes 19 members. On February 29, 2000,
version 1.0.0 was released. As of this writing, the current version is 2.6.2.
This book was originally based upon a set of notes developed for the
course in Basic Statistics for Health Researchers at the Faculty of Health
Sciences of the University of Copenhagen. The course had a primary tar-
get of students for the Ph.D. degree in medicine. However, the material
has been substantially revised, and I hope that it will be useful for a larger
audience, although some biostatistical bias remains, particularly in the
choice of examples.
In later years, the course in Statistical Practice in Epidemiology, which has
been held yearly in Tartu, Estonia, has been a major source of inspiration
and experience in introducing young statisticians and epidemiologists to
R.
This book is not a manual for R. The idea is to introduce a number of basic
concepts and techniques that should allow the reader to get started with
practical statistics.
In terms of the practical methods, the book covers a reasonable curriculum
for first-year students of theoretical statistics as well as for engineering
students. These groups will eventually need to go further and study
more complex models as well as general techniques involving actual
programming in the R language.
Preface ix
For fields where elementary statistics is taught mainly as a tool, the book
goes somewhat further than what is commonly taught at the under-
graduate level. Multiple regression methods or analysis of multifactorial
experiments are rarely taught at that level but may quickly become essen-
tial for practical research. I have collected the simpler methods near the
beginning to make the book readable also at the elementary level. How-
ever, in order to keep technical material together, Chapters 1 and 2 do
include material that some readers will want to skip.
The book is thus intended to be useful for several groups, but I will not
pretend that it can stand alone for any of them. I have included brief
theoretical sections in connection with the various methods, but more
than as teaching material, these should serve as reminders or perhaps as
appetizers for readers who are new to the world of statistics.
Acknowledgements
Obviously, this book would not have been possible without the efforts of
my friends and colleagues on the R Core Team, the authors of contributed
packages, and many of the correspondents of the e-mail discussion lists.
I am deeply grateful for the support of my colleagues and co-teachers
Lene Theil Skovgaard, Bendix Carstensen, Birthe Lykke Thomsen, Helle
Rootzen, Claus Ekstrøm, Thomas Scheike, and from the Tartu course
Krista Fischer, Esa Läära, Martyn Plummer, Mark Myatt, and Michael
Hills, as well as the feedback from several students. In addition, sev-
eral people, including Bill Venables, Brian Ripley, and David James, gave
valuable advice on early drafts of the book.
Finally, profound thanks are due to the free software community at large.
The R project would not have been possible without their effort. For the
x Preface
typesetting of this book, TEX, LATEX, and the consolidating efforts of the
LATEX2e project have been indispensable.
Peter Dalgaard
Copenhagen
April 2008
Contents
Preface vii
1 Basics 1
1.1 First steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 An overgrown calculator . . . . . . . . . . . . . . 3
1.1.2 Assignments . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Vectorized arithmetic . . . . . . . . . . . . . . . . 4
1.1.4 Standard procedures . . . . . . . . . . . . . . . . 6
1.1.5 Graphics . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 R language essentials . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Expressions and objects . . . . . . . . . . . . . . . 9
1.2.2 Functions and arguments . . . . . . . . . . . . . 11
1.2.3 Vectors . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Quoting and escape sequences . . . . . . . . . . 13
1.2.5 Missing values . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Functions that create vectors . . . . . . . . . . . . 14
1.2.7 Matrices and arrays . . . . . . . . . . . . . . . . . 16
1.2.8 Factors . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.9 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.10 Data frames . . . . . . . . . . . . . . . . . . . . . 20
1.2.11 Indexing . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.12 Conditional selection . . . . . . . . . . . . . . . . 22
1.2.13 Indexing of data frames . . . . . . . . . . . . . . 23
1.2.14 Grouped data and data frames . . . . . . . . . . 25
xii Contents
2 The R environment 31
2.1 Session management . . . . . . . . . . . . . . . . . . . . 31
2.1.1 The workspace . . . . . . . . . . . . . . . . . . . . 31
2.1.2 Textual output . . . . . . . . . . . . . . . . . . . . 32
2.1.3 Scripting . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.4 Getting help . . . . . . . . . . . . . . . . . . . . . 34
2.1.5 Packages . . . . . . . . . . . . . . . . . . . . . . . 35
2.1.6 Built-in data . . . . . . . . . . . . . . . . . . . . . 35
2.1.7 attach and detach . . . . . . . . . . . . . . 36
2.1.8 subset, transform, and within . . . . . . . . 37
2.2 The graphics subsystem . . . . . . . . . . . . . . . . . . . 39
2.2.1 Plot layout . . . . . . . . . . . . . . . . . . . . . . 39
2.2.2 Building a plot from pieces . . . . . . . . . . . . . 40
2.2.3 Using par . . . . . . . . . . . . . . . . . . . . . . 42
2.2.4 Combining plots . . . . . . . . . . . . . . . . . . . 42
2.3 R programming . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Flow control . . . . . . . . . . . . . . . . . . . . . 44
2.3.2 Classes and generic functions . . . . . . . . . . . 46
2.4 Data entry . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.1 Reading from a text file . . . . . . . . . . . . . . . 47
2.4.2 Further details on read.table . . . . . . . . . . 50
2.4.3 The data editor . . . . . . . . . . . . . . . . . . . 51
2.4.4 Interfacing to other programs . . . . . . . . . . . 52
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
C Compendium 325
Bibliography 355
Index 357