An Introduction To R
An Introduction To R
Table of Contents
• Preface
• 1 Introduction and preliminaries
o 1.1 The R environment
o 1.2 Related software and documentation
o 1.3 R and statistics
o 1.4 R and the window system
o 1.5 Using R interactively
o 1.6 An introductory session
o 1.7 Getting help with functions and features
o 1.8 R commands, case sensitivity, etc.
o 1.9 Recall and correction of previous commands
o 1.10 Executing commands from or diverting output to a file
o 1.11 Data permanency and removing objects
• 2 Simple manipulations; numbers and vectors
o 2.1 Vectors and assignment
o 2.2 Vector arithmetic
o 2.3 Generating regular sequences
o 2.4 Logical vectors
o 2.5 Missing values
o 2.6 Character vectors
o 2.7 Index vectors; selecting and modifying subsets of a data set
o 2.8 Other types of objects
• 3 Objects, their modes and attributes
o 3.1 Intrinsic attributes: mode and length
o 3.2 Changing the length of an object
o 3.3 Getting and setting attributes
o 3.4 The class of an object
• 4 Ordered and unordered factors
o 4.1 A specific example
o 4.2 The function tapply() and ragged arrays
o 4.3 Ordered factors
• 5 Arrays and matrices
o 5.1 Arrays
o 5.2 Array indexing. Subsections of an array
o 5.3 Index matrices
o 5.4 The array() function
▪ 5.4.1 Mixed vector and array arithmetic. The recycling rule
o 5.5 The outer product of two arrays
o 5.6 Generalized transpose of an array
o 5.7 Matrix facilities
▪ 5.7.1 Matrix multiplication
▪ 5.7.2 Linear equations and inversion
▪ 5.7.3 Eigenvalues and eigenvectors
▪ 5.7.4 Singular value decomposition and determinants
▪ 5.7.5 Least squares fitting and the QR decomposition
o 5.8 Forming partitioned matrices, cbind() and rbind()
o 5.9 The concatenation function, c(), with arrays
o 5.10 Frequency tables from factors
• 6 Lists and data frames
o 6.1 Lists
o 6.2 Constructing and modifying lists
▪ 6.2.1 Concatenating lists
o 6.3 Data frames
▪ 6.3.1 Making data frames
▪ 6.3.2 attach() and detach()
▪ 6.3.3 Working with data frames
▪ 6.3.4 Attaching arbitrary lists
▪ 6.3.5 Managing the search path
• 7 Reading data from files
o 7.1 The read.table() function
o 7.2 The scan() function
o 7.3 Accessing builtin datasets
▪ 7.3.1 Loading data from other R packages
o 7.4 Editing data
• 8 Probability distributions
o 8.1 R as a set of statistical tables
o 8.2 Examining the distribution of a set of data
o 8.3 One- and two-sample tests
• 9 Grouping, loops and conditional execution
o 9.1 Grouped expressions
o 9.2 Control statements
▪ 9.2.1 Conditional execution: if statements
▪ 9.2.2 Repetitive execution: for loops, repeat and while
• 10 Writing your own functions
o 10.1 Simple examples
o 10.2 Defining new binary operators
o 10.3 Named arguments and defaults
o 10.4 The ‘…’ argument
o 10.5 Assignments within functions
o 10.6 More advanced examples
▪ 10.6.1 Efficiency factors in block designs
▪ 10.6.2 Dropping all names in a printed array
▪ 10.6.3 Recursive numerical integration
o 10.7 Scope
o 10.8 Customizing the environment
o 10.9 Classes, generic functions and object orientation
• 11 Statistical models in R
o 11.1 Defining statistical models; formulae
▪ 11.1.1 Contrasts
o 11.2 Linear models
o 11.3 Generic functions for extracting model information
o 11.4 Analysis of variance and model comparison
▪ 11.4.1 ANOVA tables
o 11.5 Updating fitted models
o 11.6 Generalized linear models
▪ 11.6.1 Families
▪ 11.6.2 The glm() function
o 11.7 Nonlinear least squares and maximum likelihood models
▪ 11.7.1 Least squares
▪ 11.7.2 Maximum likelihood
o 11.8 Some non-standard models
• 12 Graphical procedures
o 12.1 High-level plotting commands
▪ 12.1.1 The plot() function
▪ 12.1.2 Displaying multivariate data
▪ 12.1.3 Display graphics
▪ 12.1.4 Arguments to high-level plotting functions
o 12.2 Low-level plotting commands
▪ 12.2.1 Mathematical annotation
▪ 12.2.2 Hershey vector fonts
o 12.3 Interacting with graphics
o 12.4 Using graphics parameters
▪ 12.4.1 Permanent changes: The par() function
▪ 12.4.2 Temporary changes: Arguments to graphics functions
o 12.5 Graphics parameters list
▪ 12.5.1 Graphical elements
▪ 12.5.2 Axes and tick marks
▪ 12.5.3 Figure margins
▪ 12.5.4 Multiple figure environment
o 12.6 Device drivers
▪ 12.6.1 PostScript diagrams for typeset documents
▪ 12.6.2 Multiple graphics devices
o 12.7 Dynamic graphics
• 13 Packages
o 13.1 Standard packages
o 13.2 Contributed packages and CRAN
o 13.3 Namespaces
• 14 OS facilities
o 14.1 Files and directories
o 14.2 Filepaths
o 14.3 System commands
o 14.4 Compression and Archives
• Appendix A A sample session
• Appendix B Invoking R
o B.1 Invoking R from the command line
o B.2 Invoking R under Windows
o B.3 Invoking R under macOS
o B.4 Scripting with R
• Appendix C The command-line editor
o C.1 Preliminaries
o C.2 Editing actions
o C.3 Command-line editor summary
• Appendix D Function and variable index
• Appendix E Concept index
• Appendix F References
Preface
This introduction to R is derived from an original set of notes describing the S and S-
PLUS environments written in 1990–2 by Bill Venables and David M. Smith when at
the University of Adelaide. We have made a number of small changes to reflect
differences between the R and S programs, and expanded some of the material.
We would like to extend warm thanks to Bill Venables (and David Smith) for granting
permission to distribute this modified version of the notes in this way, and for being a
supporter of R from way back.
Comments and corrections are always welcome. Please address email correspondence
to [email protected].
Suggestions to the reader
Most R novices will start with the introductory session in Appendix A. This should
give some familiarity with the style of R sessions and more importantly some instant
feedback on what actually happens.
Many users will come to R mainly for its graphical facilities. See Graphical procedures,
which can be read at almost any time and need not wait until all the preceding sections
have been digested.
Next: Simple manipulations; numbers and vectors, Previous: Preface, Up: An Introduction to
R [Contents][Index]
Next: R and the window system, Previous: Related software and documentation,
Up: Introduction and preliminaries [Contents][Index]
Next: Getting help with functions and features, Previous: R and the window system,
Up: Introduction and preliminaries [Contents][Index]
Next: R commands, case sensitivity, etc., Previous: Using R interactively, Up: Introduction
and preliminaries [Contents][Index]
Next: Recall and correction of previous commands, Previous: Getting help with functions and
features, Up: Introduction and preliminaries [Contents][Index]
Next: Executing commands from or diverting output to a file, Previous: R commands, case
sensitivity, etc., Up: Introduction and preliminaries [Contents][Index]
Next: Data permanency and removing objects, Previous: Recall and correction of previous
commands, Up: Introduction and preliminaries [Contents][Index]
Previous: Executing commands from or diverting output to a file, Up: Introduction and
preliminaries [Contents][Index]
Next: Objects, their modes and attributes, Previous: Introduction and preliminaries, Up: An
Introduction to R [Contents][Index]
Next: Vector arithmetic, Previous: Simple manipulations; numbers and vectors, Up: Simple
manipulations; numbers and vectors [Contents][Index]
Next: Generating regular sequences, Previous: Vectors and assignment, Up: Simple
manipulations; numbers and vectors [Contents][Index]
Next: Logical vectors, Previous: Vector arithmetic, Up: Simple manipulations; numbers and
vectors [Contents][Index]
Next: Missing values, Previous: Generating regular sequences, Up: Simple manipulations;
numbers and vectors [Contents][Index]
Next: Character vectors, Previous: Logical vectors, Up: Simple manipulations; numbers and
vectors [Contents][Index]
Next: Index vectors; selecting and modifying subsets of a data set, Previous: Missing values,
Up: Simple manipulations; numbers and vectors [Contents][Index]
Next: Other types of objects, Previous: Character vectors, Up: Simple manipulations; numbers
and vectors [Contents][Index]
2.7 Index vectors; selecting and modifying
subsets of a data set
Subsets of the elements of a vector may be selected by appending to the name of the
vector an index vector in square brackets. More generally any expression that
evaluates to a vector may have subsets of its elements similarly selected by appending
an index vector in square brackets immediately after the expression.
Such index vectors can be any of four distinct types.
1. A logical vector. In this case the index vector is recycled to the same length as
the vector from which elements are to be selected. Values corresponding
to TRUE in the index vector are selected and those corresponding to FALSE are
omitted. For example
2. > y <- x[!is.na(x)]
creates (or re-creates) an object y which will contain the non-missing values
of x, in the same order. Note that if x has missing values, y will be shorter
than x. Also
> (x+1)[(!is.na(x)) & x>0] -> z
creates an object z and places in it the values of the vector x+1 for which the
corresponding value in x was both non-missing and positive.
3. A vector of positive integral quantities. In this case the values in the index
vector must lie in the set {1, 2, …, length(x)}. The corresponding elements of
the vector are selected and concatenated, in that order, in the result. The index
vector can be of any length and the result is of the same length as the index
vector. For example x[6] is the sixth component of x and
4. > x[1:10]
selects the first 10 elements of x (assuming length(x) is not less than 10). Also
> c("x","y")[rep(c(1,2,2,1), times=4)]
(an admittedly unlikely thing to do) produces a character vector of length 16
consisting of "x", "y", "y", "x" repeated four times.
5. A vector of negative integral quantities. Such an index vector specifies the
values to be excluded rather than included. Thus
6. > y <- x[-(1:5)]
gives y all but the first five elements of x.
7. A vector of character strings. This possibility only applies where an object
has a names attribute to identify its components. In this case a sub-vector of the
names vector may be used in the same way as the positive integral labels in
item 2 further above.
8. > fruit <- c(5, 10, 1, 20)
9. > names(fruit) <- c("orange", "banana", "apple", "peach")
10. > lunch <- fruit[c("apple","orange")]
The advantage is that alphanumeric names are often easier to remember
than numeric indices. This option is particularly useful in connection with data
frames, as we shall see later.
An indexed expression can also appear on the receiving end of an assignment, in
which case the assignment operation is performed only on those elements of the
vector. The expression must be of the form vector[index_vector] as having an arbitrary
expression in place of the vector name does not make much sense here.
For example
> x[is.na(x)] <- 0
replaces any missing values in x by zeros and
> y[y < 0] <- -y[y < 0]
has the same effect as
> y <- abs(y)
Previous: Index vectors; selecting and modifying subsets of a data set, Up: Simple
manipulations; numbers and vectors [Contents][Index]
Next: Ordered and unordered factors, Previous: Simple manipulations; numbers and vectors,
Up: An Introduction to R [Contents][Index]
Next: Changing the length of an object, Previous: Objects, their modes and attributes,
Up: Objects, their modes and attributes [Contents][Index]
Next: Getting and setting attributes, Previous: Intrinsic attributes: mode and length,
Up: Objects, their modes and attributes [Contents][Index]
Next: The class of an object, Previous: Changing the length of an object, Up: Objects, their
modes and attributes [Contents][Index]
Previous: Getting and setting attributes, Up: Objects, their modes and
attributes [Contents][Index]
Next: Arrays and matrices, Previous: Objects, their modes and attributes, Up: An Introduction
to R [Contents][Index]
Next: Ordered factors, Previous: Ordered and unordered factors, Up: Ordered and unordered
factors [Contents][Index]
Previous: The function tapply() and ragged arrays, Up: Ordered and unordered
factors [Contents][Index]
Next: Lists and data frames, Previous: Ordered and unordered factors, Up: An Introduction to
R [Contents][Index]
Next: Array indexing. Subsections of an array, Previous: Arrays and matrices, Up: Arrays and
matrices [Contents][Index]
5.1 Arrays
An array can be considered as a multiply subscripted collection of data entries, for
example numeric. R allows simple facilities for creating and handling arrays, and in
particular the special case of matrices.
A dimension vector is a vector of non-negative integers. If its length is k then the
array is k-dimensional, e.g. a matrix is a 2-dimensional array. The dimensions are
indexed from one up to the values given in the dimension vector.
A vector can be used by R as an array only if it has a dimension vector as
its dim attribute. Suppose, for example, z is a vector of 1500 elements. The
assignment
> dim(z) <- c(3,5,100)
gives it the dim attribute that allows it to be treated as a 3 by 5 by 100 array.
Other functions such as matrix() and array() are available for simpler and more
natural looking assignments, as we shall see in The array() function.
The values in the data vector give the values in the array in the same order as they
would occur in FORTRAN, that is “column major order,” with the first subscript
moving fastest and the last subscript slowest.
For example if the dimension vector for an array, say a, is c(3,4,2) then there are 3 *
4 * 2 = 24 entries in a and the data vector holds them in the order a[1,1,1],
a[2,1,1], …, a[2,4,2], a[3,4,2].
Arrays can be one-dimensional: such arrays are usually treated in the same way as
vectors (including when printing), but the exceptions can cause confusion.
Next: Index matrices, Previous: Arrays, Up: Arrays and matrices [Contents][Index]
Next: The array() function, Previous: Array indexing. Subsections of an array, Up: Arrays
and matrices [Contents][Index]
Next: The outer product of two arrays, Previous: Index matrices, Up: Arrays and
matrices [Contents][Index]
The precise rule affecting element by element mixed calculations with vectors and
arrays is somewhat quirky and hard to find in the references. From experience we
have found the following to be a reliable guide.
• The expression is scanned from left to right.
• Any short vector operands are extended by recycling their values until they
match the size of any other operands.
• As long as short vectors and arrays only are encountered, the arrays must all
have the same dim attribute or an error results.
• Any vector operand longer than a matrix or array operand generates an error.
• If array structures are present and no error or coercion to vector has been
precipitated, the result is an array structure with the common dim attribute of its
array operands.
Next: Generalized transpose of an array, Previous: The array() function, Up: Arrays and
matrices [Contents][Index]
Next: Matrix facilities, Previous: The outer product of two arrays, Up: Arrays and
matrices [Contents][Index]
Next: Linear equations and inversion, Previous: Matrix facilities, Up: Matrix
facilities [Contents][Index]
Next: Singular value decomposition and determinants, Previous: Linear equations and
inversion, Up: Matrix facilities [Contents][Index]
Next: Least squares fitting and the QR decomposition, Previous: Eigenvalues and
eigenvectors, Up: Matrix facilities [Contents][Index]
The function svd(M) takes an arbitrary matrix argument, M, and calculates the singular
value decomposition of M. This consists of a matrix of orthonormal columns U with the
same column space as M, a second matrix of orthonormal columns V whose column
space is the row space of M and a diagonal matrix of positive entries D such that M = U
%*% D %*% t(V). D is actually returned as a vector of the diagonal elements. The result
of svd(M) is actually a list of three components named d, u and v, with evident
meanings.
If M is in fact square, then, it is not hard to see that
> absdetM <- prod(svd(M)$d)
calculates the absolute value of the determinant of M. If this calculation were needed
often with a variety of matrices it could be defined as an R function
> absdet <- function(M) prod(svd(M)$d)
after which we could use absdet() as just another R function. As a further trivial but
potentially useful example, you might like to consider writing a function, say tr(), to
calculate the trace of a square matrix. [Hint: You will not need to use an explicit loop.
Look again at the diag() function.]
R has a builtin function det to calculate a determinant, including the sign, and
another, determinant, to give the sign and modulus (optionally on log scale),
Next: The concatenation function, c(), with arrays, Previous: Matrix facilities, Up: Arrays
and matrices [Contents][Index]
Previous: The concatenation function, c(), with arrays, Up: Arrays and
matrices [Contents][Index]
5.10 Frequency tables from factors
Recall that a factor defines a partition into groups. Similarly a pair of factors defines a
two way cross classification, and so on. The function table() allows frequency tables
to be calculated from equal length factors. If there are k factor arguments, the result is
a k-way array of frequencies.
Suppose, for example, that statef is a factor giving the state code for each entry in a
data vector. The assignment
> statefr <- table(statef)
gives in statefr a table of frequencies of each state in the sample. The frequencies are
ordered and labelled by the levels attribute of the factor. This simple case is
equivalent to, but more convenient than,
> statefr <- tapply(statef, statef, length)
Further suppose that incomef is a factor giving a suitably defined “income class” for
each entry in the data vector, for example with the cut() function:
> factor(cut(incomes, breaks = 35+10*(0:7))) -> incomef
Then to calculate a two-way table of frequencies:
> table(incomef,statef)
statef
incomef act nsw nt qld sa tas vic wa
(35,45] 1 1 0 1 0 0 1 0
(45,55] 1 1 1 1 2 0 1 3
(55,65] 0 3 1 3 2 2 2 1
(65,75] 0 1 0 0 0 0 1 0
Extension to higher-way frequency tables is immediate.
Next: Reading data from files, Previous: Arrays and matrices, Up: An Introduction to
R [Contents][Index]
Next: Constructing and modifying lists, Previous: Lists and data frames, Up: Lists and data
frames [Contents][Index]
6.1 Lists
An R list is an object consisting of an ordered collection of objects known as
its components.
There is no particular need for the components to be of the same mode or type, and,
for example, a list could consist of a numeric vector, a logical value, a matrix, a
complex vector, a character array, a function, and so on. Here is a simple example of
how to make a list:
> Lst <- list(name="Fred", wife="Mary", no.children=3,
child.ages=c(4,7,9))
Components are always numbered and may always be referred to as such. Thus
if Lst is the name of a list with four components, these may be individually referred to
as Lst[[1]], Lst[[2]], Lst[[3]] and Lst[[4]]. If, further, Lst[[4]] is a vector
subscripted array then Lst[[4]][1] is its first entry.
If Lst is a list, then the function length(Lst) gives the number of (top level)
components it has.
Components of lists may also be named, and in this case the component may be
referred to either by giving the component name as a character string in place of the
number in double square brackets, or, more conveniently, by giving an expression of
the form
> name$component_name
for the same thing.
This is a very useful convention as it makes it easier to get the right component if you
forget the number.
So in the simple example given above:
Lst$name is the same as Lst[[1]] and is the string "Fred",
Lst$wife is the same as Lst[[2]] and is the string "Mary",
Lst$child.ages[1] is the same as Lst[[4]][1] and is the number 4.
Additionally, one can also use the names of the list components in double square
brackets, i.e., Lst[["name"]] is the same as Lst$name. This is especially useful, when
the name of the component to be extracted is stored in another variable as in
> x <- "name"; Lst[[x]]
It is very important to distinguish Lst[[1]] from Lst[1]. ‘[[…]]’ is the operator used
to select a single element, whereas ‘[…]’ is a general subscripting operator. Thus the
former is the first object in the list Lst, and if it is a named list the name
is not included. The latter is a sublist of the list Lst consisting of the first entry only. If
it is a named list, the names are transferred to the sublist.
The names of components may be abbreviated down to the minimum number of
letters needed to identify them uniquely. Thus Lst$coefficients may be minimally
specified as Lst$coe and Lst$covariance as Lst$cov.
The vector of names is in fact simply an attribute of the list like any other and may be
handled as such. Other structures besides lists may, of course, similarly be given
a names attribute also.
Next: Data frames, Previous: Lists, Up: Lists and data frames [Contents][Index]
When the concatenation function c() is given list arguments, the result is an object of
mode list also, whose components are those of the argument lists joined together in
sequence.
> list.ABC <- c(list.A, list.B, list.C)
Recall that with vector objects as arguments the concatenation function similarly
joined together all arguments into a single vector structure. In this case all other
attributes, such as dim attributes, are discarded.
Previous: Constructing and modifying lists, Up: Lists and data frames [Contents][Index]
6.3 Data frames
A data frame is a list with class "data.frame". There are restrictions on lists that may
be made into data frames, namely
• The components must be vectors (numeric, character, or logical), factors,
numeric matrices, lists, or other data frames.
• Matrices, lists, and data frames provide as many variables to the new data
frame as they have columns, elements, or variables, respectively.
• Vector structures appearing as variables of the data frame must all have
the same length, and matrix structures must all have the same number of rows.
A data frame may for many purposes be regarded as a matrix with columns possibly
of differing modes and attributes. It may be displayed in matrix form, and its rows and
columns extracted using matrix indexing conventions.
• Making data frames
• attach() and detach()
• Working with data frames
• Attaching arbitrary lists
• Managing the search path
Next: attach() and detach(), Previous: Data frames, Up: Data frames [Contents][Index]
Objects satisfying the restrictions placed on the columns (components) of a data frame
may be used to form one using the function data.frame:
> accountants <- data.frame(home=statef, loot=incomes, shot=incomef)
A list whose components conform to the restrictions of a data frame may
be coerced into a data frame using the function as.data.frame()
The simplest way to construct a data frame from scratch is to use
the read.table() function to read an entire data frame from an external file. This is
discussed further in Reading data from files.
Next: Working with data frames, Previous: Making data frames, Up: Data
frames [Contents][Index]
The $ notation, such as accountants$home, for list components is not always very
convenient. A useful facility would be somehow to make the components of a list or
data frame temporarily visible as variables under their component name, without the
need to quote the list name explicitly each time.
The attach() function takes a ‘database’ such as a list or data frame as its argument.
Thus suppose lentils is a data frame with three
variables lentils$u, lentils$v, lentils$w. The attach
> attach(lentils)
places the data frame in the search path at position 2, and provided there are no
variables u, v or w in position 1, u, v and w are available as variables from the data
frame in their own right. At this point an assignment such as
> u <- v+w
does not replace the component u of the data frame, but rather masks it with another
variable u in the workspace at position 1 on the search path. To make a permanent
change to the data frame itself, the simplest way is to resort once again to
the $ notation:
> lentils$u <- v+w
However the new value of component u is not visible until the data frame is detached
and attached again.
To detach a data frame, use the function
> detach()
More precisely, this statement detaches from the search path the entity currently at
position 2. Thus in the present context the variables u, v and w would be no longer
visible, except under the list notation as lentils$u and so on. Entities at positions
greater than 2 on the search path can be detached by giving their number to detach,
but it is much safer to always use a name, for example
by detach(lentils) or detach("lentils")
Note: In R lists and data frames can only be attached at position 2 or above, and what
is attached is a copy of the original object. You can alter the attached
values via assign, but the original list or data frame is unchanged.
Next: Attaching arbitrary lists, Previous: attach() and detach(), Up: Data
frames [Contents][Index]
A useful convention that allows you to work with many different problems
comfortably together in the same workspace is
• gather together all variables for any well defined and separate problem in a data
frame under a suitably informative name;
• when working with a problem attach the appropriate data frame at position 2,
and use the workspace at level 1 for operational quantities and temporary
variables;
• before leaving a problem, add any variables you wish to keep for future
reference to the data frame using the $ form of assignment, and then detach();
• finally remove all unwanted variables from the workspace and keep it as clean
of left-over temporary variables as possible.
In this way it is quite simple to work with many problems in the same directory, all of
which have variables named x, y and z, for example.
Next: Managing the search path, Previous: Working with data frames, Up: Data
frames [Contents][Index]
attach() is a generic function that allows not only directories and data frames to be
attached to the search path, but other classes of object as well. In particular any object
of mode "list" may be attached in the same way:
> attach(any.old.list)
Anything that has been attached can be detached by detach, by position number or,
preferably, by name.
The function search shows the current search path and so is a very useful way to keep
track of which data frames and lists (and packages) have been attached and detached.
Initially it gives
> search()
[1] ".GlobalEnv" "Autoloads" "package:base"
where .GlobalEnv is the workspace.18
After lentils is attached we have
> search()
[1] ".GlobalEnv" "lentils" "Autoloads" "package:base"
> ls(2)
[1] "u" "v" "w"
and as we see ls (or objects) can be used to examine the contents of any position on
the search path.
Finally, we detach the data frame and confirm it has been removed from the search
path.
> detach("lentils")
> search()
[1] ".GlobalEnv" "Autoloads" "package:base"
Next: Probability distributions, Previous: Lists and data frames, Up: An Introduction to
R [Contents][Index]
Next: The scan() function, Previous: Reading data from files, Up: Reading data from
files [Contents][Index]
By default numeric items (except row labels) are read as numeric variables and non-
numeric variables, such as Cent.heat in the example, as character variables. This can
be changed if necessary.
The function read.table() can then be used to read the data frame directly
> HousePrice <- read.table("houses.data")
Often you will want to omit including the row labels directly and use the default
labels. In this case the file may omit the row label column as in the following.
Input file form without row labels:
Next: Accessing builtin datasets, Previous: The read.table() function, Up: Reading data
from files [Contents][Index]
Next: Editing data, Previous: The scan() function, Up: Reading data from
files [Contents][Index]
To access data from a particular package, use the package argument, for example
data(package="rpart")
data(Puromycin, package="datasets")
If a package has been attached by library, its datasets are automatically included in
the search.
User-contributed packages can be a rich source of datasets.
Previous: Accessing builtin datasets, Up: Reading data from files [Contents][Index]
Next: Grouping, loops and conditional execution, Previous: Reading data from files, Up: An
Introduction to R [Contents][Index]
8 Probability distributions
• R as a set of statistical tables
• Examining the distribution of a set of data
• One- and two-sample tests
Next: Examining the distribution of a set of data, Previous: Probability distributions,
Up: Probability distributions [Contents][Index]
Prefix the name given here by ‘d’ for the density, ‘p’ for the CDF, ‘q’ for the quantile
function and ‘r’ for simulation (random deviates). The first argument
is x for dxxx, q for pxxx, p for qxxx and n for rxxx (except
for rhyper, rsignrank and rwilcox, for which it is nn). In not quite all cases is the non-
centrality parameter ncp currently available: see the on-line help for details.
The pxxx and qxxx functions all have logical arguments lower.tail and log.p and
the dxxx ones have log. This allows, e.g., getting the cumulative (or
“integrated”) hazard function, H(t) = - log(1 - F(t)), by
- pxxx(t, ..., lower.tail = FALSE, log.p = TRUE)
or more accurate log-likelihoods (by dxxx(..., log = TRUE)), directly.
In addition there are functions ptukey and qtukey for the distribution of the
studentized range of samples from a normal distribution,
and dmultinom and rmultinom for the multinomial distribution. Further distributions
are available in contributed packages, notably SuppDists.
Here are some examples
> ## 2-tailed p-value for t distribution
> 2*pt(-2.43, df = 13)
> ## upper 1% point for an F(2, 7) distribution
> qf(0.01, 2, 7, lower.tail = FALSE)
See the on-line help on RNG for how random-number generation is done in R.
Next: One- and two-sample tests, Previous: R as a set of statistical tables, Up: Probability
distributions [Contents][Index]
16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370
A stem-and-leaf plot is like a histogram, and R has a function hist to plot histograms.
> hist(eruptions)
## make the bins smaller, make a plot of density
> hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
> lines(density(eruptions, bw=0.1))
> rug(eruptions) # show the actual data points
More elegant density plots can be made by density, and we added a line produced
by density in this example. The bandwidth bw was chosen by trial-and-error as the
default gives too much smoothing (it usually does for “interesting” densities). (Better
automated methods of bandwidth choice are available, and in this example bw =
"SJ" gives a good result.)
We can plot the empirical cumulative distribution function by using the function ecdf.
> plot(ecdf(eruptions), do.points=FALSE, verticals=TRUE)
This distribution is obviously far from any standard distribution. How about the right-
hand mode, say eruptions of longer than 3 minutes? Let us fit a normal distribution
and overlay the fitted CDF.
> long <- eruptions[eruptions > 3]
> plot(ecdf(long), do.points=FALSE, verticals=TRUE)
> x <- seq(3, 5.4, 0.01)
> lines(x, pnorm(x, mean=mean(long), sd=sqrt(var(long))), lty=3)
x <- rt(250, df = 5)
qqnorm(x); qqline(x)
which will usually (if it is a random sample) show longer tails than expected for a
normal. We can make a Q-Q plot against the generating distribution by
qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")
qqline(x)
Finally, we might want a more formal test of agreement with normality (or not). R
provides the Shapiro-Wilk test
> shapiro.test(long)
data: long
W = 0.9793, p-value = 0.01052
and the Kolmogorov-Smirnov test
> ks.test(long, "pnorm", mean = mean(long), sd = sqrt(var(long)))
data: long
D = 0.0661, p-value = 0.4284
alternative hypothesis: two.sided
(Note that the distribution theory is not valid here as we have estimated the parameters
of the normal distribution from the same sample.)
B <- scan()
80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97
boxplot(A, B)
which indicates that the first group tends to give higher results than the second.
To test for the equality of the means of the two examples, we can use an unpaired t-
test by
> t.test(A, B)
data: A and B
t = 3.2499, df = 12.027, p-value = 0.00694
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01385526 0.07018320
sample estimates:
mean of x mean of y
80.02077 79.97875
which does indicate a significant difference, assuming normality. By default the R
function does not assume equality of variances in the two samples. We can use the F
test to test for equality in the variances, provided that the two samples are from
normal populations.
> var.test(A, B)
data: A and B
F = 0.5837, num df = 12, denom df = 7, p-value = 0.3938
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1251097 2.1052687
sample estimates:
ratio of variances
0.5837405
which shows no evidence of a significant difference, and so we can use the classical t-
test that assumes equality of the variances.
> t.test(A, B, var.equal=TRUE)
data: A and B
t = 3.4722, df = 19, p-value = 0.002551
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01669058 0.06734788
sample estimates:
mean of x mean of y
80.02077 79.97875
All these tests assume normality of the two samples. The two-sample Wilcoxon (or
Mann-Whitney) test only assumes a common continuous distribution under the null
hypothesis.
> wilcox.test(A, B)
data: A and B
W = 89, p-value = 0.007497
alternative hypothesis: true location shift is not equal to 0
Warning message:
Cannot compute exact p-value with ties in: wilcox.test(A, B)
Note the warning: there are several ties in each sample, which suggests strongly that
these data are from a discrete distribution (probably due to rounding).
There are several ways to compare graphically the two samples. We have already seen
a pair of boxplots. The following
> plot(ecdf(A), do.points=FALSE, verticals=TRUE, xlim=range(A, B))
> plot(ecdf(B), do.points=FALSE, verticals=TRUE, add=TRUE)
will show the two empirical CDFs, and qqplot will perform a Q-Q plot of the two
samples. The Kolmogorov-Smirnov test is of the maximal vertical distance between
the two ecdfs, assuming a common continuous distribution:
> ks.test(A, B)
data: A and B
D = 0.5962, p-value = 0.05919
alternative hypothesis: two-sided
Warning message:
cannot compute correct p-values with ties in: ks.test(A, B)
Next: Writing your own functions, Previous: Probability distributions, Up: An Introduction to
R [Contents][Index]
Next: Repetitive execution: for loops, repeat and while, Previous: Control statements,
Up: Control statements [Contents][Index]
9.2.1 Conditional execution: if statements
Next: Statistical models in R, Previous: Grouping, loops and conditional execution, Up: An
Introduction to R [Contents][Index]
Next: Defining new binary operators, Previous: Writing your own functions, Up: Writing your
own functions [Contents][Index]
Next: Named arguments and defaults, Previous: Simple examples, Up: Writing your own
functions [Contents][Index]
Next: The ‘…’ argument, Previous: Defining new binary operators, Up: Writing your own
functions [Contents][Index]
Next: More advanced examples, Previous: The ‘…’ argument, Up: Writing your own
functions [Contents][Index]
Next: Scope, Previous: Assignments within functions, Up: Writing your own
functions [Contents][Index]
10.6 More advanced examples
• Efficiency factors in block designs
• Dropping all names in a printed array
• Recursive numerical integration
Next: Dropping all names in a printed array, Previous: More advanced examples, Up: More
advanced examples [Contents][Index]
Functions may be recursive, and may themselves define functions within themselves.
Note, however, that such functions, or indeed variables, are not inherited by called
functions in higher evaluation frames as they would be if they were on the search
path.
The example below shows a naive way of performing one-dimensional numerical
integration. The integrand is evaluated at the end points of the range and in the
middle. If the one-panel trapezium rule answer is close enough to the two panel, then
the latter is returned as the value. Otherwise the same process is recursively applied to
each panel. The result is an adaptive integration process that concentrates function
evaluations in regions where the integrand is farthest from linear. There is, however, a
heavy overhead, and the function is only competitive with other algorithms when the
integrand is both smooth and very difficult to evaluate.
The example is also given partly as a little puzzle in R programming.
area <- function(f, a, b, eps = 1.0e-06, lim = 10) {
fun1 <- function(f, a, b, fa, fb, a0, eps, lim, fun) {
## function ‘fun1’ is only visible inside ‘area’
d <- (a + b)/2
h <- (b - a)/4
fd <- f(d)
a1 <- h * (fa + fd)
a2 <- h * (fd + fb)
if(abs(a0 - a1 - a2) < eps || lim == 0)
return(a1 + a2)
else {
return(fun(f, a, d, fa, fd, a1, eps, lim - 1, fun) +
fun(f, d, b, fd, fb, a2, eps, lim - 1, fun))
}
}
fa <- f(a)
fb <- f(b)
a0 <- ((fa + fb) * (b - a))/2
fun1(f, a, b, fa, fb, a0, eps, lim, fun1)
}
Next: Customizing the environment, Previous: More advanced examples, Up: Writing your
own functions [Contents][Index]
10.7 Scope
The discussion in this section is somewhat more technical than in other parts of this
document. However, it details one of the major differences between S-PLUS and R.
The symbols which occur in the body of a function can be divided into three classes;
formal parameters, local variables and free variables. The formal parameters of a
function are those occurring in the argument list of the function. Their values are
determined by the process of binding the actual function arguments to the formal
parameters. Local variables are those whose values are determined by the evaluation
of expressions in the body of the functions. Variables which are not formal parameters
or local variables are called free variables. Free variables become local variables if
they are assigned to. Consider the following function definition.
f <- function(x) {
y <- 2*x
print(x)
print(y)
print(z)
}
In this function, x is a formal parameter, y is a local variable and z is a free variable.
In R the free variable bindings are resolved by first looking in the environment in
which the function was created. This is called lexical scope. First we define a function
called cube.
cube <- function(n) {
sq <- function() n*n
n*sq()
}
The variable n in the function sq is not an argument to that function. Therefore it is a
free variable and the scoping rules must be used to ascertain the value that is to be
associated with it. Under static scope (S-PLUS) the value is that associated with a
global variable named n. Under lexical scope (R) it is the parameter to the
function cube since that is the active binding for the variable n at the time the
function sq was defined. The difference between evaluation in R and evaluation in S-
PLUS is that S-PLUS looks for a global variable called n while R first looks for a
variable called n in the environment created when cube was invoked.
## first evaluation in S
S> cube(2)
Error in sq(): Object "n" not found
Dumped
S> n <- 3
S> cube(2)
[1] 18
## then the same function evaluated in R
R> cube(2)
[1] 8
Lexical scope can also be used to give functions mutable state. In the following
example we show how R can be used to mimic a bank account. A functioning bank
account needs to have a balance or total, a function for making withdrawals, a
function for making deposits and a function for stating the current balance. We
achieve this by creating the three functions within account and then returning a list
containing them. When account is invoked it takes a numerical argument total and
returns a list containing the three functions. Because these functions are defined in an
environment which contains total, they will have access to its value.
The special assignment operator, <<-, is used to change the value associated
with total. This operator looks back in enclosing environments for an environment
that contains the symbol total and when it finds such an environment it replaces the
value, in that environment, with the value of right hand side. If the global or top-level
environment is reached without finding the symbol total then that variable is created
and assigned to there. For most users <<- creates a global variable and assigns the
value of the right hand side to it22. Only when <<- has been used in a function that was
returned as the value of another function will the special behavior described here
occur.
open.account <- function(total) {
list(
deposit = function(amount) {
if(amount <= 0)
stop("Deposits must be positive!\n")
total <<- total + amount
cat(amount, "deposited. Your balance is", total, "\n\n")
},
withdraw = function(amount) {
if(amount > total)
stop("You don't have that much money!\n")
total <<- total - amount
cat(amount, "withdrawn. Your balance is", total, "\n\n")
},
balance = function() {
cat("Your balance is", total, "\n\n")
}
)
}
ross$withdraw(30)
ross$balance()
robert$balance()
ross$deposit(50)
ross$balance()
ross$withdraw(500)
Next: Classes, generic functions and object orientation, Previous: Scope, Up: Writing your
own functions [Contents][Index]
Previous: Customizing the environment, Up: Writing your own functions [Contents][Index]
Next: Graphical procedures, Previous: Writing your own functions, Up: An Introduction to
R [Contents][Index]
11 Statistical models in R
This section presumes the reader has some familiarity with statistical methodology, in
particular with regression analysis and the analysis of variance. Later we make some
rather more ambitious presumptions, namely that something is known about
generalized linear models and nonlinear regression.
The requirements for fitting statistical models are sufficiently well defined to make it
possible to construct general tools that apply in a broad spectrum of problems.
R provides an interlocking suite of facilities that make fitting statistical models very
simple. As we mention in the introduction, the basic output is minimal, and one needs
to ask for the details by calling extractor functions.
• Defining statistical models; formulae
• Linear models
• Generic functions for extracting model information
• Analysis of variance and model comparison
• Updating fitted models
• Generalized linear models
• Nonlinear least squares and maximum likelihood models
• Some non-standard models
Before giving a formal specification, a few examples may usefully set the picture.
Suppose y, x, x0, x1, x2, … are numeric variables, X is a matrix and A, B, C, … are
factors. The following formulae on the left side below specify statistical models as
described on the right.
y ~ x
y ~ 1 + x
Both imply the same simple linear regression model of y on x. The first has an
implicit intercept term, and the second an explicit one.
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
Simple linear regression of y on x through the origin (that is, without an
intercept term).
log(y) ~ x1 + x2
Multiple regression of the transformed variable,log(y), on x1 and x2 (with an
implicit intercept term).
y ~ poly(x,2)
y ~ 1 + x + I(x^2)
Polynomial regression of y on x of degree 2. The first form uses orthogonal
polynomials, and the second uses explicit powers, as basis.
y ~ X + poly(x,2)
Multiple regression y with model matrix consisting of the matrix X as well as
polynomial terms in x to degree 2.
y ~ A
Single classification analysis of variance model of y, with classes determined
by A.
y ~ A + x
Single classification analysis of covariance model of y, with classes determined
by A, and with covariate x.
y ~ A*B
y ~ A + B + A:B
y ~ B %in% A
y ~ A/B
Two factor non-additive model of y on A and B. The first two specify the same
crossed classification and the second two specify the same nested classification.
In abstract terms all four specify the same model subspace.
y ~ (A + B + C)^2
y ~ A*B*C - A:B:C
Three factor experiment but with a model containing main effects and two
factor interactions only. Both formulae specify the same model.
y ~ A * x
y ~ A/x
y ~ A/(1 + x) - 1
Separate simple linear regression models of y on x within the levels of A, with
different codings. The last form produces explicit estimates of as many
different intercepts and slopes as there are levels in A.
y ~ A*B + Error(C)
An experiment with two treatment factors, A and B, and error strata determined
by factor C. For example a split plot experiment, with whole plots (and hence
also subplots), determined by factor C.
The operator ~ is used to define a model formula in R. The form, for an ordinary linear
model, is
response ~ op_1 term_1 op_2 term_2 op_3 term_3 ...
where
response
is a vector or matrix, (or expression evaluating to a vector or matrix) defining
the response variable(s).
op_i
is an operator, either + or -, implying the inclusion or exclusion of a term in the
model, (the first is optional).
term_i
is either
• a vector or matrix expression, or 1,
• a factor, or
• a formula expression consisting of factors, vectors or matrices connected
by formula operators.
In all cases each term defines a collection of columns either to be added to or
removed from the model matrix. A 1 stands for an intercept column and is by
default included in the model matrix unless explicitly removed.
The formula operators are similar in effect to the Wilkinson and Rogers notation used
by such programs as Glim and Genstat. One inevitable change is that the operator ‘.’
becomes ‘:’ since the period is a valid name character in R.
The notation is summarized below (based on Chambers & Hastie, 1992, p.29):
Y ~ M
Y is modeled as M.
M_1 + M_2
Include M_1 and M_2.
M_1 - M_2
Include M_1 leaving out terms of M_2.
M_1 : M_2
The tensor product of M_1 and M_2. If both terms are factors, then the
“subclasses” factor.
M_1 %in% M_2
Similar to M_1:M_2, but with a different coding.
M_1 * M_2
M_1 + M_2 + M_1:M_2.
M_1 / M_2
M_1 + M_2 %in% M_1.
M^n
All terms in M together with “interactions” up to order n
I(M)
Insulate M. Inside M all operators have their normal arithmetic meaning, and
that term appears in the model matrix.
Note that inside the parentheses that usually enclose function arguments all operators
have their normal arithmetic meaning. The function I() is an identity function used to
allow terms in model formulae to be defined using arithmetic operators.
Note particularly that the model formulae specify the columns of the model matrix, the
specification of the parameters being implicit. This is not the case in other contexts,
for example in specifying nonlinear models.
• Contrasts
We need at least some idea how the model formulae specify the columns of the model
matrix. This is easy if we have continuous variables, as each provides one column of
the model matrix (and the intercept will provide a column of ones if included in the
model).
What about a k-level factor A? The answer differs for unordered and ordered factors.
For unordered factors k - 1 columns are generated for the indicators of the second,
…, k-th levels of the factor. (Thus the implicit parameterization is to contrast the
response at each level with that at the first.) For ordered factors the k - 1 columns are
the orthogonal polynomials on 1, ..., k, omitting the constant term.
Although the answer is already complicated, it is not the whole story. First, if the
intercept is omitted in a model that contains a factor term, the first such term is
encoded into k columns giving the indicators for all the levels. Second, the whole
behavior can be changed by the options setting for contrasts. The default setting in R
is
options(contrasts = c("contr.treatment", "contr.poly"))
The main reason for mentioning this is that R and S have different defaults for
unordered factors, S using Helmert contrasts. So if you need to compare your results
to those of a textbook or paper which used S-PLUS, you will need to set
options(contrasts = c("contr.helmert", "contr.poly"))
This is a deliberate difference, as treatment contrasts (R’s default) are thought easier
for newcomers to interpret.
We have still not finished, as the contrast scheme to be used can be set for each term
in the model using the functions contrasts and C.
We have not yet considered interaction terms: these generate the products of the
columns introduced for their component terms.
Although the details are complicated, model formulae in R will normally generate the
models that an expert statistician would expect, provided that marginality is
preserved. Fitting, for example, a model with an interaction but not the corresponding
main effects will in general lead to surprising results, and is for experts only.
Next: Generic functions for extracting model information, Previous: Defining statistical
models; formulae, Up: Statistical models in R [Contents][Index]
Next: Analysis of variance and model comparison, Previous: Linear models, Up: Statistical
models in R [Contents][Index]
Next: Updating fitted models, Previous: Generic functions for extracting model information,
Up: Statistical models in R [Contents][Index]
Previous: Analysis of variance and model comparison, Up: Analysis of variance and model
comparison [Contents][Index]
Note also that the analysis of variance table (or tables) are for a sequence of fitted
models. The sums of squares shown are the decrease in the residual sums of squares
resulting from an inclusion of that term in the model at that place in the sequence.
Hence only for orthogonal experiments will the order of inclusion be inconsequential.
For multistratum experiments the procedure is first to project the response onto the
error strata, again in sequence, and to fit the mean model to each projection. For
further details, see Chambers & Hastie (1992).
A more flexible alternative to the default full ANOVA table is to compare two or
more models directly using the anova() function.
> anova(fitted.model.1, fitted.model.2, ...)
The display is then an ANOVA table showing the differences between the fitted
models when fitted in sequence. The fitted models being compared would usually be
an hierarchical sequence, of course. This does not give different information to the
default, but rather makes it easier to comprehend and control.
Next: Generalized linear models, Previous: Analysis of variance and model comparison,
Up: Statistical models in R [Contents][Index]
Next: Nonlinear least squares and maximum likelihood models, Previous: Updating fitted
models, Up: Statistical models in R [Contents][Index]
Next: The glm() function, Previous: Generalized linear models, Up: Generalized linear
models [Contents][Index]
11.6.1 Families
Since the distribution of the response depends on the stimulus variables through a
single linear function only, the same mechanism as was used for linear models can
still be used to specify the linear part of a generalized model. The family has to be
specified in a different way.
The R function to fit a generalized linear model is glm() which uses the form
> fitted.model <- glm(formula, family=family.generator, data=data.frame)
The only new feature is the family.generator, which is the instrument by which the
family is described. It is the name of a function that generates a list of functions and
expressions that together define and control the model and estimation process.
Although this may seem a little complicated at first sight, its use is quite simple.
The names of the standard, supplied family generators are given under “Family
Name” in the table in Families. Where there is a choice of links, the name of the link
may also be supplied with the family name, in parentheses as a parameter. In the case
of the quasi family, the variance function may also be specified in this way.
Some examples make the process clear.
The gaussian family
A call such as
> fm <- glm(y ~ x1 + x2, family = gaussian, data = sales)
achieves the same result as
> fm <- lm(y ~ x1+x2, data=sales)
but much less efficiently. Note how the gaussian family is not automatically provided
with a choice of links, so no parameter is allowed. If a problem requires a gaussian
family with a nonstandard link, this can usually be achieved through the quasi family,
as we shall see later.
The binomial family
With the Poisson family the default link is the log, and in practice the major use of
this family is to fit surrogate Poisson log-linear models to frequency data, whose
actual distribution is often multinomial. This is a large and important subject we will
not discuss further here. It even forms a major part of the use of non-gaussian
generalized models overall.
Occasionally genuinely Poisson data arises in practice and in the past it was often
analyzed as gaussian data after either a log or a square-root transformation. As a
graceful alternative to the latter, a Poisson generalized linear model may be fitted as in
the following example:
> fmod <- glm(y ~ A + B + x, family = poisson(link=sqrt),
data = worm.counts)
Quasi-likelihood models
For all families the variance of the response will depend on the mean and will have
the scale parameter as a multiplier. The form of dependence of the variance on the
mean is a characteristic of the response distribution; for example for the Poisson
distribution Var(y) = mu.
For quasi-likelihood estimation and inference the precise response distribution is not
specified, but rather only a link function and the form of the variance function as it
depends on the mean. Since quasi-likelihood estimation uses formally identical
techniques to those for the gaussian distribution, this family provides a way of fitting
gaussian models with non-standard link functions or variance functions, incidentally.
For example, consider fitting the non-linear regression y = theta_1 z_1 / (z_2 -
theta_2) + e which may be written alternatively as y = 1 / (beta_1 x_1 + beta_2 x_2) +
e where x_1 = z_2/z_1, x_2 = -1/z_1, beta_1 = 1/theta_1, and beta_2 =
theta_2/theta_1. Supposing a suitable data frame to be set up we could fit this non-
linear regression as
> nlfit <- glm(y ~ x1 + x2 - 1,
family = quasi(link=inverse, variance=constant),
data = biochem)
The reader is referred to the manual and the help document for further information, as
needed.
Next: Some non-standard models, Previous: Generalized linear models, Up: Statistical models
in R [Contents][Index]
Next: Maximum likelihood, Previous: Nonlinear least squares and maximum likelihood
models, Up: Nonlinear least squares and maximum likelihood models [Contents][Index]
One way to fit a nonlinear model is by minimizing the sum of the squared errors
(SSE) or residuals. This method makes sense if the observed errors could have
plausibly arisen from a normal distribution.
Here is an example from Bates & Watts (1988), page 51. The data are:
> x <- c(0.02, 0.02, 0.06, 0.06, 0.11, 0.11, 0.22, 0.22, 0.56, 0.56,
1.10, 1.10)
> y <- c(76, 47, 97, 107, 123, 139, 159, 152, 191, 201, 207, 200)
The fit criterion to be minimized is:
> fn <- function(p) sum((y - (p[1] * x)/(p[2] + x))^2)
In order to do the fit we need initial estimates of the parameters. One way to find
sensible starting values is to plot the data, guess some parameter values, and
superimpose the model curve using those values.
> plot(x, y)
> xfit <- seq(.02, 1.1, .05)
> yfit <- 200 * xfit/(0.1 + xfit)
> lines(spline(xfit, yfit))
We could do better, but these starting values of 200 and 0.1 seem adequate. Now do
the fit:
> out <- nlm(fn, p = c(200, 0.1), hessian = TRUE)
After the fitting, out$minimum is the SSE, and out$estimate are the least squares
estimates of the parameters. To obtain the approximate standard errors (SE) of the
estimates we do:
> sqrt(diag(2*out$minimum/(length(y) - 2) * solve(out$hessian)))
The 2 which is subtracted in the line above represents the number of parameters. A
95% confidence interval would be the parameter estimate +/- 1.96 SE. We can
superimpose the least squares fit on a new plot:
> plot(x, y)
> xfit <- seq(.02, 1.1, .05)
> yfit <- 212.68384222 * xfit/(0.06412146 + xfit)
> lines(spline(xfit, yfit))
The standard package stats provides much more extensive facilities for fitting non-
linear models by least squares. The model we have just fitted is the Michaelis-Menten
model, so we can use
> df <- data.frame(x=x, y=y)
> fit <- nls(y ~ SSmicmen(x, Vm, K), df)
> fit
Nonlinear regression model
model: y ~ SSmicmen(x, Vm, K)
data: df
Vm K
212.68370711 0.06412123
residual sum-of-squares: 1195.449
> summary(fit)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Vm 2.127e+02 6.947e+00 30.615 3.24e-11
K 6.412e-02 8.281e-03 7.743 1.57e-05
Previous: Least squares, Up: Nonlinear least squares and maximum likelihood
models [Contents][Index]
Previous: Nonlinear least squares and maximum likelihood models, Up: Statistical models in
R [Contents][Index]
12 Graphical procedures
Graphical facilities are an important and extremely versatile component of the R
environment. It is possible to use the facilities to display a wide variety of statistical
graphs and also to build entirely new types of graph.
The graphics facilities can be used in both interactive and batch modes, but in most
cases, interactive use is more productive. Interactive use is also easy because at
startup time R initiates a graphics device driver which opens a special graphics
window for the display of interactive graphics. Although this is done automatically, it
may useful to know that the command used is X11() under UNIX, windows() under
Windows and quartz() under macOS. A new device can always be opened
by dev.new().
Once the device driver is running, R plotting commands can be used to produce a
variety of graphical displays and to create entirely new kinds of display.
Plotting commands are divided into three basic groups:
• High-level plotting functions create a new plot on the graphics device, possibly
with axes, labels, titles and so on.
• Low-level plotting functions add more information to an existing plot, such as
extra points, lines and labels.
• Interactive graphics functions allow you interactively add information to, or
extract information from, an existing plot, using a pointing device such as a
mouse.
In addition, R maintains a list of graphical parameters which can be manipulated to
customize your plots.
This manual only describes what are known as ‘base’ graphics. A separate graphics
sub-system in package grid coexists with base – it is more powerful but harder to use.
There is a recommended package lattice which builds on grid and provides ways to
produce multi-panel plots akin to those in the Trellis system in S.
• High-level plotting commands
• Low-level plotting commands
• Interacting with graphics
• Using graphics parameters
• Graphics parameters list
• Device drivers
• Dynamic graphics
One of the most frequently used plotting functions in R is the plot() function. This is
a generic function: the type of plot produced is dependent on the type or class of the
first argument.
plot(x, y)
plot(xy)
If x and y are vectors, plot(x, y) produces a scatterplot of y against x. The
same effect can be produced by supplying one argument (second form) as
either a list containing two elements x and y or a two-column matrix.
plot(x)
If x is a time series, this produces a time-series plot. If x is a numeric vector, it
produces a plot of the values in the vector against their index in the vector.
If x is a complex vector, it produces a plot of imaginary versus real parts of the
vector elements.
plot(f)
plot(f, y)
f is a factor object, y is a numeric vector. The first form generates a bar plot
of f; the second form produces boxplots of y for each level of f.
plot(df)
plot(~ expr)
plot(y ~ expr)
df is a data frame, y is any object, expr is a list of object names separated by ‘+’
(e.g., a + b + c). The first two forms produce distributional plots of the
variables in a data frame (first form) or of a number of named objects (second
form). The third form plots y against every object named in expr.
Next: Display graphics, Previous: The plot() function, Up: High-level plotting
commands [Contents][Index]
Other high-level graphics functions produce different types of plots. Some examples
are:
qqnorm(x)
qqline(x)
qqplot(x, y)
Distribution-comparison plots. The first form plots the numeric vector x against
the expected Normal order scores (a normal scores plot) and the second adds a
straight line to such a plot by drawing a line through the distribution and data
quartiles. The third form plots the quantiles of x against those of y to compare
their respective distributions.
hist(x)
hist(x, nclass=n)
hist(x, breaks=b, …)
Produces a histogram of the numeric vector x. A sensible number of classes is
usually chosen, but a recommendation can be given with the nclass= argument.
Alternatively, the breakpoints can be specified exactly with
the breaks= argument. If the probability=TRUE argument is given, the bars
represent relative frequencies divided by bin width instead of counts.
dotchart(x, …)
Constructs a dot chart of the data in x. In a dot chart the y-axis gives a labelling
of the data in x and the x-axis gives its value. For example it allows easy visual
selection of all data entries with values lying in specified ranges.
image(x, y, z, …)
contour(x, y, z, …)
persp(x, y, z, …)
Plots of three variables. The image plot draws a grid of rectangles using
different colours to represent the value of z, the contour plot draws contour
lines to represent the value of z, and the persp plot draws a 3D surface.
Next: Interacting with graphics, Previous: High-level plotting commands, Up: Graphical
procedures [Contents][Index]
Next: Hershey vector fonts, Previous: Low-level plotting commands, Up: Low-level plotting
commands [Contents][Index]
In some cases, it is useful to add mathematical symbols and formulae to a plot. This
can be achieved in R by specifying an expression rather than a character string in any
one of text, mtext, axis, or title. For example, the following code draws the formula
for the Binomial probability function:
> text(x, y, expression(paste(bgroup("(", atop(n, x), ")"), p^x, q^{n-x})))
More information, including a full listing of the features available can obtained from
within R using the commands:
> help(plotmath)
> example(plotmath)
> demo(plotmath)
It is possible to specify Hershey vector fonts for rendering text when using
the text and contour functions. There are three reasons for using the Hershey fonts:
• Hershey fonts can produce better output, especially on a computer screen, for
rotated and/or small text.
• Hershey fonts provide certain symbols that may not be available in the standard
fonts. In particular, there are zodiac signs, cartographic symbols and
astronomical symbols.
• Hershey fonts provide Cyrillic and Japanese (Kana and Kanji) characters.
More information, including tables of Hershey characters can be obtained from within
R using the commands:
> help(Hershey)
> demo(Hershey)
> help(Japanese)
> demo(Japanese)
Next: Using graphics parameters, Previous: Low-level plotting commands, Up: Graphical
procedures [Contents][Index]
Next: Graphics parameters list, Previous: Interacting with graphics, Up: Graphical
procedures [Contents][Index]
The par() function is used to access and modify the list of graphics parameters for the
current graphics device.
par()
Without arguments, returns a list of all graphics parameters and their values for
the current device.
par(c("col", "lty"))
With a character vector argument, returns only the named graphics parameters
(again, as a list.)
par(col=4, lty=2)
With named arguments (or a single list argument), sets the values of the named
graphics parameters, and returns the original values of the parameters as a list.
Setting graphics parameters with the par() function changes the value of the
parameters permanently, in the sense that all future calls to graphics functions (on the
current device) will be affected by the new value. You can think of setting graphics
parameters in this way as setting “default” values for the parameters, which will be
used by all graphics functions unless an alternative value is given.
Note that calls to par() always affect the global values of graphics parameters, even
when par() is called from within a function. This is often undesirable behavior—
usually we want to set some graphics parameters, do some plotting, and then restore
the original values so as not to affect the user’s R session. You can restore the initial
values by saving the result of par() when making changes, and restoring the initial
values when plotting is complete.
> oldpar <- par(col=4, lty=2)
... plotting commands ...
> par(oldpar)
To save and restore all settable24 graphical parameters use
> oldpar <- par(no.readonly=TRUE)
... plotting commands ...
> par(oldpar)
Previous: Permanent changes: The par() function, Up: Using graphics
parameters [Contents][Index]
Graphics parameters may also be passed to (almost) any graphics function as named
arguments. This has the same effect as passing the arguments to the par() function,
except that the changes only last for the duration of the function call. For example:
> plot(x, y, pch="+")
produces a scatterplot using a plus sign as the plotting character, without changing the
default plotting character for future plots.
Unfortunately, this is not implemented entirely consistently and it is sometimes
necessary to set and reset graphics parameters using par().
Next: Axes and tick marks, Previous: Graphics parameters list, Up: Graphics parameters
list [Contents][Index]
Many of R’s high-level plots have axes, and you can construct axes yourself with the
low-level axis() graphics function. Axes have three main components: the axis
line (line style controlled by the lty graphics parameter), the tick marks (which mark
off unit divisions along the axis line) and the tick labels (which mark the units.) These
components can be customized with the following graphics parameters.
lab=c(5, 7, 12)
The first two numbers are the desired number of tick intervals on
the x and y axes respectively. The third number is the desired length of axis
labels, in characters (including the decimal point.) Choosing a too-small value
for this parameter may result in all tick labels being rounded to the same
number!
las=1
Orientation of axis labels. 0 means always parallel to axis, 1 means always
horizontal, and 2 means always perpendicular to the axis.
mgp=c(3, 1, 0)
Positions of axis components. The first component is the distance from the axis
label to the axis position, in text lines. The second component is the distance to
the tick labels, and the final component is the distance from the axis position to
the axis line (usually zero). Positive numbers measure outside the plot region,
negative numbers inside.
tck=0.01
Length of tick marks, as a fraction of the size of the plotting region.
When tck is small (less than 0.5) the tick marks on the x and y axes are forced
to be the same size. A value of 1 gives grid lines. Negative values give tick
marks outside the plotting region. Use tck=0.01 and mgp=c(1,-1.5,0) for
internal tick marks.
xaxs="r"
yaxs="i"
Axis styles for the x and y axes, respectively. With styles "i" (internal)
and "r" (the default) tick marks always fall within the range of the data,
however style "r" leaves a small amount of space at the edges.
Next: Multiple figure environment, Previous: Axes and tick marks, Up: Graphics parameters
list [Contents][Index]
R allows you to create an n by m array of figures on a single page. Each figure has its
own margins, and the array of figures is optionally surrounded by an outer margin, as
shown in the following figure.
By passing the file argument to the postscript() device driver function, you may
store the graphics in PostScript format in a file of your choice. The plot will be in
landscape orientation unless the horizontal=FALSE argument is given, and you can
control the size of the graphic with the width and height arguments (the plot will be
scaled as appropriate to fit these dimensions.) For example, the command
> postscript("file.ps", horizontal=FALSE, height=5, pointsize=10)
will produce a file containing PostScript code for a figure five inches high, perhaps
for inclusion in a document. It is important to note that if the file named in the
command already exists, it will be overwritten. This is the case even if the file was
only created earlier in the same R session.
Many usages of PostScript output will be to incorporate the figure in another
document. This works best when encapsulated PostScript is produced: R always
produces conformant output, but only marks the output as such when
the onefile=FALSE argument is supplied. This unusual notation stems from S-
compatibility: it really means that the output will be a single page (which is part of the
EPSF specification). Thus to produce a plot for inclusion use something like
> postscript("plot1.eps", horizontal=FALSE, onefile=FALSE,
height=8, width=6, pointsize=10)
Previous: PostScript diagrams for typeset documents, Up: Device drivers [Contents][Index]
12.6.2 Multiple graphics devices
In advanced use of R it is often useful to have several graphics devices in use at the
same time. Of course only one graphics device can accept graphics commands at any
one time, and this is known as the current device. When multiple devices are open,
they form a numbered sequence with names giving the kind of device at any position.
The main commands used for operating with multiple devices, and their meanings are
as follows:
X11()
[UNIX]
windows()
win.printer()
win.metafile()
[Windows]
quartz()
[macOS]
postscript()
pdf()
png()
jpeg()
tiff()
bitmap()
…
Each new call to a device driver function opens a new graphics device, thus
extending by one the device list. This device becomes the current device, to
which graphics output will be sent.
dev.list()
Returns the number and name of all active devices. The device at position 1 on
the list is always the null device which does not accept graphics commands at
all.
dev.next()
dev.prev()
Returns the number and name of the graphics device next to, or previous to the
current device, respectively.
dev.set(which=k)
Can be used to change the current graphics device to the one at position k of the
device list. Returns the number and label of the device.
dev.off(k)
Terminate the graphics device at point k of the device list. For some devices,
such as postscript devices, this will either print the file immediately or
correctly complete the file for later printing, depending on how the device was
initiated.
dev.copy(device, …, which=k)
dev.print(device, …, which=k)
Make a copy of the device k. Here device is a device function, such
as postscript, with extra arguments, if needed, specified by ‘…’. dev.print is
similar, but the copied device is immediately closed, so that end actions, such
as printing hardcopies, are immediately performed.
graphics.off()
Terminate all graphics devices on the list, except the null device.
13 Packages
All R functions and datasets are stored in packages. Only when a package is loaded
are its contents available. This is done both for efficiency (the full list would take
more memory and would take longer to search than a subset), and to aid package
developers, who are protected from name clashes with other code. The process of
developing packages is described in Creating R packages in Writing R Extensions. Here,
we will describe them from a user’s point of view.
To see which packages are installed at your site, issue the command
> library()
with no arguments. To load a particular package (e.g., the boot package containing
functions from Davison & Hinkley (1997)), use a command like
> library(boot)
Users connected to the Internet can use
the install.packages() and update.packages() functions (available through
the Packages menu in the Windows and macOS GUIs, see Installing packages in R
Installation and Administration) to install and update packages.
To see which packages are currently loaded, use
> search()
to display the search list. Some packages may be loaded but not available on the
search list (see Namespaces): these will be included in the list given by
> loadedNamespaces()
To see a list of all available help topics in an installed package, use
> help.start()
to start the HTML help system, and then navigate to the package listing in
the Reference section.
• Standard packages
• Contributed packages and CRAN
• Namespaces
13.3 Namespaces
Packages have namespaces, which do three things: they allow the package writer to
hide functions and data that are meant only for internal use, they prevent functions
from breaking when a user (or other package writer) picks a name that clashes with
one in the package, and they provide a way to refer to an object within a particular
package.
For example, t() is the transpose function in R, but users might define their own
function named t. Namespaces prevent the user’s definition from taking precedence,
and breaking every function that tries to transpose a matrix.
There are two operators that work with namespaces. The double-colon
operator :: selects definitions from a particular namespace. In the example above, the
transpose function will always be available as base::t, because it is defined in
the base package. Only functions that are exported from the package can be retrieved
in this way.
The triple-colon operator ::: may be seen in a few places in R code: it acts like the
double-colon operator but also allows access to hidden objects. Users are more likely
to use the getAnywhere() function, which searches multiple packages.
Packages are often inter-dependent, and loading one may cause others to be
automatically loaded. The colon operators described above will also cause automatic
loading of the associated package. When packages with namespaces are loaded
automatically they are not added to the search list.
14 OS facilities
R has quite extensive facilities to access the OS under which it is running: this allows
it to be used as a scripting language and that ability is much used by R itself, for
example to install packages.
Because R’s own scripts need to work across all platforms, considerable effort has
gone into make the scripting facilities as platform-independent as is feasible.
• Files and directories
• Filepaths
• System commands
• Compression and Archives
14.2 Filepaths
With a few exceptions, R relies on the underlying OS functions to manipulate
filepaths. Some aspects of this are allowed to depend on the OS, and do, even down to
the version of the OS. There are POSIX standards for how OSes should interpret
filepaths and many R users assume POSIX compliance: but Windows does not claim
to be compliant and other OSes may be less than completely compliant.
The following are some issues which have been encountered with filepaths.
• POSIX filesystems are case-sensitive, so foo.png and Foo.PNG are
different files. However, the defaults on Windows and macOS are to be case-
insensitive, and FAT filesystems (commonly used on removable storage) are
not normally case-sensitive (and all filepaths may be mapped to lower case).
• Almost all the Windows’ OS services support the use of slash or backslash as
the filepath separator, and R converts the known exceptions to the form
required by Windows.
• The behaviour of filepaths with a trailing slash is OS-dependent. Such paths are
not valid on Windows and should not be expected to work. POSIX-2008
requires such paths to match only directories, but earlier versions allowed them
to also match files. So they are best avoided.
• Multiple slashes in filepaths such as /abc//def are valid on POSIX
filesystems and treated as if there was only one slash. They
are usually accepted by Windows’ OS functions. However, leading double
slashes may have a different meaning.
• Windows’ UNC filepaths (such
as \\server\dir1\dir2\file and \\?\UNC\server\dir1\dir2\
file) are not supported, but they may work in some R functions. POSIX
filesystems are allowed to treat a leading double slash specially.
• Windows allows filepaths containing drives and relative to the current directory
on a drive, e.g. d:foo/bar refers to d:/a/b/c/foo/bar if the current
directory on drive d: is /a/b/c. It is intended that these work, but the use of
absolute paths is safer.
Functions basename and dirname select parts of a file path: the recommended way to
assemble a file path from components is file.path. Function pathexpand does ‘tilde
expansion’, substituting values for home directories (the current user’s, and perhaps
those of other users).
On filesystems with links, a single file can be referred to by many filepaths.
Function normalizePath will find a canonical filepath.
Windows has the concepts of short (‘8.3’) and long file names: normalizePath will
return an absolute path using long file names and shortPathName will return a version
using short names. The latter does not contain spaces and uses backslash as the
separator, so is sometimes useful for exporting names from R.
File permissions are a related topic. R has support for the POSIX concepts of
read/write/execute permission for owner/group/all but this may be only partially
supported on the filesystem, so for example on Windows only read-only files (for the
account running the R session) are recognized. Access Control Lists (ACLs) are
employed on several filesystems, but do not have an agreed standard and R has no
facilities to control them. Use Sys.chmod to change permissions.
Appendix B Invoking R
Users of R on Windows or macOS should read the OS-specific section first, but
command-line use is also supported.
• Invoking R from the command line
• Invoking R under Windows
• Invoking R under macOS
• Scripting with R
Next: Invoking R under macOS, Previous: Invoking R from the command line, Up: Invoking
R [Contents][Index]
You can pass parameters to scripts via additional arguments on the command line: for
example (where the exact quoting needed will depend on the shell in use)
R CMD BATCH "--args arg1 arg2" foo.R &
will pass arguments to a script which can be retrieved as a character vector by
args <- commandArgs(TRUE)
This is made simpler by the alternative front-end Rscript, which can be invoked by
Rscript foo.R arg1 arg2
and this can also be used to write executable script files like (at least on Unix-alikes,
and in some Windows shells)
#! /path/to/Rscript
args <- commandArgs(TRUE)
...
q(status=<exit status code>)
If this is entered into a text file runfoo and this is made executable (by chmod 755
runfoo), it can be invoked for different arguments by
runfoo arg1 arg2
For further options see help("Rscript"). This writes R output
to stdout and stderr, and this can be redirected in the usual way for the shell
running the command.
If you do not wish to hardcode the path to Rscript but have it in your path (which is
normally the case for an installed R except on Windows, but e.g. macOS users may
need to add /usr/local/bin to their path), use
#! /usr/bin/env Rscript
...
At least in Bourne and bash shells, the #! mechanism does not allow extra arguments
like #! /usr/bin/env Rscript --vanilla.
One thing to consider is what stdin() refers to. It is commonplace to write R scripts
with segments like
chem <- scan(n=24)
2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70
and stdin() refers to the script file to allow such traditional usage. If you want to refer
to the process’s stdin, use "stdin" as a file connection, e.g. scan("stdin", ...).
Another way to write executable script files (suggested by François Pinard) is to use
a here document like
#!/bin/sh
[environment variables can be set here]
R --no-echo [other options] <<EOF
EOF
but here stdin() refers to the program source and "stdin" will not be usable.
Short scripts can be passed to Rscript on the command-line via the -e flag. (Empty
scripts are not accepted.)
Note that on a Unix-alike the input filename (such as foo.R) should not contain
spaces nor shell metacharacters.
C.1 Preliminaries
When the GNU readline library is available at the time R is configured for
compilation under UNIX, an inbuilt command line editor allowing recall, editing and
re-submission of prior commands is used. Note that other versions of readline exist
and may be used by the inbuilt command line editor: this is most common on macOS.
You can find out which version (if any) is available by running extSoftVersion() in
an R session.
It can be disabled (useful for usage with ESS 25) using the startup option --no-
readline.
Windows versions of R have somewhat simpler command-line editing: see
‘Console’ under the ‘Help’ menu of the GUI, and the file README.Rterm for
command-line editing under Rterm.exe.
When using R with GNU26 readline capabilities, the functions described below are
available, as well as others (probably) documented in man readline or info
readline on your system.
Many of these use either Control or Meta characters. Control characters, such
as Control-m, are obtained by holding the CTRL down while you press the m key, and
are written as C-m below. Meta characters, such as Meta-b, are typed by holding
down META27 and pressing b, and written as M-b in the following. If your terminal does
not have a META key enabled, you can still type Meta characters using two-character
sequences starting with ESC. Thus, to enter M-b, you could type ESCb. The ESC character
sequences are also allowed on terminals with real Meta keys. Note that case is
significant for Meta characters.
Some but not all versions28 of readline will recognize resizing of the terminal window
so this is best avoided.
C-p
Go to the previous command (backwards in the history).
C-n
Go to the next command (forwards in the history).
C-r text
Find the last command with the text string in it. This can be cancelled by C-
g (and on some versions of R by C-c).
On most terminals, you can also use the up and down arrow keys instead of C-p and C-
n, respectively.
C-a
Go to the beginning of the command.
C-e
Go to the end of the line.
M-b
Go back one word.
M-f
Go forward one word.
C-b
Go back one character.
C-f
Go forward one character.
On most terminals, you can also use the left and right arrow keys instead of C-b and C-
f, respectively.
text
Insert text at the cursor.
C-f text
Append text after the cursor.
DEL
Delete the previous character (left of the cursor).
C-d
Delete the character under the cursor.
M-d
Delete the rest of the word under the cursor, and “save” it.
C-k
Delete from cursor to end of command, and “save” it.
C-y
Insert (yank) the last “saved” text here.
C-t
Transpose the character under the cursor with the next.
M-l
Change the rest of the word to lower case.
M-c
Change the rest of the word to upper case.
RET
Re-submit the command to R.
The final RET terminates the command line editing sequence.
The readline key bindings can be customized in the usual
way via a ~/.inputrc file. These customizations can be conditioned on
application R, that is by including a section like
$if R
"\C-xd": "q('no')\n"
$endif
-
- Vector arithmetic
:
: Generating regular sequences
:: Namespaces
::: Namespaces
!
! Logical vectors
!= Logical vectors
?
? Getting help
?? Getting help
.
. Updating fitted models
.First Customizing the environment
.Last Customizing the environment
*
* Vector arithmetic
/
/ Vector arithmetic
&
& Logical vectors
&& Conditional execution
%
%*% Multiplication
%o% The outer product of two arrays
^
^ Vector arithmetic
+
+ Vector arithmetic
<
< Logical vectors
<<- Scope
<= Logical vectors
=
== Logical vectors
>
> Logical vectors
>= Logical vectors
|
| Logical vectors
|| Conditional execution
~
~ Formulae for statistical models
A
abline Low-level plotting commands
ace Some non-standard models
add1 Updating fitted models
anova Generic functions for extracting model information
anova ANOVA tables
aov Analysis of variance and model comparison
aperm Generalized transpose of an array
array The array() function
as.data.frame Making data frames
as.vector The concatenation function c() with arrays
attach attach() and detach()
attr Getting and setting attributes
attr Getting and setting attributes
attributes Getting and setting attributes
attributes Getting and setting attributes
avas Some non-standard models
axis Low-level plotting commands
B
boxplot One- and two-sample tests
break Repetitive execution
bruto Some non-standard models
C
c Vectors and assignment
c Character vectors
c The concatenation function c() with arrays
c Concatenating lists
C Contrasts
cbind Forming partitioned matrices
coef Generic functions for extracting model information
coefficients Generic functions for extracting model information
contour Display graphics
contrasts Contrasts
coplot Displaying multivariate data
cos Vector arithmetic
crossprod Index matrices
crossprod Multiplication
cut Frequency tables from factors
D
data Accessing builtin datasets
data.frame Making data frames
density Examining the distribution of a set of data
det Singular value decomposition and determinants
detach attach() and detach()
determinant Singular value decomposition and determinants
dev.list Multiple graphics devices
dev.next Multiple graphics devices
dev.off Multiple graphics devices
dev.prev Multiple graphics devices
dev.set Multiple graphics devices
deviance Generic functions for extracting model information
diag Multiplication
dim Arrays
dotchart Display graphics
drop1 Updating fitted models
E
ecdf Examining the distribution of a set of data
edit Editing data
eigen Eigenvalues and eigenvectors
else Conditional execution
Error Analysis of variance and model comparison
example Getting help
exp Vector arithmetic
F
F Logical vectors
factor Factors
FALSE Logical vectors
fivenum Examining the distribution of a set of data
for Repetitive execution
formula Generic functions for extracting model information
function Writing your own functions
G
getAnywhere Object orientation
getS3method Object orientation
glm The glm() function
H
help Getting help
help Getting help
help.search Getting help
help.start Getting help
hist Examining the distribution of a set of data
hist Display graphics
I
identify Interacting with graphics
if Conditional execution
if Conditional execution
ifelse Conditional execution
image Display graphics
is.na Missing values
is.nan Missing values
J
jpeg Device drivers
K
ks.test Examining the distribution of a set of data
L
legend Low-level plotting commands
length Vector arithmetic
length The intrinsic attributes mode and length
levels Factors
lines Low-level plotting commands
list Lists
lm Linear models
lme Some non-standard models
locator Interacting with graphics
loess Some non-standard models
loess Some non-standard models
log Vector arithmetic
lqs Some non-standard models
lsfit Least squares fitting and the QR decomposition
M
mars Some non-standard models
max Vector arithmetic
mean Vector arithmetic
methods Object orientation
min Vector arithmetic
mode The intrinsic attributes mode and length
N
NA Missing values
NaN Missing values
ncol Matrix facilities
next Repetitive execution
nlm Nonlinear least squares and maximum likelihood models
nlm Least squares
nlm Maximum likelihood
nlme Some non-standard models
nlminb Nonlinear least squares and maximum likelihood models
nrow Matrix facilities
O
optim Nonlinear least squares and maximum likelihood models
order Vector arithmetic
ordered Ordered factors
ordered Ordered factors
outer The outer product of two arrays
P
pairs Displaying multivariate data
par The par() function
paste Character vectors
pdf Device drivers
persp Display graphics
plot Generic functions for extracting model information
plot The plot() function
pmax Vector arithmetic
pmin Vector arithmetic
png Device drivers
points Low-level plotting commands
polygon Low-level plotting commands
postscript Device drivers
predict Generic functions for extracting model information
print Generic functions for extracting model information
prod Vector arithmetic
Q
qqline Examining the distribution of a set of data
qqline Display graphics
qqnorm Examining the distribution of a set of data
qqnorm Display graphics
qqplot Display graphics
qr Least squares fitting and the QR decomposition
quartz Device drivers
R
range Vector arithmetic
rbind Forming partitioned matrices
read.table The read.table() function
rep Generating regular sequences
repeat Repetitive execution
resid Generic functions for extracting model information
residuals Generic functions for extracting model information
rlm Some non-standard models
rm Data permanency and removing objects
S
scan The scan() function
sd The function tapply() and ragged arrays
search Managing the search path
seq Generating regular sequences
shapiro.test Examining the distribution of a set of data
sin Vector arithmetic
sink Executing commands from or diverting output to a file
solve Linear equations and inversion
sort Vector arithmetic
source Executing commands from or diverting output to a file
split Repetitive execution
sqrt Vector arithmetic
stem Examining the distribution of a set of data
step Generic functions for extracting model information
step Updating fitted models
sum Vector arithmetic
summary Examining the distribution of a set of data
summary Generic functions for extracting model information
svd Singular value decomposition and determinants
T
T Logical vectors
t Generalized transpose of an array
t.test One- and two-sample tests
table Index matrices
table Frequency tables from factors
tan Vector arithmetic
tapply The function tapply() and ragged arrays
text Low-level plotting commands
title Low-level plotting commands
tree Some non-standard models
TRUE Logical vectors
U
unclass The class of an object
update Updating fitted models
V
var Vector arithmetic
var The function tapply() and ragged arrays
var.test One- and two-sample tests
vcov Generic functions for extracting model information
vector Vectors and assignment
W
while Repetitive execution
wilcox.test One- and two-sample tests
windows Device drivers
X
X11 Device drivers
A
Accessing builtin datasets Accessing builtin datasets
Additive models Some non-standard models
Analysis of variance Analysis of variance and model comparison
Arithmetic functions and Vector arithmetic
operators
Arrays Arrays
Assignment Vectors and assignment
Attributes Objects
B
Binary operators Defining new binary operators
Box plots One- and two-sample tests
C
Character vectors Character vectors
Classes The class of an object
Classes Object orientation
Concatenating lists Concatenating lists
Contrasts Contrasts
Control statements Control statements
CRAN Contributed packages and CRAN
Customizing the environment Customizing the environment
D
Data frames Data frames
Default values Named arguments and defaults
Density estimation Examining the distribution of a set of data
Determinants Singular value decomposition and determinants
Diverting input and output Executing commands from or diverting output to a file
Dynamic graphics Dynamic graphics
E
Eigenvalues and eigenvectors Eigenvalues and eigenvectors
Empirical CDFs Examining the distribution of a set of data
F
Factors Factors
Factors Contrasts
Families Families
Formulae Formulae for statistical models
G
Generalized linear models Generalized linear models
Generalized transpose of an array Generalized transpose of an array
Generic functions Object orientation
Graphics device drivers Device drivers
Graphics parameters The par() function
Grouped expressions Grouped expressions
I
Indexing of and by arrays Array indexing
Indexing vectors Index vectors
K
Kolmogorov-Smirnov test Examining the distribution of a set of data
L
Least squares fitting Least squares fitting and the QR decomposition
Linear equations Linear equations and inversion
Linear models Linear models
Lists Lists
Local approximating regressions Some non-standard models
Loops and conditional execution Loops and conditional execution
M
Matrices Arrays
Matrix multiplication Multiplication
Maximum likelihood Maximum likelihood
Missing values Missing values
Mixed models Some non-standard models
N
Named arguments Named arguments and defaults
Namespace Namespaces
Nonlinear least squares Nonlinear least squares and maximum likelihood
models
O
Object orientation Object orientation
Objects Objects
One- and two-sample tests One- and two-sample tests
Ordered factors Factors
Ordered factors Contrasts
Outer products of arrays The outer product of two arrays
P
Packages R and statistics
Packages Packages
Probability distributions Probability distributions
Q
QR decomposition Least squares fitting and the QR decomposition
Quantile-quantile plots Examining the distribution of a set of data
R
Reading data from files Reading data from files
Recycling rule Vector arithmetic
Recycling rule The recycling rule
Regular sequences Generating regular sequences
Removing objects Data permanency and removing objects
Robust regression Some non-standard models
S
Scope Scope
Search path Managing the search path
Shapiro-Wilk test Examining the distribution of a set of data
Singular value decomposition Singular value decomposition and determinants
Statistical models Statistical models in R
Student’s t test One- and two-sample tests
T
Tabulation Frequency tables from factors
Tree-based models Some non-standard models
U
Updating fitted models Updating fitted models
V
Vectors Simple manipulations numbers and vectors
W
Wilcoxon test One- and two-sample tests
Workspace Data permanency and removing objects
Writing functions Writing your own functions
Jump to: A B C D E F G I K L M N O P Q R S T U V W
Appendix F References
D. M. Bates and D. G. Watts (1988), Nonlinear Regression Analysis and Its
Applications. John Wiley & Sons, New York.
Richard A. Becker, John M. Chambers and Allan R. Wilks (1988), The New S
Language. Chapman & Hall, New York. This book is often called the “Blue Book”.
John M. Chambers and Trevor J. Hastie eds. (1992), Statistical Models in S. Chapman
& Hall, New York. This is also called the “White Book”.
John M. Chambers (1998) Programming with Data. Springer, New York. This is also
called the “Green Book”.
A. C. Davison and D. V. Hinkley (1997), Bootstrap Methods and Their Applications,
Cambridge University Press.
Annette J. Dobson (1990), An Introduction to Generalized Linear Models, Chapman
and Hall, London.
Peter McCullagh and John A. Nelder (1989), Generalized Linear Models. Second
edition, Chapman and Hall, London.
John A. Rice (1995), Mathematical Statistics and Data Analysis. Second edition.
Duxbury Press, Belmont, CA.
S. D. Silvey (1970), Statistical Inference. Penguin, London.
Footnotes
(1)
For portable R code (including that to be used in R packages) only A–Z, a–z, and 0–9
should be used.
(3)
not inside strings, nor within the argument list of a function definition
(4)
some of the consoles will not allow you to enter more, and amongst those which do
some will silently discard the excess and some will use it as the start of the next line.
(5)
of unlimited length.
(6)
The leading “dot” in this file name makes it invisible in normal file listings in UNIX,
and in default GUI file listings on macOS and Windows.
(7)
With other than vector types of argument, such as list mode arguments, the action
of c() is rather different. See Concatenating lists.
(8)
Actually, it is still available as .Last.value before any other statements are executed.
(9)
paste(..., collapse=ss) joins the arguments into a single character string
putting ss in between, e.g., ss <- "|". There are more tools for character
manipulation, see the help for sub and substring.
(10)
Note however that length(object) does not always contain intrinsic useful information,
e.g., when object is a function.
(12)
In general, coercion from numeric to character and back again will not be exactly
reversible, because of roundoff errors in the character representation.
(13)
Readers should note that there are eight states and territories in Australia, namely the
Australian Capital Territory, New South Wales, the Northern Territory, Queensland,
South Australia, Tasmania, Victoria and Western Australia.
(15)
Note that tapply() also works in this case when its second argument is not a factor,
e.g., ‘tapply(incomes, state)’, and this is true for quite a few other functions, since
arguments are coerced to factors when necessary (using as.factor()).
(16)
Note that x %*% x is ambiguous, as it could mean either x’x or x x’, where x is the
column form. In such cases the smaller matrix seems implicitly to be the interpretation
adopted, so the scalar x’x is in this case the result. The matrix x x’ may be calculated
either by cbind(x) %*% x or x %*% rbind(x) since the result of rbind() or cbind() is
always a matrix. However, the best way to compute x’x or x x’ is crossprod(x) or x
%o% x respectively.
(17)
Even better would be to form a matrix square root B with A = BB’ and find the
squared length of the solution of By = x , perhaps using the Cholesky or
eigendecomposition of A.
(18)
See the on-line help for autoload for the meaning of the second term.
(19)
In some sense this mimics the behavior in S-PLUS since in S-PLUS this operator
always creates or assigns to a global variable.
(23)
Some graphics parameters such as the size of the current device are for information
only.
(25)
In particular, not versions 6.3 or later: this is worked around as from R 3.4.0.