R Programming
R Programming
Guide
Version 2.5
W. J. Owen
Department of Mathematics and Computer Science
University of Richmond
Volume
10
20
30
40
50
60
70
large residual
65
70
75
80
85
Height
Consider a log transform on Volume
W. J. Owen 2010. A license is granted for personal study and classroom use.
Redistribution in any other form is prohibited. Comments/questions can be sent to
[email protected].
Humans are good, she knew, at discerning subtle patterns that are really there, but equally so
at imagining them when they are altogether absent. -- Carl Sagan (1985) Contact
Preface
This document is an introduction to the program R. Although it could be a companion
manuscript for an introductory statistics course, it is designed to be used in a course on
mathematical statistics.
The purpose of this document is to introduce many of the basic concepts and nuances of the
language so that users can avoid a slow learning curve. The best way to use this manual is to
use a learn by doing approach. Try the examples and exercises in the Chapters and aim to
understand what works and what doesnt. In addition, some topics and special features of R
have been added for the interested reader and to facilitate the subject matter.
The most up-to-date version of this manuscript can be found at
http://www.mathcs.richmond.edu/~wowen/TheRGuide.pdf.
Font Conventions
This document is typed using Times New Roman font. However, when R code is presented,
referenced, or when R output is given, we use 10 point Bold Courier New Font.
Acknowledgements
Much of this document started from class handouts. Those were augmented by including
new material and prose to add continuity between the topics and to expand on various
themes. We thank the large group of R contributors and other existing sources of
documentation where some inspiration and ideas used in this document were borrowed. We
also acknowledge the R Core team for their hard work with the software.
ii
Contents
Page
1. Overview and history
1.1 What is R?
1.2 Starting and Quitting R
1.3 A Simple Example: the c() Function and Vector Assignment
1.4 The Workspace
1.5 Getting Help
1.6 More on Functions in R
1.7 Printing and Saving Your Work
1.8 Other Sources of Reference
1.9 Exercises
1
1
1
2
3
4
5
6
7
7
8
8
9
10
11
12
12
13
14
14
15
16
18
18
18
4. Introduction to Graphics
4.1 The Graphics Window
4.2 Two Generic Graphing Functions
4.2.1 The plot() function
4.2.2 The curve() function
4.3 Graph Embellishments
4.4 Changing Graphics Parameters
4.5 Exercises
19
19
19
19
20
21
21
22
iii
5. Summarizing Data
5.1 Numerical Summaries
5.2 Graphical Summaries
5.3 Exercises
23
23
24
28
29
29
30
30
31
32
33
34
7. Statistical Methods
7.1 One and Two-sample t-tests
7.2 Analysis of Variance (ANOVA)
7.2.1 Factor Variables
7.2.2 The ANOVA Table
7.2.3 Multiple Comparisons
7.3 Linear Regression
7.4 Chi-square Tests
7.4.1 Goodness of Fit
7.4.2 Contingency Tables
7.5 Other Tests
35
35
37
37
39
39
40
44
44
45
46
8. Advanced Topics
8.1 Scripts
8.2 Control Flow
8.3 Writing Functions
8.4 Numerical Methods
8.5 Exercises
48
48
48
50
51
53
54
References
56
Index
57
iv
This is the main site for information on R. Here, you can find information on obtaining the
software, get documentation, read FAQs, etc. For downloading the software directly, you
can visit the Comprehensive R Archive Network (CRAN) in the U.S. at
http://cran.us.r-project.org/
Once R has started, you should be greeted with a command line similar to
R version 2.9.1 (2009-06-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
At the > prompt, you tell R what you want it to do. You give R a command and R does the
work and gives the answer. If your command is too long to fit on a line or if you submit an
incomplete command, a + is used for the continuation prompt.
To quit R, type q() or use the Exit option in the File menu.
1.3 A Simple Example: the c() Function and the Assignment Operator
A useful command in R for entering small data sets is the c() function. This function
combines terms together. For example, suppose the following represents eight tosses of a fair
die:
2 5 1 6 5 5 4 1
The assignment operator is <-; to be specific, this is composed of a < (less than)
and a (minus or dash) typed together. It is usually read as gets the variable
dieroll gets the value c(2,5,1,6,5,5,4,1). Alternatively, as of R version 1.4.0,
you can use = as the assignment operator.
The value of dieroll doesnt automatically print out. But, it does when we type just
the name on the input line as seen above.
The value of dieroll is prefaced with a [1]. This indicates that the value is a vector
(more on this later).
When entering commands in R, you can save yourself a lot of typing when you learn to use
the arrow keys effectively. Each command you submit is stored in the History and the up
arrow () will navigate backwards along this history and the down arrow () forwards. The
left () and right arrow () keys move backwards and forwards along the command line.
These keys combined with the mouse for cutting/pasting can make it very easy to edit and
execute previous commands.
If we define a new variable a simple function of the variable dieroll it will be added to
the workspace:
> newdieroll <- dieroll/2
# divide every element by two
> newdieroll
[1] 1.0 2.5 0.5 3.0 2.5 2.5 2.0 0.5
> ls()
[1] "dieroll"
"newdieroll"
The new variable newdieroll has been assigned the value of dieroll divided by 2
more about algebraic expressions is given in the next session.
You can add a comment to a command line by beginning it with the # character. R
ignores everything on an input line after a #.
To remove objects from the workspace (youll want to do this occasionally when your
workspace gets too cluttered), use the rm() function:
> rm(newdieroll)
> ls()
[1] "dieroll"
In Windows, you can clear the entire workspace via the Remove all objects option under
the Misc menu. However, this is dangerous more likely than not you will want to keep
some things and delete others.
When exiting R, the software asks if you would like to save your workspace image. If you
click yes, all objects (both new ones created in the current session and others from earlier
sessions) will be available during your next session. If you click no, all new objects will be
lost and the workspace will be restored to the last time the image was saved. Get in the
habit of saving your work it will probably help you in the future.
1.5 Getting Help
There is text help available from within R using the function help() or the ? character typed
before a command. If you have questions about any function in this manual, see the
corresponding help file. For example, suppose you would like to learn more about the
function log() in R. The following two commands result in the same thing:
> help(log)
> ?log
package:base
R Documentation
So, we see that the log() function in R is the logarithm function from mathematics. This
function takes two arguments: x is the variable or object that will be taken the logarithm of
and base defines which logarithm is calculated. Note that base is defaulted to e =
2.718281828.., which is the natural logarithm. We also see that there are other associated
functions, namely log10() and log2() for the calculation of base 10 and 2 (respectively)
logarithms. Some examples:
> log(100)
[1] 4.60517
> log2(16)
# same as log(16,base=2) or just log(16,2)
[1] 4
> log(1000,base=10)
# same as log10(1000)
[1] 3
>
Due to the object oriented nature of R, we can also use the log() function to calculate the
logarithm of numerical vectors and matrices:
> log2(c(1,2,3,4))
# log base 2 of the vector (1,2,3,4)
[1] 0.000000 1.000000 1.584963 2.000000
>
Help can also be accessed from the menu on the R Console. This includes both the text help
and help that you can access via a web browser. You can also perform a keyword search
with the function apropos(). As an example, to find all functions in R that contain the
string norm, type:
> apropos("norm")
[1] "dlnorm"
"dnorm"
[6] "qnorm"
"qqnorm"
>
"plnorm"
"pnorm"
"qqnorm.default" "rlnorm"
"qlnorm"
"rnorm"
Note that we put the keyword in double quotes, but single quotes ('') will also work.
1.6 More on Functions in R
We have already seen a few functions at this point, but R has an incredible number of
functions that are built into the software, and you even have the ability to write your own (see
Chapter 8). Most functions will return something, and functions usually require one or more
input values. In order to understand how to generally use functions in R, lets consider the
function matrix(). A call to the help file gives the following:
Matrices
Description:
'matrix' creates a matrix from the given set of values.
Usage:
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE)
Arguments:
data:
nrow:
ncol:
byrow:
...
So, we see that this is a function that takes vectors and turns them into matrix objects. There
are 4 arguments for this function, and they specify the entries and the size of the matrix
object to be created. The argument byrow is set to be either TRUE or FALSE (or T or F either
are allowed for logicals) to specify how the values are filled in the matrix.
Often arguments for functions will have default values, and we see that all of the arguments
in the matrix() function do. So, the call
> matrix()
will return a matrix that has one row, one column, with the single entry NA (missing or not
available). However, the following is more interesting:
> a <- c(1,2,3,4,5,6,7,8)
> A <- matrix(a,nrow=2,ncol=4, byrow=FALSE)
> A
[,1] [,2] [,3] [,4]
[1,]
1
3
5
7
[2,]
2
4
6
8
>
# a is different from A
Note that we could have left off the byrow=FALSE argument, since this is the default value.
In addition, since there is a specified ordering to the arguments in the function, we also could
have typed
> A <- matrix(a,2,4)
to get the same result. For the most part, however, it is best to include the argument names in
a function call (especially when you arent using the default values) so that you dont confuse
yourself. We will learn more about this function in the next chapter.
The R program: From the Help menu you can access the manuals that come with the
software. These are written by the R core development team. Some are very lengthy
and specific, but the manual An Introduction to R is a good source of useful
information.
Free Documentation: The CRAN website has several user contributed documents in
several languages. These include:
R for Beginners by Emmanuel Paradis (76 pages). A good overview of the software
with some nice descriptions of the graphical capabilities of R. The author assumes
that the reader knows some statistical methods.
R reference card by Tom Short (4 pages). This is a great desk companion when
working with R.
Books: These you have to buy, but they are excellent! Some examples:
Introductory Statistics with R by Peter Dalgaard, Springer-Verlag (2002). Peter is a
member of the R Core team and this book is a fantastic reference that includes both
elementary and some advanced statistical methods in R.
Modern Applied Statistics with S, 4th Ed. by W.N. Venable and B.D. Ripley,
Springer-Verlag (2002). The authoritative guide to the S programming language for
advanced statistical methods.
1.9 Exercises
1. Use the help system to find information on the R functions mean and median.
2. Get a list of all the functions in R that contains the string test.
3. Create the vector info that contains your age, height (in inches/cm), and phone
number.
4. Create the matrix Ident defined as a 3x3 identity matrix.
5. Save your work from this session in the file 1stR.txt.
Other standard functions that are found on most calculators are available in R:
Name
sqrt()
abs()
sin(), cos(), tan()
pi
exp(), log()
gamma()
factorial()
choose()
> sqrt(2)
[1] 1.414214
> abs(2-4)
[1] 2
> cos(4*pi)
[1] 1
> log(0)
[1] -Inf
> factorial(6)
[1] 720
> choose(52,5)
[1] 2598960
Operation
square root
absolute value
trig functions (radians) type ?Trig for others
the number = 3.1415926..
exponential and logarithm
Eulers gamma function
factorial function
combination
# not defined
# 6!
# this is 52!/(47!*5!)
Ahem. The Greek letters and are used to denote sums and products, respectively. Signomi.
Operation
returns the number of entries in a vector
calculates the arithmetic sum of all values in the vector
calculates the product of all values in the vector
cumulative sums and products
sort a vector
computes suitably lagged (default is 1) differences
# fill in by column
Matrix operations (multiplication, transpose, etc.) can easily be performed in R using a few
simple functions like:
Name
dim()
as.matrix()
%*%
t()
det()
solve()
eigen()
Operation
dimension of the matrix (number of rows and columns)
used to coerce an argument into a matrix object
matrix multiplication
matrix transpose
determinant of a square matrix
matrix inverse; also solves a system of linear equations
computes eigenvalues and eigenvectors
Using the matrices A, B, and C just created, we can have some linear algebra fun using the
above functions:
10
> t(C)
[,1] [,2]
[1,]
1
6
[2,]
2
7
[3,]
3
8
[4,]
4
9
[5,]
5
10
> B%*%C
[,1] [,2] [,3] [,4] [,5]
[1,]
13
16
19
22
25
[2,]
27
34
41
48
55
[3,]
41
52
63
74
85
[4,]
55
70
85 100 115
[5,]
69
88 107 126 145
> D <- C%*%B
> D
[,1] [,2]
[1,]
95 110
[2,] 220 260
> det(D)
[1] 500
> solve(D)
[,1] [,2]
[1,] 0.52 -0.22
[2,] -0.44 0.19
>
# this is D-1
2.4 Exercises
Use R to compute the following:
1.
2 3 32 .
2. ee.
3. (2.3)8 + ln(7.5) cos(/ 2 ).
1 3 5 2
0 1 3 4
-1
T
2 4 7 3 . Find AB and BA .
1 5 1 2
1 2 3 2
4. Let A = 2 1 6 4 , B =
4 7 2 5
11
As we will see, there are many other ways to create vectors and datasets in R.
3.1 Sequences
Sometimes we will need to create a string of numerical values that have a regular pattern.
Instead of typing the sequence out, we can define the pattern using some special operators
and functions.
4.5
> prod(1:8)
[1] 40320
# same as 1:5
> seq(1,5,by=.5)
# increment by 0.5
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
12
When entering a larger array of values, the c() function can be unwieldy. Alternatively, data
can be read directly from the keyboard by using the scan() function. This is useful since
data values need only be separated by a blank space (although this can be changed in the
arguments of the function). Also, by default the function expects numerical inputs, but you
can specify others by using the what = option. The syntax is:
> x <- scan()
When the above is typed in an R session, you get a prompt signifying that R expects you to
type in values to be added to the vector x. The prompt indicates the indexed value in the
vector that it expects to receive. R will stop adding data when you enter a blank row. After
entering a blank row, R will indicate the number of values it read in.
Suppose that we count the number of passengers (not including the driver) in the next 30
automobiles at an intersection:
> passengers <- scan()
1: 2 4 0 1 1 2 3 1 0 0 3 2 1 2 1 0 2 1 1 2 0 0
# I hit return
23: 1 3 2 2 3 1 0 3
# I hit return again
31:
# I hit return one last time
Read 30 items
> passengers
# print out the values
[1] 2 4 0 1 1 2 3 1 0 0 3 2 1 2 1 0 2 1 1 2 0 0 1 3 2 2 3 1 0 3
13
In addition, the scan() function can be used to read data that is stored (as ASCII text) in an
external file into a vector. To do this, you simply pass the filename (in quotes) to the scan()
function. For example, suppose that the above passenger data was originally saved in a text
file called passengers.txt that is located on a disk drive. To read in this data that is located
on a C: drive, we would simply type
> passengers <- scan("C:/passengers.txt")
Read 30 items
>
Notes:
ALWAYS view the text file first before you read it into R to make sure it is what you
want and formatted appropriately.
To represent directories or subdirectories, use the forward slash (/), not a backslash (\)
in the path of the filename even on a Windows system.
If your computer is connected to the internet, data can also read (contained in a text
file) from a URL using the scan() function. The basic syntax is given by:
> dat <- scan("http://www...")
If the directory name and/or file name contains spaces, we need to take special care
denoting the space character. This is done by including a backslash (\) before the space is
typed.
As an alternative, you can use the function file.choose() in place of the filename. In so
doing, an explorer-type window will open and the file can be selected interactively.
More on reading in datasets from external sources is given in the next section.
Often in statistics, a dataset will contain more than one variable recorded in an experiment.
For example, in the automobile experiment from the last section, other variables might have
been recorded like automobile type (sedan, SUV, minivan, etc.) and driver seatbelt use (Y,
N). A dataset in R is best stored in an object called a data frame. Individual variables are
designated as columns of the data frame and have unique names. However, all of the
columns in a data frame must be of the same length.
You can enter data directly into a data frame by using the built-in data editor. This allows for
an interactive means for data-entry that resembles a spreadsheet. You can access the editor
by using either the edit() or fix() commands:
> new.data <- data.frame()
> new.data <- edit(new.data)
14
OR
> new.data <- data.frame()
> fix(new.data)
The data editor allows you to add as many variables (columns) to your data frame that you
wish. The column names can be changed from the default var1, var2, etc. by clicking the
column header. At this point, the variable type (either numeric or character) can also be
specified.
When you close the data editor, the edited frame is saved.
You can also create data frames from preexisting variables in the workspace. Suppose that in
the last experiment we also recorded the seatbelt use of the driver: Y = seatbelt worn, N =
seatbelt not worn. This data is entered by (recall that since these data are text based, we need
to put quotes around each data value):
> seatbelt <- c("Y","N","Y","Y","Y","Y","Y","Y","Y","Y",
+ "N","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y",
+ "Y","Y","N","Y","Y","Y","Y")
>
# return
# return
We can combine these variables into a single data frame with the command
> car.dat <- data.frame(passengers,seatbelt)
The values along the left side are simply the row numbers.
15
A window will open and the available datasets are listed (many others are accessible from
external user-written packages, however). To open the dataset called trees, simply type
> data(trees)
After doing so, the data frame trees is now in your workspace. To learn more about this (or
any other included dataset), type help(trees).
You can access single variables in a data frame by using the $ argument. For example, we
see that the trees dataset has three variables:
> trees
Girth Height Volume
1
8.3
70
10.3
2
8.6
65
10.3
3
8.8
63
10.2
4
10.5
72
16.4
5
10.7
81
18.8
. . .
To access single variables in a data frame, use a $ between the data frame and column names:
> trees$Height
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69 75 74 85 86 71 64 78
[22] 80 74 72 77 81 82 80 80 80 87
> sum(trees$Height)
# sum of just these values
[1] 2356
>
You can also access a specific element or row of data by calling the specific position (in row,
column format) in brackets after the name of the data frame:
> trees[4,3]
[1] 16.4
> trees[4,]
Girth Height Volume
4 10.5
72
16.4
>
Often, we will want to access variables in a data frame, but using the $ argument can get a
little awkward. Fortunately, you can make R find variables in any data frame by adding the
data frame to the search path. For example, to include the variables in the data frame trees
in the search path, type
> attach(trees)
16
To see exactly what is going on here, we can view the search path by using the search()
command:
> search()
[1] ".GlobalEnv"
[4] "package:stats"
[7] "Autoloads"
>
"trees"
"package:graphics"
"package:base"
"package:methods"
"package:utils"
Note that the data frame trees is placed as the second item in the search path. This is the
order in which R looks for things when you type in commands. FYI, .GlobalEnv is your
workspace and the package quantities are libraries that contain (among other things) the
functions and datasets that we are learning about in this manual.
To remove an object from the search path, use the detach() command in the same way that
attach() is used. However, note that when you exit R, any objects added to the search path
are removed anyway.
To list the features of any object in R, be it a vector, data frame, etc. use the attributes()
function. For example:
> attributes(trees)
$names
[1] "Girth" "Height" "Volume"
$row.names
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
[12] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
[23] "23" "24" "25" "26" "27" "28" "29" "30" "31"
$class
[1] "data.frame"
>
Here, we see that the trees object is a data frame with 31 rows and has variable names
corresponding to the measurements taken on each tree.
Lastly, using the variable Height, this is a cool feature:
> Height[Height > 75]
# pick off all heights greater than 75
[1] 81 83 80 79 76 76 85 86 78 80 77 81 82 80 80 80 87
>
17
A dataset that has been created and stored externally (again, as ASCII text) can be read into a
data frame. Here, we use the function read.table() to load the dataset into a data frame. If
the first line of the text file contains the names of the variables in the dataset (which is often
the case), R can take those as the names of the variables in the data frame. This is specified
with the header = T option in the function call. If no header is included in a file, you can
ignore this option and R will use the default variable names for a data frame. Filenames are
specified in the same way as the scan() function, or the file.choose() function can be
used to select the file interactively. For example, the function call
> smith <- read.table(file.choose(), header=T)
would read in data from a user-specified text file where the first line of the file designates the
names of the variables in the data frame. For CSV files, the function read.csv() is used in
a similar fashion.
If you have a file that is in a software-specific format (e.g. Excel, SPSS, etc.), there are R
functions that can be used to import the data in R. The manual R Data Import/Export
accessible on the R Project website addresses this. However, since most programs allow you
to save data files as a tab delimited text files, this is usually preferred.
3.4 Exercises
"banana"
2. Using the scan() function, enter 10 numbers (picked at random between 1 and 100)
into a vector called blahblah.
3. Create a data frame called schedule with the following variables:
coursenumber: the course numbers of your classes this semester (e.g. 329)
coursedays:
meeting days (either MWF, TR, etc.)
grade:
your anticipated grade (A, B, C, D, or F)
4. Load in the stackloss dataset from within R and save the variables Water.Temp and
Acid.Conc. in a data frame called tempacid.
18
4. Introduction to Graphics
One of the greatest powers of R is its graphical capabilities (see the exercises section in this
chapter for some amazing demonstrations). In this chapter, some of these features will be
briefly explored.
When pictures are created in R, they are presented in the active graphical device or window
(for the Mac, its the Quartz device). If no such window is open when a graphical function is
executed, R will open one. Some features of the graphics window:
You can print directly from the graphics window, or choose to copy the graph to the
clipboard and paste it into a word processor. There, you can also resize the graph to
fit your needs. A graph can also be saved in many other formats, including pdf,
bitmap, metafile, jpeg, or postscript.
Each time a new plot is produced in the graphics window, the old one is lost. In MS
Windows, you can save a history of your graphs by activating the Recording
feature under the History menu (seen when the graphics window is active). You can
access old graphs by using the Page Up and Page Down keys. Alternatively, you
can simply open a new active graphics window (by using the function x11() in
Windows/Unix and quartz() on a Mac).
There are many functions in R that produce graphs, and they range from the very basic to the
very advanced and intricate. In this section, two basic functions will be profiled, and
information on ways to embellish plots will be given in the sections that follow. Other
graphical functions will be described in Chapter 5.
The most common function used to graph anything in R is the plot() function. This is a
generic function that can be used for scatterplots, time-series plots, function graphs, etc. If a
single vector object is given to plot(), the values are plotted on the y-axis against the row
numbers or index. If two vector objects (of the same length) are given, a bivariate scatterplot
is produced. For example, consider again the dataset trees in R. To visualize the
relationship between Height and Volume, we can draw a scatterplot:
> plot(Height, Volume)
70
60
50
Volume
40
30
20
10
65
70
75
80
85
Height
Notice that the format here is the first variable is plotted along the horizontal axis and the
second variable is plotted along the vertical axis. By default, the variable names are listed
along each axis.
This graph is pretty basic, but the plot() function can allow for some pretty snazzy window
dressing by changing the function arguments from the default values. These include adding
titles/subtitles, changing the plotting character/color (over 600 colors are available!), etc. See
?par for an overwhelming lists of these options.
This function will be used again in the succeeding chapters.
To graph a continuous function over a specified range of values, the curve() function can be
used (although interestingly curve() actually calls the plot() function). The basic use of
this function is:
curve(expr, from, to, add = FALSE, ...)
Arguments:
expr: an expression written as a function of 'x'
from, to: the range over which the function will be plotted.
add: logical; if 'TRUE' add to already existing plot.
Note that it is necessary that the expr argument is always written as a function of 'x'. If the
argument add is set to TRUE, the function graph will be overlaid on the current graph in the
graphics window (this useful feature will be illustrated in Chapter 6).
20
For example, the curve() function can be used to plot the sine function from 0 to 2:
0.0
-1.0
-0.5
sin(x)
0.5
1.0
In addition to standard graphics functions, there are a host of other functions that can be used
to add features to a drawn graph in the graphics window. These include (see each functions
help file for more information):
Function
abline()
arrows()
lines()
points()
rug()
segments()
text()
title()
Operation
adds a straight line with specified intercept and slope (or draw a
vertical or horizontal line)
adds an arrow at a specified coordinate
adds lines between coordinates
adds points at specified coordinates (also for overlaying
scatterplots)
adds a rug representation to one axis of the plot
similar to lines() above
adds text (possibly inside the plotting region)
adds main titles, subtitles, etc. with other options
The plot used on the cover page of this document includes some of these additional features
applied to the graph in Section 4.2.1.
21
There is still more fine tuning available for altering the graphics settings. To make changes
to how plots appear in the graphics window itself, or to have every graphic created in the
graphics window follow a specified form, the default graphical parameters can be changed
using the par() function. There are over 70 graphics parameters that can be adjusted, so
only a few will be mentioned here. Some very useful ones are given below:
>
>
>
>
#
#
#
#
Any or all parameters can be changed in a par() command, and they remain in effect until
they are changed again (or if the program is exited). You can save a copy of the original
parameter settings in par(), and then after making changes recall the original parameter
settings. To do this, type
> oldpar <- par(no.readonly = TRUE)
... then, make your changes in par() ...
> par(oldpar)
4.5 Exercises
As mentioned previously, more on graphics will be seen in the next two chapters. For this
section, enter the following commands to see some Rs incredible graphical capabilities.
Also, try pasting/inserting a graph into a word processor or document.
> demo(graphics)
> demo(persp)
> demo(image)
This is used for all line plots (e.g. page 31) presented herein.
22
5. Summarizing Data
One of the simplest ways to describe what is going on in a dataset is to use a graphical or
numerical summary procedure. Numerical summaries are things like means, proportions,
and variances, while graphical summaries include histograms and boxplots.
Operation
arithmetic mean
sample median
five-number summary
generic summary function for data and model fits
smallest/largest values
calculate sample quantiles (percentiles)
sample variance, sample standard deviation
sample covariance/correlation
These functions will take one or more vectors as arguments for the calculation; in addition,
they will (in general) work in the correct way when they are given a data frame as the
argument.
If your data contains only discrete counts (like the number of pets owned by each family in a
group of 20 families), or is categorical in nature (like the eye color recorded for a sample of
50 fruit flies), the above numerical measures may not be of much use. For categorical or
discrete data, we can use the table() function to summarize a dataset.
For examples using these functions, lets consider the dataset mtcars in R contains
measurements on 11 aspects of automobile design and performance for 32 automobiles
(1973-74 models):
> data(mtcars)
> attach(mtcars)
> mtcars
Mazda RX4
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Hornet Sportabout
. . .
# load in dataset
# add mtcars to search path
mpg cyl disp hp drat
wt qsec vs am gear carb
21.0
6 160.0 110 3.90 2.620 16.46 0 1
4
4
21.0
6 160.0 110 3.90 2.875 17.02 0 1
4
4
22.8
4 108.0 93 3.85 2.320 18.61 1 1
4
1
21.4
6 258.0 110 3.08 3.215 19.44 1 0
3
1
18.7
8 360.0 175 3.15 3.440 17.02 0 0
3
2
The variables in this dataset are both continuous (e.g. mpg, disp, wt) and discrete (e.g. gear,
carb, cyl) in nature. For the continuous variables, we can calculate:
23
> mean(hp)
[1] 146.6875
> var(mpg)
[1] 36.3241
> quantile(qsec, probs = c(.20, .80))
# 20th and 80th percentiles
20%
80%
16.734 19.332
> cor(wt,mpg)
# not surprising that this is negative
[1] -0.8676594
So, it can be seen that eleven of the vehicles have 4 cylinders, seven vehicles have 6, and
fourteen have 8 cylinders. We can turn the counts into percentages (or relative frequencies)
by dividing by the total number of observations:
> table(cyl)/length(cyl)
cyl
4
6
8
0.34375 0.21875 0.43750
# note: length(cyl) = 32
For discrete or categorical data, we can display the information given in a table command
in a picture using the barplot() function. This function takes as its argument a table
object created using the table() command discussed above:
# use relative frequencies on
# the y-axis
0.0
0.1
0.2
0.3
0.4
> barplot(table(cyl)/length(cyl))
24
See ?barplot on how to change the fill color, add titles to your graph, etc.
hist():
This function will plot a histogram that is typically used to display continuous-type data. Its
format, with the most commonly used options, is:
hist(x,breaks="Sturges",prob=FALSE,main=paste("Histogram of" ,xname))
Arguments:
x: a vector of values for which the histogram is desired.
breaks: one of:
* a character string (in double quotes) naming an algorithm to
compute the number of cells
The default for 'breaks' is '"Sturges"': Other names for which
algorithms are supplied are '"Scott"' and '"FD"'
* a single number giving the number of cells for the histogram
prob: logical; if FALSE, the histogram graphic is a representation
of frequencies, the 'counts' component of the result; if
TRUE, _relative_ frequencies ("probabilities"), component
'density', are plotted.
main: the title for the histogram
The breaks argument specifies the number of bins (or classes) for the histogram. Too few
or too many bins can result in a poor picture that wont characterize the data well. By
default, R uses the Sturges formula for calculating the number of bins. This is given by
log 2 (n) 1
where n is the sample size and is the ceiling operator.
Other methods exist that consider finding the optimal bin width (the number of bins required
would then be the sample range divided by the bin width). The Freedman-Diaconis formula
(Freedman and Diaconis 1981) is based on the inter-quartile range (iqr)
2 iqr n 3 ;
1
the formula proposed by Scott (1979) is based on the standard deviation (s)
3 .5 s n 3 .
1
25
To see some differences, consider the faithful dataset in R, which is a famous dataset that
exhibits natural bimodality. The variable eruptions gives the duration of the eruption (in
minutes) and waiting is the time between eruptions for the Old Faithful geyser:
> data(faithful)
> attach(faithful)
> hist(eruptions, main = "Old Faithful data", prob = T)
0.0
0.1
0.2
Density
0.3
0.4
0.5
eruptions
We can give the picture a slightly different look by changing the number of bins:
> hist(eruptions, main = "Old Faithful data", prob = T, breaks=18)
0.4
0.3
0.2
0.1
0.0
Density
0.5
0.6
0.7
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
eruptions
26
stem()
|
|
|
|
|
|
|
|
|
|
|
|
3
55566666777788899999
00000111111222223333333444444444
555555666677788889999999
00000022223334444
555667899
00001111123333333444444
555555556666666667777777777778888888888888889999999999
000000001111111111111222222222222333333333333334444444444
55555566666677888888999
00000012334
6
boxplot()
This function will construct a single boxplot if the argument passed is a single vector, but if
many vectors are contained (or if a data frame is passed), a boxplot for each variable is
produced on the same graph.
For the two data files in the Old Faithful dataset:
# same as boxplot(eruptions, waiting)
20
40
60
80
> boxplot(faithful)
eruptions
waiting
Thus, the waiting time for an eruption is generally much larger and has higher variability
than the actual eruption time. See ?boxplot for ways to add titles/color, changing the
orientation, etc.
27
ecdf():
This function will create the values of the empirical distribution function (EDF) Fn(x). It
requires a single argument the vector of numerical values in a sample. To plot the EDF for
data contained in x, type
> plot(ecdf(x))
These functions are used to check for normality in a sample of data by constructing a normal
probability plot (NPP) or normal q-q plot. The syntax is:
> qqnorm(x)
> qqline(x)
Only a few of the many functions used for graphics have been discussed so far. Other
graphical functions include:
Name
pairs()
persp()
pie()
qqplot()
ts.plot()
Operation
Draws all possible scatterplots for two columns in a matrix/dataframe
Three dimensional plots with colors and perspective
Constructs a pie chart for categorical/discrete data
quantile-quantile plot to compare two datasets
Time series plot
5.3 Exercises
28
R name
Additional arguments
beta
binom
chisq
unif
exp
f
gamma
geometric
hypergeometric
negative binomial
normal
Poisson
t distribution
Weibull
geom
hyper
nbinom
norm
pois
t
weibull
shape1, shape2
size, prob
df (degrees of freedom)
min, max
rate
df1, df2
shape, rate (or scale =
1/rate)
prob
m, n, k (sample size)
size, prob
mean, sd
lambda
df
shape, scale
Argument defaults
min = 0, max = 1
rate = 1
scale = 1
mean = 0, sd = 1
scale = 1
Prefix each R name given above with d for the density or mass function, p for the CDF,
q for the percentile function (also called the quantile), and r for the generation of pseudorandom variables. The syntax has the following form we use the wildcard rname to denote
a distribution above:
>
>
>
>
drname(x,
prname(q,
qrname(p,
rrname(n,
...)
...)
...)
...)
#
#
#
#
x <- rnorm(100)
# simulate 100 standard normal RVs, put in x
w <- rexp(1000,rate=.1)
# simulate 1000 from Exp( = 10)
dbinom(3,size=10,prob=.25) # P(X=3) for X ~ Bin(n=10, p=.25)
dpois(0:2, lambda=4)
# P(X=0), P(X=1), P(X=2) for X ~ Poisson
pbinom(3,size=10,prob=.25) # P(X 3) in the above distribution
pnorm(12,mean=10,sd=2)
# P(X 12) for X~N(mu = 10, sigma = 2)
qnorm(.75,mean=10,sd=2)
# 3rd quartile of N(mu = 10,sigma = 2)
qchisq(.10,df=8)
# 10th percentile of 2(8)
qt(.95,df=20)
# 95th percentile of t(20)
29
I=
g ( x)dx ,
a
but the antiderivative of g(x) cannot be found in closed form. Standard techniques involve
approximating the integral by a sum, and many computer packages can do this. Another
approach to finding I is called Monte Carlo integration and it works for the following
example. Suppose that we generate n independent Uniform random variables3 (this we have
already seen how to do) X1, X2, , Xn on the interval [a, b] and compute
1 n
I (b a) g (X i )
n i 1
By the Law of Large Numbers, as n increases with out bound,
In (b a )Eg ( X ) .
By mathematical expectation,
b
E g ( X ) g ( x )
a
1
1
dx
I
ba
ba
4 sin( 2 x ) exp( x
)dx .
The method can be easily modified to use another distribution defined on the interval [a, b]. See Robert and
Casella (1999).
4
The R function integrate() is another option for numerical quadrature. See ?integrate.
30
0.15
0.00
0.05
0.10
p(x)
0.20
0.25
10
We have done a few things here. First, we created the vector x that contains the integers 0
through 10. Then we calculated the binomial probabilities at each of the points in x and
stored them in the vector y. Then, we specified that the plot be of type "h" which is gives
the histogram-like vertical lines and we fattened the lines with the lwd = 30 option (the
default width is 1, which is line thickness). Finally, we gave it some informative titles.
Lastly, we note that to conserve space in the workspace, we could have produced the same
plot without actually creating the vectors x and y (embellishments removed):
> plot(0:10, dbinom(x, size=10, prob=.25), type = "h", lwd = 30)
31
0.2
0.0
0.1
dnorm(x)
0.3
0.4
-3
-2
-1
# a normal CDF
0.8
0.6
0.4
0.2
0.0
1.0
10
12
14
16
Note that we restricted the plotting to be between -3 and 3 in the first plot since this is where
the standard normal has the majority of its area. Alternatively, we could have used upper and
lower percentiles (say the .5% and 99.5%) and calculated them by using the qnorm()
function.
Also note that the curve() function has as an option to be added to an existing plot. When
you overlay a curve, you dont have to specify the from and to arguments because R defaults
32
them to the low and high values of the x-values on the original plot. Consider how the
histogram a large simulation of random variables compares to the density function:
> simdata <- rexp(1000, rate=.1)
> hist(simdata, prob = T, breaks = "FD", main="Exp(theta = 10) RVs")
> curve(dexp(x, rate=.1), add = T)
# overlay the density curve
0.04
0.00
0.02
Density
0.06
10
20
30
40
50
60
simdata
Simple probability experiments like choosing a number at random between 1 and 100 and
drawing three balls from an urn can be simulated in R. The theory behind games like these
forms the foundation of sampling theory (drawing random samples from fixed populations);
in addition, resampling methods (repeated sampling within the same sample) like the
bootstrap are important tools in statistics. The key function in R is the sample() function. Its
usage is:
sample(x, size, replace = FALSE, prob = NULL)
Arguments:
x: Either a (numeric, complex, character or logical) vector of
more than one element from which to choose, or a positive
integer.
size: non-negative integer giving the number of items to choose.
replace: Should sampling be with or without replacement?
prob: An optional vector of probability weights for obtaining the
elements of the vector being sampled.
> sample(1:100, 1)
[1] 34
6.5 Exercises
1.
2.
3.
4.
5.
exp( x
)dx
with n = 1,000,000.
6. Simulate 100 observations from the normal distribution with = 50 and = 4. Plot
the empirical cdf Fn(x) for this sample and overlay the true CDF F(x).
7. Simulate 25 flips of a fair coin where the results are heads and tails (hint: use the
sample() function).
34
7. Statistical Methods
R includes a host of statistical methods and tests of hypotheses. We will focus on the most
common ones in this Chapter.
The main function that performs these sorts of tests is t.test(). It yields hypothesis tests
and confidence intervals that are based on the t-distribution. Its syntax is:
Usage:
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95)
Arguments:
x, y: numeric vectors of data values.
sample test is performed.
Note that from the above, t.test() not only performs the hypothesis test but also calculates
a confidence interval. However, if the alternative is either a greater than or less than
hypothesis, a lower (in case of a greater than alternative) or upper (less than) confidence
bound is given.
Example 1: Using the trees dataset, test the hypothesis that the mean black cherry tree
height is 70 ft. versus a two-sided alternative:
> data(trees)
> t.test(trees$Height, mu = 70)
35
For our test, we will assume that the two population means are equal. In R, the analysis of a
twosample ttest would be performed by:
> drug <- c(15, 10, 13, 7, 9, 8, 21, 9, 14, 8)
> plac <- c(15, 14, 12, 8, 14, 7, 16, 10, 15, 12)
> t.test(drug, plac, alternative = "less", var.equal = T)
Two Sample t-test
data: drug and plac
t = -0.5331, df = 18, p-value = 0.3002
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 2.027436
sample estimates:
mean of x mean of y
11.4
12.3
With a pvalue of .3002, we would not reject the claim that the two mean recovery times are
equal.
Example 3: an experiment was performed to determine if a new gasoline additive can
increase the gas mileage of cars. In the experiment, six cars are selected and driven with and
without the additive. The gas mileages (in miles per gallon, mpg) are given below.
Car
1
2
3
4
5
6
mpg w/ additive 24.6 18.9 27.3 25.2 22.0 30.9
mpg w/o additive 23.8 17.7 26.6 25.1 21.6 29.6
6
36
Since this is a paired design, we can test the claim using the paired ttest (under an
assumption of normality for mpg measurements). This is performed by:
> add <- c(24.6, 18.9, 27.3, 25.2, 22.0, 30.9)
> noadd <- c(23.8, 17.7, 26.6, 25.1, 21.6, 29.6)
> t.test(add, noadd, paired=T, alt = "greater")
Paired t-test
data: add and noadd
t = 3.9994, df = 5, p-value = 0.005165
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.3721225
Inf
sample estimates:
mean of the differences
0.75
With a pvalue of .005165, we can conclude that the mpg improves with the additive.
The simplest way to fit an ANOVA model is to call to the aov() function, and the type of
ANOVA model is specified by a formula statement. Some examples:
>
>
>
>
aov(x
aov(x
aov(x
aov(x
~
~
~
~
a)
a + b)
a + b + a:b)
a*b)
#
#
#
#
one-way
two-way
two-way
exactly
ANOVA model
ANOVA with no interaction
ANOVA with interaction
the same as the above
In the above statements, the x variable is continuous and contains all of the responses in the
ANOVA experiment. The variables a and b represent factor variables they contain the
levels of the experimental factors. The levels of a factor in R can be either numerical (e.g. 1,
2, 3,) or categorical (e.g. low, medium, high, ), but the variables must be stored as
factor variables. We will see how this is done next.
Type
A
B
C
Strength 3225, 3320, 3220, 3410, 3545, 3600,
(lb/in2) 3165, 3145 3320, 3370 3580, 3485
37
In R:
> Str <- c(3225,3320,3165,3145,3220,3410,3320,3370,3545,3600,3580,3485)
> Type <- c(rep("A",4), rep("B",4), rep("C",4))
Thus, the Type variable specifies the rubber type and the Str variable is the tensile strength.
Currently, R thinks of Type as a character variable; we want to let R know that these letters
actually represent factor levels in an experiment. To do this, use the factor() command:
> Type <- factor(Type)
# consider these alternative expressions7
> Type
[1] A A A A B B B B C C C C
Levels: A B C
>
Note that after the values in Type are printed, a Levels list is given. To access the levels in a
factor variable directly, you can type:
> levels(Type)
[1] "A" "B" "C"
With the data in this format, we can easily perform calculations on the subgroups contained
in the variable Str. To calculate the sample means of the subgroups, type
> tapply(Str,Type,mean)
A
B
C
3213.75 3330.00 3552.50
The function tapply() creates a table of the resulting values of a function applied to
subgroups defined by the second (factor) argument. To calculate the variances:
> tapply(Str,Type,var)
A
B
C
6172.917 6733.333 2541.667
We can also get multiple boxplots by specifying the relationship in the boxplot() function:
> boxplot(Str ~ Type, horizontal = T, xlab="Strength", col = "gray")
38
C
B
A
3200
3300
3400
3500
3600
Strength
The object anova.fit is a linear model object. To extract the ANOVA table, use the R
function summary():
> summary(anova.fit)
Df Sum Sq Mean Sq F value
Pr(>F)
Type
2 237029 118515 23.016 0.0002893 ***
Residuals
9 46344
5149
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
With a p-value of 0.0002893, we can confidently reject the null hypothesis that all three
rubber types have the same mean strength.
Specific percentage points for this distribution can be found using qtukey().
39
The above output gives 95% confidence intervals for the pairwise differences of mean
strengths for the three types of rubber. So, types B and A do not appear to have
significantly different mean strengths since the confidence interval for their mean difference
contains zero. A graphic can be used to support the analysis:
> plot(TukeyHSD(anova.fit))
C-B
C-A
B-A
100
200
300
400
500
Fitting a linear regression model in R is very similar to the ANOVA model material, and this
is intuitive since they are both linear models. To fit the linear regression (least-squares)
model to data, we use the lm() function9. This can be used to fit simple linear (single
predictor), multiple linear, and polynomial regression models. With data loaded in, you only
need to specify the linear model desired. Examples of these are:
9
40
>
>
>
>
>
>
lm(y
lm(y
lm(y
lm(y
lm(y
lm(y
~
~
~
~
~
~
x)
x1 + x2)
x1 + x2 + x3)
x 1)
x + I(x^2))
x + I(x^2) + I(x^3))
#
#
#
#
In the first example, the model is specified by the formula y ~ x which implies the linear
relationship between y (dependent/response variable) and x (independent/predictor variable).
The vectors x1, x2, and x3 denote possible independent variables used in a multiple linear
regression model. For the polynomial regression model examples, the function I() is used to
tell R to treat the variable as is (and not to actually compute the quantity).
The lm() function creates another linear model object from which a wealth of information
can be extracted.
Example: consider the cars dataset. The data give the speed (speed) of cars and the
distances (dist) taken to come to a complete stop. Here, we will fit a linear regression model
using speed as the independent variable and dist as the dependent variable (these variables
should be plotted first to check for evidence of a linear relation).
To compute the least-squares (assuming that the dataset cars is loaded and in the
workspace), type:
> fit <- lm(dist ~ speed)
The object fit is a linear model object. To see what it contains, type:
> attributes(fit)
$names
[1] "coefficients" "residuals"
[5] "fitted.values" "assign"
[9] "xlevels"
"call"
"effects"
"qr"
"terms"
"rank"
"df.residual"
"model"
$class
[1] "lm"
So, from the fit object, we can extract, for example, the residuals by assessing the variable
fit$residuals (this is useful for a residual plot which will be seen shortly).
To get the least squares estimates of the slope and intercept, type
> fit
Call:
lm(formula = dist ~ speed)
Coefficients:
(Intercept)
-17.579
speed
3.932
41
So, the fitted regression model has an intercept of 17.579 and a slope of 3.932. More
information about the fitted regression can be obtained by using the summary() function:
> summary(fit)
Call:
lm(formula = dist ~ speed)
Residuals:
Min
1Q
-29.069 -9.525
Median
-2.272
3Q
9.215
Max
43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791
6.7584 -2.601
0.0123 *
speed
3.9324
0.4155
9.464 1.49e-12 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-Squared: 0.6511,
Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12
In the output above, we get (among other things) residual statistics, standard errors of the
least squares estimates and tests of hypotheses on model parameters. In addition, the value
for Residual standard error is the estimate for , the standard deviation of the error term
in the linear model. A call to anova(fit) will print the ANOVA table for the regression
model:
> anova(fit)
Analysis of Variance Table
Response: dist
Df Sum Sq Mean Sq F value
Pr(>F)
speed
1 21185.5 21185.5 89.567 1.490e-12 ***
Residuals 48 11353.5
236.5
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confidence intervals for the slope and intercept parameters are a snap with the function
confint(); the default confidence level is 95%:
> confint(fit)
42
120
100
80
0
20
40
60
dist
10
15
20
25
speed
0
-20
fit$residuals
20
40
10
15
20
25
speed
Lastly, confidence and prediction intervals for specified levels of the independent variable(s)
can be calculated by using the predict.lm() function. This function accepts the linear
model object as an argument along with several other optional arguments. As an example, to
calculate (default) 95% confidence and prediction intervals for distance for cars traveling
at speeds of 15 and 23 miles per hour:
43
If values for newdata are not specified in the function call, all of the levels of the
independent variable are used for the calculation of the confidence/prediction intervals.
Hypothesis test for count data that use the Pearson Chi-square statistic are available in R.
These include the goodness-of-fit tests and those for contingency tables. Each of these are
performed by using the chisq.test() function. The basic syntax for this function is (see
?chisq.test for more information):
chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x),length(x)))
Arguments:
x: a vector, table, or matrix.
y: a vector; ignored if 'x' is a matrix.
correct: a logical indicating whether to apply continuity correction
when computing the test statistic.
p: a vector of probabilities of the same length of 'x'.
We will see how to use this function in the next two subsections.
7.4.1 Goodness of Fit
In order to perform the Chi-square goodness of fit test to test the appropriateness of a
particular probability model, the vector x above contains the tabular counts (if your data
vector contains the raw observations that havent been summarized, youll need to use the
table() command to tabulate the counts). The vector p contains the probabilities associated
with the individual cells. In the default, the value of p assumes that all of the cell
probabilities are equal.
Example: A die was cast 300 times and the following was observed:
1 2 3 4 5 6
Die face
Frequency 43 49 56 45 66 41
44
To test that the die is fair, we can use the goodness-of-fit statistic using 1/6 for each cell
probability value:
> counts <- c(43, 49, 56, 45, 66, 41)
> probs <- rep(1/6, 6)
> chisq.test(counts, p = probs)
Chi-squared test for given probabilities
data: counts
X-squared = 8.96, df = 5, p-value = 0.1107
>
Note that the output gives the value of the test statistic, the degrees of freedom, and the pvalue.
7.4.2 Contingency Tables
The easiest way to analyze a tabulated contingency table in R is to enter it as a matrix (again,
if you have the raw counts, you can tabulate them using the table() function).
Example: A random sample of 1000 adults was classified according to sex and whether or
not they were color-blind as summarized below:
Normal
Color-blind
Male Female
442
514
38
6
In a contingency table, the row names and column names are meaningful, so we can change
these from the defaults:
> dimnames(color.blind) <- list(c("normal","c-b"),c("Male","Female"))
> color.blind
Male Female
normal 442
514
c-b
38
6
This was really not necessary, but it does make the color.blind object look more like a
table.
To test if there is a relationship between gender and color-blind incidence, we obtain the
values of the chi-square statistic:
45
As with the ANOVA and linear model functions, the chisq.test() function actually creates
an output object so that other information can be extracted if desired:
> out <- chisq.test(color.blind, correct=F)
> attributes(out)
$names
[1] "statistic" "parameter" "p.value"
"method"
[6] "observed" "expected" "residuals"
"data.name"
$class
[1] "htest"
So, as an example we can extract the expected counts for the contingency table:
> out$expected
Male Female
normal 458.88 497.12
c-b
21.12 22.88
There are many, many more statistical procedures included in R, but most are used in a
similar fashion to those discussed in this chapter. Below is a list and description of other
common tests (see their corresponding help file for more information):
prop.test()
Large sample test for a single proportion or to compare two or more proportions that uses
a chi-square test statistic. An exact test for the binomial parameter p is given by the
function binom.test()
var.test()
Performs an F test to compare the variances of two independent samples from normal
populations.
cor.test()
Test for significance of the computed sample correlation value for two vectors. Can
perform both parametric and nonparametric tests.
46
wilcox.test()
One and two-sample (paired and independent samples) nonparametric Wilcoxon tests,
using a similar format to t.test()
kruskal.test()
Performs the Kruskal-Wallis rank sum test (similar to lm() for the ANVOA model).
friedman.test()
The nonparametric Friedman rank sum test for unreplicated blocked data.
ks.test()
47
8. Advanced Topics
In this final chapter we will introduce and summarize some advanced features of R namely,
the ability to perform numerical methods (e.g. optimization) and to write scripts and
functions in order to facilitate many steps in a procedure.
8.1 Scripts
If you have a long series of commands that you would like to save for future use, you can
save all of the lines of code in a file and execute them together using the source() function.
For example, we could type the following statements in a text editor (you dont precede a
line with a > in the editor):
x1 <- rnorm(500)
x2 <- rnorm(500)
x3 <- rnorm(500)
y1 <- x1 + x2
y2 <- x2 + x3
r <- cor(y1,y2)
If we save the file as corsim.R on the C: drive, we execute the script by typing
> source("C:/corsim.R")
> r
[1] 0.5085203
>
Note that not only was the object r was created, but so was x1, x2, x3, y1, and y2.
Many of these statements require the evaluation of a logical statement, and these can be
expressed using logical operators:
48
Operator
==
!=
<, <=
>, >=
&
|
Meaning
Equal to
Not equal to
Less than, less than or equal to
Greater than, greater than or equal to
Logical AND
Logical OR
Some examples of these are given below. The first example checks if the number x is greater
than 2, and if so the text contained in the quotes is printed on the screen:
> x <- rnorm(1)
> if(x > 2) print("This value is more than the 97.72 percentile")
The code below creates vectors x and z and defines 100 entries according to assignments
contained within the curly brackets. Since more than one expression is evaluated in the for
loop, the expressions are contained by {}.
> n <- 50
> for(i in 1:100) {
+
x[i] <- mean(rexp(n, rate = .5))
+
z[i] <- (x[i] 2)/sqrt(2/n)
+ }
>
This last bit of code considers a Monte Carlo (MC) estimate of . The basis of it is as
follows: if we generate a coordinate randomly over a square with vertices (1, 1), (1, 1),
(1, 1), and (1, 1), then the probability that the coordinate lies within the circle with radius
1 centered at the origin (0,0) is /4. In the code below, n is the number of coordinates
generated and s counts the number of observations contained in the unit circle. Thus, s/n is
the MC estimate of the probability and 4*s/n is the MC estimate of . The code stops when
we are within .001 of the true value. Since the function uses MC estimation, the result will
be different each time the code executes.
> eps <- 1; s <- 0; n <- 0
# initialize values
> while(eps > .001) {
+
n <- n + 1
+
x <- runif(1,-1,1)
+
y <- runif(1,-1,1)
+
if(x^2 + y^2 < 1) s <- s + 1
+
pihat <- 4*s/n
+
eps = abs(pihat - pi)
+ }
>
> pihat
[1] 3.141343
> n
[1] 1132
49
Probably one of the most powerful aspects of the R language is the ability of a user to write
functions. When doing so, many computations can be incorporated into a single function and
(unlike with scripts) intermediate variables used during the computation are local to the
function and are not saved to the workspace. In addition, functions allow for input values
used during a computation that can be changed when the function is executed.
The general format for creating a function is
fname <- function(arg1, arg2, ...)
{ R code }
In the above, fname is any allowable object name and arg1, arg2, ... are function
arguments. As with any R function, they can be assigned default values. When you write a
function, it is saved in your workspace as a function object.
Here is a simple example of a user-written function. We made the function a little more
readable by adding some comments and spacing to single out the first if statement:
>
+
+
+
+
+
+
+
+
+
>
f1 <- function(a, b) {
# This function returns the maximum of two scalars or the
# statement that they are equal.
if(is.numeric(c(a,b))) {
if(a < b) return(b)
if(a > b) return(a)
else print("The values are equal")
# could also use cat()
}
else print("Character inputs not allowed.")
}
The function f1 takes two values and returns one of several possible things. Observe how
the function works: before the values a and b are compared, it is first determined if they are
both numeric. The function is.numeric() returns TRUE the argument is either real or
integer, and FALSE otherwise. If this conditional is satisfied, the values are compared.
Otherwise, the user gets the warning message. To use the function:
> f1(4,7)
[1] 7
> f1(pi,exp(1))
[1] 3.141593
> f1(0,exp(log(0)))
[1] "The values are equal"
> f1("Stephen","Christopher")
[1] "Character inputs not allowed."
The function object f1 will remain in your workspace until you remove it.
50
To make changes to your function, you can use the fix() or edit() commands described in
Section 3.3.
Here is another example10. Below is the formula for calculating the monthly mortgage
payment for a home loan:
P A
r/1200
,
1 (1 r/1200)12y
where A is the loan amount, r is the nominal interest rate (assumed convertible monthly), and
y is the number of years for the loan. The value of P represents the monthly payment. Below
is an R function that computes the monthly payment based on these inputs:
>
+
+
+
This function takes three inputs, but we have given them default values. Thus, if we dont
give the function any arguments, it will calculate the monthly payment for a $100,000 loan at
6% interest for 30 years. The round() function is used to make the result look like money
and give the calculation only two decimal points:
> mortgage()
[1] 599.55
> mortgage(200000,5.5)
# use default 30 year loan value
[1] 1135.58
> mortage(y = 15)
# Doh! Bad spelling...
Error: couldn't find function "mortage"
> mortgage(y = 15)
[1] 843.86
It is often the plight of the statistician to encounter a model that requires an iterative
technique to estimate unknown parameters using observed data. For likelihood-based
approaches, a computer program is usually written to execute some optimization method.
But, R includes functions for doing just that. Two important examples are:
optimize()
Returns the value that minimizes (or maximizes) a function over a specified interval.
uniroot()
This is from The New S Language, by Becker/Chambers/Wilks, Chapman and Hall, London, 1988.
51
These are both for one-dimensional searches. To use either of these, you create the function
(as described in Section 8.3) and the search is performed with respect to the functions first
argument. The interval to be searched must be specified as well. The help file for each of
these functions provides the details on additional arguments, etc.
As an example, suppose that a random sample of size n is to be observed from the Weibull
distribution with unknown shape parameter and a scale parameter of 1. The pdf for this
model is given by
f (x) = x 1 exp( x ) , x > 0, > 0.
Using standard likelihood methods, the log likelihood function is given by
l() = n log() ( 1)i 1 log( x i ) i 1 x i
n
i 1
log( x i ) i 1 x i log( x i ) .
n
To find , the maximum likelihood estimate for , we require the value of that maximizes
l(), given the data x1, x2, , xn. Equivalently, we could find the root of l(). Clearly, the
two functions introduced in this section can do the job. To see these at work, data will be
simulated from this Weibull distribution with = 4.0:
> data <- rweibull(20, shape=4, scale=1)
> data
[1] 0.6151766 0.9417053 0.8244651 1.3818235 0.9866513 0.6121007 1.1391467
[8] 0.8711759 1.3590739 0.4826331 1.0755436 1.0318024 1.1728524 0.8062276
[15] 1.3630116 1.2094519 0.7547847 1.1178163 0.6184907 0.9310806
l <- function(theta, x)
{
return(length(x)*log(theta) + (theta-1)*sum(log(x)) - sum(x^theta))
}
Thus, is defined as the value that maximizes this function. To do this, enter:
> optimize(l, interval=c(0,20), x = data, max = T)
$maximum
[1] 3.874208
value that achieves the maximum
$objective
[1] -1.834197
52
So, = 3.874208, which is pretty close to the true value 4.0. The interval (0, 20) in the
above function call can be considered a guessed range for the parameter.
Of course, this same maximum value could also be found by using the function l() and
uniroot().
8.5 Exercises
53
n
p ( x) p x (1 p ) n x , x 0 , 1, 2, ..., n
x
p( x) p(1 p) x , x 0 , 1, 2, ...
m n
x k x
p ( x)
, x = # of white balls drawn
m n
x n 1 n
p (1 p ) x , x 0 , 1, 2, ...
p( x)
n
x
p ( x)
e , x 0, 1, 2, ...
x!
f ( x)
(a b) a 1
x (1 x) b 1 , 0 x 1
(a )(b)
Continuous uniform
f ( x)
1
, min x max
max min
54
f ( x) exp(x), x 0
f ( x)
1
x a 1 exp( x / s), x 0
a
(a ) s
( x ) 2
exp
, x
2 2
2
a x
f ( x)
bb
a 1
x a
exp , x 0
b
55
References
Becker, R., Chambers, J., and Wilks, A. (1988). The New S Language, Chapman and Hall,
London.
Freedman, D. and Diaconis, P. (1981). On this histogram as a density estimator: L2 theory,
Zeit. Wahr. ver. Geb. 57, p. 453 476.
Robert, C. and Casella, G. (1999). Monte Carlo Statistical Methods, Springer-Verlag, New
York.
Scott, D.W. (1979). On optimal and data-based histograms, Biometrika 66, p. 605 610.
Sturges, H. (1926), The choice of a class-interval, J. Amer. Statist. Assoc. 21, p. 65 66.
56
Index
:, 12
$, 16
%*%, 10
abline(), 21, 42
abs(), 8
anova(), 42
aov(), 39
apropos(), 5
arrows(), 21
as.matrix(), 10
attach(), 16
attributes(), 17, 41
barplot(), 24
boxplot(), 27, 38
c(), 2, 12
cat(), 50
chisq.test(), 44
confint(), 42
Control, 48
cor(), 23
cor.test(), 46
cov(), 23
curve(), 20, 32
choose(), 8
cumsum(), 9
data(), 15
data.frame(), 14
dbinom(), 29
det(), 10
detach(), 17
dim(), 10
dimnames(), 45
diff(), 9
ecdf(), 28
edit(), 14
eigen(), 10
exp(), 8
factorial(), 8
FALSE or F, 6
file.choose(), 14, 18
fivenum(), 23
fix(), 14
for(), 47, 48
friedman.test(), 47
function(), 50
gamma(), 8
gl(), 38
help(), 4
hist(), 25
I(), 41
if(), 48, 50, 51
integrate(), 30
kruskal.test(), 47
ks.test(), 47
length(), 9
lines(), 21
lm(), 41
log(), 4
ls(), 3
matrix(), 6
max(), 23
mean(), 23
median(), 23
min(), 23
NA, 6
nls(), 40
optimize(), 51
pairs(), 28
par(), 20, 21
pbinom(), 29
persp(), 28
pi, 8
pie(), 28
plot(), 19
pnorm(), 29
points(), 21
predict.lm(), 43
print(), 49
prod(), 9
prop.test(), 46
q(), 2
qchisq(), 29
qnorm(), 29
qqline(), 28
qqnorm(), 28
qqplot(), 28
qt(), 29
qtukey(), 39
quantile(), 23
quartz(), 19
read.csv() , 18
read.table(), 18
rep(), 13
return(), 50
rexp(), 29, 33
rm(), 3
round(), 51
rug(), 21
runif(), 29
sample(), 33
scan(), 13
sd(), 23
search(), 17
segments(), 21
seq(), 12
shapiro.test(), 47
solve(), 10
sort(), 9
source(), 48
stem(), 27
sum(), 9
summary(), 23, 39
sqrt(), 8
t(), 11
t.test(), 35
table(), 24
text(), 21
title(), 21
TRUE or T, 6
ts.plot(), 28
TukeyHSD(), 40
uniroot(), 51
var(), 23
var.test(), 46
while(), 48, 49
wilcox.test(), 47
x11(), 19
57