Programming in R
Programming in R
Programming in R
Jane M. Horgan∗
R programming is now a major resource for statistical analysis, research, and
teaching and has an impressive suite of applications and packages. In this
article, we introduce the basic features of the language including data entry,
data description, graphical procedures, writing simple programs, and simulation.
It is not the intention to provide an exhaustive list of features of the language, just
enough to give a flavor of the structure of R, and how to get help to proceed further.
2011 Wiley Periodicals, Inc.
WHAT IS R? Installing R
R is obtained from the website called CRAN
R is a data-analysis system that provides an
environment for statistical analysis and graphics.
It can be used just as a calculator, all the way up to
(Comprehensive R Archive Network), and is down-
loaded by proceeding as follows:
producing elaborate graphics, performing simulations,
and statistical modeling. It is in fact a complete object- • Go to the Cran website at http://cran.r-
oriented programming language, is open source, and project.org/
available from the web under the General Public • Click ‘Download and Install R’
License (GPL) which allows free use. It exists for • Choose an operating system
Microsoft Windows, Linux, and Unix platforms, and
for Apple Macintosh (OS versions newer than 8.6). • Choose the ‘base’ package
Unlike standard statistical packages such as SPSS • Click on ‘Download R’
and Minitab, which use point-and-click graphical-user • Press the option ‘Run’
interfaces, R is command driven; user type commands
at a prompt, and R responds. The purpose of this R is now installed.
article is to describe enough of the main features of
To start, click on the R icon, or go to ‘Programs’,
R to enable the new user to get started. We first,
select R, and then click on the R icon. When the R
in the next section, show how to download R and
program is started, and after it prints an introductory
describe some of its basic operations, editing, and
message on the screen, the interpreter prompts for
help procedures. The methods used to read and edit
input with ‘>’.
statistical data are discussed in Section ‘Data Entry’,
and an introduction to data analysis is given in Section R as a Calculator
‘Data Analysis’. Some of the graphical features of R are
Expressions that are typed at the command
examined in Section ‘Graphical Displays’, and Section
prompt (>) are executed by the interpreter. For
‘Simulation’ deals with an example in queuing, to
example:
illustrate the powerful simulation tools available in
R. We conclude with some suggestions for further 6+7*3/2
reading.
returns
BASICS
[1] 16.5
As a start we look at how to download R, and get it
to perform simple calculations. x <- 1:4
∗ Correspondence
to: [email protected] Here the integers 1, 2, 3, 4 are assigned to the vector
School of Computing, Dublin City University, Dublin, Ireland x. To check the contents of x, type
x
for an HTML browser interface.
which returns It could be helpful to look at some demonstrations of
R by typing
[1] 1 2 3 4
demo()
xx <- x**2
causes each element in the vector x to be squared and which gives a list of available demonstrations. For
example,
stored in the vector xx. To examine the contents of
xx, type
demo(graphics)
xx
returns some examples of graphical procedures, along
which gives with the code used to implement them.
A more specific way of getting help is to
[1] 1 4 9 16
type the name of the function you require. For
To multiply a vector by a constant, type example:
X <- 10 ?read.table
prod1 <- X*x This will provide details on the exact syntactic
prod1 structure of the instruction ‘read.table’.
[1] 10 20 30 40 If you do not know the name of the command,
Here the integer 10 is stored in X, and X∗x causes type some words from the topic as follows:
each element of the vector x to be multiplied by 10. help.search ("data.entry")
Some points to note:
Check yourself what this will give.
• <- is the assignment operator; in the illustration
‘x <- 1:4’, the vector (1, 2, 3, 4) is assigned to
x. An alternative assignment operator is just ‘=’; DATA ENTRY
• R is case sensitive; x and X represent different Before carrying out a statistical analysis, it is
variables; necessary to get the data into the computer. How
• Variable names can consist of any combination this is done varies depending on the amount of
of lower and upper case letters, numerals, data involved. We illustrate the various options
periods, and underscores, but cannot begin with of data entry with the Anscombe Quartet1 given
in Table 1. It consists of four data sets (x1,
a numeral or an underscore;
y1), (x2, y2), (x3, y3), (x4, y4), each consisting
• All of the above examples of variables are of two variables in each of which there are 11
numeric, but R supports many other types of observations.
data, such as nonnumeric strings and matrices.
The entities that R creates and manipulates are called Reading and Displaying Data on Screen
objects. These include variables, arrays of numbers, A small data set may be entered directly from the
strings, or functions. All objects are stored in what is screen. It is usually stored as a vector, which is
known as the workspace. essentially a list of numbers. To input the x1 values
given in Table 1 from the screen, type
x1 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12,
Getting Help
7, 5)
The easiest way of getting help when working in the R
environment is to click the Help button on the toolbar. The construct c(...) is used to define a
Alternatively you can type vector containing the data points. These are then
assigned to a vector called x1. Similarly for y1,
help()
type
for on-line help, or y1 <- c(8.04, 6.95, 7.58, 8.81, 8.33,
9.96, 7.24, 4.26, 10.84,
help.start() 4.82, 5.68)
Entering Data from a File After attach there is no need to use the data frame
When the data set is large, it is better to set up a text name. Now
file and access the data from this, rather than to enter
x1[5]
it directly from the screen. For example, if the data in
Table 1 are stored in a file called anscombe.txt in the returns
G directory and data subdirectory, the file can be read
into R using [1] 11
anscombe <- read.table read.table assumes that the data in the text file
(‘‘G:/data/anscombe.txt’’, header = T) are separated by spaces, as in Table 1. Other forms
include:
Here header = T specifies that the first line
is a header, in this case containing the names of vari- read.csv, used when the data points are separated
ables. Notice that the forward slash (/) is used in the by commas;
filename, not backslash (\) which would be expected
in the windows environment. The backslash has itself read.csv2, used when the data are separated by
a meaning within R, and cannot be used in this semicolons.
context.
In R, this type of data set is stored in what is Spread Sheets
referred to as a data frame, which is an object with It is also possible to enter data into a spreadsheet and
rows and columns. Equivalently it is a list of vectors store it in a data frame as follows:
of the same length; the columns denote the variables,
while the rows are the observations on the variables. anscombe <- data.frame()
The read.table instruction above assigns the data fix(anscombe)
to a data frame called anscombe.
The convention for accessing the column This brings up a blank spread sheet called anscombe,
variables is to use the name of the data frame followed and the user may then enter the variable labels
by the name of the relevant column. For example: and the variable values. When finished entering
the data, right click and close creates a data
anscombe$x1[5] frame anscombe in which the new information is
stored.
returns
[1] 11
Editing
which is the 5th observation in the column labeled x1. If you subsequently need to amend anscombe, type
An easier way of doing this is to type and enter
attach(anscombe) fix(anscombe)
This brings up the spreadsheet with the data, which Summarizing Statistical Data
can be changed as you wish. Alternatively click on First we implement some of the most commonly used
Edit on the tool bar to get access to the Data descriptive statistical measures.
Editor. For the mean of x1 write
mean(x1)
Missing Values which gives
R allows vectors to contain a special NA value to
indicate that the data point is not available. The [1] 9
absent values are referred to as missing values, and For the standard deviation of x1
are not included at the analysis stage.
sd(x1)
gives
Saving and Retrieving the Workspace
To save the entire workspace use [1] 3.316625
x1 y1 x2 y2 x3
Min. : 4.0 Min. : 4.260 Min. : 4.0 Min. :3.100 Min. : 4.0
1st Qu.: 6.5 1st Qu.: 6.315 1st Qu.: 6.5 1st Qu.:6.695 1st Qu.: 6.5
Median : 9.0 Median : 7.580 Median : 9.0 Median :8.140 Median : 9.0
Mean : 9.0 Mean : 7.501 Mean : 9.0 Mean :7.501 Mean : 9.0
3rd Qu.:11.5 3rd Qu.: 8.570 3rd Qu.:11.5 3rd Qu.:8.950 3rd Qu.:11.5
Max. :14.0 Max. :10.840 Max. :14.0 Max. :9.260 Max. :14.0
y3 x4 y4
Min. : 5.39 Min. : 8 Min. : 5.520
1st Qu.: 6.25 1st Qu.: 8 1st Qu.: 6.170
Median : 7.11 Median : 8 Median : 7.040
Mean : 7.50 Mean : 9 Mean : 7.525
3rd Qu.: 7.98 3rd Qu.: 8 3rd Qu.: 8.190
Max. :12.74 Max. :19 Max. :12.500
diffx1 <- x1-mean(x1) #subtract the mean from each data point
diffsq <- (diffx1)^2 # obtain the squares of these differences
sumdiffsq < sum(diffsq) #sum the squared differences
std <- sqrt(sumdiffsq)/(length(x1)-1)) #divide this sum by (length(x1)-1), take the square root
Writing
this program as you wish. File/Save causes the file to
std be saved; you may designate what name you want
gives to call it, and it will be given a .R extension. In
subsequent sessions, File/Open Script brings up all the
[1] 3.316625 .R files you have saved, and you can select the one
Frequency
1.5
Creating Functions 1.0
Users can write function of their own when what they
need is not available as a built-in function in R. We 0.5
take as an example the skewness coefficient, which
0.0
measures how much the data differ from symmetry
and is defined as 4 5 6 7 8 9 10 11
y1
√ n
n (xi − x)3 FIGURE 1 | A histogram.
skew = i=1 . (3)
n 2 3/2
i=1 (xi − x)
11
A perfectly symmetrical set of data will have a
skewness of zero; when the skewness coefficient is 10
substantially greater than zero, the data are assymetric 9
with a long tail to the right, and a negative skewness
coefficient means that data have a long tail to the left. 8
The following syntax calculates the skewness 7
coefficient, and assigns it to a function called skew
6
which has one argument (x).
5
Example 3: Function which calculates the skewness
4
coefficient
skew <- function(x) FIGURE 2 | A simple boxplot.
{
sum2 <- sum((x-mean(x))^2)
sum3 <- sum((x-mean(x))^3)
skew <- (sqrt(length(x))* sum3)/(sum2^(1.5)) Histogram
return(skew)
}
The traditional way of examining the ‘shape’ of a set
of data is a histogram.
The function skew can be applied to any data set. For
example hist(y1)
gives
1] -0.05580807 Boxplots
A boxplot is a graphical summary based on the
which indicates that the y1 data is slightly negatively median, quartiles, and extreme values. To display
skewed. the y1 data using a boxplot, type
boxplot(y1)
GRAPHICAL DISPLAYS which gives Figure 2.
As well as numerical summaries, there are various Often called the Box and Whiskers Plot, the box
pictorial representations and graphical displays represents the interquartile range which contains 50%
available which have a more dramatic impact on the of cases. The whiskers are the lines that extend from
user and make for a better understanding of the data. the box to the highest and lowest values. The line
The ease and speed which graphical displays can be across the box indicates the median.
produced is one of the important features of R. We Multiple boxplots can be displayed on the same
look at some of the most commonly used. axis, by adding extra arguments to the boxplot
11
10
10
9
8
8
y1
6 7
6
4
5
4
1 2
4 6 8 10 12 14
FIGURE 3 | Boxplot of y1 and y2. x1
11
15
10
9
10
8
y1
7
5
6
x1 x2 x3 x4 y1 y2 y3 y4 5
4
FIGURE 4 | Boxplots of all the variables in the data frame.
4 6 8 10 12 14
x1
function or by using the complete data frame. For
example FIGURE 6 | The line of best fit.
boxplot(y1, y2) Here you can see that there is what is called a
yields Figure 3. linear trend in these data. The line that ‘best fits’ these
Notice the point below the whiskers of the data is obtained and displayed with
boxplot in y2. This data point is called an outlier abline(lm(y1˜x1))
and represents a case more than 1.5 box lengths
from the upper or lower end of the box. This point This gives Figure 6.
is considered atypical of the data in general, being When more than two variables are involved, R
extremely low compared to the rest of the data. provides a facility for producing scatter plots of all
Boxplots of all the variables in the data frame possible pairs. Writing
anscombe are obtained with pairs(anscombe)
boxplot(anscombe) will generate Figure 7.
which gives Figure 4.
Graphical Display versus Summary Statistics
Scatter Plots Looking again at the Anscombe data set given in
Table 1, we calculate the means (rounded to one
Scatter plots are useful to investigate relationships
decimal place) as follows:
between variables. To examine, for example,
the relationship between x1 and y1, we could round(mean(anscombe), 1)
write:
gives
plot(x1, y1)
x1 x2 x3 x4 y1 y2 y3 y4
to obtain Figure 5. 9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5
4 12 8 16 3 7 6 12 12 12
12 8 8
y1
y2
x1
4 4 4
12 0 0
x2
4 0 5 10 15 20 0 5 10 15 20
12 x1 x2
x3
4
12 12
16 x4
8 8 8
y3
y4
8 4 4
y1
4 0 0
0 5 10 15 20 0 5 10 15 20
7 y2
3 x3 x4
12
y3
6
FIGURE 8 | Plots of four data sets with same means and standard
deviations.
12
y4
6
3. Data set 3 has an outlier. If the outlier were
4 12 4 12 4 8 6 12
removed the data would be linear;
FIGURE 7 | Use of the pairs function. 4. Data set 4 contains x values which are equal
except for one outlier. If the outlier were
The standard deviations (rounded to two decimal removed, the data would be vertical.
places) are calculated with
Graphical displays are the core of getting ‘insight/feel’
round(sd(anscombe),2) for the data. Such ‘insight/feel’ does not come from the
which gives quantitative statistics; on the contrary, calculations
of quantitative statistics should be done after the
x1 x2 x3 x4 y1 y2 exploratory data analysis using graphical displays.
3.32 3.32 3.32 3.32 2.03 2.03 The powerful graphical procedures of R facilitate this
y3 y4 approach.
2.03 2.03
Notice that the four sets of data (x1, y1), (x2, SIMULATION
y2), (x3, y3), (x4, y4) have the same mean and
standard deviation, which might lead to the conclusion With the computational power of R it is easy to
that the four data sets are essentially the same. simulate problems that might otherwise be difficult to
Investigating further using graphical displays understand. We illustrate with an example from the
gives a different picture. Scatter plots is the obvious theory of queues.
exploratory technique to use with paired data:
par(mfrow = c(2, 2)) #gives a two by two display Queues
plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13))
plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13)) There is an extensive literature on queuing theory; R
plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13)) enables us to sidestep the theory, and to concentrate
plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13))
instead on experimentation. We use as an example
the M/M/1 queue, where there is one server dealing
generates Figure 8. We use xlim = c(0,20) and
with customers on a first-in first-out basis. Customers
ylim= = c(0,13) to make the scales on the axes
are usually assumed to arrive in accordance with a
the same in the four plots, to allow for a valid
Poisson distribution, and are served immediately if
comparison.
the queue is empty, otherwise they join the end of
Examining Figure 8, we see that there are very
the queue. The service rates are also assumed to be
great differences in the data sets:
Poisson.
Traffic intensity (I) is the ratio of that arrival
1. Data set 1 is linear with some scatter; rate to the service rate. When the arrival rate is greater
2. Data set 2 is quadratic; than the service rate I > 1, when it is equal I = 1, and
Traffic
Intensity I > 1
Traffic
Intensity I = 1
Traffic
Intensity I < 1
Figure 9 illustrates the severe problem that devel-
2000 2000 2000
ops when the arrival rate is greater than the service
rate (I > 1), the length of the queue is increasing
steeply. With arrival and service rates equal (I = 1),
the problem is not as severe, but it does exist, and we
1500 1500 1500 see that in the long run it will become serious. The
only tenable solution to the queuing problem is to
keep I < 1.
Queue length
Queue length
Queue length
1000 1000 1000
SUMMARY
500 500 500 We have tried to set before you some of the features
of R which make it such a flexible and accessible
language within which to tackle your statistical prob-
lems; we hope you have been convinced. For further
0 0 0
and deeper information, there are many books and
0 4000 10000 0 4000 10000 0 4000 10000 manuals both on and off line which, between them,
Time Time Time
deal with most statistical applications. Venables et al.2
FIGURE 9 | Queue lengths. provide a manual which gives an introduction to the
language and how to use R for doing statistical anal-
when it is less then I < 1. We investigate each of these ysis and graphics; it is downloadable from the CRAN
three scenarios in turn. website (http://cran.r-project.org/). Chambers3 guides
The following code simulates a queue in which the reader in programming with R, from interactive
customers arrive at the rate of 4 per minute, and use and writing simple functions to the design of pack-
are serviced at 3.8 per minute (I > 1). It generates ages and intersystem interfaces. Horgan4 deals with
10,000 random Poisson arrivals (rpois(10000, probability problems. Statistical inference examples
4)), and 10,000 Poisson services (rpois(10000, are tackled in Dalgaard5 . The book of Maindon-
3.8)), and calculates the queue length at each time ald and Braun6 has extensive examples that illustrate
interval. practical data analysis using R. Fox and Weisberg7
Example 4: A program to simulate a simple queue
arrivals <- rpois(10000, 4) #generates 10,000 values from a Poisson dist with mean =4
service <- rpois(10000, 3.8) #generates 10,000 values from a Poisson dist with mean =3.8
queue[1] <- max(arrivals[1] - service[1], 0)
for (t in 2:10000) queue[t] = max(queue[t-1]+arrivals[t]-service[t], 0) #length of queue
plot(queue, xlab = "Time", ylab = "Queue length")
ACKNOWLEDGMENT
Thanks to the referees whose observations and suggestions greatly improved this article.
REFERENCES
1. Anscombe FJ. Graphs in statistical analysis. Am Stat Available at: http://www.r-project.org/. (Accessed July
1973, 27:17–21. 04, 2011).
2. Venables WN, Smith DM, the R Development
Core Team. An Introduction to R: A Programming 3. Chambers JM. Software for Data Analysis: Programming
Environment for Data Analysis and Graphics, 2004. with R. New York: Springer; 2008.
4. Horgan JM. Probability with R: An Introduction with 7. Fox J, Weisberg S. An R Companion to Applied Regres-
Computer Science Applications. Hoboken, NJ: John sion. 2nd ed. Thousand Oaks LA: Sage Publications;
Wiley & Sons; 2008. 2011.
5. Dalgaard P. Introductory Statistics with R. 2nd ed. 8. Gentleman R. R programming for Bioinformatics. Lon-
Heidelberg: Springer-Verlag; 2008. don: Chapman and Hall/CRC; 2009.
6. Maindonald J, Braun J. Data Analysis and Graphics
Using R. 2nd ed. Cambridge: Cambridge University
Press; 2007.