Prerequis R
Prerequis R
Marina Krémé
This document is a reminder of the basic notions necessary for the good realization of the practical work that
you will have to do during the EMA. This is obviously not a complete course in R. We invite you to consult
also the following reference R for Beginners
The documents that I used for the preparation of this course are :
• A First Course in Statistical Programming in R
• The Art of R Programming
To this document, I join a “memory help” to help you to find quickly the names of the functions.
Installation
Installing R on Window
Choose the “Download R for Windows” tab. Click on the “base” tab, then “Download R 3.6.1 for Windows”
for an automatic installation. In most cases, the installation should work without problem whatever the
version of Windows (XP, Vista, 7, 8, 10). If exceptionally warning or error messages or error messages
are displayed, there may be special instructions to be followed (see Tab “Does R run under my version of
Windows?”). You can also watch the following video Installing R on Window.
Installing R on Linux
Choose the tab Download R for Linux. The installation then depends on the distribution you use: debian,
redhat, suse ou ubuntu. On Ubuntu, you must first add in your file /etc/apt/sources.list one of the following
links (depending on the version of your Ubuntu):
• deb https://cloud.r-project.org/bin/linux/ubuntu disco-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu cosmic-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu trusty-cran35/
Also add in the file /etc/apt/sources.list the following line to allow to download some Ubuntu programs
necessary for the functioning of some R packages (the list of mirror sites for Ubuntu can be found here
Ubuntu can be found here : https://launchpad.net/ubuntu/+archivemirrors:
Then, in a terminal, you must execute the following code:
• sudo apt-get update
• sudo apt-get install r-base
1
Installing R on MacOs
Choose the Download R for Mac OS X tab and click on “R.3.6.1.pkg”. After downloading the file,the
installation is done automatically by double clicking on the file as shown in the following video Installing R
on Os.
A First R session
Let’s make a simple data set (in R parlance, a vector ) consisting of the num-bers 1, 2, and 4, and name it x:
x<-c(1,2,3,4)
The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not
work in some special situations. Note that there are no fixed types associated with variables. Here, we’ve
assigned a vector to x, but later we might assign something of a different type to it. We’ll look at vectors and
the other types in Section 1.4. The c stands for concatenate. Here, we are concatenating the numbers 1, 2,
and 4. More precisely, we are concatenating three one-element vectors that consist of those numbers. This is
because any number is also considered to be a one-element vector. Now we can also do the following:
q<-c(x,x,8)
which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates). Now let’s confirm that the data is really in x. To
print the vector to the screen, simply type its name. If you type any variable name (or, more generally, any
expression) while in interactive mode, R will print out the value of that variable (or expression). Programmers
familiar with other languages such as Python will find this feature familiar. For our example, enter this:
x
## [1] 1 2 3 4
Yep, sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are accessed via [ ].
Here’s how we can print out the third element of x.
Subsetting is a very important operation on vectors. Here’s an example
x<-c(1,2,3,4)
x[2:3]
## [1] 2 3
The expression x[2:3] refers to the subvector of x consisting of elements 2 through 3, which are 2 and 4 here.
We can easily find the mean and standard deviation of our data set, as follows:
mean(x)
## [1] 2.5
sd(x)
## [1] 1.290994
If we want to save the computed mean in a variable instead of just print- ing it to the screen, we could
execute this code:
y<-mean(x)
y # print out y
## [1] 2.5
Finally, let’s do something with one of R’s internal data sets (these are used for demos). You can get a list of
these data sets by typing the following:
2
#data()
One of the data sets is called Nile and contains data on the flow of the Nile River. Let’s find the mean and
standard deviation of this data set:
mean(Nile)
## [1] 919.35
sd(Nile)
## [1] 169.2275
We can also plot a histogram of the data:
hist(Nile)
Histogram of Nile
25
20
Frequency
15
10
5
0
Nile
The call hist(z,breaks=12) would draw a histogram of the data set z with 12 bins. You can also create nicer
labels, make use of color, and make many other changes to create a more informative and eye- appealing
graph. When you become more familiar with R, you’ll be able to construct complex, rich color graphics of
striking beauty
Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function (or
alternatively by pressing CTRL -D in Linux or CMD -D on a Mac):
Vectors
3
Vectors are stored like arrays in C, contiguously, and thus you cannot insert or delete elements—something
you may be used to if you are a Python programmer. The size of a vector is determined at its creation, so if
you wish to add or delete elements, you’ll need to reassign the vector. For example, let’s add an element to
the middle of a four-element vector:
x<-c(1,2,12,89)
x<-c(x[1:3],400,x[4]) #insert 400 before 89
x
## [1] 1 2 12 400 89
x<-c(1,2,3)
length(x)
## [1] 3
C<-c()
C
## NULL
length(C)
## [1] 0
Recycling
When applying an operation to two vectors that requires them to be the same length, R automatically
recycles, or repeats, the shorter one, until it is long enough to match the longer one. Here is an example:
c(1,2,3) + c(8,10,9,20,2)
## Warning in c(1, 2, 3) + c(8, 10, 9, 20, 2): la taille d'un objet plus long n'est
## pas multiple de la taille d'un objet plus court
## [1] 9 12 12 21 4
The shorter vector was recycled, so the operation was taken to be as follows:
c(1,2,3,1,2) + c(8,10,9,20,2)
## [1] 9 12 12 21 4
## [1] 5
"+"(2,3)
4
## [1] 5
If you are familiar with linear algebra, you may be surprised at what hap- pens when we multiply two vectors
x<-c(1,2,3)
y<-c(4,5,6)
x*y
## [1] 4 10 18
But remember, because of the way the function is applied, the multiplica- tion is done element by element.
The same principle applies to other numeric operators. Here’s an example
x/y
## [1] 1 2 3
Vector indexing
One of the most important and frequently used operations in R is that of indexing vectors, in which we form
a subvector by picking elements of the given vector for specific indices
y<-c(1,10,5,8,9,4,0,2,1)
y[(c(1,3))] #extract elements 1 and 3 of y
## [1] 1 5
y[2:3]
## [1] 10 5
z<-3:5
y[z]
## [1] 5 8 9
Note that duplicates are allowed
x <- c(4,2,17,5)
y <- x[c(1,1,3)]
y
## [1] 4 4 17
Negative subscripts mean that we want to exclude the given elements in our output.
z<-c(8,10,11,9)
z[-1]#exclude element 1
## [1] 10 11 9
z[-1:-2] #exclude elements 1 through 2
## [1] 11 9
In such contexts, it is often useful to use the length() function. For instance, suppose we wish to pick up all
elements of a vector z except for the last. The following code will do just that:
5
z<-c(3,4,27)
z[1:length(z)-1]
## [1] 3 4
or more simply
z[-length((z))]
## [1] 3 4
## [1] 5 6 7 8 9 10
i<-2
1:i-1 # this means (1:i) -1, not 1:(i-1)
## [1] 0 1
1:(i-1)
## [1] 1
A generalization of : is the seq() (or sequence) function, which generates a sequence in arithmetic progression.
seq(from=10, to=20,by=2)
## [1] 10 12 14 16 18 20
The spacing can be a noninteger value, too, say 0.1.
seq(from=1.1,to=2,length=12)
## [1] TRUE
any(x>90)
## [1] FALSE
all(x>0)
## [1] TRUE
6
Using NA
In many of R’s statistical functions, we can instruct the function to skip over any missing values, or NAs.
Here is an example
x <- c(88,NA,12,168,13)
mean(x)
## [1] NA
mean(x,na.rm =T)
## [1] 70.25
In the first call, mean() refused to calculate, as one value in x was NA. But by setting the optional argument
na.rm (NA remove) to true (T), we calculated the mean of the remaining elements. B
Using NULL
One use of NULL is to build up vectors in loops, in which each iteration adds another element to the vector.
In this simple example, we build up a vector of even numbers:
# build up a vector of the even numbers in 1:10
z <- NULL
for (i in 1:10) if (i %%2 == 0) z <- c(z,i)
z
## [1] 2 4 6 8 10
Thus the example loop starts with a NULL vector and then adds the element 2 to it, then 4, and so on.But
the point here is to demonstrate the difference between NA and NULL. If we were to use NA instead of
NULL in the preceding example, we would pick up an unwanted NA:
z <- NA
for (i in 1:10) if (i %%2 == 0) z <- c(z,i)
z
## [1] NA 2 4 6 8 10
NULL values really are counted as nonexistent, as you can see here:
u<-NULL
length(u)
## [1] 0
v<-NA
length((v))
## [1] 1
Filtering indices
z <- c(5,2,-3,68)
w<-z[z*z>8] # simple example
w
7
## [1] 5 -3 68
"<"(2,1) # comparison
## [1] FALSE
x<-1:10
x[x>3]<-0 #x in which we wish to replace all elements larger than a 3 with a 0.
x
## [1] 1 2 3 0 0 0 0 0 0 0
#which
z <- c(5,2,-3,8)
which(z*z > 8) # find the positions within z at which the condition occurs.
## [1] 1 3 4
## [,1] [,2]
## [1,] -3 83.0
## [2,] 2 0.1
It’s important to be aware of how R fills up the matrix using the entries from data. Looking at the previous
example, you can see that the 2 × 2 matrix A has been filled in a column-by-column fashion when reading
the data entries from left to right. You can control how R fills in data using the argument byrow, as shown
in the following examples:
matrix(data=c(1,2,3,4,5,6),nrow=2,ncol=3,byrow=FALSE)
8
cbind(c(1,4),c(2,5),c(3,6))
## [1] 4 3
nrow(mymat)
## [1] 4
ncol(mymat)
## [1] 3
dim(mymat)[2]
## [1] 3
A <- matrix(c(0.3,4.5,55.3,91,0.1,105.5,-4.2,8.2,27.9),nrow=3,ncol=3)
A
## [1] 105.5
A[1,] # the first row
9
## [,1] [,2]
## [1,] 105.5 27.9
## [2,] 91.0 -4.2
diag(x=A) # identify the values along the diagonal of A
## [,1] [,2]
## [1,] 0.3 -4.2
## [2,] 4.5 8.2
## [3,] 55.3 27.9
A[-1,3:2] # removes the first row from A and retrieves the third and second column values
## [,1] [,2]
## [1,] 8.2 0.1
## [2,] 27.9 105.5
A[-1,-2] # A without its first row and second column
## [,1] [,2]
## [1,] 4.5 8.2
## [2,] 55.3 27.9
A[-1,-c(2,3)] # deletes the first row and then deletes the second and third columns
A <- rbind(c(2,5,2),c(6,1,4))
A
## [,1] [,2]
## [1,] 2 6
## [2,] 5 1
## [3,] 2 4
I <- diag(x=3) # identity matrix
B <- matrix(data=c(3,4,1,2),nrow=2,ncol=2)
solve(B)#inverse
## [,1] [,2]
## [1,] 1 -0.5
## [2,] -2 1.5
B%*%solve(B) #to verify
## [,1] [,2]
10
## [1,] 1 0
## [2,] 0 1
Multidimensional Arrays
D <- array(data=1:24,dim=c(3,4,2)) #
D
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[2]]
## [1] TRUE FALSE TRUE TRUE
##
## [[3]]
## [1] "hello"
length(x=foo)
## [1] 3
You can retrieve components from a list using indexes, which are entered in double square brackets.
foo[[1]]
## [,1] [,2]
## [1,] 1 3
11
## [2,] 2 4
foo[[3]]
## [1] "hello"
This action is known as a member reference. When you’ve retrieved a component this way, you can treat it
just like a stand-alone object in the workspace; there’s nothing special that needs to be done.
foo[[1]] + 5.5
## [,1] [,2]
## [1,] 6.5 8.5
## [2,] 7.5 9.5
foo[[1]][1,2]
## [1] 3
foo[[1]][2,]
## [1] 2 4
cat(foo[[3]],"you!")
## hello you!
Naming
You can name list components to make the elements more recognizable and easy to work with.
names(foo) <- c("mymatrix","mylogicals","mystring")
foo
## $mymatrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $mylogicals
## [1] TRUE FALSE TRUE TRUE
##
## $mystring
## [1] "hello"
This has changed how the object is printed to the console. Where ear- lier it printed [[1]], [[2]], and [[3]]
before each component, now it prints the names you specified: $mymatrix, $mylogicals, and $mystring.
You can now perform member referencing using these names and the dollar operator, rather than the double
square brackets.
foo$mymatrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
This is the same as calling foo[[1]]. In fact, even when an object is named, you can still use the numeric index
to obtain a member.
12
foo[[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Subsetting named members also works the same way.
all(foo$mymatrix[,2]==foo[[1]][,2])
## [1] TRUE
To name the components of a list as it’s being created, assign a label to each component in the list command.
Using some components of foo, create a new, named list.
b <- list(tom=c(foo[[2]],T,T,T,F),dick="g'day mate",harry=foo$mymatrix*2)
b
## $tom
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
##
## $dick
## [1] "g'day mate"
##
## $harry
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
names(b)
Nesting
As noted earlier, a member of a list can itself be a list. When nesting lists like this, it’s important to keep
track of the depth of any member for subsetting or extraction later. Note that you can add components to
any existing list by using the dol- lar operator and a new name. Here’s an example using foo and baz from
earlier
b$bobby <- foo
b
## $tom
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
##
## $dick
## [1] "g'day mate"
##
## $harry
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
##
## $bobby
## $bobby$mymatrix
## [,1] [,2]
13
## [1,] 1 3
## [2,] 2 4
##
## $bobby$mylogicals
## [1] TRUE FALSE TRUE TRUE
##
## $bobby$mystring
## [1] "hello"
DataFrames
To create a data frame from scratch, use the data.frame function. You supply your data, grouped by variable,
as vectors of the same length—the same way you would construct a named list. Consider the following
example data set:
mydata <- data.frame(person=c("Peter","Lois","Meg","Chris","Stewie"),
age=c(42,40,17,14,1),
sex=factor(c("M","F","F","M","M")))
mydata
## [1] 40
mydata[3:5,3] #Now extract the third, fourth, and fifth elements of the third column
## [1] F M M
## Levels: F M
mydata[,c(3,1)] #extracts the entire third and first columns
## sex person
## 1 M Peter
## 2 F Lois
## 3 F Meg
## 4 M Chris
## 5 M Stewie
This results in another data frame giving the sex and then the name of each person. You can also use the
names of the vectors that were passed to data.frame to access variables even if you don’t know their column
index positions, which can be useful for large data sets. You use the same dollar operator you used for
member-referencing named lists.
mydata$age
## [1] 42 40 17 14 1
mydata$person
14
## Levels: Chris Lois Meg Peter Stewie
mydata <- data.frame(person=c("Peter","Lois","Meg","Chris","Stewie"),
age=c(42,40,17,14,1),sex=factor(c("M","F","F","M","M")),
stringsAsFactors=FALSE)
mydata
15
## 2 Lois 40 F High
## 3 Meg 17 F Low
## 4 Chris 14 M Med
## 5 Stewie 1 M High
## 6 Brian 7 M Med
The rbind and cbind functions aren’t the only ways to extend a data frame. One useful alternative for adding
a variable is to use the dollar oper- ator, much like adding a new member to a named list, as in Section
5.1.3. Suppose now you want to add another variable to mydata by including a column with the age of the
individuals in months, not years, calling this new variable age.mon
mydata$age.mon <- mydata$age*12
mydata
if statements
if (condition){ do any code here }
a <- 3
mynumber <-4
if(a<=mynumber){
a <- aˆ2
16
}
a
## [1] 9
else statements
The if statement executes a chunk of code if and only if a defined condi- tion is TRUE. If you want something
different to happen when the condition is FALSE, you can add an else declaration.
if(condition){ do any code in here if condition is TRUE } else { do any code in here if condition is FALSE }
if(a<=mynumber){
cat("Condition was",a<=mynumber)
a <- aˆ2
} else {
cat("Condition was",a<=mynumber)
a <- a-3.5
}
## [1] 5.5
}
}
17
## First condition was FALSE
## Second condition was TRUE
a
## [1] 2
b
## [1] 0.5
For Loops
The R for loop always takes the following general form:
for(loopindex in loopvector){ do any code in here }
for(myitem in 5:7){
cat("--BRACED AREA BEGINS--\n")
cat("the current item is",myitem,"\n")
cat("--BRACED AREA ENDS--\n\n")
}
## [1] 0.8
## [1] 2.2
## [1] 0.68
## [1] 1.1
18
for(i in 1:length(myvec)){
print(2*myvec[i])
}
## [1] 0.8
## [1] 2.2
## [1] 0.68
## [1] 1.1
Functions
A function definition always follows this standard format:
functionname <- function(arg1,arg2,arg3,...) do any code in here when called return(returnobject)
example: Fibonacci sequence generator
myfib <- function(){
fib.a <- 1
fib.b <- 1
cat(fib.a,", ",fib.b,", ",sep="")
repeat{
temp <- fib.a+fib.b
fib.a <- fib.b
fib.b <- temp
cat(fib.b,", ",sep="")
if(fib.b>150){
cat("BREAK NOW...")
break
}
}
}
myfib()
Adding arguments
Rather than printing a fixed set of terms, let’s add an argument to control how many Fibonacci numbers are
printed. Consider the following new func- tion, myfib2, with this modification
myfib2 <- function(thresh){
fib.a <- 1
fib.b <- 1
cat(fib.a,", ",fib.b,", ",sep="")
repeat{
temp <- fib.a+fib.b
fib.a <- fib.b
fib.b <- temp
cat(fib.b,", ",sep="")
if(fib.b>thresh){
cat("BREAK NOW...")
19
break
}
}
}
myfib2(thresh=150)
## 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711,
If you want to use the results of a function in future operations (rather than just printing output to the
console), you need to return content to the user. Continuing with the current example, here’s a Fibonacci
function that stores the sequence in a vector and returns it
myfib3 <- function(thresh){
fibseq <- c(1,1)
counter <- 2
repeat{
fibseq <- c(fibseq,fibseq[counter-1]+fibseq[counter])
counter <- counter+1
if(fibseq[counter]>thresh){
break
}
}
return(fibseq)
}
myfib3(150)
Using return
If there’s no return statement inside a function, the function will end when the last line in the body code has
been run, at which point it will return the most recently assigned or created object in the function. If noth-
ing is created, such as in myfib and myfib2 from earlier, the function returns NULL. To demonstrate this
point, enter the following two dummy functions in the editor:
dummy1 <- function(){
aa <- 2.5
bb <- "string me along"
cc <- "string 'em up"
dd <- 4:8
}
dummy2 <- function(){
aa <- 2.5
bb <- "string me along"
cc <- "string 'em up"
dd <- 4:8
return(dd)
20
}
The first function, dummy1, simply assigns four different objects in its lexical environment (not the global
environment) and doesn’t explicitly return anything. On the other hand, dummy2 creates the same four
objects and explicitly returns the last one, dd. If you import and run the two func- tions, both provide the
same return object.
foo <- dummy1()
foo
## [1] 4 5 6 7 8
bar <- dummy2()
bar
## [1] 4 5 6 7 8
A function will end as soon as it evaluates a return command, without executing any remaining code in the
function body. To emphasize this, con- sider one more version of the dummy function
dummy3 <- function(){
aa <- 2.5
bb <- "string me along"
return(aa)
cc <- "string 'em up"
dd <- 4:8
return(bb)
}
Here, dummy3 has two calls to return: one in the middle and one at the end. But when you import and
execute the function, it returns only one value.
baz <- dummy3()
baz
## [1] 2.5
Basics plots
The easiest way to think about generating plots in R is to treat your screen as a blank, two-dimensional
canvas. You can plot points and lines using x- and y-coordinates. On paper, these coordinates are usually
represented with points written as a pair: (x value, y value). The R function plot, on the other hand, takes
in two vectors—one vector of x locations and one vector of y locations—and opens a graphics device where it
displays the result.
x <- c(1.1,2,3.5,3.9,4.2)
y<-c(2,2.2,-1.3,0,0.2)
plot(x,y)
21
2.0
1.0
y
0.0
−1.0
Graphics parameters
• type : Tells R how to plot the supplied coordinates (for example, as stand-alone points or joined by
lines or both dots and lines).
• main, xlab, ylab: Options to include plot title, the horizontal axis label, and the vertical axis label,
respectively.
• col: Color (or colors) to use for plotting points and lines.
• pch: Stands for point character. This selects which character to use for plotting individual points.
• cex: Stands for character expansion. This controls the size of plotted point characters.
• lty: Stands for line type. This specifies the type of line to use to connect the points (for example, solid,
dotted, or dashed).
• lwd: Stands for line width. This controls the thickness of plotted lines.
• xlim, ylim: This provides limits for the horizontal range and vertical range (respectively) of the
plotting region
plot(x,y,type="l")
22
2.0
1.0
y
0.0
−1.0
x #titles
and axes labels
plot(x,y,type="b",main="My lovely plot",xlab="x axis label",
ylab="location y")
My lovely plot
2.0
1.0
location y
0.0
−1.0
x axis label
plot(x,y,type="b",main="My lovely plot\ntitle on two lines",xlab="",
ylab="")
23
My lovely plot
title on two lines
2.0
1.0
0.0
−1.0
My lovely plot
2.0
1.0
0.0
−1.0
24
2.0
1.0
0.0
−1.0 My lovely plot
#install.packages("ggplot2")
xx<- c(1.1,2,3.5,3.9,4.2)
yy<- c(2,2.2,-1.3,0,0.2)
#qplot(xx,yy)
elementary statistics
Take the data frame chickwts, which is avail- able in the automatically loaded datasets package. At the
prompt, directly entering the following gives you the first five records of this data set.
chickwts[1:5,]
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
R’s help file (?chickwts) describes these data as comprising the weights of 71 chicks (in grams) after six weeks,
based on the type of food provided to them. Now let’s take a look at the two columns in their entirety as
vectors:
chickwts$weight
## [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213
## [20] 257 244 271 243 230 248 327 329 250 193 271 316 267 199 171 158 248 423 340
25
## [39] 392 339 341 226 320 295 334 322 297 318 325 257 303 315 380 153 263 242 206
## [58] 344 258 368 390 379 260 404 318 352 359 216 222 283 332
chickwts$feed
26
−35 −30 −25 −20 −15 −10
Latitude
Longitude
## [1] 2.825
x_median <- median(xdata)
x_median
## [1] 2.6
min(xdata)
## [1] 2
max(xdata)
## [1] 4.4
range(xdata)
## [1] 261.3099
median(chickwts$weight)
## [1] 258
27
Many of the functions R uses to compute statistics from a numeric structure will not run if the data set
includes missing or undefined values (NAs or NaNs). Here’s an example:
mean(c(1,4,NA))
## [1] NA
mean(c(1,4,NaN))
## [1] NaN
To prevent unintended NaNs or forgotten NAs being ignored without the user’s knowledge,R does not by
default ignore these special values when running functions such as mean—and therefore will not return the
intended numeric results. You can, however, set an optional argument na.rm to TRUE, which will force the
function to operate only on the numeric values that are present.
mean(c(1,4,NA),na.rm=TRUE)
## [1] 2.5
mean(c(1,4,NaN),na.rm=TRUE)
## [1] 2.5
Finding a mode is perhaps most easily achieved by using R’s table func- tion, which gives you the frequencies
you need
xtab <- table(xdata)
xtab
## xdata
## 2 2.2 3 4 4.4
## 3 1 2 1 1
You can construct a logical flag vector to get the mode from table
d <- xtab[xtab==max(xtab)]
d
## 2
## 3
quantile(xdata,prob=c(0,0.25,0.5,0.75,1))
## 25% 75%
## 204.5 323.5
There are ways to obtain the five-number summary other than using quantile; when applied to a numeric
vector, the summary function also pro- vides these statistics, along with the mean, automatically.
summary(xdata)
28
Basic data vizualisation
Data visualization is an important part of a statistical analysis
##
## 4 6 8
## 11 7 14
The result is easily displayed as a barplot, as shown here:
barplot(cyl_freq)
14
12
10
8
6
4
2
0
4 6 8
Similar plots may be produced using ggplot2. If you load the installed package with library(“ggplot2”).
29
#library("ggplot2")
#qplot(factor(mtcars$cyl),geom="bar")
The venerable pie chart is an alternative option for visualizing frequency- based quantities across levels of
categorical variables, with appropriately sized “slices” representing the relative counts of each categorical
variable.
pie(table(mtcars$cyl),labels=c("V4","V6","V8"),
col=c("white","gray","black"),main="Performance cars by cylinders")
V4
V6
V8
Histogram: For a simple example of a histogram, consider the horsepower data of the 32 cars in mtcars, given
in the fourth column, named hp
mtcars$hp
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
## [20] 65 97 150 150 245 175 66 91 113 264 175 335 109
hist(mtcars$hp)
30
Histogram of mtcars$hp
10
8
Frequency
6
4
2
0
mtcars$hp
hist(mtcars$hp,breaks=seq(0,400,25),col="gray",main="Horsepower",xlab="HP")
abline(v=c(mean(mtcars$hp),median(mtcars$hp)),lty=c(2,3),lwd=2)
legend("topright",legend=c("mean HP","median HP"),lty=c(2,3),lwd=2)
Horsepower
8
mean HP
median HP
6
Frequency
4
2
0
HP
Let’s return to the built-in quakes data frame of the 1,000 seismic events near Fiji. For the sake of comparison,
31
you can examine both a histogram and a boxplot of the magnitudes of these events using default base R
behavior.
hist(quakes$mag)
Histogram of quakes$mag
200
150
Frequency
100
50
0
quakes$mag
boxplot(quakes$mag)
6.0
5.5
5.0
4.5
4.0
Side-by-Side Boxplots
One particularly pleasing aspect of these plots is the ease with which you can compare the five-number
summary distributions of different groups with side-by-side boxplots.
32
stations.fac <- cut(quakes$stations,breaks=c(0,50,100,150))
boxplot(quakes$mag~stations.fac,
xlab="# stations detected",ylab="Magnitude",col="gray")
6.0
5.5
Magnitude
5.0
4.5
4.0
# stations detected
#Scatterplots A scatterplot is most frequently used to identify a relationship between the observed values of
two different numeric-continuous variables, displayed as x-y coordinate plots.
Another example: the famous iris data. Collected in the mid- 1930s, this data frame of 150 rows and 5
columns consists of petal and sepal measurements for three species of perennial iris flowers—Iris setosa, Iris
vir-ginica, and Iris versicolor
iris[1:5,]
Single Plot
You can modify a simple scatterplot to split the plotted points according to a categorical variable, exposing
potential differences between any visible rela- tionships with respect to the continuous variables. For example,
using base R graphics, you can examine the petal measurements according to the three species
plot(iris[,4],iris[,3],type="n",xlab="Petal Width (cm)",
ylab="Petal Length (cm)")
points(iris[iris$Species=="setosa",4],
iris[iris$Species=="setosa",3],pch=19,col="black")
points(iris[iris$Species=="virginica",4],
iris[iris$Species=="virginica",3],pch=19,col="gray")
33
points(iris[iris$Species=="versicolor",4],
iris[iris$Species=="versicolor",3],pch=1,col="black")
legend("topleft",legend=c("setosa","virginica","versicolor"),
col=c("black","gray","black"),pch=c(19,19,1))
7
setosa
virginica
6
versicolor
Petal Length (cm)
5
4
3
2
1
Matrix plots
The “single” type of planar scatterplot is really useful only when comparing two numeric-continuous variables.
When there are more continuous vari- ables of interest, it isn’t possible to display this information satisfactorily
on a single plot. A simple and common solution is to generate a two-variable scatterplot for each pair of
variables and show them together in a structured way; this is referred to as a scatterplot matrix.
iris_pch <- rep(19,nrow(iris))
iris_pch[iris$Species=="versicolor"] <- 1
iris_col <- rep("black",nrow(iris))
iris_col[iris$Species=="virginica"] <- "gray"
#plot(iris[,4],iris[,3],col=iris_col,pch=iris_pch,
#xlab="Petal Width (cm)",ylab="Petal Length (cm)")
pairs(iris[,1:4],pch=iris_pch,col=iris_col,cex=0.75)
34
2.0 3.0 4.0 0.5 1.5 2.5
7.5
Sepal.Length
6.0
4.5
4.0
Sepal.Width
3.0
2.0
7
5
Petal.Length
3
1
2.5
1.5
Petal.Width
0.5
To generate a matrix in a ggplot2 style, it’s recommended that you download the GGally package.
#install.packages(GGAlly)
library("GGally")
35
8 Corr: −0.118 Corr: 0.872*** Corr: 0.818***
7 setosa: 0.743*** setosa: 0.267. setosa: 0.278.
Sepal.Length
6 versicolor: 0.526***versicolor: 0.754***versicolor: 0.546***
5 6 7 8 virginica: 0.457*** virginica: 0.864*** virginica: 0.281*
4.5
4 Corr: −0.428*** Corr: −0.366***
3.5 setosa: 0.178 setosa: 0.233
3
Sepal.Width versicolor: 0.561***versicolor: 0.664***
2.5 virginica: 0.401** virginica: 0.538***
2 2.5 3 3.5 44.5
6 Corr: 0.963***
setosa: 0.332*
Petal.Length
4 versicolor: 0.787***
2 4 6 virginica: 0.322*
2.5
2
1.5
Petal.Width
1
0.5
00.5 1 1.5 2 2.5
setosa
Species
versicolor
virginica
## [1] 0.004167619
X.prob <- dbinom(x=0:8,size=8,prob=1/6)
X.prob
## [1] 1
Rounding to three decimal places,the results are easier to read
round(X.prob,3)
## [1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
#The pbinom Function
36
The other R functions for the binomial distribution work in much the same way. The first argument is always
the value (or values) of interest; n is sup- plied as size and p as prob. To find, for example, the probability
that you observe three or fewer 4s, P(X<=3), you either sum the relevant individual entries from dbinom as
earlier or use pbinom
sum(dbinom(x=0:3,size=8,prob=1/6))
## [1] 0.9693436
pbinom(q=3,size=8,prob=1/6)
## [1] 0.9693436
#qbinom function Less frequently used is the qbinom function, which is the inverse of pbinom. Where
pbinom provides a cumulative probability when given a quantile value q, the function qbinom provides a
quantile value when given a cumulative probability p. The discrete nature of a binomial random variable
means qbinom will return the nearest value of x below which p lies
qbinom(p=0.95,size=8,prob=1/6)
## [1] 3
## [1] 2
rbinom(n=1,size=8,prob=1/6)
## [1] 2
rbinom(n=1,size=8,prob=1/6)
## [1] 1
rbinom(n=3,size=8,prob=1/6)
## [1] 1 3 0
## [1] 0.2223249
barplot(dpois(x=0:10,lambda=3.22),ylim=c(0,0.25),space=0,
names.arg=0:10,ylab="Pr(X=x)",xlab="x")
37
0.25
0.20
0.15
Pr(X=x)
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
x
# uniform distribution
dunif(x=c(-2,-0.33,0,0.5,1.05,1.2),min=-0.4,max=1.1)
38