0% found this document useful (0 votes)
43 views38 pages

Prerequis R

This document provides an introduction to statistical programming with R. It discusses installing R on Windows, Linux, and MacOS. It then covers basic R concepts like vectors, subsetting vectors, arithmetic operations, and handling missing data. The goal is to provide the foundations for practical work using R during a course.

Uploaded by

eliestephane44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views38 pages

Prerequis R

This document provides an introduction to statistical programming with R. It discusses installing R on Windows, Linux, and MacOS. It then covers basic R concepts like vectors, subsetting vectors, arithmetic operations, and handling missing data. The goal is to provide the foundations for practical work using R during a course.

Uploaded by

eliestephane44
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to Statistical Programming with R

Marina Krémé

This document is a reminder of the basic notions necessary for the good realization of the practical work that
you will have to do during the EMA. This is obviously not a complete course in R. We invite you to consult
also the following reference R for Beginners
The documents that I used for the preparation of this course are :
• A First Course in Statistical Programming in R
• The Art of R Programming
To this document, I join a “memory help” to help you to find quickly the names of the functions.

Installation

Installing R on Window

Choose the “Download R for Windows” tab. Click on the “base” tab, then “Download R 3.6.1 for Windows”
for an automatic installation. In most cases, the installation should work without problem whatever the
version of Windows (XP, Vista, 7, 8, 10). If exceptionally warning or error messages or error messages
are displayed, there may be special instructions to be followed (see Tab “Does R run under my version of
Windows?”). You can also watch the following video Installing R on Window.

Installing R on Linux

Choose the tab Download R for Linux. The installation then depends on the distribution you use: debian,
redhat, suse ou ubuntu. On Ubuntu, you must first add in your file /etc/apt/sources.list one of the following
links (depending on the version of your Ubuntu):
• deb https://cloud.r-project.org/bin/linux/ubuntu disco-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu cosmic-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu xenial-cran35/
• deb https://cloud.r-project.org/bin/linux/ubuntu trusty-cran35/
Also add in the file /etc/apt/sources.list the following line to allow to download some Ubuntu programs
necessary for the functioning of some R packages (the list of mirror sites for Ubuntu can be found here
Ubuntu can be found here : https://launchpad.net/ubuntu/+archivemirrors:
Then, in a terminal, you must execute the following code:
• sudo apt-get update
• sudo apt-get install r-base

1
Installing R on MacOs

Choose the Download R for Mac OS X tab and click on “R.3.6.1.pkg”. After downloading the file,the
installation is done automatically by double clicking on the file as shown in the following video Installing R
on Os.

A First R session
Let’s make a simple data set (in R parlance, a vector ) consisting of the num-bers 1, 2, and 4, and name it x:
x<-c(1,2,3,4)

The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not
work in some special situations. Note that there are no fixed types associated with variables. Here, we’ve
assigned a vector to x, but later we might assign something of a different type to it. We’ll look at vectors and
the other types in Section 1.4. The c stands for concatenate. Here, we are concatenating the numbers 1, 2,
and 4. More precisely, we are concatenating three one-element vectors that consist of those numbers. This is
because any number is also considered to be a one-element vector. Now we can also do the following:
q<-c(x,x,8)

which sets q to (1,2,4,1,2,4,8) (yes, including the duplicates). Now let’s confirm that the data is really in x. To
print the vector to the screen, simply type its name. If you type any variable name (or, more generally, any
expression) while in interactive mode, R will print out the value of that variable (or expression). Programmers
familiar with other languages such as Python will find this feature familiar. For our example, enter this:
x

## [1] 1 2 3 4
Yep, sure enough, x consists of the numbers 1, 2, and 4. Individual elements of a vector are accessed via [ ].
Here’s how we can print out the third element of x.
Subsetting is a very important operation on vectors. Here’s an example
x<-c(1,2,3,4)
x[2:3]

## [1] 2 3
The expression x[2:3] refers to the subvector of x consisting of elements 2 through 3, which are 2 and 4 here.
We can easily find the mean and standard deviation of our data set, as follows:
mean(x)

## [1] 2.5
sd(x)

## [1] 1.290994
If we want to save the computed mean in a variable instead of just print- ing it to the screen, we could
execute this code:
y<-mean(x)
y # print out y

## [1] 2.5
Finally, let’s do something with one of R’s internal data sets (these are used for demos). You can get a list of
these data sets by typing the following:

2
#data()

One of the data sets is called Nile and contains data on the flow of the Nile River. Let’s find the mean and
standard deviation of this data set:
mean(Nile)

## [1] 919.35
sd(Nile)

## [1] 169.2275
We can also plot a histogram of the data:
hist(Nile)

Histogram of Nile
25
20
Frequency

15
10
5
0

400 600 800 1000 1200 1400

Nile
The call hist(z,breaks=12) would draw a histogram of the data set z with 12 bins. You can also create nicer
labels, make use of color, and make many other changes to create a more informative and eye- appealing
graph. When you become more familiar with R, you’ll be able to construct complex, rich color graphics of
striking beauty
Well, that’s the end of our first, five-minute introduction to R. Quit R by calling the q() function (or
alternatively by pressing CTRL -D in Linux or CMD -D on a Mac):

Vectors

Adding, deleting vectors elements


The fundamental data type in R is the vector. You saw a few examples in previews section and now you’ll
learn the details.

3
Vectors are stored like arrays in C, contiguously, and thus you cannot insert or delete elements—something
you may be used to if you are a Python programmer. The size of a vector is determined at its creation, so if
you wish to add or delete elements, you’ll need to reassign the vector. For example, let’s add an element to
the middle of a four-element vector:
x<-c(1,2,12,89)
x<-c(x[1:3],400,x[4]) #insert 400 before 89
x

## [1] 1 2 12 400 89

Obtaining the length of a vector

x<-c(1,2,3)
length(x)

## [1] 3
C<-c()
C

## NULL
length(C)

## [1] 0

Recycling
When applying an operation to two vectors that requires them to be the same length, R automatically
recycles, or repeats, the shorter one, until it is long enough to match the longer one. Here is an example:
c(1,2,3) + c(8,10,9,20,2)

## Warning in c(1, 2, 3) + c(8, 10, 9, 20, 2): la taille d'un objet plus long n'est
## pas multiple de la taille d'un objet plus court
## [1] 9 12 12 21 4
The shorter vector was recycled, so the operation was taken to be as follows:
c(1,2,3,1,2) + c(8,10,9,20,2)

## [1] 9 12 12 21 4

Vector arithmetic and logical operations


Remember that R is a functional language. Every operator, including + in the following example, is actually
a function
2+3

## [1] 5
"+"(2,3)

4
## [1] 5
If you are familiar with linear algebra, you may be surprised at what hap- pens when we multiply two vectors
x<-c(1,2,3)
y<-c(4,5,6)
x*y

## [1] 4 10 18
But remember, because of the way the function is applied, the multiplica- tion is done element by element.
The same principle applies to other numeric operators. Here’s an example
x/y

## [1] 0.25 0.40 0.50


x%%y

## [1] 1 2 3

Vector indexing
One of the most important and frequently used operations in R is that of indexing vectors, in which we form
a subvector by picking elements of the given vector for specific indices
y<-c(1,10,5,8,9,4,0,2,1)
y[(c(1,3))] #extract elements 1 and 3 of y

## [1] 1 5
y[2:3]

## [1] 10 5
z<-3:5
y[z]

## [1] 5 8 9
Note that duplicates are allowed
x <- c(4,2,17,5)
y <- x[c(1,1,3)]
y

## [1] 4 4 17
Negative subscripts mean that we want to exclude the given elements in our output.
z<-c(8,10,11,9)
z[-1]#exclude element 1

## [1] 10 11 9
z[-1:-2] #exclude elements 1 through 2

## [1] 11 9
In such contexts, it is often useful to use the length() function. For instance, suppose we wish to pick up all
elements of a vector z except for the last. The following code will do just that:

5
z<-c(3,4,27)
z[1:length(z)-1]

## [1] 3 4
or more simply
z[-length((z))]

## [1] 3 4

Other ways to generate vectors

5:10 # with operators :

## [1] 5 6 7 8 9 10
i<-2
1:i-1 # this means (1:i) -1, not 1:(i-1)

## [1] 0 1
1:(i-1)

## [1] 1
A generalization of : is the seq() (or sequence) function, which generates a sequence in arithmetic progression.
seq(from=10, to=20,by=2)

## [1] 10 12 14 16 18 20
The spacing can be a noninteger value, too, say 0.1.
seq(from=1.1,to=2,length=12)

## [1] 1.100000 1.181818 1.263636 1.345455 1.427273 1.509091 1.590909 1.672727


## [9] 1.754545 1.836364 1.918182 2.000000

Using all() and any()


The any() and all() functions are handy shortcuts. They report whether any or all of their arguments are
TRUE
x<-1:10
any(x>7)

## [1] TRUE
any(x>90)

## [1] FALSE
all(x>0)

## [1] TRUE

6
Using NA
In many of R’s statistical functions, we can instruct the function to skip over any missing values, or NAs.
Here is an example
x <- c(88,NA,12,168,13)
mean(x)

## [1] NA
mean(x,na.rm =T)

## [1] 70.25
In the first call, mean() refused to calculate, as one value in x was NA. But by setting the optional argument
na.rm (NA remove) to true (T), we calculated the mean of the remaining elements. B

Using NULL
One use of NULL is to build up vectors in loops, in which each iteration adds another element to the vector.
In this simple example, we build up a vector of even numbers:
# build up a vector of the even numbers in 1:10
z <- NULL
for (i in 1:10) if (i %%2 == 0) z <- c(z,i)
z

## [1] 2 4 6 8 10
Thus the example loop starts with a NULL vector and then adds the element 2 to it, then 4, and so on.But
the point here is to demonstrate the difference between NA and NULL. If we were to use NA instead of
NULL in the preceding example, we would pick up an unwanted NA:
z <- NA
for (i in 1:10) if (i %%2 == 0) z <- c(z,i)
z

## [1] NA 2 4 6 8 10
NULL values really are counted as nonexistent, as you can see here:
u<-NULL
length(u)

## [1] 0
v<-NA
length((v))

## [1] 1

Filtering indices

z <- c(5,2,-3,68)
w<-z[z*z>8] # simple example
w

7
## [1] 5 -3 68
"<"(2,1) # comparison

## [1] FALSE
x<-1:10
x[x>3]<-0 #x in which we wish to replace all elements larger than a 3 with a 0.
x

## [1] 1 2 3 0 0 0 0 0 0 0
#which
z <- c(5,2,-3,8)
which(z*z > 8) # find the positions within z at which the condition occurs.

## [1] 1 3 4

Matrices and arrays


To create a matrix in R, use the aptly named matrix command, providing the entries of the matrix to the
data argument as a vector:
A <- matrix(data=c(-3,2,83,0.1),nrow=2,ncol=2)
A

## [,1] [,2]
## [1,] -3 83.0
## [2,] 2 0.1
It’s important to be aware of how R fills up the matrix using the entries from data. Looking at the previous
example, you can see that the 2 × 2 matrix A has been filled in a column-by-column fashion when reading
the data entries from left to right. You can control how R fills in data using the argument byrow, as shown
in the following examples:
matrix(data=c(1,2,3,4,5,6),nrow=2,ncol=3,byrow=FALSE)

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6
matrix(data=c(1,2,3,4,5,6),nrow=2,ncol=3,byrow=TRUE) #et’s repeat the same line of code but set byrow=TR

## [,1] [,2] [,3]


## [1,] 1 2 3
## [2,] 4 5 6
If you have multiple vectors of equal length, you can quickly build a matrix by binding together these vectors
using the built-in R functions, rbind and cbind.
rbind(1:3,4:6)

## [,1] [,2] [,3]


## [1,] 1 2 3
## [2,] 4 5 6
Here, rbind has bound together the vectors as two rows of a matrix, with the top-to-bottom order of the rows
matching the order of the vectors sup- plied to rbind.

8
cbind(c(1,4),c(2,5),c(3,6))

## [,1] [,2] [,3]


## [1,] 1 2 3
## [2,] 4 5 6
Another useful function, dim, provides the dimensions of a matrix stored in your workspace.
mymat <- rbind(c(1,3,4),5:3,c(100,20,90),11:13)
mymat

## [,1] [,2] [,3]


## [1,] 1 3 4
## [2,] 5 4 3
## [3,] 100 20 90
## [4,] 11 12 13
dim(mymat) # dimension of matrix

## [1] 4 3
nrow(mymat)

## [1] 4
ncol(mymat)

## [1] 3
dim(mymat)[2]

## [1] 3

Row, column and diagonal extractions

A <- matrix(c(0.3,4.5,55.3,91,0.1,105.5,-4.2,8.2,27.9),nrow=3,ncol=3)
A

## [,1] [,2] [,3]


## [1,] 0.3 91.0 -4.2
## [2,] 4.5 0.1 8.2
## [3,] 55.3 105.5 27.9
A[3,2] #“look at the third row of A and give me the element from the second column

## [1] 105.5
A[1,] # the first row

## [1] 0.3 91.0 -4.2


A[2:3,] #

## [,1] [,2] [,3]


## [1,] 4.5 0.1 8.2
## [2,] 55.3 105.5 27.9
A[c(3,1),2:3] #

9
## [,1] [,2]
## [1,] 105.5 27.9
## [2,] 91.0 -4.2
diag(x=A) # identify the values along the diagonal of A

## [1] 0.3 0.1 27.9


A[,-2] # A without its seond column

## [,1] [,2]
## [1,] 0.3 -4.2
## [2,] 4.5 8.2
## [3,] 55.3 27.9
A[-1,3:2] # removes the first row from A and retrieves the third and second column values

## [,1] [,2]
## [1,] 8.2 0.1
## [2,] 27.9 105.5
A[-1,-2] # A without its first row and second column

## [,1] [,2]
## [1,] 4.5 8.2
## [2,] 55.3 27.9
A[-1,-c(2,3)] # deletes the first row and then deletes the second and third columns

## [1] 4.5 55.3

Matrix Operations and Algebra

A <- rbind(c(2,5,2),c(6,1,4))
A

## [,1] [,2] [,3]


## [1,] 2 5 2
## [2,] 6 1 4
t(A) # transpose of

## [,1] [,2]
## [1,] 2 6
## [2,] 5 1
## [3,] 2 4
I <- diag(x=3) # identity matrix
B <- matrix(data=c(3,4,1,2),nrow=2,ncol=2)
solve(B)#inverse

## [,1] [,2]
## [1,] 1 -0.5
## [2,] -2 1.5
B%*%solve(B) #to verify

## [,1] [,2]

10
## [1,] 1 0
## [2,] 0 1

Multidimensional Arrays

D <- array(data=1:24,dim=c(3,4,2)) #
D

## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24

Lists and DataFrames

Definition and components access


Creating a list is much like creating a vector. You supply the elements that you want to include to the list
function, separated by commas
foo <- list(matrix(data=1:4,nrow=2,ncol=2),c(T,F,T,T),"hello")
foo

## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[2]]
## [1] TRUE FALSE TRUE TRUE
##
## [[3]]
## [1] "hello"
length(x=foo)

## [1] 3
You can retrieve components from a list using indexes, which are entered in double square brackets.
foo[[1]]

## [,1] [,2]
## [1,] 1 3

11
## [2,] 2 4
foo[[3]]

## [1] "hello"
This action is known as a member reference. When you’ve retrieved a component this way, you can treat it
just like a stand-alone object in the workspace; there’s nothing special that needs to be done.
foo[[1]] + 5.5

## [,1] [,2]
## [1,] 6.5 8.5
## [2,] 7.5 9.5
foo[[1]][1,2]

## [1] 3
foo[[1]][2,]

## [1] 2 4
cat(foo[[3]],"you!")

## hello you!

Naming
You can name list components to make the elements more recognizable and easy to work with.
names(foo) <- c("mymatrix","mylogicals","mystring")
foo

## $mymatrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $mylogicals
## [1] TRUE FALSE TRUE TRUE
##
## $mystring
## [1] "hello"
This has changed how the object is printed to the console. Where ear- lier it printed [[1]], [[2]], and [[3]]
before each component, now it prints the names you specified: $mymatrix, $mylogicals, and $mystring.
You can now perform member referencing using these names and the dollar operator, rather than the double
square brackets.
foo$mymatrix

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
This is the same as calling foo[[1]]. In fact, even when an object is named, you can still use the numeric index
to obtain a member.

12
foo[[1]]

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Subsetting named members also works the same way.
all(foo$mymatrix[,2]==foo[[1]][,2])

## [1] TRUE
To name the components of a list as it’s being created, assign a label to each component in the list command.
Using some components of foo, create a new, named list.
b <- list(tom=c(foo[[2]],T,T,T,F),dick="g'day mate",harry=foo$mymatrix*2)
b

## $tom
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
##
## $dick
## [1] "g'day mate"
##
## $harry
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
names(b)

## [1] "tom" "dick" "harry"

Nesting
As noted earlier, a member of a list can itself be a list. When nesting lists like this, it’s important to keep
track of the depth of any member for subsetting or extraction later. Note that you can add components to
any existing list by using the dol- lar operator and a new name. Here’s an example using foo and baz from
earlier
b$bobby <- foo
b

## $tom
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
##
## $dick
## [1] "g'day mate"
##
## $harry
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
##
## $bobby
## $bobby$mymatrix
## [,1] [,2]

13
## [1,] 1 3
## [2,] 2 4
##
## $bobby$mylogicals
## [1] TRUE FALSE TRUE TRUE
##
## $bobby$mystring
## [1] "hello"

DataFrames
To create a data frame from scratch, use the data.frame function. You supply your data, grouped by variable,
as vectors of the same length—the same way you would construct a named list. Consider the following
example data set:
mydata <- data.frame(person=c("Peter","Lois","Meg","Chris","Stewie"),
age=c(42,40,17,14,1),
sex=factor(c("M","F","F","M","M")))
mydata

## person age sex


## 1 Peter 42 M
## 2 Lois 40 F
## 3 Meg 17 F
## 4 Chris 14 M
## 5 Stewie 1 M
mydata[2,2]

## [1] 40
mydata[3:5,3] #Now extract the third, fourth, and fifth elements of the third column

## [1] F M M
## Levels: F M
mydata[,c(3,1)] #extracts the entire third and first columns

## sex person
## 1 M Peter
## 2 F Lois
## 3 F Meg
## 4 M Chris
## 5 M Stewie
This results in another data frame giving the sex and then the name of each person. You can also use the
names of the vectors that were passed to data.frame to access variables even if you don’t know their column
index positions, which can be useful for large data sets. You use the same dollar operator you used for
member-referencing named lists.
mydata$age

## [1] 42 40 17 14 1
mydata$person

## [1] Peter Lois Meg Chris Stewie

14
## Levels: Chris Lois Meg Peter Stewie
mydata <- data.frame(person=c("Peter","Lois","Meg","Chris","Stewie"),
age=c(42,40,17,14,1),sex=factor(c("M","F","F","M","M")),
stringsAsFactors=FALSE)
mydata

## person age sex


## 1 Peter 42 M
## 2 Lois 40 F
## 3 Meg 17 F
## 4 Chris 14 M
## 5 Stewie 1 M

Adding Data Columns and Combining Data Frames

newrecord <- data.frame(person="Brian",age=7, sex=factor("M",levels=levels(mydata$sex)))


newrecord

## person age sex


## 1 Brian 7 M
mydata <- rbind(mydata,newrecord) # the mixtes
mydata

## person age sex


## 1 Peter 42 M
## 2 Lois 40 F
## 3 Meg 17 F
## 4 Chris 14 M
## 5 Stewie 1 M
## 6 Brian 7 M
Using rbind, you combined mydata with the new record and overwrote mydata with the result. Adding a
variable to a data frame is also quite straightforward. Let’s say you’re now given data on the classification of
how funny these six individuals are, defined as a “degree of funniness.” The degree of funniness can take
three possible values: Low, Med (medium), and High. Suppose Peter, Lois, and Stewie have a high degree of
funniness, Chris and Brian have a medium degree of funniness, and Meg has a low degree of funniness. In R,
you’d have a factor vector like this:
funny <- c("High","High","Low","Med","High","Med")
funny <- factor(x=funny,levels=c("Low","Med","High"))
funny

## [1] High High Low Med High Med


## Levels: Low Med High
The first line creates the basic character vector as funny, and the second line overwrites funny by turning it
into a factor. The order of these elements must correspond to the records in your data frame. Now, you can
simply use cbind to append this factor vector as a column to the existing mydata
mydata <- cbind(mydata,funny)
mydata

## person age sex funny


## 1 Peter 42 M High

15
## 2 Lois 40 F High
## 3 Meg 17 F Low
## 4 Chris 14 M Med
## 5 Stewie 1 M High
## 6 Brian 7 M Med
The rbind and cbind functions aren’t the only ways to extend a data frame. One useful alternative for adding
a variable is to use the dollar oper- ator, much like adding a new member to a named list, as in Section
5.1.3. Suppose now you want to add another variable to mydata by including a column with the age of the
individuals in months, not years, calling this new variable age.mon
mydata$age.mon <- mydata$age*12
mydata

## person age sex funny age.mon


## 1 Peter 42 M High 504
## 2 Lois 40 F High 480
## 3 Meg 17 F Low 204
## 4 Chris 14 M Med 168
## 5 Stewie 1 M High 12
## 6 Brian 7 M Med 84
you want to examine all records corresponding to male
mydata$sex=="M"

## [1] TRUE FALSE FALSE TRUE TRUE TRUE


This returns data for all variables for only the male participants. You can use the same behavior to pick and
choose which variables to return in the subset.
mydata[mydata$sex=="M",]

## person age sex funny age.mon


## 1 Peter 42 M High 504
## 4 Chris 14 M Med 168
## 5 Stewie 1 M High 12
## 6 Brian 7 M Med 84

Conditions and loops


To write more sophisticated programs with R, you’ll need to control the flow and order of execution in your
code. One funda- mental way to do this is to make the execution of certain sections of code dependent on a
condition

if statements
if (condition){ do any code here }
a <- 3
mynumber <-4

if(a<=mynumber){
a <- aˆ2

16
}
a

## [1] 9

else statements
The if statement executes a chunk of code if and only if a defined condi- tion is TRUE. If you want something
different to happen when the condition is FALSE, you can add an else declaration.
if(condition){ do any code in here if condition is TRUE } else { do any code in here if condition is FALSE }
if(a<=mynumber){
cat("Condition was",a<=mynumber)
a <- aˆ2
} else {
cat("Condition was",a<=mynumber)
a <- a-3.5
}

## Condition was FALSE


a

## [1] 5.5

Nesting and Stacking Statements


An if statement can itself be placed within the outcome of another if state- ment. By nesting or stacking
several statements, you can weave intricate paths of decision-making by checking a number of conditions at
various stages during execution.
if(a<=mynumber){
cat("First condition was TRUE\n")
a <- aˆ2
if(mynumber>3){
cat("Second condition was TRUE")
b <- seq(1,a,length=mynumber)
} else {
cat("Second condition was FALSE")
b <- a*mynumber
}
} else {
cat("First condition was FALSE\n")
a <- a-3.5
if(mynumber>=4){
cat("Second condition was TRUE")
b <- aˆ(3-mynumber)
} else {
cat("Second condition was FALSE")
b <- rep(a+mynumber,times=3)

}
}

17
## First condition was FALSE
## Second condition was TRUE
a

## [1] 2
b

## [1] 0.5

For Loops
The R for loop always takes the following general form:
for(loopindex in loopvector){ do any code in here }
for(myitem in 5:7){
cat("--BRACED AREA BEGINS--\n")
cat("the current item is",myitem,"\n")
cat("--BRACED AREA ENDS--\n\n")
}

## --BRACED AREA BEGINS--


## the current item is 5
## --BRACED AREA ENDS--
##
## --BRACED AREA BEGINS--
## the current item is 6
## --BRACED AREA ENDS--
##
## --BRACED AREA BEGINS--
## the current item is 7
## --BRACED AREA ENDS--
counter <- 0
for(myitem in 5:7){
counter <- counter+1
cat("The item in run",counter,"is",myitem,"\n")
}

## The item in run 1 is 5


## The item in run 2 is 6
## The item in run 3 is 7
Looping via Index or Value
myvec <- c(0.4,1.1,0.34,0.55)
for(i in myvec){
print(2*i)
}

## [1] 0.8
## [1] 2.2
## [1] 0.68
## [1] 1.1

18
for(i in 1:length(myvec)){
print(2*myvec[i])
}

## [1] 0.8
## [1] 2.2
## [1] 0.68
## [1] 1.1

Functions
A function definition always follows this standard format:
functionname <- function(arg1,arg2,arg3,...) do any code in here when called return(returnobject)
example: Fibonacci sequence generator
myfib <- function(){
fib.a <- 1
fib.b <- 1
cat(fib.a,", ",fib.b,", ",sep="")
repeat{
temp <- fib.a+fib.b
fib.a <- fib.b
fib.b <- temp
cat(fib.b,", ",sep="")
if(fib.b>150){
cat("BREAK NOW...")
break
}
}
}
myfib()

## 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, BREAK NOW...

Adding arguments
Rather than printing a fixed set of terms, let’s add an argument to control how many Fibonacci numbers are
printed. Consider the following new func- tion, myfib2, with this modification
myfib2 <- function(thresh){
fib.a <- 1
fib.b <- 1
cat(fib.a,", ",fib.b,", ",sep="")
repeat{
temp <- fib.a+fib.b
fib.a <- fib.b
fib.b <- temp
cat(fib.b,", ",sep="")
if(fib.b>thresh){
cat("BREAK NOW...")

19
break
}
}
}
myfib2(thresh=150)

## 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, BREAK NOW...


myfib2(1000000)

## 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711,
If you want to use the results of a function in future operations (rather than just printing output to the
console), you need to return content to the user. Continuing with the current example, here’s a Fibonacci
function that stores the sequence in a vector and returns it
myfib3 <- function(thresh){
fibseq <- c(1,1)
counter <- 2
repeat{
fibseq <- c(fibseq,fibseq[counter-1]+fibseq[counter])
counter <- counter+1
if(fibseq[counter]>thresh){
break
}
}
return(fibseq)
}
myfib3(150)

## [1] 1 1 2 3 5 8 13 21 34 55 89 144 233


foo <- myfib3(10000)

Using return
If there’s no return statement inside a function, the function will end when the last line in the body code has
been run, at which point it will return the most recently assigned or created object in the function. If noth-
ing is created, such as in myfib and myfib2 from earlier, the function returns NULL. To demonstrate this
point, enter the following two dummy functions in the editor:
dummy1 <- function(){
aa <- 2.5
bb <- "string me along"
cc <- "string 'em up"
dd <- 4:8
}
dummy2 <- function(){
aa <- 2.5
bb <- "string me along"
cc <- "string 'em up"
dd <- 4:8
return(dd)

20
}

The first function, dummy1, simply assigns four different objects in its lexical environment (not the global
environment) and doesn’t explicitly return anything. On the other hand, dummy2 creates the same four
objects and explicitly returns the last one, dd. If you import and run the two func- tions, both provide the
same return object.
foo <- dummy1()
foo

## [1] 4 5 6 7 8
bar <- dummy2()
bar

## [1] 4 5 6 7 8
A function will end as soon as it evaluates a return command, without executing any remaining code in the
function body. To emphasize this, con- sider one more version of the dummy function
dummy3 <- function(){
aa <- 2.5
bb <- "string me along"
return(aa)
cc <- "string 'em up"
dd <- 4:8
return(bb)
}

Here, dummy3 has two calls to return: one in the middle and one at the end. But when you import and
execute the function, it returns only one value.
baz <- dummy3()
baz

## [1] 2.5

Basics plots
The easiest way to think about generating plots in R is to treat your screen as a blank, two-dimensional
canvas. You can plot points and lines using x- and y-coordinates. On paper, these coordinates are usually
represented with points written as a pair: (x value, y value). The R function plot, on the other hand, takes
in two vectors—one vector of x locations and one vector of y locations—and opens a graphics device where it
displays the result.
x <- c(1.1,2,3.5,3.9,4.2)
y<-c(2,2.2,-1.3,0,0.2)
plot(x,y)

21
2.0
1.0
y

0.0
−1.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Graphics parameters
• type : Tells R how to plot the supplied coordinates (for example, as stand-alone points or joined by
lines or both dots and lines).
• main, xlab, ylab: Options to include plot title, the horizontal axis label, and the vertical axis label,
respectively.
• col: Color (or colors) to use for plotting points and lines.
• pch: Stands for point character. This selects which character to use for plotting individual points.
• cex: Stands for character expansion. This controls the size of plotted point characters.
• lty: Stands for line type. This specifies the type of line to use to connect the points (for example, solid,
dotted, or dashed).
• lwd: Stands for line width. This controls the thickness of plotted lines.
• xlim, ylim: This provides limits for the horizontal range and vertical range (respectively) of the
plotting region

Automatic plots types

plot(x,y,type="l")

22
2.0
1.0
y

0.0
−1.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

x #titles
and axes labels
plot(x,y,type="b",main="My lovely plot",xlab="x axis label",
ylab="location y")

My lovely plot
2.0
1.0
location y

0.0
−1.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

x axis label
plot(x,y,type="b",main="My lovely plot\ntitle on two lines",xlab="",
ylab="")

23
My lovely plot
title on two lines
2.0
1.0
0.0
−1.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0 # Color


Adding color to a graph is far from just an aesthetic consideration. Color can make data much clearer—for
example by distinguishing factor levels or emphasizing important numeric limits. You can set colors with the
col parameter in a number of ways. The simplest options are to use an integer selector or a character string.
plot(x,y,type="b",main="My lovely plot",xlab="",ylab="",col=2)

My lovely plot
2.0
1.0
0.0
−1.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0


plot(x,y,type="b",main="My lovely plot",xlab="",ylab="",col="seagreen4")

24
2.0
1.0
0.0
−1.0 My lovely plot

1.0 1.5 2.0 2.5 3.0 3.5 4.0

The ggplot2 Package

#install.packages("ggplot2")
xx<- c(1.1,2,3.5,3.9,4.2)
yy<- c(2,2.2,-1.3,0,0.2)

#qplot(xx,yy)

elementary statistics
Take the data frame chickwts, which is avail- able in the automatically loaded datasets package. At the
prompt, directly entering the following gives you the first five records of this data set.
chickwts[1:5,]

## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
R’s help file (?chickwts) describes these data as comprising the weights of 71 chicks (in grams) after six weeks,
based on the type of food provided to them. Now let’s take a look at the two columns in their entirety as
vectors:
chickwts$weight

## [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213
## [20] 257 244 271 243 230 248 327 329 250 193 271 316 267 199 171 158 248 423 340

25
## [39] 392 339 341 226 320 295 334 322 297 318 325 257 303 315 380 153 263 242 206
## [58] 344 258 368 390 379 260 404 318 352 359 216 222 283 332
chickwts$feed

## [1] horsebean horsebean horsebean horsebean horsebean horsebean horsebean


## [8] horsebean horsebean horsebean linseed linseed linseed linseed
## [15] linseed linseed linseed linseed linseed linseed linseed
## [22] linseed soybean soybean soybean soybean soybean soybean
## [29] soybean soybean soybean soybean soybean soybean soybean
## [36] soybean sunflower sunflower sunflower sunflower sunflower sunflower
## [43] sunflower sunflower sunflower sunflower sunflower sunflower meatmeal
## [50] meatmeal meatmeal meatmeal meatmeal meatmeal meatmeal meatmeal
## [57] meatmeal meatmeal meatmeal casein casein casein casein
## [64] casein casein casein casein casein casein casein
## [71] casein
## Levels: casein horsebean linseed meatmeal soybean sunflower
Another example:
quakes[1:5,]

## lat long depth mag stations


## 1 -20.42 181.62 562 4.8 41
## 2 -20.62 181.03 650 4.2 15
## 3 -26.00 184.10 42 5.4 43
## 4 -17.97 181.66 626 4.1 19
## 5 -20.42 181.96 649 4.0 11
If you look at the first five records and read the descriptions in the help file ?quakes, you quickly get a good
understanding of what’s presented.
The columns lat and long provide the latitude and longitude of the event, depth provides the depth of the
event (in kilometers), mag provides the magnitude on the Richter scale, and stations provides the number
of observation stations that detected the event. If you’re interested in the spa- tial dispersion of these
earthquakes, then examining only the latitude or the longitude is rather uninformative. The location of each
event is described with two components: a latitude and a longitude value. You can easily plot these 1,000
events
plot(quakes$long,quakes$lat,xlab="Longitude",ylab="Latitude")

26
−35 −30 −25 −20 −15 −10
Latitude

165 170 175 180 185

Longitude

compute the statistics

xdata <- c(2,4.4,3,3,2,2.2,2,4)

x_bar <- mean(xdata) # mean


x_bar

## [1] 2.825
x_median <- median(xdata)
x_median

## [1] 2.6
min(xdata)

## [1] 2
max(xdata)

## [1] 4.4
range(xdata)

## [1] 2.0 4.4


#The mean and median weights of the chicks are as follows:
mean(chickwts$weight)

## [1] 261.3099
median(chickwts$weight)

## [1] 258

27
Many of the functions R uses to compute statistics from a numeric structure will not run if the data set
includes missing or undefined values (NAs or NaNs). Here’s an example:
mean(c(1,4,NA))

## [1] NA
mean(c(1,4,NaN))

## [1] NaN
To prevent unintended NaNs or forgotten NAs being ignored without the user’s knowledge,R does not by
default ignore these special values when running functions such as mean—and therefore will not return the
intended numeric results. You can, however, set an optional argument na.rm to TRUE, which will force the
function to operate only on the numeric values that are present.
mean(c(1,4,NA),na.rm=TRUE)

## [1] 2.5
mean(c(1,4,NaN),na.rm=TRUE)

## [1] 2.5
Finding a mode is perhaps most easily achieved by using R’s table func- tion, which gives you the frequencies
you need
xtab <- table(xdata)
xtab

## xdata
## 2 2.2 3 4 4.4
## 3 1 2 1 1
You can construct a logical flag vector to get the mode from table
d <- xtab[xtab==max(xtab)]
d

## 2
## 3
quantile(xdata,prob=c(0,0.25,0.5,0.75,1))

## 0% 25% 50% 75% 100%


## 2.00 2.00 2.60 3.25 4.40
quantile(chickwts$weight,prob=c(0.25,0.75))

## 25% 75%
## 204.5 323.5
There are ways to obtain the five-number summary other than using quantile; when applied to a numeric
vector, the summary function also pro- vides these statistics, along with the mean, automatically.
summary(xdata)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 2.000 2.000 2.600 2.825 3.250 4.400
summary(quakes$mag[quakes$depth<400])

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 4.00 4.40 4.60 4.67 4.90 6.40

28
Basic data vizualisation
Data visualization is an important part of a statistical analysis

Barplots and Pie Charts


Barplots and pie charts are commonly used to visualize qualitative data by category frequency. In this section
you’ll learn how to generate both using R Example: let’s use the mtcars data set
mtcars[1:5,]

## mpg cyl disp hp drat wt qsec vs am gear carb


## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
The documentation in ?mtcars explains the variables that have been recorded. Of these, cyl provides the
number of cylinders in each engine— four, six, or eight. To find out how many cars were observed with each
number of cylinders, you can use table, as shown here
cyl_freq <- table(mtcars$cyl)
cyl_freq

##
## 4 6 8
## 11 7 14
The result is easily displayed as a barplot, as shown here:
barplot(cyl_freq)
14
12
10
8
6
4
2
0

4 6 8
Similar plots may be produced using ggplot2. If you load the installed package with library(“ggplot2”).

29
#library("ggplot2")
#qplot(factor(mtcars$cyl),geom="bar")

The venerable pie chart is an alternative option for visualizing frequency- based quantities across levels of
categorical variables, with appropriately sized “slices” representing the relative counts of each categorical
variable.
pie(table(mtcars$cyl),labels=c("V4","V6","V8"),
col=c("white","gray","black"),main="Performance cars by cylinders")

Performance cars by cylinders

V4

V6

V8
Histogram: For a simple example of a histogram, consider the horsepower data of the 32 cars in mtcars, given
in the fourth column, named hp
mtcars$hp

## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
## [20] 65 97 150 150 245 175 66 91 113 264 175 335 109
hist(mtcars$hp)

30
Histogram of mtcars$hp
10
8
Frequency

6
4
2
0

50 100 150 200 250 300 350

mtcars$hp
hist(mtcars$hp,breaks=seq(0,400,25),col="gray",main="Horsepower",xlab="HP")
abline(v=c(mean(mtcars$hp),median(mtcars$hp)),lty=c(2,3),lwd=2)
legend("topright",legend=c("mean HP","median HP"),lty=c(2,3),lwd=2)

Horsepower
8

mean HP
median HP
6
Frequency

4
2
0

0 100 200 300 400

HP
Let’s return to the built-in quakes data frame of the 1,000 seismic events near Fiji. For the sake of comparison,

31
you can examine both a histogram and a boxplot of the magnitudes of these events using default base R
behavior.
hist(quakes$mag)

Histogram of quakes$mag
200
150
Frequency

100
50
0

4.0 4.5 5.0 5.5 6.0

quakes$mag
boxplot(quakes$mag)
6.0
5.5
5.0
4.5
4.0

Side-by-Side Boxplots
One particularly pleasing aspect of these plots is the ease with which you can compare the five-number
summary distributions of different groups with side-by-side boxplots.

32
stations.fac <- cut(quakes$stations,breaks=c(0,50,100,150))
boxplot(quakes$mag~stations.fac,
xlab="# stations detected",ylab="Magnitude",col="gray")
6.0
5.5
Magnitude

5.0
4.5
4.0

(0,50] (50,100] (100,150]

# stations detected
#Scatterplots A scatterplot is most frequently used to identify a relationship between the observed values of
two different numeric-continuous variables, displayed as x-y coordinate plots.
Another example: the famous iris data. Collected in the mid- 1930s, this data frame of 150 rows and 5
columns consists of petal and sepal measurements for three species of perennial iris flowers—Iris setosa, Iris
vir-ginica, and Iris versicolor
iris[1:5,]

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa

Single Plot
You can modify a simple scatterplot to split the plotted points according to a categorical variable, exposing
potential differences between any visible rela- tionships with respect to the continuous variables. For example,
using base R graphics, you can examine the petal measurements according to the three species
plot(iris[,4],iris[,3],type="n",xlab="Petal Width (cm)",
ylab="Petal Length (cm)")
points(iris[iris$Species=="setosa",4],
iris[iris$Species=="setosa",3],pch=19,col="black")
points(iris[iris$Species=="virginica",4],
iris[iris$Species=="virginica",3],pch=19,col="gray")

33
points(iris[iris$Species=="versicolor",4],
iris[iris$Species=="versicolor",3],pch=1,col="black")
legend("topleft",legend=c("setosa","virginica","versicolor"),
col=c("black","gray","black"),pch=c(19,19,1))
7

setosa
virginica
6

versicolor
Petal Length (cm)

5
4
3
2
1

0.5 1.0 1.5 2.0 2.5

Petal Width (cm)

Matrix plots
The “single” type of planar scatterplot is really useful only when comparing two numeric-continuous variables.
When there are more continuous vari- ables of interest, it isn’t possible to display this information satisfactorily
on a single plot. A simple and common solution is to generate a two-variable scatterplot for each pair of
variables and show them together in a structured way; this is referred to as a scatterplot matrix.
iris_pch <- rep(19,nrow(iris))
iris_pch[iris$Species=="versicolor"] <- 1
iris_col <- rep("black",nrow(iris))
iris_col[iris$Species=="virginica"] <- "gray"

#plot(iris[,4],iris[,3],col=iris_col,pch=iris_pch,
#xlab="Petal Width (cm)",ylab="Petal Length (cm)")
pairs(iris[,1:4],pch=iris_pch,col=iris_col,cex=0.75)

34
2.0 3.0 4.0 0.5 1.5 2.5

7.5
Sepal.Length

6.0
4.5
4.0

Sepal.Width
3.0
2.0

7
5
Petal.Length

3
1
2.5
1.5

Petal.Width
0.5

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

To generate a matrix in a ggplot2 style, it’s recommended that you download the GGally package.
#install.packages(GGAlly)
library("GGally")

## Loading required package: ggplot2


## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2022a.1.0/
## zoneinfo/Europe/Paris'
library("ggplot2")
ggpairs(iris,mapping=aes(col=Species),axisLabels="internal")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

35
8 Corr: −0.118 Corr: 0.872*** Corr: 0.818***
7 setosa: 0.743*** setosa: 0.267. setosa: 0.278.
Sepal.Length
6 versicolor: 0.526***versicolor: 0.754***versicolor: 0.546***
5 6 7 8 virginica: 0.457*** virginica: 0.864*** virginica: 0.281*

4.5
4 Corr: −0.428*** Corr: −0.366***
3.5 setosa: 0.178 setosa: 0.233
3
Sepal.Width versicolor: 0.561***versicolor: 0.664***
2.5 virginica: 0.401** virginica: 0.538***
2 2.5 3 3.5 44.5
6 Corr: 0.963***
setosa: 0.332*
Petal.Length
4 versicolor: 0.787***
2 4 6 virginica: 0.322*
2.5
2
1.5
Petal.Width
1
0.5
00.5 1 1.5 2 2.5
setosa
Species
versicolor
virginica

COMMON PROBABILITY DISTRIBUTIONS

The dbinom Function


To the dbinom function, you provide the specific value of interest as x; the total number of trials, n, as size;
and the probability of success at each trial, p, as prob.
dbinom(x=5,size=8,prob=1/6)

## [1] 0.004167619
X.prob <- dbinom(x=0:8,size=8,prob=1/6)
X.prob

## [1] 2.325680e-01 3.721089e-01 2.604762e-01 1.041905e-01 2.604762e-02


## [6] 4.167619e-03 4.167619e-04 2.381497e-05 5.953742e-07
sum(X.prob) #these can be confirmed to sum to 1.

## [1] 1
Rounding to three decimal places,the results are easier to read
round(X.prob,3)

## [1] 0.233 0.372 0.260 0.104 0.026 0.004 0.000 0.000 0.000
#The pbinom Function

36
The other R functions for the binomial distribution work in much the same way. The first argument is always
the value (or values) of interest; n is sup- plied as size and p as prob. To find, for example, the probability
that you observe three or fewer 4s, P(X<=3), you either sum the relevant individual entries from dbinom as
earlier or use pbinom
sum(dbinom(x=0:3,size=8,prob=1/6))

## [1] 0.9693436
pbinom(q=3,size=8,prob=1/6)

## [1] 0.9693436
#qbinom function Less frequently used is the qbinom function, which is the inverse of pbinom. Where
pbinom provides a cumulative probability when given a quantile value q, the function qbinom provides a
quantile value when given a cumulative probability p. The discrete nature of a binomial random variable
means qbinom will return the nearest value of x below which p lies
qbinom(p=0.95,size=8,prob=1/6)

## [1] 3

The rbinom Function


Lastly, the random generation of realizations of a binomially distributed variable is retrieved using the rbinom
function.
rbinom(n=1,size=8,prob=1/6)

## [1] 2
rbinom(n=1,size=8,prob=1/6)

## [1] 2
rbinom(n=1,size=8,prob=1/6)

## [1] 1
rbinom(n=3,size=8,prob=1/6)

## [1] 1 3 0

Density probability distribution

dpois(x=3,lambda=3.22) # poisson distribution

## [1] 0.2223249
barplot(dpois(x=0:10,lambda=3.22),ylim=c(0,0.25),space=0,
names.arg=0:10,ylab="Pr(X=x)",xlab="x")

37
0.25
0.20
0.15
Pr(X=x)

0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10

x
# uniform distribution
dunif(x=c(-2,-0.33,0,0.5,1.05,1.2),min=-0.4,max=1.1)

## [1] 0.0000000 0.6666667 0.6666667 0.6666667 0.6666667 0.0000000

38

You might also like