Software R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 355

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/354248891

Software R

Presentation · September 2021

CITATIONS

1 author:

Pier Giuseppe Giribone


Università degli Studi di Genova
128 PUBLICATIONS   277 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Corporate Course View project

Machine Learning View project

All content following this page was uploaded by Pier Giuseppe Giribone on 31 August 2021.

The user has requested enhancement of the downloaded file.


Lecture notes for the course

Software R
An introduction to Statistical Programming

Master’s Degree in Economics and Data Science


Academic Year 2021/2022

By Pier Giuseppe Giribone


1.
Introduction
The R console
Common basic functions
The R Studio IDE
Software R

R is an integrated software mainly used for:


- Statistical analysis
- Numerical Computing
- Data handling
- Graphical visualization
It was created by the New Zealand statistician Ross Ihaka together
with the Canadian mathematician and statistician Robert Gentleman
in 1996 (“R a language for data analysis and graphics” – Journal of
Computational and Graphical Statistics, Vol. 5, 299-314).

1. 3
Ross Ihaka and Robert Gentleman

1. 4
Software R

R, in addition to being a software, is also a programming language,


considered a "dialect" of the S language created by the team of John
Chambers at Bell Laboratories and marketed with the name of S-
PLUS by Insightful.
R is distributed freely under the terms of the GNU General Public
License: its development and distribution are carried out by a group
of statisticians known by the name of R Development Core Team.
R, whose source code is written mainly in C, is available for several
operating systems (Linux, Windows and macOS).

1. 5
Software R

The reference website for obtaining the files to install R is:


https://cran.r-project.org/

1. 6
R console

Once R is installed, the console Examples to type in the


shows the following: R console:
> licence()
> contributors()
> demo()
> colors()
> getwd()
> cd()
> q() #close the console

1. 7
R console - help functions
> help() is the function that opens the online help. It is useful for
knowing the meaning of an unknown R command and being able to
implement it using the right syntax.
For instance, in order to know the meaning of the getwd function,
you can type in the console:
> help("getwd")
Or equivalently:
> ?getwd
And the page of the guide dedicated to the function appears on the
web browser.

1. 8
> setwd("C:/Users/Utente/Desktop/")

1. 9
Traditional structure of a help page

Description: general overview


Usage: syntax
Arguments: details about the inputs of the function
Details: technical details for the implementation
Value: details about the outputs of the function
References: bibliographical references
See Also: related commands/functions
Examples: executable instructions and working samples

1.10
On-line help
Typing the help.start()instruction in the Console, the on-line help opens

1.11
R console – Case sensitivity

R is a case-sensitive language, consequently for calling pi constant


(π) in Console, you must write in the prompt pi and not, for instance,
«Pi» o «PI».
> pi
[1] 3.141593
> PI
Error: object "PI" not found
> Pi
Error: object "Pi" not found

1.12
R console – Comments

«The comment, in the context of programming languages, is a part of


the source code that has the sole purpose of describing its functional
characteristics, that is, to explain the operation of the subsequent
lines of code, and which is not part of the algorithm codified in
programming language.
During the compilation process these instructions are ignored and
consequently do not weigh computationally on the size of the
executable produced.» (Wikipedia)
In R, the comment is made by placing the hash first # (hashmark)
> # This line is not considered by the R interpreter

1.13
Multiple instructions on the same line

You can type the following instructions in the console:


> a=2
> b=3
> c=a+b
Or, equivalently, you can use the semicolon ; for typing more
instructions in the same line
> a=2; b=3; c=a+b
By typing the name of the declared variable in the prompt, you can
see its contents.
> a
1.14
R console – Editor

Using the Editor provided by the console, you can write the code
more easily, save the file and share its content.

Ctrl + R
For
executing the
instructions
The saved
script has the
extension *.R

1.15
R console – Completion of a command “+”

If the instruction provided is not complete, R signals this information


to the user by making the + symbol appear in the console instead of
the > symbol.
This indicates to the user that the R compiler is unable to process
further instructions as long as the missing information is not correctly
defined.
> d=a*
+ b
> d
[1] 6

1.16
R console – Memory management

To see from the console all the variables stored in the memory and
which can therefore be used, you can execute the ls() function or
equivalently the objects() functions.
> ls()
[1] "a" "b" "c" "d"
Consequently, you cannot use any variable which is not currently
stored in the memory.
> e
Error: object "e" not found

1.17
R console – Memory management

The memory area and, therefore, the data contained therein can be
saved in a file with the * .RData extension using the save instruction
> save.image("variables.RData")
The RData file is saved in the current working directory or in the path
specified by the user.
A more extensive and customizable way is to use the instruction:
> save(list = ls(all.names = TRUE), file = ".RData",
envir = .GlobalEnv)
You can access the guide for this function by typing:
> help("save")

1.18
R console – Memory management

In order to clear variables from the memory, you can use the rm()
function.
For instance, you can remove the variable a by typing in the prompt:
> rm(a)
> a
Error: object "a" not found
To clean the memory completely, the following instruction is used:
> rm(list=ls())
> ls()
character(0)

1.19
R console – Memory management

In order to import data stored in a Rdata file, you can use the load()
function.
> load("variables.RData")
> ls()
[1] "a" "b" "c" "d"

Obviously, if the file were in a different path than the current working
directory, you must either specify the path of the file or set the
working directory where the file of interest is located using the setwd
function.

1. 20
R console – Remove the previous instructions from the console

To clean the console from the previously interpreted lines of code, it


is a common choice to use the keyboard shortcut ctrl + L.

Be careful: cleaning previously interpreted instructions DOES NOT


clean the variables from the memory.

To see all the other shortcuts go to the Console menu item:

“Help Console”

1. 21
Save the history of the interpreted instructions in a file

The history of compiled instructions can be stored in a file with the


*.Rhistory extension using the savehistory function.
> savehistory(file = "CommandsHistory.Rhistory")

And it can be loaded through the loadhistory function.

If you want to view the latest five instructions processed by the


compiler, write:
> history(max.show=5)

1. 22
R Studio – the most widespread IDE for R programming

RStudio is one of the most popular development environments for the


R language (IDE - integrated development environment for R).
It therefore allows to freely manage (open-source software) projects
coded in R.
RStudio was created by J.J. Allaire and currently Hadley Wickham is
the Chief Scientist of the project.
The main language in which this IDE was written is Java.
The link from which you can directly download the software is:
https://www.rstudio.com/

1. 23
R Studio – Download

1. 24
A C

B D
25
R Studio – the graphic interface of the R development software

In section A , the R scripts are opened and visualized.


For executing a part of code, it is sufficient to highlight the
statements and use the shortcut CTRL + Enter.
B
The selected code will be executed in the R console.
The objects stored in the memory will be displayed in part C

If you call up graphic objects they will be plotted in D


In addition to having a more orderly layout, RStudio allows to use
buttons that recall the common basic instructions commented in the
previous slides, as a result it facilitates coding.

1. 26
2.
Vectors
Assignment
Numeric and Logical vector
Character and Index vector
Vectors and assignment

R works on a data type called data structure.


The easiest structure is a numeric vector, which is a single entity that
consists in a collection of numbers.
For creating a vector called x consists of a collection of six numbers,
for instance 9.3, 5.5, 10.6, 5.3, 1.3, 7.7, you can use the following
syntax:
x <- c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7)
This is a statement of assignment which uses the c()function.
Remember that any function must always be invoked using
parentheses (…) after its name.

2.2
Vectors and assignment

The assignment
instruction is written in the
script, then it is executed The variable is in the
in the console Environment because it
has been stored in the
memory

2.3
The View() function allows to
see the contents of an object in
a table-like format. > View(x)

Recalling the help function, the


user guide will be opened in a
dedicated tab of the IDE

2.4
Vectors and assignment
c()can take an arbitrary number of vectors as input data: the result
will be an only vector made by the concatenation of all the input
vectors.
It is worth to note that a number, or better a scalar, is itself a vector of
size 1x1.
Alternative ways of writing the previous expression of the vector x
are:
> assign("x", c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7))
> c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7) -> x

2.5
Vectors and assignment

In RStudio there is a useful keyboard short cut to quickly write the


classic assignment symbol <- which is Alt -.
Even if in some contexts it does not perform the same task, in these
examples it is quite similar to the use of the traditional operator =.

Be careful: If an expression is used without being allocated in


memory through the assignment syntax, then it will be lost. In the
sense that the result will be displayed, but not permanently stored.
In fact, it will be available by typing the expression .Last.value
until any subsequent instruction is performed.

2.6
Current
WD

The statement 1/x which calculates the reciprocals of the six numeric
elements contained in the vector has not been assigned to any variable and
therefore does not appear in memory. The result is only displayed and can
only be called temporarily using the command .Last.Value

2.7
Vectors and assignment
The combine function, c(), can obviously be used with vectors of
different dimensions.
x <- c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7)

y <- c(x, 0, c(44,55))

# The print() function allows to display the elements of


a vector in the console
> print(y)
[1] 9.3 5.5 10.6 5.3 1.3 7.7 0.0 44.0 55.0

2.8
Arithmetic vectors – Arithmetic operations

Vectors can be used in arithmetic operations and in this case the


operations are intended to be applied element-by-element of the
vector.

Vectors do not need to be of the same length: if vectors of different


sizes appear in the expression, the result will be a vector of the same
size as the longest vector that appears in the expression.

The smaller vectors contained in the expression are recycled the


number of times it takes to meet the dimension of the longest vector.

2.9
Arithmetic vectors – Arithmetic operations
x1 <- c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7)
y1 <- c(1,2)
z1 <- c(6,5,2,4)
v1 <- x1+y1+z1+1
x2 <- x1
y2 <- c(1,2,1,2,1,2)
z2 <- c(6,5,2,4,6,5)
Warning message:
v2 <- x2+y2+z2+1 In x1 + y1 + z1 :
> print(v1-v2) longer object length is not a multiple of shorter object
length
[1] 0 0 0 0 0 0

2.10
Arithmetic vectors – traditional mathematical functions

All the traditional arithmetic operators can be used with numeric


vectors: +, -, *, / and ^ for exponentiation.
In addition, all the most common arithmetic functions are available:
log, exp, sin, cos, tan, sqrt, …
max and min select the smallest and the greatest value in the vector,
respectively.
range is a function that returns a vector of length 2 equal to
c(min(x),max(x)).
length(x) returns the number of elements in x, sum the summation
and prod the product.

2.11
Arithmetic vectors – traditional arithmetic functions
5L means the integer number of 5
myvec <- c(1, 5, 3.5, -1, +2) (L=Long Integer)
By default R interprets the numbers as double
(i.e. Real numbers)

Range_vec <- range(myvec)


Minimum_vec <- min(myvec)
Maximum_vec <- max(myvec)
Length_vec <- length(myvec)
Sum_vec <- sum(myvec)
Prod_vec <- prod(myvec)

2.12
Arithmetic vectors – mean and variance

Among the most common statistical functions, there are the


arithmetic mean mean and the sample variance var
x <- c(1, 5, 3.5, -1, +2)
Average <- mean(x); myAverage <- sum(x)/length(x)
Variance <- var(x)
myVariance <- sum((x-mean(x))^2)/(length(x)-1)
> print(paste("First moment:",Average, "Second
moment:",Variance))
[1] "First moment: 2.1 Second moment: 5.3"

2.13
Arithmetic vectors - sorting
sort(x) returns a vector of the same dimension of x having its
elements sorted in ascending order.
x <- c(1, 5, 3.5, -1, +2)
xsorted <- sort(x)
help("sort")
> print(x)
[1] 1.0 5.0 3.5 -1.0 2.0
> print(xsorted)
[1] -1.0 1.0 2.0 3.5 5.0

2.14
Arithmetic vectors – NaN and complex numbers

Normally the user of R will not have to worry about whether the
numbers contained in a vector are integers, real or complex: the
calculations are done internally in the most precise way, treating
them as real double or complex double. In order to work with
complex numbers, it is necessary to make the complex part explicit.
Consequently:
> sqrt(-16)
«In programming, NaN (Not a Number) is a
[1] NaN warning indicating that the result of a
> sqrt(-16+0i) (numeric) operation was performed on
[1] 0+4i invalid operands.»

2.15
Arithmetic vectors – regular sequences

R has several facilities that allow the generation of the most common
number sequences. For instance, in order to create the numeric
sequence which goes from 1 to 10, you can use c():
sequence_vector <- c(1,2,3,4,5,6,7,8,9,10)
Or, more easily, you can do the same task using :
sequence_vector <- 1:10
: has a higher priority than the other arithmetic operators. The
output for the instruction 2*1:10 isn’t a vector that goes from 2 to 10,
rather the instruction will first generate the sequence that spans from
1 to 10 and then this vector will be multiplied by 2.

2.16
Arithmetic vectors – “:” and priority

Be careful to the priority of the operators

> n <- 10
> 1:n-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(n-1)
[1] 1 2 3 4 5 6 7 8 9

The operator : can also be used for sequences that go backward.


For instance 15:1

2.17
Arithmetic vectors – The seq function
The seq() function allows to generate numeric sequences in a more
general and customizable way.
> help("seq")
This function has five input arguments, but not all of these are
compulsory during its call.
The first two inputs are the starting and the ending of the numeric
sequence. Consequently:
> 2:10
is equivalent to:
> seq(2,10)

2.18
Arithmetic vectors – The seq function and its named-form input
The input arguments for seq(), as well as many other R functions,
can be passed in the so-called named form.
In this case, the order in which the input data are passed is
irrelevant.
The first two mandatory input arguments can therefore also be
written in the named-form using from=value, to=value
The following instructions generate the same outputs:
> seq(3,15)
> seq(from=3,to=15)
> seq(to=15,from=3)

2.19
Arithmetic vectors – The seq function and its named-form input
Looking at the function help, the next two input arguments of seq()
are: by=value, length=value.
These specify a step and a length for creating the number sequence.
The default value for by is by=1.
For instance, the instruction:
> vect1 <- seq(-2,2, by=.5)
generates a vector named vect1 having the following 9 elements:
> vect1
[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

2.20
Arithmetic vectors – The seq function and its named-form input

The same output can be obtained using the command:


> vect2 <- seq(length=9, from=-2, by=.5)
The fifth argument for seq() is along=vector, which is normally
used as the only argument to generate the numeric sequence
1,2,...,lenght(vector).
> vect3 <- seq(along=c(10:15))

> print(vect3) Vector Element 10 11 12 13 14 15


[1] 1 2 3 4 5 6 Vector Index 1 2 3 4 5 6

2.21
Arithmetic vectors – Random sequences

It is useful in statistics to be able to generate random data, as a


result, R provides a large number of probability density functions.
These functions are of the form rfunc(n,p1,p2,...) where func
indicates the probability distribution, n is the numer of data to be
generated and p1,p2,... are the parameters that uniquely
determine the distribution from which the numbers are drawn. The
following table shows the details for each distribution together with
the possible default values, i.e. those values that, unless otherwise
specified, are used by R. If the default value is not specified in the
table, this means that it must be defined by the user.

2.22
Arithmetic vectors – Random sequences
# 4 numbers drawn from a
NID(0,1)
> rnorm(4)
-0.01977535 1.34546924
-0.41916212 -1.15732186
# 4 numbers drawn by a
uniform U(3,5)
> runif(4,min=3,max=5)
3.409750 4.036633
3.157913 3.160682

2.23
Arithmetic vectors – dfunc, pfunc, qfunc

For most of the previous statistical functions:


- replacing the rfunc with dfunc returns the probability density.
- replacing the rfunc with pfunc returns the cumulative probability
density.
- replacing the rfunc with qfunc returns the quantile.
The last two sets of functions can be used to find the critical values
or P-values used in statistical tests.
For example, the critical values for a two-tailed statistical test
following a normal distribution with 𝛼 = 0.05 is:

2.24
Arithmetic vectors – dfunc, pfunc, qfunc
> qnorm(c(0.025,0.975))
[1] -1.959964 1.959964
The P-value for a chi-square test 𝜒 2 = 3.84 with one degree of
freedom, 𝑑𝑓 = 1 is:
> 1-pchisq(3.84,1)
[1] 0.05004352
The values of a standard normal cumulative distribution, 𝑁(𝑑), for
𝑑 = {−1,0, +1} are:
> pnorm(c(-1,0,+1))
[1] 0.1586553 0.5000000 0.8413447

2.25
Arithmetic vectors – the rep function
The rep() function allows to replicate an object in different ways
> help("rep")
The easiest method for creating replication for the object x is:
> x=1:3

> vectRepl1 <- rep(x,times=4)

> print(vectRepl1)
[1] 1 2 3 1 2 3 1 2 3 1 2 3

times= 1 2 3 4

2.26
Arithmetic vectors – the rep function
Another common way to implement rep() is to specify the named
form parameter each=value.
In this case each element is repeated and not the entire sequence of
the object.
> x=1:3
> vectRepl2 <- rep(x,each=4)
> print(vectRepl2)
[1] 1 1 1 1 2 2 2 2 3 3 3 3

each= 4 4 4

2.27
Logical vector

Like with numeric vectors, R also allows for the manipulation of


logical values. The elements in a logical vector can be: TRUE, FALSE
and NA (not available).
Logic vectors are normally generated by the verification of
conditions.
logicTest vector has the
For instance: same length of x and
x=c(10, -2, 15, 12, 12.5) associates FALSE when the
logicTest <- x > 12 elements of x do not satisfy
the condition <12 and TRUE
> print(logicTest)
for the elements of x for
[1] FALSE FALSE TRUE FALSE TRUE which >12

2.28
Logical vectors – logical operators

The logical operators for R are:


< less than
<= equal or less than
> greater than
>= equal or greater than
== equal
!= unequal
If operations are carried out between numbers and logical values,
FALSE automatically becomes 0 and TRUE becomes 1.

2.29
Logical vectors – logical expressions
Furthermore, if cond1 and cond2 are two logical expressions, the
principles of traditional logic apply:
cond1 & cond2 is the intersection (the logical and shortcut)
cond1 | cond2 is the union (the logical or shortcut)
!cond1 is the negation of cond1
Therefore, all the rules of the "Truth Tables" are respected.
However, keep in mind that a determined logical expression
(therefore true or false) compared with an undetermined logical
expression (NA) will (almost always) return NA.

2.30
Logical vectors – logical expressions
cond1=FALSE; cond2=TRUE # OR
cond3=NA print(cond1 | cond1)
# Negation print(cond1 | cond2)
print(!cond1) print(cond2 | cond1)
# AND print(cond2 | cond2)
print(cond1 & cond1) # Operations withn NA
print(cond1 & cond2) print(!cond3)
print(cond2 & cond1) print(cond2 & cond3)
print(cond2 & cond2) print(cond1 | cond3)
print(cond1 & cond3)

2.31
Missing Value

In some cases the components of a vector may not be fully known.


When an element or a value is not available (NA - not available) or
missing in the statistical sense of the term (missing value), the
special value NA has to be used.
Generally all operations involving an NA element return NA (with
some exceptions already discussed).
The reason for this rule is quite intuitive: if the specifications for
carrying out an operation are incomplete, the result will be too.

2.32
Missing Values – is.na function
The is.na(x) function returns a logical vector characterized by the
same dimension of the input vector x having the value TRUE if the
value of the element in x is NA, in correspondence of the examined
index, and FALSE otherwise.
> vect <- c(1:5,NA)
> print(vect)
[1] 1 2 3 4 5 NA
> logicvect=is.na(vect)
> print(logicvect)
[1] FALSE FALSE FALSE FALSE FALSE TRUE

2.33
Missing Values – NaN

It is worth noting that there is a second kind of "missing value" which


is produced by numerical processing (therefore not natively absent or
missing): the NaN (Not a Number) values.
Typical examples are the indefinite forms:
> 0/0
> Inf-Inf
[1] NaN

The is.na() function detects both NA and NaN values, while


is.nan() only detects NaN values.

2.34
Character vectors – Strings

Character vectors (strings) are frequently used in R both for the


graphic part (plot labels) and to make the results or the input data of
a function more intelligible.
Strings are declared using "" or '', but in the console, they are
always displayed using ""
> string1='example1'; string2="example2"
> print(string1)
[1] "example1"
> > print(string2)
[1] "example2"

2.35
Character vectors – C-style escape sequences

R uses many escape sequences inherited from C.


For example, to display characters in the console that would
otherwise be interpreted differently (such as " and '), use the \
symbol ahead. To start a new line you can use \n (new line), \tab
is the tabular key, \b is the backspace (see ?Quotes)
> string1="Welcome to \"R course\"!"
> cat(string1,"\nPier Giribone")
cat()concatenates the
Welcome to "R course"!
character vectors and
Pier Giribone prints them in console

2.36
Access vector elements
The elements stored in a vector x can be selected in the simplest
way by using the index of the position at which you want to access
the element in square brackets x[index].

> x <- c(10,-5,15,1.2,20)


Element of the vector x 10 -5 15 1.2 20
> x[3]
Index of the vector x 1 2 3 4 5
[1] 15

To select the vector element x stored in the third position (index = 3),
you have to type x[3]

2.37
Index vectors: logical
In this case, only the elements of the vector x will be selected that
respect the logical condition expressed in the square brackets.
For instance:
> x <- c(10,-5,15,1.2,20)
> y <- x[x>=10]
> print(y)
[1] 10 15 20
Vector y contains only the values of x that meet the condition of
being greater than or equal to 10. Note that length(y) <= length(x).

2.38
Index vectors: positive integers

In this case, the values of the index vector must be chosen in the set
{1,2,...,length(x)}.
The elements expressed by the index vector are selected by the
vector x and concatenated accordingly.
For instance:
> x <- c(10,-5,15,1.2,20,6,3)
> y <- x[c(1,3:5,length(x))]
> print(y)
[1] 10.0 15.0 1.2 20.0 3.0

2.39
Index vectors: negative integers

The index vector in this case specifies the elements to be excluded


instead of those to be included by placing a minus – in front of the
vector.
> x <- c(10,-5,15,1.2,20,6,3)
> y <- x[-c(1,3:5,length(x))]
> print(y)
[1] -5 6
> print(x[-3])
[1] 10.0 -5.0 1.2 20.0 6.0 3.0

2.40
Index vectors: character strings

The indexing of a vector through strings can be used when defining


attributes of the type names for identifying its elements.
exams_results <- c(30, 30, 28, 18)
names(exams_results) <- c("FixedIncome", "Derivatives",
"PortfolioManagement","Accounting")
Results <- exams_results[c("Derivatives","Accounting")]
> print(Results)
Derivatives Accounting
30 18

2.41
Index vectors and names attributes

This alphanumeric notation (names) is often used to identify the


indexes of a vector, especially in data frames as it is easier to
remember than a numerical indexing.
In the case that an assignment appears together with a sub-selection
of elements of the reference vector x, performed with one of the four
described methods, this assignment is only valid for the elements of
vector x selected from the index vector.
x <- c(10,NaN,5,NaN,7,8,NaN); x[is.nan(x)] <- 0
print(x)
[1] 10 0 5 0 7 8 0

2.42
Other R objects… A preview

Vectors are one of the most important types in R, but there are other
fundamental objects that will be covered in the next slides:
- Matrix and multidimensional array  extensions of vectors.
- Factors  objects useful for handling categorical data.
- Lists  generalization of vectors in which the stored elements do
not necessarily have to be of the same nature.
- Data Frames  structures which look like matrices, but they can
host different types of data (“data matrices”).
- Functions  functions are themselves managed as objects in R
and they can be stored in the workspace.
2.43
3.
Mode & Attributes
Mode, Type, Class and Attributes
Recursive Structures
Object Coercion
Intrinsic attributes: mode and length

The entities through which R operates are technically known as


objects. Examples of objects already discussed are:
- Vectors whose elements are integers, double or complex
numbers – numeric vector
- Vectors whose elements are logical – logical vector
- Vectors whose elements are strings – characters vector

These objects (numbers, logical values and characters) are primitive


structures (atomic) characterized by a different and specific
behaviour (mode).

3. 2
Intrinsic attributes: mode and length

Vectors must have all the elements of the same nature.


Hence a vector must be unambiguously of a given nature: logical or
numeric or complex or characters or byte (raw).
The only apparent exception to this rule is the special "value" NA, but
in fact there are different types of behavior for this value (typing)
depending on the vector to which it belongs.
The is.vector function checks if the vector is atomic.
> print(is.vector(c(1,5,NA)))
[1] TRUE

3. 3
Intrinsic attributes: mode and length
vect_double <- c(1, 2.3, 3.1, -4)
mode(vect_double) #numeric
typeof(vect_double) #double
vect_int <- c(1L, 10L, 5L)
mode(vect_int) #numeric
typeof(vect_int) #integer
vect_complex <- c(0.0 + 1i, 2.5 + 6.0i)
mode(vect_complex) #complex
typeof(vect_complex) #complex

3. 4
Intrinsic attributes: mode and length
vect_text <- c("knight","queen","king")
mode(vect_text) #character
typeof(vect_text) #character
vect_logic <- c(TRUE,3>0,FALSE,0==3)
mode(vect_logic) #logical
typeof(vect_logic) #logical
vect_raw <-raw(3)
mode(vect_raw) #raw
typeof(vect_raw) #raw

3. 5
Intrinsic attributes: mode and length

It is worth noting that vectors can be without elements (empty), but


they are still characterized by a mode.

vect_empty_num <- numeric(0)


vect_empty_logi <- logical(0)
vect_empty_chr <- character(0)

The same functions can also be used to generate a vector


characterized by a mode (typically numerical-logical-character) with a
fixed size without, however, specifying the elements of the vector yet.

3. 6
Intrinsic attributes: mode and length
vect_empty_num <- numeric(10)
vect_empty_chr <- character(10)

This vector initialization procedure is particularly useful when its


elements are not a priori known, but will for example be defined later
by the code flow through a calculation or, as we will see, through a
loop cycle.

3. 7
Recursive structures and lists

R also works on objects with a mode equal to list.


These are ordered sequences of objects, which are individually
characterized by a potentially different mode.
Lists are not atomic structures, but they are said to be recursive
since the components of a list can be lists themselves.
Other typical structures of the R language are functions and
expressions.
These new objects will be covered in the next modules of the course.
The next slide shows an example dealing with recursion.

3. 8
Recursive structures and lists
# Lists can store objects characterized by a different
nature, as a result lists are not atomic
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
x <- list(n, s, b)
# We can declare lists of lists (object recursion).
xrecursive <- list(x,30)
# We now check the variables stored in memory and their
classification

3. 9
Recursive structures and lists

3.10
Intrinsic attributes

mode(object) and length(object) are intrinsic attributes of an object.


The term “intrinsic” derives from the fact that the essence of the
created object is defined because every object stored in memory has
to be characterized by a mode and a length (an empty vector has a
length equal to zero).
There are other types of properties associated with an object defined
by non-intrinsic attributes.
Attributes of this second category can be explored with the function:
attributes(object) and be set using the attr(object,name) function.

3.11
Attributes() e attr()
> x <- 1:10

> attr(x,"dim") <- c(2, 5)

> View(x)
> attributes(x)
$dim
[1] 2 5

3.12
Change of mode: Object Coercion

R provides a wide range of functions that allow you to change the


mode of an object.
The usual instruction for carrying out this type of operation is:
as.something().
x_int <- 1:3
x_double <- as.double(x_int)
x_char <- as.character(x_int)

In R, this operation is called coercion or cast.

3.13
The class of an object

R objects have a class, which is viewable through the class() function


For basic objects (like numeric, logical, character or list), the class of
an object coincides with the mode, but for matrix, array, factor and
data.frame, it does not.
This special attribute called class is used to allow advanced
programming using an object oriented programming paradigm.
For example, depending on the class to which an object belongs, the
functions of R can behave differently, characterizing themselves
according to the class to which the object belongs.

3.14
Class – Mode – Type
x <- c(2.1, 4, 3, 1, 5, 7)
print(class(x)); print(mode(x)); print(typeof(x))
# class: numeric - mode: numeric – type: double
A = matrix(
x, # the data elements
nrow=2, ncol=3, # number of rows and columns
byrow = TRUE) # fill matrix by row
print(class(A)); print(mode(A)); print(typeof(A))
# class: matrix - mode: numeric – type: double

3.15
The class of an object

The unclass() function allows to reduce complex objects (matrix,


array, data.frame,factor) into simpler objects (vector, list).
Making the attribute of the class coincide with that of the mode
removes the specific characterization of the object for which some
functions (such as plot) may behave differently and less accurately.
However, the handling of the object in memory will be faster.
For instance: if Results belongs to the data frame class, the unclass
function will convert the dataset in an ordinary list.

3.16
4.
String
String Manipulation
Format
Regular Expression
Rules for the generation of strings

A string is a sequence of characters.


Any value in R between ' ' or " " is a string.
The quotation marks at the beginning and at the end of a string must
be ' ' or " " and therefore not mixed as for instance ' "
The " " can be inserted within a string that begins and ends with' '
The ' ' can be inserted within a string that begins and ends with" "
The " " cannot be inserted within a string that begins and ends
with " "

4.2
Examples of valid and invalid strings
# valid declarations for strings
str1 <- 'Start and end with single quote'
str2 <- "Start and end with double quotes"
str3 <- "single quote ' in between double quotes"
str4 <- 'Double quotes " in between single quote'
# invalid declarations for strings
str5 <- 'Mixed quotes"
str6 <- 'Single quote ' inside single quote'
str7 <- "Double quotes " inside double quotes"

4.3
String manipulation – nchar()

The operations that allow to operate on strings are called


manipulation functions (string manipulation).
R natively provides some useful functions for the manipulation of
strings that will be discussed in this module.
The nchar() function counts the number of characters in a string.
result <- nchar("Asterix and Obelix")
> print(result)
[1] 18

4.4
String manipulation – toupper() and tolower() case
toupper() makes all characters in the string uppercase
tolower() makes all characters in the string lowercase

> result <- toupper("Asterix and Obelix")


> print(result)
[1] "ASTERIX AND OBELIX"
> result <- tolower("Asterix and Obelix")
> print(result)
[1] "asterix and obelix"

4.5
String manipulation – substring()
The substring() function extracts a part of a string.
The basic syntax is:
substring(x,first_index,last_index)
Where x is the input character vector, first_index is the index
position at which to extract the first character, last_index is the
index position at which to extract the last character
> result <- substring("Asterix and Obelix", 13, 18)
> print(result)
[1] "Obelix"

4.6
Numbers and strings formatting – format()

Numbers and strings can be formatted into a character vector


according to a specific style using the format() function.
The basic syntax is:
format(x, digits, nsmall, scientific, width,
justify = c("left", "right", "centre", "none"))
Where x is the input character vector, digits is the total number of
digits to be displayed, nsmall is the minimum number of digits to be
displayed to the right of the decimal point, scientific if it is TRUE it
displays the number using the scientific notation, width refers to the
minimum width to be displayed, justify is the text position.

4.7
Numbers and strings formatting – format()
# Total number of digits to be displayed. Last digit
will be rounded.
result <- format(pi, digits = 9)
> print(result)
[1] "3.14159265"

# Display of numbers in scientific notation


result <- format(c(42, pi), scientific = TRUE)
> print(result)
[1] "4.200000e+01" "3.141593e+00"

4.8
Numbers and strings formatting – format()
# The minimum number of digits to display to the right
of the decimal point.
result <- format(3.14, nsmall = 5)
> print(result)
[1] "3.14000"
# To fill the 8 positions of width, white spaces are
inserted to the left of the number.
result <- format(42, width = 8)
> print(result)
[1] " 42"

4.9
Numbers and strings formatting – format()
# Left justify strings.
result <- format("Idefix", width = 10, justify = "l")
> print(result)
[1] "Idefix "

# Center justify strings.


result <- format("Idefix", width = 10, justify = "c")
> print(result)
[1] " Idefix "

4.10
String concatenation – paste()
Strings can be combined together using the paste() function.
The basic syntax is:
paste(..., sep = " ", collapse = NULL)

... stands for the character vectors to be concatenated.


sep represents any character to be inserted between the character
vectors to be concatenated.
collapse is used to eliminate spaces between strings, but not the
space between words in a string.

4.11
String concatenation – paste()
a <- "Asterix"
b <- 'and'
c <- 'Obelix'
d <- "by Uderzo! "
> print(paste(a,b,c,d))
[1] "Asterix and Obelix by Uderzo! "
> print(paste(a,b,c,d, sep = "**"))
[1] "Asterix**and**Obelix**by Uderzo! "
> print(paste(a,b,c,d, sep = "", collapse = ""))
[1] "AsterixandObelixby Uderzo! "

4.12
String splitting – strsplit()

A string can be divided (splitted) in a list of substrings using the


strsplit() function.
The basic syntax is:
strsplit(x,split) with x being the string to be divided and
split being the character vector to be used for the partition.
str1 <- "Asterix and Obelix"
result <- strsplit(str1, " ")
> str(result)
List of 1
$ : chr [1:3] "Asterix" "and" "Obelix"

4.13
Substring replacement – sub() e gsub()
The sub() and gsub() functions allow to substitute a substring with
another one thus implementing a replacement.
The syntax is:
sub(old_substring, new_substring, string)
gsub(old_substring, new_substring, string)

The first instruction makes the substitution in string of the


old_substring with the new_substring only in the first occurence,
while the second command makes the substitution in all the
occurences.

4.14
Substring replacement – sub() e gsub()
# The sub() function
sentence <- "Savona is a seaside town. Savona is located
in Liguria"

print(sentence)
[1] "Savona is a seaside town. Savona is located in
Liguria"

sub("Savona", "Genova", sentence)


[1] "Genova is a seaside town. Savona is located in
Liguria"

4.15
Substring replacement – sub() e gsub()
# The gsub() function
sentence <- "Savona is a seaside town. Savona is located
in Liguria"

print(sentence)
[1] "Savona is a seaside town. Savona is located in
Liguria"

gsub("Savona", "Genova", sentence)


[1] "Genova is a seaside town. Genova is located in
Liguria"

4.16
Regular expression with R
«A regular expression, regex or regexp (sometimes called a rational
expression) is a sequence of characters that define a search pattern.
Usually such patterns are used by string searching algorithms for "find" or
"find and replace" operations on strings, or for input validation. It is a
technique developed in theoretical computer science and formal language
theory.» (Wikipedia)

R has a number of basic functions capable of dealing with Regular


Expressions. In regular expressions, the best practice is to specify the
parameter perl=TRUE which allows the R engine to use the PCRE Regular
Expressions Library. All the functions concerning the Regular Expressions
are case sensitive by default, ignore.case=FALSE

4.17
Regex matches in string vector – grep()
The grep() function takes the regular expression (regex) as first input
argument and a string vector as second input parameter.
If you specify the value=FALSE parameter, grep() returns a new vector
with the indices of the elements that satisfy the regular expression.
If you specify value=TRUE, grep() returns a vector with a copy of the
elements of the original one for which the regular expression is verified.
grep("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE,
value=FALSE)
[1] 1 2 3 6

4.18
Regex matches in string vector – grepl()
grep("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE,
value=TRUE)
[1] "Asterix" "Obelix" "Panoramix" "Ordinalfabetix"

The grepl() function has the same input arguments as grep(), except for
the value= which is no longer supported.

grepl() returns a logical vector of the same length as the vector of input
strings: the elements valued at TRUE correspond to the indexes such that the
regular expression is verified. Elements with FALSE correspond to indices for
which it is not verified.

4.19
Regex matches in string vector – regexpr()
grepl("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE)
[1] TRUE TRUE TRUE FALSE FALSE TRUE
The regexpr() has the same input arguments of grepl(). It returns a
numeric vector characterized by the position of the index such that the
regular expression is verified.
If it is not verified, it fills the vector with -1.
Each element in this vector is characterized by having a match.length
attribute. The latter is a vector of integers with the number of characters
found in correspondence with the first regular expression found.

4.20
Regex matches in string vector – regexpr()
regexpr("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina","Falbalà","Ordinalfabetix"), perl=TRUE)
[1] 6 5 8 -1 -1 13
attr(,"match.length")
[1] 2 2 2 -1 -1 2
regexpr("al", c("Asterix", "Obelix", "Panoramix",
"Beniamina","Falbalà","Ordinalfabetix"), perl=TRUE)
[1] -1 -1 -1 -1 2 6
attr(,"match.length")
[1] -1 -1 -1 -1 2 2

4.21
Regex matches in string vector – gregexpr()
The gregexpr() function has the same task as regexpr() except that it
finds all matches and not just the first one.
> gregexpr("al", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE)
...
[[1]]
[[5]]
[1] -1
[1] 2 5
attr(,"match.length")
attr(,"match.length")
[1] -1
[1] 2 2
...

4.22
Regex matches in string vector – regmatches()
You use the regmatches() function to get substrings that match the
regular expression.
As the first argument, we use the same input that is passed to regexpr()
or gregexpr().
Regarding the second argument, we pass the output vector returned by
regexpr() or gregexpr(). For instance:
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1] "a" "a" "aa"

4.23
Regex matches in string vector – regmatches()
m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a" More information on Regular Expressions can be found
[[2]] on the website:
character(0)
https://www.regular-expressions.info/
[[3]]
[1] "a" "a"
[[4]]
[1] "aa"

4.24
gsub() function supports the syntax for Regular Expression
Remembering the syntax of gsub():
gsub(old_substring, new_substring, string)
The first input argument can be a regular expression.
The following example eliminates the numeric digits in a string using a
typical syntax of Regular Expressions:
sentence <- "The postal code of Savona is 17100"
print(sentence)
[1] "The postal code of Savona is 17100"
gsub("[0-9]*", "", sentence)
[1] "The postal code of Savona is "

4.25
5.
Factors
Levels
Ordered factors
tapply
Factors

Factors are useful objects for categorizing data and providing for the
discrete classification of the components of a vector.
These objects are very useful in analyzing categorical data and for
statistical modeling.
Factors are handled by R as integers, but they are typically
represented by a textual label.
Although factors are presented to the user in the form of character
strings (and they sometimes even behave as such), in fact they are
characterized by a numerical nature and therefore require particular
attention in their management.

5.2
Factors

Factors are defined in R with the factor() function and they can only
contain predefined values known in statistical analysis with the term
levels.
The number of levels that characterizes a factor can be displayed
using the nlevels() function.
Conversely, levels() displays the values that the factors can assume.
Let us consider, as an example, the gender factor which includes two
levels: Male and Female.

5.3
Factors
sex<-c("Male", "Female", "Male","Male","Female")
> print(sex)
[1] "Male" "Female" "Male" "Male" "Female"
sexF <- factor(sex)
> print(sexF)
[1] Male Female Male Male Female
Levels: Female Male

5.4
Factors
> levels(sexF)
[1] "Female" "Male"

> nlevels(sexF)
[1] 2

RStudio displays the variables stored in memory in the “Environment”


tab and it classifies the levels in numeric terms.
In particular, it associates 1 to the "Female" level and 2 to the "Male"
level, in alphabetical order.

5.5
Ordered Factors

Sometimes the order of the factors is not important, but it may be


useful to specify it in certain contexts.
Take as an example the result of a survey that provides three levels
of customer satisfaction: high, medium, low.
R allows this specification using the ordered() function or more easily
through the parameter levels = character vector inside the
factors() function.
Customer_satisfaction <- factor(c("medium", "low", "high",
"high", NA))
levels(Customer_satisfaction)

5.6
Ordered Factors
> levels(Customer_satisfaction)
[1] "high" "low" "medium"

The order in which the factor levels are presented is by default


alphabetical.

To specify the correct order of the levels, you have to explicitly define
the optional parameter of the factor function: levels.

5.7
Ordered Factors
Customer_satisfaction <- factor(c("medium", "low", "high",
"high", NA), levels=c("high", "medium", "low"))
> levels(Customer_satisfaction)
[1] "high" "medium" "low"

By operating in this way there is a correct correspondence between


the integer associated with the label level:
1 – High, 2 – Medium and 3 – Low.

5.8
Ordered Factors

Be careful: in order to assign hierarchical importance to the levels


(and therefore not only define the correct labeling) it is necessary to
define a second optional parameter which is ordered = logical
value.
Customer_satisfaction <- factor(c("medium", "low", "high",
"high", NA), levels=c("low", "medium", "high"),
ordered=TRUE)
> min(Customer_satisfaction)
[1] <NA> Levels: low < medium < high
> min(Customer_satisfaction[!is.na(Customer_satisfaction)])
[1] low Levels: low < medium < high

5.9
The tapply() function

Suppose you have four teams (RAV, GRY, HUF, SLY) and their
scores obtained in different tests.

RAV GRY HUF SLY

10 30 7 13

20 8 12 20

15 5

A possible representation using factors could be:

5.10
The tapply() function
players <- c("RAV","GRY","HUF","SLY","RAV","GRY", "HUF",
"SLY","RAV","SLY","SLY")
scores <- c(10,30,7,13,20,8,12,20,15,5,8)
player_fact <- factor(players)
To apply a function (such as the mean) to the vector of scores
grouped by factor, you generally use tapply()
scoresAVG <- tapply(scores,player_fact,mean)
> print(scoresAVG)
GRY HUF RAV SLY
19.0 9.5 15.0 11.5

5.11
The tapply() function
scoresMAX<- tapply(scores,player_fact,max)
> print(scoresMAX)
GRY HUF RAV SLY
30 12 20 20
scoresMIN<- tapply(scores,player_fact,min)
> print(scoresMIN)
scoresSTDEV<- tapply(scores,player_fact,sd)
> print(scoresSTDEV)
GRY HUF RAV SLY
15.556349 3.535534 5.000000 6.557439

5.12
6.
Matrix & Array
Bidimensional vectors
Matrix operations
Multidimensional vectors
Matrices

Matrices are R objects in which the elements are arranged according


to a rectangular layout.
Elements in matrices must be primitive (atomic) with the same mode.
Although matrices can also be of a logical or textual nature, they are
normally used for the representation of numerical data in two
dimensions.
The reference function in R for the creation of a matrix object is
matrix() .

6.2
matrix()

The basic syntax is:


matrix(data, nrow, ncol, byrow, dimnames)
Where:
data is the input vector whose elements will fill the matrix
nrow is the number of rows of the matrix
ncol is the number of columns of the matrix
byrow is a logical value: if TRUE, the elements of the input vector
are arranged along the rows
dimname is the name to be assigned to the rows and columns

6.3
matrix()
data <- c(1,5,-1,8,4,3)
A1 <- matrix(data, nrow=3, ncol=2, byrow=TRUE)
A2 <- matrix(data, nrow=3, ncol=2, byrow=FALSE)
View(A1); View(A2)

#byrow = true false

6.4
dimname

We want to represent in matrix form the table containing the scores


for four teams.
players <- c("RAV","GRY","HUF","SLY")
matches <- c("September","December","February")
scores <- c(10,8,6,9,7,7,7,9,9,10,8,5)

MatrixResults <- matrix(scores, RAV GRY HUF SLY


nrow=3,ncol=4,byrow=TRUE, September 10 8 6 9
dimnames = list(matches,players)) December 7 7 7 9
#list() defines a list object February 9 10 8 5

6.5
dimname
> print(MatrixResults)
> View(MatrixResults)

RAV GRY HUF SLY


September 10 8 6 9
December 7 7 7 9
February 9 10 8 5

6.6
Access the elements of a matrix

The selection of elements in a matrix is very similar to the selection


of elements in vectors.
Durer_numbers<-c(16,3,2,13,5,10,11,8,9,6,7,12,4,15,14,1)
A <- matrix(Durer_numbers,nrow=4,ncol=4,byrow=TRUE)
> print(A)
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
[3,] 9 6 7 12
[4,] 4 15 14 1

6.7
Access the elements of a matrix

To access the elements of a matrix you can


use the traditional notation A[i,j] where i is
the index for the rows and j for the columns.
As a result, for the selection of the number 11
in the matrix, the indexes must be i=2 e j=3.
> A[2,3]
An alternative method is to use the notation
A[k] where k indicates the number of
elements in the columns.
> A[10]

6.8
Access a set of elements of a matrix

To select a contiguous range of numbers, index vectors can be used


on each dimension of the matrix.
> print(A) Code for the range selection
[,1] [,2] [,3] [,4] > A[2:3,2:4]
[1,] 16 3 2 13 [,1] [,2] [,3]
[2,] 5 10 11 8 [1,] 10 11 8
[3,] 9 6 7 12 [2,] 6 7 12
[4,] 4 15 14 1

6.9
Access an entire row of a matrix

In order to select all the elements of a row, you can use the notation
A[i,]

> print(A)
Code for the range selection
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13 > A[2,]
[2,] 5 10 11 8 [1] 5 10 11 8
[3,] 9 6 7 12 Which is equivalent to:
[4,] 4 15 14 1
> A[2,1:ncol(A)]

6.10
Access an entire column of a matrix

In order to select all the elements of a column, you can use the
notation A[,j]
> print(A) Code for the range selection
[,1] [,2] [,3] [,4]
> A[,3]
[1,] 16 3 2 13
[1] 2 11 7 14
[2,] 5 10 11 8
[3,] 9 6 7 12 Which is equivalent to:
[4,] 4 15 14 1
> A[seq(1,nrow(A)),3]

6.11
Properties of the Durer matrix – sum of the rows
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13 34
[2,] 5 10 11 8 34
[3,] 9 6 7 12 34
[4,] 4 15 14 1 34

# Sum of rows
RowsSum <-
c(sum(A[1,]),sum(A[2,]),sum(A[3,]),sum(A[4,]))

6.12
Properties of the Durer matrix – sum of the columns
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
[3,] 9 6 7 12
[4,] 4 15 14 1
34 34 34 34 ∑
# Sum of columns
ColumnsSum <-
c(sum(A[,1]),sum(A[,2]),sum(A[,3]),sum(A[,4]))

6.13
Properties of the Durer matrix – sum of diagonals
# Sum of the main diagonal
Trace <- A[1,1] + A[2,2] + A[3,3] +A[4,4]
# or in a more elegant way
Trace <- sum(diag(A))
# Sum of the main antidiagonal
AntidiagSum <- A[1,4] + A[2,3] + A[3,2] +A[4,1]
Furthermore, since the determinant is null, the Durer matrix A is not
invertible.
det(A) # det() computes the determinant of a matrix
solve(A) # solve() computes the inverse matrix

6.14
Properties of the Durer matrix – eigenvalues

The eigenvalues are also interesting as one of these is zero


(consequence of the singularity of A) and the greater of these is
always 34.
> eigen(A)
eigen() decomposition
$values
[1]3.400000e+01 8.000000e+00 -8.000000e+00 4.848185e-17
In computer science 3.4e+01 means 34 in accordance with the
scientific notation: 3.4 ⋅ 101 . Consequently 4.8e-17 substantially
approximates 0.

6.15
Properties of the Subirachs matrix

Another famous 4x4 matrix is the Subirachs matrix:


Subirachs_numbers<-
c(1,14,14,4,11,7,6,9,8,10,10,5,13,2,3,15)
B <- matrix(Subirachs_numbers,nrow=4,ncol=4,byrow=TRUE)
> print(B)
[,1] [,2] [,3] [,4]
[1,] 1 14 14 4
[2,] 11 7 6 9
[3,] 8 10 10 5
[4,] 13 2 3 15

6.16
Properties of the Subirachs matrix

The sum of rows, the sum


of columns, the trace and
the sum of the elements
on the main antidiagonal
is always the same as
that of Durer, but equal to
33.
However, the Durer
matrix has the property of
not being invertible.

6.17
Transpose a matrix

We now consider the 4x4 matrix B.


You can compute the transposed matrix of B using the function t():
Btranspose <- t(B)
> print(B)
> print(Btranspose)
[,1] [,2] [,3] [,4]
[,1] [,2] [,3] [,4]
[1,] 1 14 14 4
[1,] 1 11 8 13
[2,] 11 7 6 9
[2,] 14 7 10 2
[3,] 8 10 10 5
[3,] 14 6 10 3
[4,] 13 2 3 15
[4,] 4 9 5 15

6.18
Operations between matrices

Let us consider matrix A of Durer and matrix B of Subirachs.


The sum and the difference between matrices is not a problem as it
is an operation that is carried out element by element.
Thus A + B means that each element 𝑎𝑖,𝑗 has to be added to the
element 𝑏𝑖,𝑗 . Same reasoning for A – B.
The product between matrices in a geometrical way is not intended
element by element, but the rules of the row by column product must
be applied. The first kind of product (also known as Hadamard
product) can be implemented coding A * B, while the second (and
more popular) kind of product can be implemented using A %*% B.

6.19
Operations between matrices
Durer_numbers<-c(16,3,2,13,5,10,11,8,9,6,7,12,4,15,14,1)
A <- matrix(Durer_numbers,nrow=4,ncol=4,byrow=TRUE)
Subirachs_numbers<-
c(1,14,14,4,11,7,6,9,8,10,10,5,13,2,3,15)
B <- matrix(Subirachs_numbers,nrow=4,ncol=4,byrow=TRUE)
> print(A) > print(B)
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 16 3 2 13 [1,] 1 14 14 4
[2,] 5 10 11 8 [2,] 11 7 6 9
[3,] 9 6 7 12 [3,] 8 10 10 5
[4,] 4 15 14 1 [4,] 13 2 3 15

6.20
Operations between matrices
> print(A+B) > print(A-B)
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 17 17 16 17 [1,] 15 -11 -12 9
[2,] 16 17 17 17 [2,] -6 3 5 -1
[3,] 17 16 17 17 [3,] 1 -4 -3 7
[4,] 17 17 17 16 [4,] -9 13 11 -14
> print(A*B) > print(A%*%B)
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 16 42 28 52 [1,] 234 291 301 296
[2,] 55 70 66 72 [2,] 307 266 264 285
[3,] 72 60 70 60 [3,] 287 262 268 305
[4,] 52 30 42 15 [4,] 294 303 289 236
6.21
Merging matrices: cbind() and rbind()

Matrices can be joined together maintaining the same number of


rows (cbind) or the same number of columns (rbind).
cbind allows for example two matrices A, B characterized by the
same number of rows (nrow(A)=nrow(B)) to be joined into a single
matrix of dimension nrow(A) x (ncol(A) + ncol(B)).
rbind allows two matrices A, B characterized by the same number of
columns (ncol(A)=ncol(B)) to be joined into a single matrix of
dimension (nrow(A) + nrow(B)) x ncol(A). If the condition
nrow(A)!=nrow(B) is verified in cbind(), then the instruction will not
be executed, the same for rbind() if ncol(A)!=ncol(B).

6.22
Merging matrices: cbind() and rbind()
Durer_numbers<-c(16,3,2,13,5,10,11,8,9,6,7,12,4,15,14,1)
A <- matrix(Durer_numbers,nrow=4,ncol=4,byrow=TRUE)
Subirachs_numbers<-
c(1,14,14,4,11,7,6,9,8,10,10,5,13,2,3,15)
B <- matrix(Subirachs_numbers,nrow=4,ncol=4,byrow=TRUE)
C <- matrix(rep(c(1,2),each=4),nrow=2,ncol=4,byrow=TRUE)
D <- t(C)
> View(A); View(B)
> View(C); View(D)

6.23
Merging matrices: cbind() and rbind()

6.24
Merging matrices: cbind() and rbind()
> print(cbind(A,B))

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]


[1,] 16 3 2 13 1 14 14 4
[2,] 5 10 11 8 11 7 6 9
[3,] 9 6 7 12 8 10 10 5
[4,] 4 15 14 1 13 2 3 15

A: nrow=4 x ncol=4 B: nrow=4 x ncol=4

6.25
Merging matrices: cbind() and rbind()
> print(rbind(A,B))
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
A: nrow=4 x ncol=4
[3,] 9 6 7 12
[4,] 4 15 14 1
[5,] 1 14 14 4
[6,] 11 7 6 9
B: nrow=4 x ncol=4
[7,] 8 10 10 5
[8,] 13 2 3 15

6.26
Merging matrices: cbind() and rbind()
> print(cbind(A,D))
[,1] [,2] [,3] [,4] [,5] [,6] A: nrow=4 x ncol=4
[1,] 16 3 2 13 1 2
[2,] 5 10 11 8 1 2 D: nrow=4 x ncol=2
[3,] 9 6 7 12 1 2
[4,] 4 15 14 1 1 2

A D

6.27
Merging matrices: cbind() and rbind()
> print(rbind(A,C))
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
A: nrow=4 x ncol=4
[3,] 9 6 7 12
[4,] 4 15 14 1
[5,] 1 1 1 1
C: nrow=2 x ncol=4
[6,] 2 2 2 2

6.28
Merging matrices: cbind() and rbind()
> print(cbind(A,C))
Error in cbind(A, C) : number of rows of matrices must
match (see arg 2)

A: nrow=4 x ncol=4 C: nrow=2 x ncol=4

> print(rbind(A,D))
Error in rbind(A, D) :
number of columns of matrices must match (see arg 2)

A: nrow=4 x ncol=4 D: nrow=4 x ncol=2

6.29
Array: multidimensional vectors

Arrays are R objects in which data with more than two dimensions
can be stored.
For example, if you want to create an array of dimension (2,3,4) it
means that 4 matrices will be created having each two rows and
three columns.
An array is generated through the array() function and takes as input
arguments a vector in which the data will be arranged according to
the dimensional specifications contained in the second input: the dim
parameter.
Elements in an array must have the same mode.

6.30
Indexing a three-dimensional array (a tensor)

6.31
Array: implementation example
, , 1 , , 3
EvenSequences <-
seq(from=2, [,1] [,2] [,3] [,1] [,2] [,3]
to=48,by=2) [1,] 2 6 10 [1,] 26 30 34
[2,] 4 8 12 [2,] 28 32 36
Z <- array(
EvenSequence, , , 2 , , 4
c(2,3,4))
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 14 18 22 [1,] 38 42 46
> print(Z) [2,] 16 20 24 [2,] 40 44 48

6.32
7.
Lists & Data frames
List Components
Data Frame Object
Attach & Detach
Lists

A list is an R object which consists of an ordered collection of objects,


called components.
The components do not necessarily need to have the same mode or
be of the same type.
A list, therefore, could contain a numeric vector, a logical value, a
matrix, a string of characters...
The function that allows the creation of a list object is list().
MyList <- list(name="Emmett Brown", wife="Clara Clayton",
no.children=2, child.names=c("Giulio","Verne"))

7.2
Lists
print(MyList)
View(MyList)
$name
[1] "Emmett Brown"
$wife
[1] "Clara Clayton"
$no.children
[1] 2
$child.names
[1] "Giulio" "Verne"

7.3
Lists - Components

The components that make up a list are characterized by internal


numbering that allows the list to be indexed.
Therefore, if the list is characterized by four components, as in the
example, these can be recalled with the syntax:
MyList[[1]] , MyList[[2]] , MyList[[3]] , MyList[[4]]
Furthermore, if a component of the list is a vector, one of its
elements can be accessed directly with the syntax:
MyList[[4]][1]
In this case the output will be "Giulio"

7.4
Lists - Components
> print(MyList[[1]])
[1] "Emmett Brown"

> print(MyList[[4]][1])
[1] "Giulio"

The length() function applied to a list returns the number of its


components.
> print(length(MyList))
[1] 4

7.5
Lists - Components

The elements of a list can also be accessed more intuitively by using


the name of the component itself.
The syntax to be used is: list_name$component_name
Both MyList[[1]] and MyList$name return as output:
> MyList$name
[1] "Emmett Brown"
Similarly MyList[[4]][1] is equal to:
> MyList$child.names[1]
[1] "Giulio"

7.6
Lists - Components

Furthermore, R allows to use the names of the components of the list


within the double square brackets.
As a result MyList[["name"]] is equivalent to MyList$name.
Be careful: it is important to be clear about the difference between
MyList[[1]] and MyList[1]

The operator [[…]] is used for the selection of the single element of
the list, while […] is used as a general sub-scripting operator.
[[…]] allows the access to an object stored in a list.

7.7
Lists – [[…]] and […]
[…] extracts a part of a list (sublist) which is a list itself. This is the
method for slicing a list.
MyList <- list(name="Emmett Brown",
wife="Clara Clayton",
no.children="2",
child.names=c("Giulio","Verne"))
element <- MyList[[1]]
sublist <- MyList[1]

7.8
Lists – Edit and add components in a list

Component names can be abbreviated to the minimum number of


letters needed to identify the field name.
> MyList$w
[1] "Clara Clayton"
Once a list object has been generated, it can be modified:
# changing the value of a component of the list
> MyList$name[1] <- "Emmett Doc Brown"
> print(MyList[[1]])
[1] "Emmett Doc Brown"

7.9
Lists – Edit and add components in a list
# modification of a component of the list
> names(MyList)[1] <- "Person"
> print(MyList[1])
$Person
[1] "Emmett Doc Brown"

# adding a new component in the list


MyList["car"]<-"DeLorean DMC-12"
> str(MyList) # display the list structure

7.10
Lists – Edit and add components in a list
# adding elements in a list component
> MyList$Person[2] <- "Marty McFly"
> str(MyList)
List of 5
$ Person : chr [1:2] "Emmett Doc Brown" "Marty McFly"
$ wife : chr "Clara Clayton"
$ no.children: num 2
$ child.names: chr [1:2] "Giulio" "Verne"
$ car : chr "DeLorean DMC-12"

7.11
Lists – Concatenation

Imagine that the Finance Administration office is divided into two


sub-departments:
- Financial Engineering Desk - Market Risk Reporting Desk
ID Name Degree ID Name Degree

12123 Paolo Computer Science 34345 Matteo Economics

23234 Pier Engineering 45456 Fabio Economics

67678 Andrea Mathematics 56567 Marcello Economics

78789 Nicola Economics

7.12
Lists – Concatenation
FinAdminDesk <-
list(ID=c(12123,23234,67678,78789,34345,45456,56567),
desk=c(rep("FinEng",each=4)),rep("MktRiskReport",3))

FinAdminDegree <-
list(name=c("Paolo","Pier","Andrea","Nicola","Matteo",
"Fabio","Marcello"),
degree=c("Computer Science","Engineering","Mathematics",
rep("Economics",each=4)))

FinAdmin <- c(FinAdminDesk,FinAdminDegree)

7.13
Lists – Concatenation

7.14
List of lists

A list can have one or more lists among its components.


CourseInformation <-
list(name="Software R",
list(professor="Giribone",
mail="[email protected]",
mobile="338/6343454"),
degree="EDS")

> str(CourseInformation)

7.15
List of lists
> str(CourseInformation)

List of 3
$ name : chr "Software R"
$ :List of 3
..$ professor: chr "Giribone"
..$ mail : chr "[email protected]"
..$ mobile : chr "338/6343454"
$ degree: chr "EDS"

7.16
List of lists
> View(CourseInformation)

7.17
Data Frame

A data frame is used to store data in tabular form. This object shares
many features with lists and matrices, but has some restrictions:
- The components of the data frame must be vectors (numeric,
textual or logical), factors, numeric matrices, lists or data frames.
- Numeric, logical and factor vectors are included in the data frame
as is, while (in some R version) string vectors are converted into
factors by default.
- The length of the components must be the same.
In short, similarly to matrices, the data frame is a two-dimensional
data structure (i.e. a table).

7.18
Data Frame

A data frame can be seen as a special case of a list whose


components must necessarily have the same length.
Each component forms a column and the elements of the component
form rows.
For many purposes a data frame can be treated as a matrix whose
columns are characterized by different modes and attributes.
Obviously, the elements must have the same mode if they belong to
the same column.
A data frame can be displayed in matrix form and its rows, columns
can be extracted using canonical matrix indexing.

7.19
Data Frame

A data frame in R can be created using the data.frame() function.


candidate = c("Rossi", "Bianchi", "Brown")
mark = c(26, 14, 28)
passed = c(TRUE, FALSE, TRUE)
dataframeExam = data.frame(candidate, mark, passed)
> print(dataframeExam)
candidate mark passed
1 Rossi 26 TRUE
2 Bianchi 14 FALSE
3 Brown 28 TRUE

7.20
Data Frame
# the data frame is a particular type of list
> typeof(dataframeExam)
[1] "list"
> class(dataframeExam)
[1] "data.frame"
# Being a list, you can access and modify its elements
in a completely similar way to what we have seen before.

> print(as.character(dataframeExam[[1]]))
[1] "Rossi" "Bianchi" "Brown"

7.21
Data Frame
> print(dataframeExam$mark[2])
[1] 14

> print(dataframeExam[[3]][2:3])
[1] FALSE TRUE

If you want to eliminate the default behavior of the data.frame()


function that automatically converts character vectors into factors,
you can pass the optional argument:
stringsAsFactors=FALSE

7.22
Data Frame

Another advantage of using a data frame is that, being a natively


two-dimensional object, its elements can also be extracted using the
matrix notation.
> print(dataframeExam)
candidate mark passed # matrix-like indexation
1 Rossi 26 TRUE > dataframeExam[2,c(1,3)]
2 Bianchi 14 FALSE candidate passed
3 Brown 28 TRUE 2 Bianchi FALSE

7.23
Edit a Data Frame

Data frames can be edited in the same way as elements of a matrix


or list:
# Matrix-like notation
dataframeExam[3,2] <- 30
dataframeExam[3,"mark"] <- 30

# List-like notation
dataframeExam$mark[3] <- 30
dataframeExam[[2]][3] <- 30

7.24
Add elements to a Data Frame
The addition of components can be done using the rbind() and
cbind()matrix functions
dataframeExam[,1] <- as.character(dataframeExam[,1])
rbind(dataframeExam,c("Silver",23,TRUE))
> print(dataframeExam)
candidate mark passed
1 Rossi 26 TRUE
2 Bianchi 14 FALSE
3 Brown 30 TRUE
4 Silver 23 TRUE

7.25
Add elements to a Data Frame
> dataframeExam <-
cbind(dataframeExam,State=c("IT","IT","UK","SP"),
stringsAsFactors=FALSE)
> print(dataframeExam)
candidate mark passed State
1 Rossi 26 TRUE IT
2 Bianchi 14 FALSE IT
3 Brown 30 TRUE UK
4 Silver 23 TRUE SP

7.26
Add elements to a Data Frame

Elements can also be added with typical list syntax.


dataframeExam$City <-
c("Genova","Genova","Londra","Madrid")
> print(dataframeExam)
candidate mark passed State City
1 Rossi 26 TRUE IT Genova
2 Bianchi 14 FALSE IT Genova
3 Brown 30 TRUE UK Londra
4 Silver 23 TRUE SP Madrid

7.27
Deleting rows and columns

A column in a dataset can be cleared by assigning it a NULL value.


> dataframeExam$City=NULL
> print(dataframeExam)
candidate mark passed State
1 Rossi 26 TRUE IT
2 Bianchi 14 FALSE IT
3 Brown 30 TRUE UK
4 Silver 23 TRUE SP

7.28
Deleting rows and columns

A row of a dataset can be deleted by reassigning.


> dataframeExam <- dataframeExam[-3,]

> print(dataframeExam)
candidate mark passed State
1 Rossi 26 TRUE IT
2 Bianchi 14 FALSE IT
4 Silver 23 TRUE SP

7.29
attach() and detach() functions for lists and data frames
The $ notation, like dataframeEsame$candidate or MyList$name,
used for the data frame and list components may not always be
convenient.
A useful support could be a function that allows the components of a
list or data frame to be temporarily visible as variables that can be
recalled from memory with the same name as the component.
In this way it would be possible to avoid writing the reference
database (list or data.frame) each time before the dollar symbol,
increasing the clarity of the R code.

7.30
attach() and detach() functions for lists and data frames
The attach() function takes as input a «database», that is a list or
data.frame object.
Suppose BooksDB is a dataframe consisting of three components:
BooksDB$author, BooksDB$title, BooksDB$year

7.31
attach() and detach() functions for lists and data frames
> attach(BooksDB)
This instruction makes a blind copy of the components of the
database. After this command, therefore, the variables can be
directly recalled
> print(year)
[1] 1990 1985 2005 1884
The detach() function destroys the copy of the variables in memory.
> detach(BooksDB)
> print(year)
Error in print(year) : object 'year' not found

7.32
8.
Import Data from file
Reading data from csv-txt-dat file
Fixed width format file Reading
Writing data in a file
Reading data from file – csv

R uses the working directory for reading and writing into files.
The command for displaying the working directory is getwd() and its
path can be changed with setwd().

In RStudio this can be handled


more intuitively using the
graphical interface provided.

Tab: files

8.2
Reading data from file – csv

If the file from which you want to import the data is present in the
working directory, so you can see it in the Files tab from RStudio, it
will not be necessary to express the entire path within the functions
dedicated to importing data from file, but only the name of the file
with its extension.
Take the following csv (comma-separated values) file as an example
of a database: StarWars.csv
source: https://www.kaggle.com/jsphyg/star-wars
The file is in the directory: C:\Users\Utente\Documents

8.3
Reading data from file – csv
Have a look at the csv file

8.4
Reading data from file – csv

The file has been stored in the working directory and therefore
appears in the «Files» tab of RStudio.

By clicking with the left button,


a menu appears that starts a
simple wizard procedure for
importing data in the R
environment.

8.5
Reading data from file – csv

The raw content of the csv file

How to import data

The code generated automatically

8.6
Since the elements of
the database were
divided in such a way,
the import of the data is
correct.
Where the data is
missing, R associates
the NA value.

8.7
Reading data from file – csv

Clicking on import an object named StarWars of type list and class


data.frame has been generated in memory.
Thus, the object can be queried with the methods already learned.
> typeof(StarWars)
[1] "list"

> StarWars$name
> StarWars[1,1]

> na.omit(StarWars$name[StarWars$height>200])

8.8
Reading data from file – csv

The same import could be done from code.


There are many R functions dedicated to this purpose, among them,
the most popular are the following:

read.table(file, header=FALSE,sep="",quote="\" ' ",dec=".")


read.csv(file, header=TRUE,sep=",",quote="\"",dec=".",fill=TRUE)
read.csv2(file, header=TRUE,sep=";",quote="\"",dec=",",fill=TRUE)
read.delim(file, header=TRUE,sep="\t",quote="\"",dec=".",fill=TRUE)
read.delim2(file, header=TRUE,sep="\t",quote="\"",dec=",",fill=TRUE)

8.9
Reading data from file – csv

file: the file name specifying its extension if the file is in the current
directory. Otherwise it is necessary to specify the entire path. The
parameter can also be a remote access to the URL - Uniform
Resource Locator - file (http://...)
header: a logical value (TRUE or FALSE) which indicates if the
names of variables are in the first row.
sep: the delimiter used for separating the elements within the file. For
instance, sep="\t" refers to the tabulator (tab).
quote: the character used for textual variables.
dec: the character used as a separator of the decimal digits.

8.10
Reading data from file – csv

fill: logical value. If TRUE and the rows do not have the same
number of variables, blancks will be added.
StarWars <- read.table("StarWars.csv",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
StarWars <- read.csv("StarWars.csv",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
StarWars <- read.delim2("StarWars.csv",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
The output object, named StarWars, has a list mode,
mode(StarWars), and a data.frame class, class(StarWars).

8.11
Reading data from file – txt

Now suppose you need to import the same dataset from a file with
the extension .txt located in the directory:
C:/Users/Utente/Documents/Database/StarWars.txt

8.12
Reading data from file – txt

The function which allows to import a txt file in the R environment is


delim(). Given that the file is not stored in the wd, we can:
- Specify the entire path that allows to reach the file
path="C:/Users/Utente/Documents/Database/StarWars.txt"
StarWars <- read.delim(path,
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
- Set the wd in the file directory and call the function
setwd("C:/Users/Utente/Documents/Database/")
StarWars <- read.delim("StarWars.txt",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)

8.13
The scan() function
The scan() function is more flexible and more customizable than
read.table().
Using scan(), we are able to a-priori specify the mode of variables.
For instance:
mydata <- scan("data.dat", what=list("",0,0))

The instruction reads three variables in the file with .dat extension:
the first one has a textual mode, while the others have been defined
as numeric variables.

8.14
The read.fwf() function
The read.fwf() function can be implemented for reading a fixed
width format (fwf).
mydata <- read.fwf("data.dat", widths=c(1,4,3))

> str(mydata)
'data.frame': 4 obs. of 3 variables:
$ V1: Factor w/ 2 levels "A","B": 1 1 2 2
$ V2: num 1.5 1.55 1.69 1.95
$ V3: num 1.2 1.3 4.3 4.4

8.15
Write contents to a file
The write.table() function writes an R object in a file.

Typically it is used to save data frames, but it also works with the
other R objects like vectors, matrices, ...

The syntax for write.table is:


write.table(x,file="",append=FALSE, quote=true, sep="",
eol="\n", na="NA",dec=".",row.name=TRUE,col.names=TRUE,
qmethod=c("escape", "double"))

8.16
Write contents to a file

x: the name of the object to be written to the file


file: the name of the file
quote: a logical or numeric vector. If TRUE the variables with textual
mode and the factors are written with " " , otherwise the numeric
vector indicates the numbers of variables to be written using " ". If
FALSE, these variables are written without " ".
sep: the delimiter to be used for parsing.
eol: the character to use at the end of each line.
na: the character to be used if there are NA values

8.17
Write contents to a file

dec: the character used to separate the decimal part of a number


row.names: a logical value that indicates whether row names should
be written to the file.
col.names: a logical value that indicates whether columns names
should be written to the file.
qmethod: if quote=TRUE, it specifies the way in which " " included in
the textual variables has to be handled: if "escape" (or "e", the
default setting) each " is substituted with \", if "d" each " is substituted
with " ".

8.18
Write contents to a file
# Example of importing a dataset from a URL
StarTrekDB <-
read.csv("https://raw.githubusercontent.com/pdxcat/nixme
ntors/master/lab-databases/startrek.csv",
header=TRUE,sep=",",stringsAsFactors = FALSE)
# Select the index of the lines containing the word
Lieutenant
indx <- grep("Lieutenant+",StarTrekDB$Rank,
perl=TRUE,value=FALSE)
# Store only the affected lines in the object
Lieutenants <- StarTrekDB[indx,]

8.19
Write contents to a file
> View(Lieutenants)

> Lieutenants[3,1] <- "Josè Tyler" #data cleaning

8.20
Write contents to a file
# Save the Lieutenants object in a csv file
write.table(Lieutenants,file="StarTrekLiutenants.csv",
append=FALSE, quote=FALSE, sep=",",eol="\n",
na="NA",row.name=FALSE,col.names=TRUE)

8.21
9.
R packages
Install, Update and Remove Pkgs
Library
Namespace
Packages

R functions are contained in packages. When an R session starts,


the basic packages are pre-loaded by default.
Consequently, all the functions dealt with in the previous modules
are contained in these packages.
To see which ones are currently loaded, use the search() function
> search()
".GlobalEnv" "package:stats" "package:graphics"
"package:grDevices" "package:utils" "package:datasets"
"package:methods" "Autoloads" "package:base"

9.2
Packages

If you consult the help for each function, the package to which it
belongs is displayed.

> help(mean)

The reference package is indicated in braces: in this case the mean


function belongs to the R {base} package.

9.3
Packages
> help(read.fwf)

For instance, read.fwf belongs to the {utils} package which is pre-


loaded during the starting of the R kernel.
If not loaded, the read.fwf function would not be usable by the
programmer.
So, as a general rule, it can be used only when the package is
loaded.
9.4
Packages

The reason for this package management policy, namely that it is not
enough for them to be installed, but that they must also be loaded, is
dictated by two main reasons:
- Loading all the functions of all installed packages in advance would
require a huge amount of memory. An accurate selection of
packages allows for a greater computational efficiency and greater
order in writing the code.

- It is an aid to professional developers of packages as it avoids


potential function naming problems.

9.5
Packages

To better understand package management (installation and use in


the code) let's start from a concrete application case.
We want to import a dataset stored in an Excel file with the .xlsx
extension.
Searching online in the R community forums we find a function that
seems useful for this purpose: the read_excel() function.
Typing the instruction help(read_excel)
No documentation for read_excel in specified packages
and libraries: you could try ??read_excel
an error is generated because the package is not installed yet.

9.6
Packages
The ??function command allows to do a more detailed on-line
research.
>??read_excel

The reference package is {readxl}

9.7
Packages

The package installation can be done from code or using the RStudio
graphical interface. If we use the IDE:
Tools -> Install packages
Install from:
Choose the repository or path where
the package to be installed is located.
The recommended choice is to use
reliable packages which come from
an official repository.

9.8
Packages

Packages that come from the R CRAN can be considered reliable.


You have to type the name of the package (or the series of
packages) that you intend to install in the Packages textbox.
Install to Library: the path where the package will be installed (use
the default one that already contains the others to maintain a rational
order).
Install dependencies: if the package that you want to install contains
functions that use other packages these will be automatically
downloaded and installed. To be sure of the correct functioning,
leave the check active.

9.9
Packages

Given that readxl is in the


official repository, it is
recognized.

The logs of the successful


installation appear in the
console.

9.10
Packages
Installing package into
‘C:/Users/Utente/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL
'https://cran.rstudio.com/bin/windows/contrib/3.4/readxl
_1.3.1.zip'
Content type 'application/zip' length 1517362 bytes (1.4
MB)
downloaded 1.4 MB
package ‘readxl’ successfully unpacked and MD5 sums
checked

9.11
Packages

The equivalent instruction to the procedure just shown is to write the


following in the console:
> install.packages("readxl")
At this point the package is installed and therefore can be used by
the programmer after loading it.
The library() function allows to load an R package
> library(readxl)
search() allows to check if the package has been indeed loaded.

9.12
Packages
> search()
".GlobalEnv" "package:readxl" "tools:rstudio"
"package:stats" "package:graphics"
"package:grDevices" "package:utils"
"package:datasets" "package:methods" "Autoloads"
"package:base"

Now the help is available


> help("readxl") #help for the package
> help("read_excel") #help for the function

9.13
Packages

9.14
Packages

> MasterYoda3D <- read_excel("MasterYoda.xlsx",


sheet="Points",
range="A1:C33863")

9.15
Packages
> mode(MasterYoda3D)
[1] "list"
> class(MasterYoda3D)
[1] "data.frame"

library(plot3D)
scatter3D(MasterYoda3D$Xcoord,
MasterYoda3D$Ycoord,
MasterYoda3D$Zcoord)

9.16
Namespace

If the package contains numerous functions and only a few are


expected to be used, instead of importing the entire package, you
can refer to the single function of interest with the syntax:
package_name::function_name
Consequently, it does not load the entire function library into memory
but a framework containing only the function you want to use
(together with its dependencies). This area is called namespace
> MasterYoda3D <-
readxl::read_excel("MasterYoda.xlsx",sheet="Points",
range="A1:C33863")

9.17
Namespace

To view the operating namespaces, use the function:


> loadedNamespaces()
"compiler" "graphics" "tools" "pillar"
"rstudioapi" "utils" "tibble" "yaml"
"grDevices" "crayon" "Rcpp" "stats"
"cellranger" "datasets" "readxl" "methods"
"rematch" "pkgconfig" "rlang" "base"

Note that if the entire package is not imported via library(), it does not
appear correctly in the search() list, but its functions can be used with
::

9.18
Package Management

Both the Rgui in the R console,


and RStudio allow to manage
the packages from a GUI
(Graphical User Interface).
The commands for installing,
updating or removing a package
are:
install.packages()
update.packages()
remove.packages()

9.19
Package Management

Install
Update
Remove
R Packages
from RStudio

9.20
Package Management

It is essential to manage R packages wisely for their installation -


uninstallation, but above all in their updating.
R Core Team suggests carrying out this massive update operation on
a regular and scheduled basis.
It can be conducted automatically by going through the RStudio
menu items:

Tools -> Check for packages updates

9.21
Package management

NEWS displays the changes


and modifications made in the
new version of the package
compared to the previous one

Logs will be shown in the R


Console.

9.22
10.
Graphics
Low level function
High level function
Layout
Introduction

The R Graphics environment is extremely powerful and is particularly


oriented to the scientific visualization of data. R provides numerous
functions dedicated to graphics management both at a basic level
and with specific packages.
These outputs are easily accessible, fully programmable and
exportable in the most common graphic formats (png, gif, bmp,…).
Full compatibility with LaTeX
https://cran.r-project.org/web/views/Graphics.html
https://www.r-graph-gallery.com/

10.2
The Graphics environment

The four main graphical environments are:


Base Graphics
- R Base Graphics (package:graphics)
- grid package
https://www.stat.auckland.ac.nz/~paul/grid/grid.html
Advanced Graphics
- lattice package
http://lmdvr.r-forge.r-project.org/figures/figures.html
- ggplot2 package: https://ggplot2.tidyverse.org/reference/

10.3
Base Graphics

The most important high-level functions for basic graphics are:


Plot  generic x-y plotting
Barplot  bar plots
The main inputs for
Boxplot  Box and Whisker plot graphical objects are
Hist  histograms matrices, data frames and
Pie  pie charts vectors.
Dotchart  Cleveland dot plots
qqnorm, qqline, qqplot  distribution comparison plots
Pairs, coplot  display of multivariant data

10.4
Base Graphics – Scatter Plots

We define the following data set:


# Initializing random seed
set.seed(29091984)
# Define y as a 10 x 3 matrix
y <- matrix(runif(30), ncol=3,
dimnames=list(letters[1:10], LETTERS[1:3]))
In order to plot the first two columns using a scatter plot, we can use
the plot() function.
> plot(y[,1], y[,2])

10.5
Base Graphics – Scatter Plots

On the right the dataset used for exploring


the basic R graphical functions.
The graphical output is hosted in the
RStudio tab plots, which provides a series
of buttons to enlarge the chart, export it to
a file (image or pdf), remove the current
chart or all graphic objects.

10.6
Base Graphics – Scatter Plots

10.7
Base Graphics – All pairs
> pairs(y)

10.8
Base Graphics – Plot Labels and Text
plot(y[,1], y[,2],
pch=20, col="red",
main="Scatter")
text(y[,1]+0.01,
y[,2],rownames(y))

The text() function


adds the label in the
specified x-y coordi-
nates to plot.

10.9
Base Graphics – Plotting characters (pch)

The pch parameter defines


the type of point to be drawn
with a numerical value.
For instance, pch=8 will
draw an asterisk instead of
the filled circle, whose color
is defined by the parameter
col.
The main parameter defines
the title of the plot.

10.10
Base Graphics – Scatter Plot
plot(y[,1], y[,2], type="n") The plot() function is
text(y[,1], y[,2], rownames(y)) equipped with many input
parameters that allow an
extensive customization of
the graph.
In this regard, consult the
guide:
> help(plot)
# graphical parameters
> help(par)

10.11
Base Graphics – Plot Parameters
# all graphic objects will have the properties expressed
in the par() function
op <- par(mar=c(7,7,7,7), bg="lightyellow")
# the graphical properties expressed in the plot
function, on the other hand, have a local validity
plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2,
cex.axis=1.2, cex.main=1.2, cex.sub=1, lwd=4, pch=20,
xlab="x label", ylab="y label", main="Main Title",
sub="Sub Title")
grid(3, 3, lwd = 2)

10.12
10.13
Base Graphics – Plot Parameters
The graphical parameters expressed in the par() function are
applied to all the ensuing graphs. As a result, if you draw a second
graph with the plot() command after executing the previous lines of
code they will all be characterized by the same definition of the
margins and the same background color. The mar parameter is a
numerical vector of 4 elements that defines the space between the
axes and the edge of the graph in accordance with the syntax:
c(bottom,left,top,right).
If not expressed, the default values are: c(5.1,4.1,4.1,2.1).

10.14
Base Graphics – Plot Parameters

The bg parameter defines the background colour of the plot. The list
of all 657 colors made available by R can be viewed with the
colors() command.
To delete all the graphical properties stored with par() including
graphic objects, you can use the command dev.off().
Now let’s discuss the parameters in the plot() function.
type indicates which type of graph is to be drawn. In this case,
dealing with a type = "p" scatter plot means that the (x, y)
coordinates will be represented in the graph as points.

10.15
Base Graphics – Plot Parameters

From the guide you can see all the other possible graphs managed:

10.16
Base Graphics – Plot Parameters

cex is a numeric value that sets the size of text and symbols.
It indicates how many times the textual character must be enlarged
with respect to the default value which is equal to 1.
The following parameters use the same described logic applying it:
- To the numbers in the axes (cex.axis),
- To the axes labels (cex.lab),
- To the title of the plot (cex.main)
- To the subtitle of the plot (cex.sub)

10.17
Base Graphics – Plot Parameters

col checks the colors of the symbols. As for the cex parameter there
are: col.axis, col.lab, col.main, col.sub.
lwd is the thickness of the line (or of the point as in this case). The
default value is 1. This parameter is "device-specific" because the
size is a specific property of a graphical object (point, line, segment).
xlab and ylab are the textual labels to be applied to the x axis and the
y axis, respectively.
main and sub are the parameters that specify the main title of the
graph and the subtitle, respectively.

10.18
Base Graphics – Plot Parameters
To conclude, the grid() function adds a grid above the graph
containing nx cell x ny cell along the abscissa (nx = 3) and ordinate
(ny = 3) axes, respectively.
# We proceed to clear the memory of all graphic devices:
objects and properties defined by the par() function

> dev.off()
null device
1

10.19
Base Graphics – abline()
The abline() function adds a line in the current plot.
So if you want to add the regression line to the base chart you can
write the following code:
# I draw the generated points using the scatter plot
plot(y[,1], y[,2])
# I perform the linear regression using lm() where y[,1]
is the regressor and y[,2] is the dependent variable
myline <- lm(y[,2]~y[,1]);
# add the regression line
abline(myline, lwd=2, lty=5)

10.20
Base Graphics – abline()

lty indicates the


type of line to be
drawn and it is an
integer:
1: solid
2: dashed
3: dotted
4: dotdash
5: longdash
6: twodash
10.21
Base Graphics – abline()

Looking at the guide, we now check the input arguments for the
abline() function:
>?abline

10.22
Base Graphics – abline()

In the example, the reg object has been assigned as input.


The following more general formulation could also be used:
plot(y[,1], y[,2])
myline <- lm(y[,2]~y[,1]);
print(myline$coeff)
a <- myline$coeff[1]
b <- myline$coeff[2]
# line defined by the intercept a and the slope b
abline(a,b, lwd=2, lty=5)

10.23
Base Graphics – log-scale

The log parameter allows to


plot values using a logarithmic
scale on the x-axis, log="x" ,
on the y-axis log="y" or on
both axes, log="xy".

> plot(y[,1], y[,2],


log="xy",main="Logscale")

10.24
Base Graphics – LaTeX compability

The following code shows how to add a label containing a


mathematical expression written in TeX notation using the latex2exp
package.
#install.packages("latex2exp")
library(latex2exp)
plot(y[,1], y[,2],xlab=TeX("$\\alpha$"),
ylab=TeX("$\\beta$"),main=(TeX("$\\LaTeX$compability")))
text(y[1,1]+.04, y[1,2],
TeX("$\\sum_{i=1}^{N}\\frac{1}{h^{2}}$"), cex=1.3)

10.25
Base Graphics – LaTeX compability
LaTeX is a markup
language used for the
preparation of texts based
on the WYSIWYM (What
You See Is What You
Mean) paradigm compared
to the more widespread
WYSIWYG (What You See
Is What You Get).
It is particularly used in the
academic community and
in the scientific field.

10.26
Base Graphics – Line Plot

The following code shows how to create a line plot from a single
dataset.
The function you can use is still plot with the parameter type="l"
#dataset
set.seed(29091984)
y <- matrix(runif(30), ncol=3,
dimnames=list(letters[1:10], LETTERS[1:3]))
#line plot – single dataset
plot(y[,1], type="l", lwd=2, col="blue")

10.27
Base Graphics – Line Plot

10.28
Base Graphics – Line Plot with more datasets

This code allows to plot three lines in the same graphical device:
split.screen(c(1,1)) #dataset 1
plot(y[,1], type="l", lwd=2, ylim=c(0,1), col="blue")
screen(1, new=FALSE) #dataset 2
plot(y[,2], type="l", lwd=2, col="red", xaxt="n",
yaxt="n", ylab="", xlab="", main="", bty="n")
screen(1, new=FALSE) #dataset 3
plot(y[,3], type="l", lwd=2, col="green", xaxt="n",
yaxt="n", ylab="", xlab="", main="", bty="n")

10.29
Base Graphics – Line Plot with more datasets

10.30
Base Graphics – Line Plot with more datasets
The split.screen() function indicates how the graphic layout must
be divided when several graphical objects must be hosted in it at the
same time.
The value c(1,1) indicates that the screen (the part of RStudio that
hosts graphic objects: tab plot) will not be divided into sub-charts.
In the first call of the plot() function we specify the following
parameters: type (line plot), lwd (line width), col (colour) and ylim
which indicates the range of variation of the y-axis. Since the dataset
is composed of numbers drawn according to a uniform distribution
[0,1] it is reasonable to set the parameter to c(0,1).

10.31
Base Graphics – Line Plot with more datasets
screen(1,new=FALSE) indicates that the next graph will be hosted
in the same first screen, or, more in general, in the same graphical
area that already hosts the current plot. The next times that the plot
function will be invoked, in addition to the input parameters already
discussed (type, lwd, col), there will also be:
xaxt="n" and yaxt="n" indicating that the x axis and y axis are
set but not drawn.
xlab="" and ylab="": the x and y axis labels are not displayed
main="": the chart title is not displayed
bty="n": the box containing the graph is not displayed

10.32
Base Graphics – box parameter (bty)

The bty parameter checks the type of box containing the graph.
Allowed values, beyond "n" are: "o""1""7""c""u""]"
par(mfrow=c(2,3))
plot(y[,1],type="l",bty="o",xaxt="n",yaxt="n",main="o")
plot(y[,1],type="l",bty="l",xaxt="n", yaxt="n",main="1")
plot(y[,1],type="l",bty="7",xaxt="n", yaxt="n",main="7")
plot(y[,1],type="l",bty="c",xaxt="n", yaxt="n",main="c")
plot(y[,1],type="l",bty="u",xaxt="n", yaxt="n",main="u")
plot(y[,1],type="l",bty="]",xaxt="n", yaxt="n",main="]")

10.33
Base Graphics – box parameter (bty)

10.34
Base Graphics – splitting the graphic window (mfrow and mfcol)

The mfrow and mfcol parameters of the par() function allow to define
the graphical partitions within the graphic window.
In our case, mfrow=c(2,3) means that RStudio tab plot will host six
graphical objects in a matrix with 2 rows and 3 columns.
The graphs will fill these cells by rows in the case of mfrow and by
columns if the mfcol function is used.
The next graph was generated from the same code but using
mfcol=c(2,3) instead of mfrow=c(2,3).

10.35
Base Graphics – splitting the graphic window (mfrow and mfcol)

10.36
Base Graphics – bar plot

The following code allows to draw a bar plot:


#dataset
set.seed(29091984)
y <- matrix(runif(30), ncol=3,
dimnames=list(letters[1:10], LETTERS[1:3]))
#barplot function
barplot(y[1:6,], ylim=c(0, 1), beside=TRUE,
legend=letters[1:6])
> help(barplot)

10.37
Base Graphics – bar plot

10.38
Base Graphics – Error bars

Error bars are graphical representations of data variability and they


are used with the aim of indicating the error or the uncertainty in a
measurement.
They can be represented in R using the following script:
bar <- barplot(m <- rowMeans(y) * 10, ylim=c(0, 10))
stdev <- sd(t(y)) #print(stdev)  0.266
arrows(bar, m, bar, m + stdev, length=0.15, angle = 90)

> help(arrows)

10.39
Base Graphics – Error bars
print(cbind(bar,round(m,1)))
[,1] [,2]
a 0.7 2.1
b 1.9 5.4
c 3.1 6.7
d 4.3 7.3
e 5.5 4.8
f 6.7 2.1
g 7.9 2.9
h 9.1 5.8
i 10.3 5.3
j 11.5 5.8

10.40
Base Graphics – Error bars

10.41
Base Graphics – Histogram (hist)
hist(y, freq=TRUE, breaks=10); help(hist)

10.42
Base Graphics – Density plot
plot(density(y), col="red")

10.43
Base Graphics – Pie chart
The pie() function draws a pie chart
pie(y[,3], col=rainbow(length(y[,3]), start=0.1,
end=0.8), clockwise=TRUE)
The clockwise parameter is a logical value: if TRUE the input vector
data is arranged on the pie chart clockwise, if FALSE anti-clockwise.
The rainbow() function creates a vector of n (length(y[,3]))
contiguous colors with tones that span from start = 0.1 to end =
0.8. The shades vary according to the following scale: red = 0,
yellow = 1/6, green = 2/6, cyan = 3/6, blue = 4/6 and magenta = 5/6.

10.44
Base Graphics – Pie chart

The legend of the pie chart is then


inserted using the legend() function

legend("topright",
legend=row.names(y), cex=1.3,
bty="n", pch=15, pt.cex=1.8,
col=rainbow(length(y[,1]),
start=0.1, end=0.8), ncol=1)

10.45
Base Graphics – Manage the layout of the graphic window

As already mentioned, the split.screen() function allows to divide the


active graphics area (i.e. the graphical device) into multiple partitions.
For example, the split.screen instruction (c(1,2)) divides the device
into two parts that can be selected with screen(1) and screen(2).
erase.screen() deletes the latest graph drawn.
An existing partition can be further divided by reusing the
split.screen() function, giving the user the possibility to make more
complex graphic arrangements.
These functions are incompatible with others (layout or coplot) and it
is suggested to use them for a first graphical exploration of the data.

10.46
Base Graphics – Manage the layout of the graphic window

The R function that allows a complete and structured management of


graphical partitions is layout(): it allows to display the areas of the
active graphical device in which the graphs will be drawn.
Its main argument is an integer matrix indicating the sub-window
numbers.
For example, to divide the device into four equal parts, use:
>layout(matrix(1:4,2,2))

Which can obviously be rewritten in order to first create the matrix


that helps to understand how the device will be divided.

10.47
Base Graphics – Manage the layout of the graphic window
> mat <- matrix(1:4,2,2)
> print(mat)
[,1] [,2]
[1,] 1 3
[2,] 2 4
> layout(mat)

To view the partition created, use the layout.show() function which


takes the number of sub-windows to be displayed as input data (in
this case 4).

10.48
Base Graphics – Manage the layout of the graphic window
> layout.show(4)

10.49
Base Graphics – Manage the layout of the graphic window
mat <- matrix(1:6,3,2)
layout(mat)
layout.show(6)

10.50
Base Graphics – Manage the layout of the graphic window
mat <- matrix(1:6,2,3)
layout(mat)
layout.show(6)

10.51
Base Graphics – Manage the layout of the graphic window
> m <- matrix(c(1:3,3),2,2)
> print(m)
[,1] [,2]
[1,] 1 3
[2,] 2 3
> layout(m)
> layout.show(3)

10.52
Base Graphics – Manage the layout of the graphic window

In these examples, the byrow option of matrix () was not used and
therefore, by default, the sub-windows were sorted according to the
columns order.
To set the rows order, simply specify the parameter byrow=TRUE in
the matrix() function.
By default, layout() splits the device with regular height and
weight. These can be changed as needed with the width and height
options.
Dimensions are usually given in a relative way, but can also be
specified in centimeters (see ?layout).

10.53
Base Graphics – Manage the layout of the graphic window
m <- matrix(1:4,2,2)
layout(m,
widths=c(1,3),
heights = c(3,1))
layout.show(4)

10.54
Base Graphics – Manage the layout of the graphic window
m <- matrix(c(1,1,2,1),2,2)
> print(m)
[,1] [,2]
[1,] 1 2
[2,] 1 1
layout(m,
widths=c(2,1),
heights = c(1,2))
layout.show(2)

10.55
Base Graphics – Manage the layout of the graphic window

Finally, the numbers in the array can include zero, giving the
possibility to create complex partitions.
m <- matrix(0:3,2,2)
layout(m, c(1,3),c(1,3))
layout.show(3)

10.56
Base Graphics – Manage the layout of the graphic window

Referring to the usual data set:


set.seed(29091984)
y <- matrix(runif(30), ncol=3,
dimnames=list(letters[1:10], LETTERS[1:3]))
We use a graphical device with three partitions to host three
graphical objects:
m <- matrix(c(1,2,3,2),2,2); layout(m)
hist(y, freq=TRUE, breaks=10)
plot(density(y), col="red")
plot(y[,3], type="l", lwd=2, col="blue")

10.57
Base Graphics – Manage the layout of the graphic window

10.58
Base Graphics – Save a chart from code to a file

In addition to the RStudio tab plot, you can also save the graphs by
coding.
For instance, in order to save a plot in pdf format:
pdf("test.pdf");
plot(density(y), col="red")
dev.off()
The procedure is very similar for other graphic formats (jpeg,png,ps):
jpeg("test.jpg"); plot(density(y), col="red"); dev.off()
bmp("test.bmp"); plot(density(y), col="red"); dev.off()

10.59
11.
Statistical analysis
Stats package
Formulae
Generic Functions
The stats package

It would be time-consuming to go into details on all the possibilities


offered by R for statistical analysis, both for the basic functions and
for the countless specific packages. The objective of this module is
therefore to provide basic knowledge on the strengths of the software
in data analysis. The stats package, loaded automatically when the R
console is starting, contains functions for conducting a wide range of
traditional statistical analyses including: null hypothesis testing, linear
models (which include least squares regressions, generalized linear
models and analysis of variance techniques), probability
distributions, clustering, time-series analysis, non-linear least
squares, multivariate analysis…

11. 2
Recommended and Contributed statistical packages

Other statistical techniques are available in a large number of


downloadable packages.
Some of these can be directly downloaded with the standard
installation of R and they are classified as recommended
(recommended packages), others are contributed by the Community
(contributed packages) and they have to be installed by the user.
It is a widespread and reasonable belief that the packages
downloadable from the Comprehensive R Archive Network - CRAN
are reliable. Be careful: It is always up to the user to critically analyze
the reliability of the results from these calculation libraries.

11. 3
The structure of the module

In order to introduce the way to data analysis in R, a simple example


of statistical analysis, that does not require the import of specific
packages, has been carried out.
Therefore, the concepts of formulae and generic functions will be
dealt with in detail.
The module concludes with a significant, but not exhaustive,
overview of the statistical packages available from the R Community.
At the end of the course, professional and academic cases of
statistical data analysis in the economic and financial field will be
presented.

11. 4
A simple example of an analysis of variance – ANOVA
The function to carry out the analysis of variance in stats is aov().
We use the R built-in dataset, called: InsectSprays.
https://www.rdocumentation.org/packages/datasets/versions/3.6.1/topics/InsectSprays

It contains information regarding the effectiveness of six insecticides


(A-B-C-D-E-F) on an observed response variable which is the
number of dead insects.
Each insecticide has been tested 12 times so the data frame
contains a total of 72 observations.

11. 5
A simple example of an analysis of variance – ANOVA
We can import the R built-in dataset using the data() function.
> data(InsectSprays)
> View(InsectSprays)

Before doing the analysis of variance, we perform a preliminary data


analysis using the box plot:
boxplot(count ~ spray, data = InsectSprays,
xlab = "Type of spray", ylab = "Insect count",
main = "InsectSprays data", varwidth = TRUE)

11. 6
Box plot – Descriptive statistics

11. 7
A simple example of an analysis of variance – ANOVA

ANOVA has been carried out on the square root of the response
through the aov() function:
aov.spray <- aov(sqrt(count) ~ spray, data=InsectSprays)

The main (and compulsory) input argument for the aov(), as in the
boxplot() function, is a formula which specifies the output
(response) on the left of the tilde symbol ~ and the predictor on the
right.
The option data=InsectSprays specifies that the variables (count
and spray) are components in the InsectSprays data frame.

11. 8
A simple example of an analysis of variance – ANOVA

Equivalently:
aov.spray <-
aov(sqrt(InsectSprays$count)~InsectSprays$spray)
Or, if you know the column numbers of the dataset, also:
aov.spray <-
aov(sqrt(InsectSprays[,1])~InsectSprays[,2])

It is undisputed to prefer the former syntax with $ as it is clearer.


Results are not displayed as long as they are not assigned to an
object, in this case named aov.spray

11. 9
The summary() function

Generally, to display a summary of the main statistical results in the


R console, the print() or the summary() functions are used.
summary() gives the users more information, while print()is more
coincise.
Typing the object name in the console is equivalent to applying the
print() function on the object itself.

The display of the outputs, however, depends on the particular


function implemented so it is essential to always consult the help of
the function.

11.10
Analysis of the results
> print(aov.spray)
Call:
aov(formula = sqrt(InsectSprays[, 1]) ~
InsectSprays[, 2])
Terms:
InsectSprays[, 2] Residuals
Sum of Squares 88.43787 26.05798
Deg. of Freedom 5 66
Residual standard error: 0.6283453
Estimated effects may be unbalanced

11.11
Analysis of the results
> summary(aov.spray)
Df Sum Sq Mean Sq F value Pr(>F)
InsectSprays[, 2] 5 88.44 17.688 44.8 <2e-16 ***
Residuals 66 26.06 0.395
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A graphical representation of the results can be performed with
par(mfcol=c(2,2))
plot(aov.spray)
termplot(aov.spray, se=TRUE, partial.resid = TRUE, rug=TRUE)

11.12
11.13
11.14
Formulae

Formulas are a key element in statistical analysis with R: the notation


used is almost the same for all functions.
A formula is typically expressed by y ~ model where y is the
analyzed response and model is a set of variables for which the
parameters must be estimated.
These variables are separated with arithmetic symbols, but in this
context they take on particular meanings.
a+b additive effects of a and b
X if it is a matrix this specifies an additive effect of each of its
columns: X[,1]+X[,2]+...+X[,ncol(X)]

11.15
Formulae
a:b interaction effect between a and b
a*b additive and interaction effects. a*b is equal to a+b+a:b
poly(a,n) polynomials of a up to degree n
^n includes the interactions up to the level n. (a+b+c)^2 is equal to
a+b+c+a:b+a:c+b:c
b %in% a the effects of b are nested in a. b %in% a is equal to
a+a:b or a/b
-b removes the effect of b, for instance (a+b+c)^2-a:b is equal to
a+b+c+a:c+b:c

11.16
Formulae
-1 y~x-1 is a regression that passes through the origin
1 y~1 fits a model without effects (only the intercept)
offset(...) adds an effect to the model without the estimation of
other parameters (for instance: offset(3*x))
We observe that the arithmetic operators of R used in a formula take
on a different meaning than the one they have in a traditional
mathematical expression.
For instance, the formula y~x1+x2 defines a model 𝑦 = 𝛽1 𝑥1 +
𝛽2 𝑥2 + 𝛼 and not 𝑦 = 𝛽 𝑥1 + 𝑥2 + 𝛼 as in the usual meaning of the +
operator.

11.17
Formulae – the I() function
To include arithmetic operations in a formula, use the I()function:
the formula y~I(x1+x2) defines the model: 𝑦 = 𝛽 𝑥1 + 𝑥2 + 𝛼.
Similarly, to define the model 𝑦 = 𝛽1 𝑥 + 𝛽2 𝑥 2 + 𝛼, we will use the
formula y~poly(x,2) and not y~x+x^2.
Furthermore, it is possible to include a function in a formula in order
to perform a variable transformation, as done in the previous
example.
aov() accepts a particular syntax for the definition of the random
effect: y~a+Error(b) means the additive effect of the fixed term a
with the random effect of b

11.18
Generic function

It is common practice that the R statistical functions return an object


of a class that has the same name as the function from which it has
been generated.
For instance aov() returns an object of the aov class, as well as
lm() returns an object of the lm class.

The functions that are used to extract the results of the analyses
(typically print() and summary()) act according to the class of the
object passed as input data.
This kind of function is called generic function.

11.19
Generic function

For example, the most used function to extract the results of


statistical analysis is summary().
If this function is applied with lm (linear model) or with aov (analysis
of variance) it is clear that the type of output displayed in console will
be different because the kind of information generated by the two
statistical analyses are different.
Consequently, despite the same name of the function, its behavior is
different depending on the class type of the object passed as an
argument. The advantage of a generic function is therefore to use
the same syntax for all cases.

11. 20
Generic function

Typically an object that contains the results of a statistical analysis is


a list and the way it is displayed is determined by its class.
We have just discussed this concept according to which the behavior
of the same function is influenced by the class of the object passed
as an argument.
This is a general feature of R, which has more than 100 generic
functions.
The following table summarizes the main generic functions that can
be used with the statistical functions.

11. 21
Generic function

A function such as aov() and lm() returns a list with the results of
the statistical analysis: not only can they be viewed, but they can
also be used in the environment.

11. 22
Generic function

Considering the example of variance analysis, the structure of the


returned object can be viewed in the console with str(aov.spray).
To display the names of the objects contained in the list, we use:
> names(aov.spray)
"coefficients" "residuals" "effects" "rank"
"fitted.values" "assign" "qr" "df.residual"
"contrasts" "xlevels" "call" "terms" "model"

The results of the statistical analysis can be extracted with the usual
syntax used for lists.

11. 23
Generic function
aov.spray$coefficients

Many contributed packages are added to the list of standard


statistical methods contained in R.
These are distributed separately and must be installed and loaded
with library() in the environment before using them.
The complete list of packages can be found on the CRAN Web site:
https://cran.r-project.org/

11. 24
14781 Contributed Packages

11. 25
12.
Programming
Function, Scope, Debug Mode
Conditional execution
Loops and Vectorization
Script versus function

The R instructions written in the previous module have been typed in


a text file with the .R extension.
The instructions were written in sequence and they could be
executed partially by highlighting the lines of code or entirely by
clicking the Source command from RStudio.

Run executes the program line by line, Source runs the entire
program. The button in the center re-executes the instructions.

12.2
Script versus function

We want to implement the Black-Scholes-Merton pricing formula for


European-style vanilla options in an R script:
𝑐 = 𝑆𝑁 𝑑1 − 𝑋𝑒 −𝑟𝑇 𝑁 𝑑2
𝑝 = 𝑋𝑒 −𝑟𝑇 𝑁 −𝑑2 − 𝑆𝑁 −𝑑1
Where:
𝑆
ln + 𝑟 + 𝜎 2 /2 𝑇
𝑑1 = 𝑋 , 𝑑2 = 𝑑1 − 𝜎 𝑇
𝜎 𝑇
With 𝑆 the spot value for the asset, 𝑋 the strike price, 𝑟 the risk-free
rate, 𝑇 the time to maturity in years, 𝜎 the annualized volatility.

12.3
Script versus function
# Script that implements the BSM formula
S <- 100; X <- 110; r <-0.05; T <- 1; sigma <- 0.2
d1 <- (log(S/X)+(r+sigma^2/2)*T)/(sigma*sqrt(T))
d2 <- d1 - sigma * sqrt(T)
Call <- S*pnorm(d1) - X*exp(-r*T)*pnorm(d2)
Put <- X*exp(-r*T) * pnorm(-d2) - S*pnorm(-d1)
paste("Call Option price: ", round(Call,2),
"Put Option price: ", round(Put,2))

By clicking on Source, the code is executed

12.4
Script versus function
[1] "Call Option price: 6.04 Put Option price: 10.68"

All variables are stored in


R regardless of their
logical purpose attributed.
Let's proceed to a more
careful functional analysis
of the variables.

12.5
Script versus function

The variables declared within the script can be functionally divided


into three categories:
- The model input variables (S , X , r , T and sigma)
- Auxiliary variables preparatory to the output calculation (d1 and d2)
- The model output variables (Call and Put)
S
X d1 d2 Call
r
T Put
sigma

12.6
Script versus function

The previous graphical representation can be described by a


function.
The syntax for creating a R function is:
function name input

myfunction <- function(arg1, arg2, ... ){


statements engine, auxiliary objects

return(object) output

12.7
Script versus function

The Black Scholes Merton script can be re-written more efficiently by


creating a BSMprice() function
BSMprice <- function(S, X, r, T, sigma ){
d1 <- (log(S/X)+(r+sigma^2/2)*T)/(sigma*sqrt(T))
d2 <- d1 - sigma * sqrt(T)
Call <- S*pnorm(d1) - X*exp(-r*T)*pnorm(d2)
Put <- X*exp(-r*T) * pnorm(-d2) - S*pnorm(-d1)
return(list("Call.Price"=Call,"Put.Price"=Put))
}

12.8
Script versus function

In R the functions are managed as objects (note the assignment in


the first row)
BSMprice <- function(S, X, r, T, sigma)

Therefore, in order to be usable by the user, the function must be


stored in the memory and appear in the environment.
The procedure for bringing the function into the environment is the
same as for the other R objects.
From the RStudio environment we now check that only the function is
present (if needed we delete all the other objects).

12.9
Script versus function

The function we have created can be implemented in the same way


as we have used the R functions coming from the various packages
in the previous modules :
> Prices <- BSMprice(S=100,X=110,r=0.05,T=1,sigma=0.2)
> print(Prices)
$Call.Price $Put.Price
[1] 6.040088 [1] 10.67532

12.10
Script versus function

The only object, besides the function stored in memory, is the price
list, i.e. the output.

12.11
Scope of variables

If you also want to have the inputs stored in memory, you can rewrite
the call to the function using = instead of <-
Prices<- BSMprice(S <- 100, X <- 110, r <- 0.05, T <- 1,
sigma <- 0.2)
> ls()
[1] "BSMprice" "Prices" "r" "S" "sigma" "T" "X"

Or equivalently:
S <- 100; X <- 110; r <- 0.05; T <- 1; sigma <- 0.2
Prices <- BSMprice(S,X,r,T,sigma)

12.12
Scope of variables

Or equivalently:
S = 100; X = 110; r = 0.05; T = 1; sigma = 0.2
Prices <- BSMprice(S,X,r,T,sigma)
Note how the operator = and <- has the same effect on the visibility
of the variables (scope) for assigning values to an object outside the
use of a function, while it assumes a different meaning if it is used
within input arguments of a function.
Prices <- BSMprice(S=100, ...) means that when the function
is called, the variable S exists only inside the object and not outside.

12.13
Scope of variables
Prices <- BSMprice(S <- 100, ...) means that when the
function is called, the variable S exists inside the object and will
continue to exist in memory even outside.
Defining your own functions within a code allows you to manage
programming more effectively in terms of:
- Management of the scope of objects and therefore of memory
- Reuse the same function more easily in different parts of the code
(greater efficiency)
- Share calculation capabilities with other programmers more
effectively.

12.14
Debug mode

In programming it is essential to be able to monitor the flow of


information generated by the code step by step and the value that
the variables assume during the execution process.
Having a tool that allows the programmer to understand if the written
code actually reflects the logic designed in order to intervene and
correct any errors (bugs) is of crucial importance.
The activity that allows such monitoring is called debugging.
RStudio allows to enter in debug mode when a red circle appears
(breakpoint) by clicking on the left of a line of the script.

12.15
Debug mode

Let's consider the previously programmed script and insert a


breakpoint in the first line of code.
Compiling the code with
we observe that the instructions
are not executed but the code
flow stops in the first line, that is
where the breakpoint has been
set, in correspondence with the
green arrow.

12.16
Debug mode

Reading the logs that appear in the console we can check that we
are in the debug mode.
> debugSource('C:/Software R Course/Bs.R')
Called from: eval(expr, p)
Browse[1]> n
debug at C:/Software R Course/Bs.R#1: S <- 100
Browse[2]>

At this point, no R object has been stored in memory yet.

12.17
Debug mode

Above the Console, the buttons needed


to inspect the code appear. From left to right, we have:
 it executes the next line of code.
 it executes the next line of code, also entering in the code
written into a function.
 it executes the remaining lines of code until the script ends.
 it continues until the next break point is reached.
 it exits from the debug mode.

12.18
Debug mode

By clicking on the Next button , the script instructions are


executed and the objects progressively stored in the environment.

12.19
Debug mode

A breakpoint is set in line 9, in correspondence with the definition of


variable d2.
The continue button is used to go to the next breakpoint.

12.20
Debug mode

The objects stored in memory inside the function are:

If you proceed to execute the remaining lines of code, it is


observed that the auxiliary internal variables d1 and d2 are no longer
present in memory as they are not returned by the function with the
return statement.

12.21
Debug mode

To remove a breakpoint, simply click on it.

12.22
Conditional execution: if statement

Conditional blocks can be introduced into the R code. They are


constructs that allow to execute portions of code only if certain logical
conditions are met.
These code flow control instructions are called if statement.
The syntax for implementing such conditional code execution is:

if (logical_expression) statement_1 else statement_2

It indicates that if the logical expression is true then statement_1 will


be executed otherwise statement_2 will be executed.

12.23
Conditional execution: if statement

For example, if you run these lines of code in the console:


age <- 19
if(age>=18) print("Adult") else print("Minor")

The result is:


[1] "Adult"

As, since the condition age>=18 is verified, statement_1


print("Adult") is executed instead of statement_2 or
print("Minor")

12.24
Conditional execution: if statement

The same command could be written in an extended form using the


so-called grouped expression:

if (test_expression) { age <- 19


Statement1 if (age>=18){
} else { print("Adult")
Statement2 }else{
} print("Minor")
}

12.25
Conditional execution: if statement

Multiple logical expressions can be evaluated using traditional


operators (&& and ||).
age <- 19
if ((age>=18) && (age<150) && (age>0)){
print("Adult")
}else if ((age<=18) && (age<150) && (age>0)) {
print("Minor")
}else {
print("Invalid Age")
}

12.26
Conditional execution: if statement

A refinement for the pricing function of a European option according


to the Black-Scholes-Merton model could be:

12.27
Loops and vectorization

Suppose we have a vector X and for each element of X with a value


equal to b, we want to associate the value 0 to another vector Y,
otherwise 1. For instance, setting b = 5 and a given vector equal to
X = c(4,-1,5,12,5,5,-4)
the resulting vector Y will be:
Y = c(1,1,0,1,0,0,1)

X 4 -1 5 12 5 5 -4

Y 1 1 0 1 0 0 1

indx 1 2 3 4 5 6 7

12.28
Loops and vectorization

The solution of this exercise can be addressed by programming a for


cycle.
The syntax of a recursive execution in R is as follows:
for each item in sequence

for (val in sequence) {


YES
Last item
statement reached?
}
NO

Body of for Exit Loop

12.29
Loops and vectorization
When using a for loop to fill an array, it is
b <- 5 mandatory to define its size and mode in
X <- c(4,-1,5,12,5,5,-4) advance.
Y <- numeric(length(X))
An ordered indentation of the code
for (i in 1:length(X)) {
allows a better understanding and
if(X[i]==b){
readability of the code. Note the aligment
Y[i] <- 0 of the curly brackets }
} else {
Y[i] <- 1 To fully understand logic, the reader is
}
invited to activate the debug mode and
execute the statements step-by-step.
}

12.30
Loops and vectorization

Loops and other control structures can however be avoided in many


cases using a feature of the R software called vectorization.
Using vectorization, we are able to make implicit loops in
expressions.
Consider, for example, vector z defined as the sum of two vectors of
equal length: x and y.
R allows the flexibility to add element by element without
programming a recursive structure.

12.31
Loops and vectorization
z <- x + y
In traditional automation languages where the vectorization feature is
not supported, the use of for is essential.
The equivalent instruction in short form (i.e. without coding a grouped
expression) is:
z <- numeric(length(x))
for (i in 1:length(z)) z[i] <- x[i] + y[i]
Beyond the fact of having to write more code, the execution of a loop
or more generally of a control structure is computationally much
more expensive than an instruction in vectorized form.

12.32
Loops and vectorization

The previous example could be rewritten more efficiently (and


elegantly) as follows:
b <- 5
X <- c(4,-1,5,12,5,5,-4)
Y <- numeric(length(X))
Y[X!=b] <- 1
The tictoc package allows to measure in a simple and intuitive way
the time taken by the compiler to execute the lines of code included
between the instructions tic() and toc(). It is therefore a useful
tool for improving code performance.

12.33
tic and toc
# install.packages("tictoc")
library(tictoc)
tic()
b <- 5
X <- c(4,-1,5,12,5,5,-4)
Y <- numeric(length(X))
Y[X!=b] <- 1
toc()

0.05 sec elapsed

12.34
While loop

An alternative way to program recursive executions is to use the


while loop.
Its syntax is: Enter while loop

while(test_expression) {
statement Test FALSE
Expression
}
TRUE
The previous example is repeated
Body of while Exit Loop
using a while loop.

12.35
While loop
When using a while loop to fill an array, it
b=5;X=c(4,-1,5,12,5,5,-4) is mandatory to define its size and mode
Y=numeric(length(X));i=1 in advance.
while (i <= length(X)) {
An ordered indentation of the code
if(X[i]==b){
allows a better understanding and
Y[i] <- 0
readibility of the code. Note the aligment
} else { of the curly brackets }
Y[i] <- 1
} To fully understand logic, the reader is
i=i+1
invited to activate the debug mode and
execute the statements step-by-step.
}

12.36
To conclude… three types of programmers

Once a programmer of the first type has reached the goal for which
"the code does what it has to do" he feels satisfied.

A programmer of the second type differs from the previous one


because he constantly challenges himself. Not only has he achieved
the goal of writing a code that actually works, he is also preoccupied
with writing it in the most efficient way, in terms of performance.

A programmer of the third type, however, adds one more element to


this challenge, that is, he makes sure that the way he writes the
instructions is also elegant: intelligible in its sublime logical form.

12.37
Programming between art and technique

The Art of Computer Programming


by Knuth can be considered a sort
of encyclopedia for programming
and, like all arts, it is constantly
evolving.

Prof. Donald Knuth

12.38
Code Wisdom – https://twitter.com/CodeWisdom
"Code never lies, comments sometimes do." – Ron Jeffries
"Make it correct, make it clear, make it concise, make it faster. In that order."
– Wes Dyer
"Debugging is like being the detective in a crime movie where you are also
the murderer" – Filipe Fortes
"The only way to learn a new programming language is by writing programs
in it" – Dennis Ritchie
"Everyday life is like programming, I guess. If you love something you can
put beauty into it." – Donald Knuth
"Tidy datasets are all alike, but every messy dataset is messy in its own
way." – Hadley Wickham

12.39
Bibliography and Sitography

These handouts have been drawn up based on various contributions


and guides found in the R CRAN:
https://cran.r-project.org/doc/contrib/ and on various specialized sites

I would like to mention two sources in particular:


An Introduction to R – Notes on R: A programming Environment for
Data Anaysis and Graphics, version 3.6.1 (2019-07-05) by W. N.
Venables, D. M. Smith and the R Core Team

R for Beginners by Emmanuel Paradis

12.40
Pier Giuseppe Giribone
Phd, CIIA®, CESGA®, CIWM®, PhD

During the course, on Mondays from


16:30 to 17:30 I will be by the office of the
Adjunct professors.
Upon request, by email, I will
always be available to make an
appointment.

[email protected]
Website: http://www.diptem.unige.it/piergiribone/
View publication stats

You might also like