Software R
Software R
Software R
net/publication/354248891
Software R
CITATIONS
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Pier Giuseppe Giribone on 31 August 2021.
Software R
An introduction to Statistical Programming
1. 3
Ross Ihaka and Robert Gentleman
1. 4
Software R
1. 5
Software R
1. 6
R console
1. 7
R console - help functions
> help() is the function that opens the online help. It is useful for
knowing the meaning of an unknown R command and being able to
implement it using the right syntax.
For instance, in order to know the meaning of the getwd function,
you can type in the console:
> help("getwd")
Or equivalently:
> ?getwd
And the page of the guide dedicated to the function appears on the
web browser.
1. 8
> setwd("C:/Users/Utente/Desktop/")
1. 9
Traditional structure of a help page
1.10
On-line help
Typing the help.start()instruction in the Console, the on-line help opens
1.11
R console – Case sensitivity
1.12
R console – Comments
1.13
Multiple instructions on the same line
Using the Editor provided by the console, you can write the code
more easily, save the file and share its content.
Ctrl + R
For
executing the
instructions
The saved
script has the
extension *.R
1.15
R console – Completion of a command “+”
1.16
R console – Memory management
To see from the console all the variables stored in the memory and
which can therefore be used, you can execute the ls() function or
equivalently the objects() functions.
> ls()
[1] "a" "b" "c" "d"
Consequently, you cannot use any variable which is not currently
stored in the memory.
> e
Error: object "e" not found
1.17
R console – Memory management
The memory area and, therefore, the data contained therein can be
saved in a file with the * .RData extension using the save instruction
> save.image("variables.RData")
The RData file is saved in the current working directory or in the path
specified by the user.
A more extensive and customizable way is to use the instruction:
> save(list = ls(all.names = TRUE), file = ".RData",
envir = .GlobalEnv)
You can access the guide for this function by typing:
> help("save")
1.18
R console – Memory management
In order to clear variables from the memory, you can use the rm()
function.
For instance, you can remove the variable a by typing in the prompt:
> rm(a)
> a
Error: object "a" not found
To clean the memory completely, the following instruction is used:
> rm(list=ls())
> ls()
character(0)
1.19
R console – Memory management
In order to import data stored in a Rdata file, you can use the load()
function.
> load("variables.RData")
> ls()
[1] "a" "b" "c" "d"
Obviously, if the file were in a different path than the current working
directory, you must either specify the path of the file or set the
working directory where the file of interest is located using the setwd
function.
1. 20
R console – Remove the previous instructions from the console
“Help Console”
1. 21
Save the history of the interpreted instructions in a file
1. 22
R Studio – the most widespread IDE for R programming
1. 23
R Studio – Download
1. 24
A C
B D
25
R Studio – the graphic interface of the R development software
1. 26
2.
Vectors
Assignment
Numeric and Logical vector
Character and Index vector
Vectors and assignment
2.2
Vectors and assignment
The assignment
instruction is written in the
script, then it is executed The variable is in the
in the console Environment because it
has been stored in the
memory
2.3
The View() function allows to
see the contents of an object in
a table-like format. > View(x)
2.4
Vectors and assignment
c()can take an arbitrary number of vectors as input data: the result
will be an only vector made by the concatenation of all the input
vectors.
It is worth to note that a number, or better a scalar, is itself a vector of
size 1x1.
Alternative ways of writing the previous expression of the vector x
are:
> assign("x", c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7))
> c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7) -> x
2.5
Vectors and assignment
2.6
Current
WD
The statement 1/x which calculates the reciprocals of the six numeric
elements contained in the vector has not been assigned to any variable and
therefore does not appear in memory. The result is only displayed and can
only be called temporarily using the command .Last.Value
2.7
Vectors and assignment
The combine function, c(), can obviously be used with vectors of
different dimensions.
x <- c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7)
2.8
Arithmetic vectors – Arithmetic operations
2.9
Arithmetic vectors – Arithmetic operations
x1 <- c(9.3, 5.5, 10.6, 5.3, 1.3, 7.7)
y1 <- c(1,2)
z1 <- c(6,5,2,4)
v1 <- x1+y1+z1+1
x2 <- x1
y2 <- c(1,2,1,2,1,2)
z2 <- c(6,5,2,4,6,5)
Warning message:
v2 <- x2+y2+z2+1 In x1 + y1 + z1 :
> print(v1-v2) longer object length is not a multiple of shorter object
length
[1] 0 0 0 0 0 0
2.10
Arithmetic vectors – traditional mathematical functions
2.11
Arithmetic vectors – traditional arithmetic functions
5L means the integer number of 5
myvec <- c(1, 5, 3.5, -1, +2) (L=Long Integer)
By default R interprets the numbers as double
(i.e. Real numbers)
2.12
Arithmetic vectors – mean and variance
2.13
Arithmetic vectors - sorting
sort(x) returns a vector of the same dimension of x having its
elements sorted in ascending order.
x <- c(1, 5, 3.5, -1, +2)
xsorted <- sort(x)
help("sort")
> print(x)
[1] 1.0 5.0 3.5 -1.0 2.0
> print(xsorted)
[1] -1.0 1.0 2.0 3.5 5.0
2.14
Arithmetic vectors – NaN and complex numbers
Normally the user of R will not have to worry about whether the
numbers contained in a vector are integers, real or complex: the
calculations are done internally in the most precise way, treating
them as real double or complex double. In order to work with
complex numbers, it is necessary to make the complex part explicit.
Consequently:
> sqrt(-16)
«In programming, NaN (Not a Number) is a
[1] NaN warning indicating that the result of a
> sqrt(-16+0i) (numeric) operation was performed on
[1] 0+4i invalid operands.»
2.15
Arithmetic vectors – regular sequences
R has several facilities that allow the generation of the most common
number sequences. For instance, in order to create the numeric
sequence which goes from 1 to 10, you can use c():
sequence_vector <- c(1,2,3,4,5,6,7,8,9,10)
Or, more easily, you can do the same task using :
sequence_vector <- 1:10
: has a higher priority than the other arithmetic operators. The
output for the instruction 2*1:10 isn’t a vector that goes from 2 to 10,
rather the instruction will first generate the sequence that spans from
1 to 10 and then this vector will be multiplied by 2.
2.16
Arithmetic vectors – “:” and priority
> n <- 10
> 1:n-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(n-1)
[1] 1 2 3 4 5 6 7 8 9
2.17
Arithmetic vectors – The seq function
The seq() function allows to generate numeric sequences in a more
general and customizable way.
> help("seq")
This function has five input arguments, but not all of these are
compulsory during its call.
The first two inputs are the starting and the ending of the numeric
sequence. Consequently:
> 2:10
is equivalent to:
> seq(2,10)
2.18
Arithmetic vectors – The seq function and its named-form input
The input arguments for seq(), as well as many other R functions,
can be passed in the so-called named form.
In this case, the order in which the input data are passed is
irrelevant.
The first two mandatory input arguments can therefore also be
written in the named-form using from=value, to=value
The following instructions generate the same outputs:
> seq(3,15)
> seq(from=3,to=15)
> seq(to=15,from=3)
2.19
Arithmetic vectors – The seq function and its named-form input
Looking at the function help, the next two input arguments of seq()
are: by=value, length=value.
These specify a step and a length for creating the number sequence.
The default value for by is by=1.
For instance, the instruction:
> vect1 <- seq(-2,2, by=.5)
generates a vector named vect1 having the following 9 elements:
> vect1
[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2.20
Arithmetic vectors – The seq function and its named-form input
2.21
Arithmetic vectors – Random sequences
2.22
Arithmetic vectors – Random sequences
# 4 numbers drawn from a
NID(0,1)
> rnorm(4)
-0.01977535 1.34546924
-0.41916212 -1.15732186
# 4 numbers drawn by a
uniform U(3,5)
> runif(4,min=3,max=5)
3.409750 4.036633
3.157913 3.160682
2.23
Arithmetic vectors – dfunc, pfunc, qfunc
2.24
Arithmetic vectors – dfunc, pfunc, qfunc
> qnorm(c(0.025,0.975))
[1] -1.959964 1.959964
The P-value for a chi-square test 𝜒 2 = 3.84 with one degree of
freedom, 𝑑𝑓 = 1 is:
> 1-pchisq(3.84,1)
[1] 0.05004352
The values of a standard normal cumulative distribution, 𝑁(𝑑), for
𝑑 = {−1,0, +1} are:
> pnorm(c(-1,0,+1))
[1] 0.1586553 0.5000000 0.8413447
2.25
Arithmetic vectors – the rep function
The rep() function allows to replicate an object in different ways
> help("rep")
The easiest method for creating replication for the object x is:
> x=1:3
> print(vectRepl1)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
times= 1 2 3 4
2.26
Arithmetic vectors – the rep function
Another common way to implement rep() is to specify the named
form parameter each=value.
In this case each element is repeated and not the entire sequence of
the object.
> x=1:3
> vectRepl2 <- rep(x,each=4)
> print(vectRepl2)
[1] 1 1 1 1 2 2 2 2 3 3 3 3
each= 4 4 4
2.27
Logical vector
2.28
Logical vectors – logical operators
2.29
Logical vectors – logical expressions
Furthermore, if cond1 and cond2 are two logical expressions, the
principles of traditional logic apply:
cond1 & cond2 is the intersection (the logical and shortcut)
cond1 | cond2 is the union (the logical or shortcut)
!cond1 is the negation of cond1
Therefore, all the rules of the "Truth Tables" are respected.
However, keep in mind that a determined logical expression
(therefore true or false) compared with an undetermined logical
expression (NA) will (almost always) return NA.
2.30
Logical vectors – logical expressions
cond1=FALSE; cond2=TRUE # OR
cond3=NA print(cond1 | cond1)
# Negation print(cond1 | cond2)
print(!cond1) print(cond2 | cond1)
# AND print(cond2 | cond2)
print(cond1 & cond1) # Operations withn NA
print(cond1 & cond2) print(!cond3)
print(cond2 & cond1) print(cond2 & cond3)
print(cond2 & cond2) print(cond1 | cond3)
print(cond1 & cond3)
2.31
Missing Value
2.32
Missing Values – is.na function
The is.na(x) function returns a logical vector characterized by the
same dimension of the input vector x having the value TRUE if the
value of the element in x is NA, in correspondence of the examined
index, and FALSE otherwise.
> vect <- c(1:5,NA)
> print(vect)
[1] 1 2 3 4 5 NA
> logicvect=is.na(vect)
> print(logicvect)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
2.33
Missing Values – NaN
2.34
Character vectors – Strings
2.35
Character vectors – C-style escape sequences
2.36
Access vector elements
The elements stored in a vector x can be selected in the simplest
way by using the index of the position at which you want to access
the element in square brackets x[index].
To select the vector element x stored in the third position (index = 3),
you have to type x[3]
2.37
Index vectors: logical
In this case, only the elements of the vector x will be selected that
respect the logical condition expressed in the square brackets.
For instance:
> x <- c(10,-5,15,1.2,20)
> y <- x[x>=10]
> print(y)
[1] 10 15 20
Vector y contains only the values of x that meet the condition of
being greater than or equal to 10. Note that length(y) <= length(x).
2.38
Index vectors: positive integers
In this case, the values of the index vector must be chosen in the set
{1,2,...,length(x)}.
The elements expressed by the index vector are selected by the
vector x and concatenated accordingly.
For instance:
> x <- c(10,-5,15,1.2,20,6,3)
> y <- x[c(1,3:5,length(x))]
> print(y)
[1] 10.0 15.0 1.2 20.0 3.0
2.39
Index vectors: negative integers
2.40
Index vectors: character strings
2.41
Index vectors and names attributes
2.42
Other R objects… A preview
Vectors are one of the most important types in R, but there are other
fundamental objects that will be covered in the next slides:
- Matrix and multidimensional array extensions of vectors.
- Factors objects useful for handling categorical data.
- Lists generalization of vectors in which the stored elements do
not necessarily have to be of the same nature.
- Data Frames structures which look like matrices, but they can
host different types of data (“data matrices”).
- Functions functions are themselves managed as objects in R
and they can be stored in the workspace.
2.43
3.
Mode & Attributes
Mode, Type, Class and Attributes
Recursive Structures
Object Coercion
Intrinsic attributes: mode and length
3. 2
Intrinsic attributes: mode and length
3. 3
Intrinsic attributes: mode and length
vect_double <- c(1, 2.3, 3.1, -4)
mode(vect_double) #numeric
typeof(vect_double) #double
vect_int <- c(1L, 10L, 5L)
mode(vect_int) #numeric
typeof(vect_int) #integer
vect_complex <- c(0.0 + 1i, 2.5 + 6.0i)
mode(vect_complex) #complex
typeof(vect_complex) #complex
3. 4
Intrinsic attributes: mode and length
vect_text <- c("knight","queen","king")
mode(vect_text) #character
typeof(vect_text) #character
vect_logic <- c(TRUE,3>0,FALSE,0==3)
mode(vect_logic) #logical
typeof(vect_logic) #logical
vect_raw <-raw(3)
mode(vect_raw) #raw
typeof(vect_raw) #raw
3. 5
Intrinsic attributes: mode and length
3. 6
Intrinsic attributes: mode and length
vect_empty_num <- numeric(10)
vect_empty_chr <- character(10)
3. 7
Recursive structures and lists
3. 8
Recursive structures and lists
# Lists can store objects characterized by a different
nature, as a result lists are not atomic
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
x <- list(n, s, b)
# We can declare lists of lists (object recursion).
xrecursive <- list(x,30)
# We now check the variables stored in memory and their
classification
3. 9
Recursive structures and lists
3.10
Intrinsic attributes
3.11
Attributes() e attr()
> x <- 1:10
> View(x)
> attributes(x)
$dim
[1] 2 5
3.12
Change of mode: Object Coercion
3.13
The class of an object
3.14
Class – Mode – Type
x <- c(2.1, 4, 3, 1, 5, 7)
print(class(x)); print(mode(x)); print(typeof(x))
# class: numeric - mode: numeric – type: double
A = matrix(
x, # the data elements
nrow=2, ncol=3, # number of rows and columns
byrow = TRUE) # fill matrix by row
print(class(A)); print(mode(A)); print(typeof(A))
# class: matrix - mode: numeric – type: double
3.15
The class of an object
3.16
4.
String
String Manipulation
Format
Regular Expression
Rules for the generation of strings
4.2
Examples of valid and invalid strings
# valid declarations for strings
str1 <- 'Start and end with single quote'
str2 <- "Start and end with double quotes"
str3 <- "single quote ' in between double quotes"
str4 <- 'Double quotes " in between single quote'
# invalid declarations for strings
str5 <- 'Mixed quotes"
str6 <- 'Single quote ' inside single quote'
str7 <- "Double quotes " inside double quotes"
4.3
String manipulation – nchar()
4.4
String manipulation – toupper() and tolower() case
toupper() makes all characters in the string uppercase
tolower() makes all characters in the string lowercase
4.5
String manipulation – substring()
The substring() function extracts a part of a string.
The basic syntax is:
substring(x,first_index,last_index)
Where x is the input character vector, first_index is the index
position at which to extract the first character, last_index is the
index position at which to extract the last character
> result <- substring("Asterix and Obelix", 13, 18)
> print(result)
[1] "Obelix"
4.6
Numbers and strings formatting – format()
4.7
Numbers and strings formatting – format()
# Total number of digits to be displayed. Last digit
will be rounded.
result <- format(pi, digits = 9)
> print(result)
[1] "3.14159265"
4.8
Numbers and strings formatting – format()
# The minimum number of digits to display to the right
of the decimal point.
result <- format(3.14, nsmall = 5)
> print(result)
[1] "3.14000"
# To fill the 8 positions of width, white spaces are
inserted to the left of the number.
result <- format(42, width = 8)
> print(result)
[1] " 42"
4.9
Numbers and strings formatting – format()
# Left justify strings.
result <- format("Idefix", width = 10, justify = "l")
> print(result)
[1] "Idefix "
4.10
String concatenation – paste()
Strings can be combined together using the paste() function.
The basic syntax is:
paste(..., sep = " ", collapse = NULL)
4.11
String concatenation – paste()
a <- "Asterix"
b <- 'and'
c <- 'Obelix'
d <- "by Uderzo! "
> print(paste(a,b,c,d))
[1] "Asterix and Obelix by Uderzo! "
> print(paste(a,b,c,d, sep = "**"))
[1] "Asterix**and**Obelix**by Uderzo! "
> print(paste(a,b,c,d, sep = "", collapse = ""))
[1] "AsterixandObelixby Uderzo! "
4.12
String splitting – strsplit()
4.13
Substring replacement – sub() e gsub()
The sub() and gsub() functions allow to substitute a substring with
another one thus implementing a replacement.
The syntax is:
sub(old_substring, new_substring, string)
gsub(old_substring, new_substring, string)
4.14
Substring replacement – sub() e gsub()
# The sub() function
sentence <- "Savona is a seaside town. Savona is located
in Liguria"
print(sentence)
[1] "Savona is a seaside town. Savona is located in
Liguria"
4.15
Substring replacement – sub() e gsub()
# The gsub() function
sentence <- "Savona is a seaside town. Savona is located
in Liguria"
print(sentence)
[1] "Savona is a seaside town. Savona is located in
Liguria"
4.16
Regular expression with R
«A regular expression, regex or regexp (sometimes called a rational
expression) is a sequence of characters that define a search pattern.
Usually such patterns are used by string searching algorithms for "find" or
"find and replace" operations on strings, or for input validation. It is a
technique developed in theoretical computer science and formal language
theory.» (Wikipedia)
4.17
Regex matches in string vector – grep()
The grep() function takes the regular expression (regex) as first input
argument and a string vector as second input parameter.
If you specify the value=FALSE parameter, grep() returns a new vector
with the indices of the elements that satisfy the regular expression.
If you specify value=TRUE, grep() returns a vector with a copy of the
elements of the original one for which the regular expression is verified.
grep("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE,
value=FALSE)
[1] 1 2 3 6
4.18
Regex matches in string vector – grepl()
grep("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE,
value=TRUE)
[1] "Asterix" "Obelix" "Panoramix" "Ordinalfabetix"
The grepl() function has the same input arguments as grep(), except for
the value= which is no longer supported.
grepl() returns a logical vector of the same length as the vector of input
strings: the elements valued at TRUE correspond to the indexes such that the
regular expression is verified. Elements with FALSE correspond to indices for
which it is not verified.
4.19
Regex matches in string vector – regexpr()
grepl("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE)
[1] TRUE TRUE TRUE FALSE FALSE TRUE
The regexpr() has the same input arguments of grepl(). It returns a
numeric vector characterized by the position of the index such that the
regular expression is verified.
If it is not verified, it fills the vector with -1.
Each element in this vector is characterized by having a match.length
attribute. The latter is a vector of integers with the number of characters
found in correspondence with the first regular expression found.
4.20
Regex matches in string vector – regexpr()
regexpr("ix", c("Asterix", "Obelix", "Panoramix",
"Beniamina","Falbalà","Ordinalfabetix"), perl=TRUE)
[1] 6 5 8 -1 -1 13
attr(,"match.length")
[1] 2 2 2 -1 -1 2
regexpr("al", c("Asterix", "Obelix", "Panoramix",
"Beniamina","Falbalà","Ordinalfabetix"), perl=TRUE)
[1] -1 -1 -1 -1 2 6
attr(,"match.length")
[1] -1 -1 -1 -1 2 2
4.21
Regex matches in string vector – gregexpr()
The gregexpr() function has the same task as regexpr() except that it
finds all matches and not just the first one.
> gregexpr("al", c("Asterix", "Obelix", "Panoramix",
"Beniamina", "Falbalà", "Ordinalfabetix"), perl=TRUE)
...
[[1]]
[[5]]
[1] -1
[1] 2 5
attr(,"match.length")
attr(,"match.length")
[1] -1
[1] 2 2
...
4.22
Regex matches in string vector – regmatches()
You use the regmatches() function to get substrings that match the
regular expression.
As the first argument, we use the same input that is passed to regexpr()
or gregexpr().
Regarding the second argument, we pass the output vector returned by
regexpr() or gregexpr(). For instance:
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1] "a" "a" "aa"
4.23
Regex matches in string vector – regmatches()
m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a" More information on Regular Expressions can be found
[[2]] on the website:
character(0)
https://www.regular-expressions.info/
[[3]]
[1] "a" "a"
[[4]]
[1] "aa"
4.24
gsub() function supports the syntax for Regular Expression
Remembering the syntax of gsub():
gsub(old_substring, new_substring, string)
The first input argument can be a regular expression.
The following example eliminates the numeric digits in a string using a
typical syntax of Regular Expressions:
sentence <- "The postal code of Savona is 17100"
print(sentence)
[1] "The postal code of Savona is 17100"
gsub("[0-9]*", "", sentence)
[1] "The postal code of Savona is "
4.25
5.
Factors
Levels
Ordered factors
tapply
Factors
Factors are useful objects for categorizing data and providing for the
discrete classification of the components of a vector.
These objects are very useful in analyzing categorical data and for
statistical modeling.
Factors are handled by R as integers, but they are typically
represented by a textual label.
Although factors are presented to the user in the form of character
strings (and they sometimes even behave as such), in fact they are
characterized by a numerical nature and therefore require particular
attention in their management.
5.2
Factors
Factors are defined in R with the factor() function and they can only
contain predefined values known in statistical analysis with the term
levels.
The number of levels that characterizes a factor can be displayed
using the nlevels() function.
Conversely, levels() displays the values that the factors can assume.
Let us consider, as an example, the gender factor which includes two
levels: Male and Female.
5.3
Factors
sex<-c("Male", "Female", "Male","Male","Female")
> print(sex)
[1] "Male" "Female" "Male" "Male" "Female"
sexF <- factor(sex)
> print(sexF)
[1] Male Female Male Male Female
Levels: Female Male
5.4
Factors
> levels(sexF)
[1] "Female" "Male"
> nlevels(sexF)
[1] 2
5.5
Ordered Factors
5.6
Ordered Factors
> levels(Customer_satisfaction)
[1] "high" "low" "medium"
To specify the correct order of the levels, you have to explicitly define
the optional parameter of the factor function: levels.
5.7
Ordered Factors
Customer_satisfaction <- factor(c("medium", "low", "high",
"high", NA), levels=c("high", "medium", "low"))
> levels(Customer_satisfaction)
[1] "high" "medium" "low"
5.8
Ordered Factors
5.9
The tapply() function
Suppose you have four teams (RAV, GRY, HUF, SLY) and their
scores obtained in different tests.
10 30 7 13
20 8 12 20
15 5
5.10
The tapply() function
players <- c("RAV","GRY","HUF","SLY","RAV","GRY", "HUF",
"SLY","RAV","SLY","SLY")
scores <- c(10,30,7,13,20,8,12,20,15,5,8)
player_fact <- factor(players)
To apply a function (such as the mean) to the vector of scores
grouped by factor, you generally use tapply()
scoresAVG <- tapply(scores,player_fact,mean)
> print(scoresAVG)
GRY HUF RAV SLY
19.0 9.5 15.0 11.5
5.11
The tapply() function
scoresMAX<- tapply(scores,player_fact,max)
> print(scoresMAX)
GRY HUF RAV SLY
30 12 20 20
scoresMIN<- tapply(scores,player_fact,min)
> print(scoresMIN)
scoresSTDEV<- tapply(scores,player_fact,sd)
> print(scoresSTDEV)
GRY HUF RAV SLY
15.556349 3.535534 5.000000 6.557439
5.12
6.
Matrix & Array
Bidimensional vectors
Matrix operations
Multidimensional vectors
Matrices
6.2
matrix()
6.3
matrix()
data <- c(1,5,-1,8,4,3)
A1 <- matrix(data, nrow=3, ncol=2, byrow=TRUE)
A2 <- matrix(data, nrow=3, ncol=2, byrow=FALSE)
View(A1); View(A2)
6.4
dimname
6.5
dimname
> print(MatrixResults)
> View(MatrixResults)
6.6
Access the elements of a matrix
6.7
Access the elements of a matrix
6.8
Access a set of elements of a matrix
6.9
Access an entire row of a matrix
In order to select all the elements of a row, you can use the notation
A[i,]
> print(A)
Code for the range selection
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13 > A[2,]
[2,] 5 10 11 8 [1] 5 10 11 8
[3,] 9 6 7 12 Which is equivalent to:
[4,] 4 15 14 1
> A[2,1:ncol(A)]
6.10
Access an entire column of a matrix
In order to select all the elements of a column, you can use the
notation A[,j]
> print(A) Code for the range selection
[,1] [,2] [,3] [,4]
> A[,3]
[1,] 16 3 2 13
[1] 2 11 7 14
[2,] 5 10 11 8
[3,] 9 6 7 12 Which is equivalent to:
[4,] 4 15 14 1
> A[seq(1,nrow(A)),3]
6.11
Properties of the Durer matrix – sum of the rows
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13 34
[2,] 5 10 11 8 34
[3,] 9 6 7 12 34
[4,] 4 15 14 1 34
∑
# Sum of rows
RowsSum <-
c(sum(A[1,]),sum(A[2,]),sum(A[3,]),sum(A[4,]))
6.12
Properties of the Durer matrix – sum of the columns
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
[3,] 9 6 7 12
[4,] 4 15 14 1
34 34 34 34 ∑
# Sum of columns
ColumnsSum <-
c(sum(A[,1]),sum(A[,2]),sum(A[,3]),sum(A[,4]))
6.13
Properties of the Durer matrix – sum of diagonals
# Sum of the main diagonal
Trace <- A[1,1] + A[2,2] + A[3,3] +A[4,4]
# or in a more elegant way
Trace <- sum(diag(A))
# Sum of the main antidiagonal
AntidiagSum <- A[1,4] + A[2,3] + A[3,2] +A[4,1]
Furthermore, since the determinant is null, the Durer matrix A is not
invertible.
det(A) # det() computes the determinant of a matrix
solve(A) # solve() computes the inverse matrix
6.14
Properties of the Durer matrix – eigenvalues
6.15
Properties of the Subirachs matrix
6.16
Properties of the Subirachs matrix
6.17
Transpose a matrix
6.18
Operations between matrices
6.19
Operations between matrices
Durer_numbers<-c(16,3,2,13,5,10,11,8,9,6,7,12,4,15,14,1)
A <- matrix(Durer_numbers,nrow=4,ncol=4,byrow=TRUE)
Subirachs_numbers<-
c(1,14,14,4,11,7,6,9,8,10,10,5,13,2,3,15)
B <- matrix(Subirachs_numbers,nrow=4,ncol=4,byrow=TRUE)
> print(A) > print(B)
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 16 3 2 13 [1,] 1 14 14 4
[2,] 5 10 11 8 [2,] 11 7 6 9
[3,] 9 6 7 12 [3,] 8 10 10 5
[4,] 4 15 14 1 [4,] 13 2 3 15
6.20
Operations between matrices
> print(A+B) > print(A-B)
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 17 17 16 17 [1,] 15 -11 -12 9
[2,] 16 17 17 17 [2,] -6 3 5 -1
[3,] 17 16 17 17 [3,] 1 -4 -3 7
[4,] 17 17 17 16 [4,] -9 13 11 -14
> print(A*B) > print(A%*%B)
[,1] [,2] [,3] [,4] [,1] [,2] [,3] [,4]
[1,] 16 42 28 52 [1,] 234 291 301 296
[2,] 55 70 66 72 [2,] 307 266 264 285
[3,] 72 60 70 60 [3,] 287 262 268 305
[4,] 52 30 42 15 [4,] 294 303 289 236
6.21
Merging matrices: cbind() and rbind()
6.22
Merging matrices: cbind() and rbind()
Durer_numbers<-c(16,3,2,13,5,10,11,8,9,6,7,12,4,15,14,1)
A <- matrix(Durer_numbers,nrow=4,ncol=4,byrow=TRUE)
Subirachs_numbers<-
c(1,14,14,4,11,7,6,9,8,10,10,5,13,2,3,15)
B <- matrix(Subirachs_numbers,nrow=4,ncol=4,byrow=TRUE)
C <- matrix(rep(c(1,2),each=4),nrow=2,ncol=4,byrow=TRUE)
D <- t(C)
> View(A); View(B)
> View(C); View(D)
6.23
Merging matrices: cbind() and rbind()
6.24
Merging matrices: cbind() and rbind()
> print(cbind(A,B))
6.25
Merging matrices: cbind() and rbind()
> print(rbind(A,B))
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
A: nrow=4 x ncol=4
[3,] 9 6 7 12
[4,] 4 15 14 1
[5,] 1 14 14 4
[6,] 11 7 6 9
B: nrow=4 x ncol=4
[7,] 8 10 10 5
[8,] 13 2 3 15
6.26
Merging matrices: cbind() and rbind()
> print(cbind(A,D))
[,1] [,2] [,3] [,4] [,5] [,6] A: nrow=4 x ncol=4
[1,] 16 3 2 13 1 2
[2,] 5 10 11 8 1 2 D: nrow=4 x ncol=2
[3,] 9 6 7 12 1 2
[4,] 4 15 14 1 1 2
A D
6.27
Merging matrices: cbind() and rbind()
> print(rbind(A,C))
[,1] [,2] [,3] [,4]
[1,] 16 3 2 13
[2,] 5 10 11 8
A: nrow=4 x ncol=4
[3,] 9 6 7 12
[4,] 4 15 14 1
[5,] 1 1 1 1
C: nrow=2 x ncol=4
[6,] 2 2 2 2
6.28
Merging matrices: cbind() and rbind()
> print(cbind(A,C))
Error in cbind(A, C) : number of rows of matrices must
match (see arg 2)
> print(rbind(A,D))
Error in rbind(A, D) :
number of columns of matrices must match (see arg 2)
6.29
Array: multidimensional vectors
Arrays are R objects in which data with more than two dimensions
can be stored.
For example, if you want to create an array of dimension (2,3,4) it
means that 4 matrices will be created having each two rows and
three columns.
An array is generated through the array() function and takes as input
arguments a vector in which the data will be arranged according to
the dimensional specifications contained in the second input: the dim
parameter.
Elements in an array must have the same mode.
6.30
Indexing a three-dimensional array (a tensor)
6.31
Array: implementation example
, , 1 , , 3
EvenSequences <-
seq(from=2, [,1] [,2] [,3] [,1] [,2] [,3]
to=48,by=2) [1,] 2 6 10 [1,] 26 30 34
[2,] 4 8 12 [2,] 28 32 36
Z <- array(
EvenSequence, , , 2 , , 4
c(2,3,4))
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 14 18 22 [1,] 38 42 46
> print(Z) [2,] 16 20 24 [2,] 40 44 48
6.32
7.
Lists & Data frames
List Components
Data Frame Object
Attach & Detach
Lists
7.2
Lists
print(MyList)
View(MyList)
$name
[1] "Emmett Brown"
$wife
[1] "Clara Clayton"
$no.children
[1] 2
$child.names
[1] "Giulio" "Verne"
7.3
Lists - Components
7.4
Lists - Components
> print(MyList[[1]])
[1] "Emmett Brown"
> print(MyList[[4]][1])
[1] "Giulio"
7.5
Lists - Components
7.6
Lists - Components
The operator [[…]] is used for the selection of the single element of
the list, while […] is used as a general sub-scripting operator.
[[…]] allows the access to an object stored in a list.
7.7
Lists – [[…]] and […]
[…] extracts a part of a list (sublist) which is a list itself. This is the
method for slicing a list.
MyList <- list(name="Emmett Brown",
wife="Clara Clayton",
no.children="2",
child.names=c("Giulio","Verne"))
element <- MyList[[1]]
sublist <- MyList[1]
7.8
Lists – Edit and add components in a list
7.9
Lists – Edit and add components in a list
# modification of a component of the list
> names(MyList)[1] <- "Person"
> print(MyList[1])
$Person
[1] "Emmett Doc Brown"
7.10
Lists – Edit and add components in a list
# adding elements in a list component
> MyList$Person[2] <- "Marty McFly"
> str(MyList)
List of 5
$ Person : chr [1:2] "Emmett Doc Brown" "Marty McFly"
$ wife : chr "Clara Clayton"
$ no.children: num 2
$ child.names: chr [1:2] "Giulio" "Verne"
$ car : chr "DeLorean DMC-12"
7.11
Lists – Concatenation
7.12
Lists – Concatenation
FinAdminDesk <-
list(ID=c(12123,23234,67678,78789,34345,45456,56567),
desk=c(rep("FinEng",each=4)),rep("MktRiskReport",3))
FinAdminDegree <-
list(name=c("Paolo","Pier","Andrea","Nicola","Matteo",
"Fabio","Marcello"),
degree=c("Computer Science","Engineering","Mathematics",
rep("Economics",each=4)))
7.13
Lists – Concatenation
7.14
List of lists
> str(CourseInformation)
7.15
List of lists
> str(CourseInformation)
List of 3
$ name : chr "Software R"
$ :List of 3
..$ professor: chr "Giribone"
..$ mail : chr "[email protected]"
..$ mobile : chr "338/6343454"
$ degree: chr "EDS"
7.16
List of lists
> View(CourseInformation)
7.17
Data Frame
A data frame is used to store data in tabular form. This object shares
many features with lists and matrices, but has some restrictions:
- The components of the data frame must be vectors (numeric,
textual or logical), factors, numeric matrices, lists or data frames.
- Numeric, logical and factor vectors are included in the data frame
as is, while (in some R version) string vectors are converted into
factors by default.
- The length of the components must be the same.
In short, similarly to matrices, the data frame is a two-dimensional
data structure (i.e. a table).
7.18
Data Frame
7.19
Data Frame
7.20
Data Frame
# the data frame is a particular type of list
> typeof(dataframeExam)
[1] "list"
> class(dataframeExam)
[1] "data.frame"
# Being a list, you can access and modify its elements
in a completely similar way to what we have seen before.
> print(as.character(dataframeExam[[1]]))
[1] "Rossi" "Bianchi" "Brown"
7.21
Data Frame
> print(dataframeExam$mark[2])
[1] 14
> print(dataframeExam[[3]][2:3])
[1] FALSE TRUE
7.22
Data Frame
7.23
Edit a Data Frame
# List-like notation
dataframeExam$mark[3] <- 30
dataframeExam[[2]][3] <- 30
7.24
Add elements to a Data Frame
The addition of components can be done using the rbind() and
cbind()matrix functions
dataframeExam[,1] <- as.character(dataframeExam[,1])
rbind(dataframeExam,c("Silver",23,TRUE))
> print(dataframeExam)
candidate mark passed
1 Rossi 26 TRUE
2 Bianchi 14 FALSE
3 Brown 30 TRUE
4 Silver 23 TRUE
7.25
Add elements to a Data Frame
> dataframeExam <-
cbind(dataframeExam,State=c("IT","IT","UK","SP"),
stringsAsFactors=FALSE)
> print(dataframeExam)
candidate mark passed State
1 Rossi 26 TRUE IT
2 Bianchi 14 FALSE IT
3 Brown 30 TRUE UK
4 Silver 23 TRUE SP
7.26
Add elements to a Data Frame
7.27
Deleting rows and columns
7.28
Deleting rows and columns
> print(dataframeExam)
candidate mark passed State
1 Rossi 26 TRUE IT
2 Bianchi 14 FALSE IT
4 Silver 23 TRUE SP
7.29
attach() and detach() functions for lists and data frames
The $ notation, like dataframeEsame$candidate or MyList$name,
used for the data frame and list components may not always be
convenient.
A useful support could be a function that allows the components of a
list or data frame to be temporarily visible as variables that can be
recalled from memory with the same name as the component.
In this way it would be possible to avoid writing the reference
database (list or data.frame) each time before the dollar symbol,
increasing the clarity of the R code.
7.30
attach() and detach() functions for lists and data frames
The attach() function takes as input a «database», that is a list or
data.frame object.
Suppose BooksDB is a dataframe consisting of three components:
BooksDB$author, BooksDB$title, BooksDB$year
7.31
attach() and detach() functions for lists and data frames
> attach(BooksDB)
This instruction makes a blind copy of the components of the
database. After this command, therefore, the variables can be
directly recalled
> print(year)
[1] 1990 1985 2005 1884
The detach() function destroys the copy of the variables in memory.
> detach(BooksDB)
> print(year)
Error in print(year) : object 'year' not found
7.32
8.
Import Data from file
Reading data from csv-txt-dat file
Fixed width format file Reading
Writing data in a file
Reading data from file – csv
R uses the working directory for reading and writing into files.
The command for displaying the working directory is getwd() and its
path can be changed with setwd().
Tab: files
8.2
Reading data from file – csv
If the file from which you want to import the data is present in the
working directory, so you can see it in the Files tab from RStudio, it
will not be necessary to express the entire path within the functions
dedicated to importing data from file, but only the name of the file
with its extension.
Take the following csv (comma-separated values) file as an example
of a database: StarWars.csv
source: https://www.kaggle.com/jsphyg/star-wars
The file is in the directory: C:\Users\Utente\Documents
8.3
Reading data from file – csv
Have a look at the csv file
8.4
Reading data from file – csv
The file has been stored in the working directory and therefore
appears in the «Files» tab of RStudio.
8.5
Reading data from file – csv
8.6
Since the elements of
the database were
divided in such a way,
the import of the data is
correct.
Where the data is
missing, R associates
the NA value.
8.7
Reading data from file – csv
> StarWars$name
> StarWars[1,1]
> na.omit(StarWars$name[StarWars$height>200])
8.8
Reading data from file – csv
8.9
Reading data from file – csv
file: the file name specifying its extension if the file is in the current
directory. Otherwise it is necessary to specify the entire path. The
parameter can also be a remote access to the URL - Uniform
Resource Locator - file (http://...)
header: a logical value (TRUE or FALSE) which indicates if the
names of variables are in the first row.
sep: the delimiter used for separating the elements within the file. For
instance, sep="\t" refers to the tabulator (tab).
quote: the character used for textual variables.
dec: the character used as a separator of the decimal digits.
8.10
Reading data from file – csv
fill: logical value. If TRUE and the rows do not have the same
number of variables, blancks will be added.
StarWars <- read.table("StarWars.csv",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
StarWars <- read.csv("StarWars.csv",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
StarWars <- read.delim2("StarWars.csv",
header=TRUE,sep=";",quote="\"",dec=".",fill=TRUE)
The output object, named StarWars, has a list mode,
mode(StarWars), and a data.frame class, class(StarWars).
8.11
Reading data from file – txt
Now suppose you need to import the same dataset from a file with
the extension .txt located in the directory:
C:/Users/Utente/Documents/Database/StarWars.txt
8.12
Reading data from file – txt
8.13
The scan() function
The scan() function is more flexible and more customizable than
read.table().
Using scan(), we are able to a-priori specify the mode of variables.
For instance:
mydata <- scan("data.dat", what=list("",0,0))
The instruction reads three variables in the file with .dat extension:
the first one has a textual mode, while the others have been defined
as numeric variables.
8.14
The read.fwf() function
The read.fwf() function can be implemented for reading a fixed
width format (fwf).
mydata <- read.fwf("data.dat", widths=c(1,4,3))
> str(mydata)
'data.frame': 4 obs. of 3 variables:
$ V1: Factor w/ 2 levels "A","B": 1 1 2 2
$ V2: num 1.5 1.55 1.69 1.95
$ V3: num 1.2 1.3 4.3 4.4
8.15
Write contents to a file
The write.table() function writes an R object in a file.
Typically it is used to save data frames, but it also works with the
other R objects like vectors, matrices, ...
8.16
Write contents to a file
8.17
Write contents to a file
8.18
Write contents to a file
# Example of importing a dataset from a URL
StarTrekDB <-
read.csv("https://raw.githubusercontent.com/pdxcat/nixme
ntors/master/lab-databases/startrek.csv",
header=TRUE,sep=",",stringsAsFactors = FALSE)
# Select the index of the lines containing the word
Lieutenant
indx <- grep("Lieutenant+",StarTrekDB$Rank,
perl=TRUE,value=FALSE)
# Store only the affected lines in the object
Lieutenants <- StarTrekDB[indx,]
8.19
Write contents to a file
> View(Lieutenants)
8.20
Write contents to a file
# Save the Lieutenants object in a csv file
write.table(Lieutenants,file="StarTrekLiutenants.csv",
append=FALSE, quote=FALSE, sep=",",eol="\n",
na="NA",row.name=FALSE,col.names=TRUE)
8.21
9.
R packages
Install, Update and Remove Pkgs
Library
Namespace
Packages
9.2
Packages
If you consult the help for each function, the package to which it
belongs is displayed.
> help(mean)
9.3
Packages
> help(read.fwf)
The reason for this package management policy, namely that it is not
enough for them to be installed, but that they must also be loaded, is
dictated by two main reasons:
- Loading all the functions of all installed packages in advance would
require a huge amount of memory. An accurate selection of
packages allows for a greater computational efficiency and greater
order in writing the code.
9.5
Packages
9.6
Packages
The ??function command allows to do a more detailed on-line
research.
>??read_excel
9.7
Packages
The package installation can be done from code or using the RStudio
graphical interface. If we use the IDE:
Tools -> Install packages
Install from:
Choose the repository or path where
the package to be installed is located.
The recommended choice is to use
reliable packages which come from
an official repository.
9.8
Packages
9.9
Packages
9.10
Packages
Installing package into
‘C:/Users/Utente/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
trying URL
'https://cran.rstudio.com/bin/windows/contrib/3.4/readxl
_1.3.1.zip'
Content type 'application/zip' length 1517362 bytes (1.4
MB)
downloaded 1.4 MB
package ‘readxl’ successfully unpacked and MD5 sums
checked
9.11
Packages
9.12
Packages
> search()
".GlobalEnv" "package:readxl" "tools:rstudio"
"package:stats" "package:graphics"
"package:grDevices" "package:utils"
"package:datasets" "package:methods" "Autoloads"
"package:base"
9.13
Packages
9.14
Packages
9.15
Packages
> mode(MasterYoda3D)
[1] "list"
> class(MasterYoda3D)
[1] "data.frame"
library(plot3D)
scatter3D(MasterYoda3D$Xcoord,
MasterYoda3D$Ycoord,
MasterYoda3D$Zcoord)
9.16
Namespace
9.17
Namespace
Note that if the entire package is not imported via library(), it does not
appear correctly in the search() list, but its functions can be used with
::
9.18
Package Management
9.19
Package Management
Install
Update
Remove
R Packages
from RStudio
9.20
Package Management
9.21
Package management
9.22
10.
Graphics
Low level function
High level function
Layout
Introduction
10.2
The Graphics environment
10.3
Base Graphics
10.4
Base Graphics – Scatter Plots
10.5
Base Graphics – Scatter Plots
10.6
Base Graphics – Scatter Plots
10.7
Base Graphics – All pairs
> pairs(y)
10.8
Base Graphics – Plot Labels and Text
plot(y[,1], y[,2],
pch=20, col="red",
main="Scatter")
text(y[,1]+0.01,
y[,2],rownames(y))
10.9
Base Graphics – Plotting characters (pch)
10.10
Base Graphics – Scatter Plot
plot(y[,1], y[,2], type="n") The plot() function is
text(y[,1], y[,2], rownames(y)) equipped with many input
parameters that allow an
extensive customization of
the graph.
In this regard, consult the
guide:
> help(plot)
# graphical parameters
> help(par)
10.11
Base Graphics – Plot Parameters
# all graphic objects will have the properties expressed
in the par() function
op <- par(mar=c(7,7,7,7), bg="lightyellow")
# the graphical properties expressed in the plot
function, on the other hand, have a local validity
plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2,
cex.axis=1.2, cex.main=1.2, cex.sub=1, lwd=4, pch=20,
xlab="x label", ylab="y label", main="Main Title",
sub="Sub Title")
grid(3, 3, lwd = 2)
10.12
10.13
Base Graphics – Plot Parameters
The graphical parameters expressed in the par() function are
applied to all the ensuing graphs. As a result, if you draw a second
graph with the plot() command after executing the previous lines of
code they will all be characterized by the same definition of the
margins and the same background color. The mar parameter is a
numerical vector of 4 elements that defines the space between the
axes and the edge of the graph in accordance with the syntax:
c(bottom,left,top,right).
If not expressed, the default values are: c(5.1,4.1,4.1,2.1).
10.14
Base Graphics – Plot Parameters
The bg parameter defines the background colour of the plot. The list
of all 657 colors made available by R can be viewed with the
colors() command.
To delete all the graphical properties stored with par() including
graphic objects, you can use the command dev.off().
Now let’s discuss the parameters in the plot() function.
type indicates which type of graph is to be drawn. In this case,
dealing with a type = "p" scatter plot means that the (x, y)
coordinates will be represented in the graph as points.
10.15
Base Graphics – Plot Parameters
From the guide you can see all the other possible graphs managed:
10.16
Base Graphics – Plot Parameters
cex is a numeric value that sets the size of text and symbols.
It indicates how many times the textual character must be enlarged
with respect to the default value which is equal to 1.
The following parameters use the same described logic applying it:
- To the numbers in the axes (cex.axis),
- To the axes labels (cex.lab),
- To the title of the plot (cex.main)
- To the subtitle of the plot (cex.sub)
10.17
Base Graphics – Plot Parameters
col checks the colors of the symbols. As for the cex parameter there
are: col.axis, col.lab, col.main, col.sub.
lwd is the thickness of the line (or of the point as in this case). The
default value is 1. This parameter is "device-specific" because the
size is a specific property of a graphical object (point, line, segment).
xlab and ylab are the textual labels to be applied to the x axis and the
y axis, respectively.
main and sub are the parameters that specify the main title of the
graph and the subtitle, respectively.
10.18
Base Graphics – Plot Parameters
To conclude, the grid() function adds a grid above the graph
containing nx cell x ny cell along the abscissa (nx = 3) and ordinate
(ny = 3) axes, respectively.
# We proceed to clear the memory of all graphic devices:
objects and properties defined by the par() function
> dev.off()
null device
1
10.19
Base Graphics – abline()
The abline() function adds a line in the current plot.
So if you want to add the regression line to the base chart you can
write the following code:
# I draw the generated points using the scatter plot
plot(y[,1], y[,2])
# I perform the linear regression using lm() where y[,1]
is the regressor and y[,2] is the dependent variable
myline <- lm(y[,2]~y[,1]);
# add the regression line
abline(myline, lwd=2, lty=5)
10.20
Base Graphics – abline()
Looking at the guide, we now check the input arguments for the
abline() function:
>?abline
10.22
Base Graphics – abline()
10.23
Base Graphics – log-scale
10.24
Base Graphics – LaTeX compability
10.25
Base Graphics – LaTeX compability
LaTeX is a markup
language used for the
preparation of texts based
on the WYSIWYM (What
You See Is What You
Mean) paradigm compared
to the more widespread
WYSIWYG (What You See
Is What You Get).
It is particularly used in the
academic community and
in the scientific field.
10.26
Base Graphics – Line Plot
The following code shows how to create a line plot from a single
dataset.
The function you can use is still plot with the parameter type="l"
#dataset
set.seed(29091984)
y <- matrix(runif(30), ncol=3,
dimnames=list(letters[1:10], LETTERS[1:3]))
#line plot – single dataset
plot(y[,1], type="l", lwd=2, col="blue")
10.27
Base Graphics – Line Plot
10.28
Base Graphics – Line Plot with more datasets
This code allows to plot three lines in the same graphical device:
split.screen(c(1,1)) #dataset 1
plot(y[,1], type="l", lwd=2, ylim=c(0,1), col="blue")
screen(1, new=FALSE) #dataset 2
plot(y[,2], type="l", lwd=2, col="red", xaxt="n",
yaxt="n", ylab="", xlab="", main="", bty="n")
screen(1, new=FALSE) #dataset 3
plot(y[,3], type="l", lwd=2, col="green", xaxt="n",
yaxt="n", ylab="", xlab="", main="", bty="n")
10.29
Base Graphics – Line Plot with more datasets
10.30
Base Graphics – Line Plot with more datasets
The split.screen() function indicates how the graphic layout must
be divided when several graphical objects must be hosted in it at the
same time.
The value c(1,1) indicates that the screen (the part of RStudio that
hosts graphic objects: tab plot) will not be divided into sub-charts.
In the first call of the plot() function we specify the following
parameters: type (line plot), lwd (line width), col (colour) and ylim
which indicates the range of variation of the y-axis. Since the dataset
is composed of numbers drawn according to a uniform distribution
[0,1] it is reasonable to set the parameter to c(0,1).
10.31
Base Graphics – Line Plot with more datasets
screen(1,new=FALSE) indicates that the next graph will be hosted
in the same first screen, or, more in general, in the same graphical
area that already hosts the current plot. The next times that the plot
function will be invoked, in addition to the input parameters already
discussed (type, lwd, col), there will also be:
xaxt="n" and yaxt="n" indicating that the x axis and y axis are
set but not drawn.
xlab="" and ylab="": the x and y axis labels are not displayed
main="": the chart title is not displayed
bty="n": the box containing the graph is not displayed
10.32
Base Graphics – box parameter (bty)
The bty parameter checks the type of box containing the graph.
Allowed values, beyond "n" are: "o""1""7""c""u""]"
par(mfrow=c(2,3))
plot(y[,1],type="l",bty="o",xaxt="n",yaxt="n",main="o")
plot(y[,1],type="l",bty="l",xaxt="n", yaxt="n",main="1")
plot(y[,1],type="l",bty="7",xaxt="n", yaxt="n",main="7")
plot(y[,1],type="l",bty="c",xaxt="n", yaxt="n",main="c")
plot(y[,1],type="l",bty="u",xaxt="n", yaxt="n",main="u")
plot(y[,1],type="l",bty="]",xaxt="n", yaxt="n",main="]")
10.33
Base Graphics – box parameter (bty)
10.34
Base Graphics – splitting the graphic window (mfrow and mfcol)
The mfrow and mfcol parameters of the par() function allow to define
the graphical partitions within the graphic window.
In our case, mfrow=c(2,3) means that RStudio tab plot will host six
graphical objects in a matrix with 2 rows and 3 columns.
The graphs will fill these cells by rows in the case of mfrow and by
columns if the mfcol function is used.
The next graph was generated from the same code but using
mfcol=c(2,3) instead of mfrow=c(2,3).
10.35
Base Graphics – splitting the graphic window (mfrow and mfcol)
10.36
Base Graphics – bar plot
10.37
Base Graphics – bar plot
10.38
Base Graphics – Error bars
> help(arrows)
10.39
Base Graphics – Error bars
print(cbind(bar,round(m,1)))
[,1] [,2]
a 0.7 2.1
b 1.9 5.4
c 3.1 6.7
d 4.3 7.3
e 5.5 4.8
f 6.7 2.1
g 7.9 2.9
h 9.1 5.8
i 10.3 5.3
j 11.5 5.8
10.40
Base Graphics – Error bars
10.41
Base Graphics – Histogram (hist)
hist(y, freq=TRUE, breaks=10); help(hist)
10.42
Base Graphics – Density plot
plot(density(y), col="red")
10.43
Base Graphics – Pie chart
The pie() function draws a pie chart
pie(y[,3], col=rainbow(length(y[,3]), start=0.1,
end=0.8), clockwise=TRUE)
The clockwise parameter is a logical value: if TRUE the input vector
data is arranged on the pie chart clockwise, if FALSE anti-clockwise.
The rainbow() function creates a vector of n (length(y[,3]))
contiguous colors with tones that span from start = 0.1 to end =
0.8. The shades vary according to the following scale: red = 0,
yellow = 1/6, green = 2/6, cyan = 3/6, blue = 4/6 and magenta = 5/6.
10.44
Base Graphics – Pie chart
legend("topright",
legend=row.names(y), cex=1.3,
bty="n", pch=15, pt.cex=1.8,
col=rainbow(length(y[,1]),
start=0.1, end=0.8), ncol=1)
10.45
Base Graphics – Manage the layout of the graphic window
10.46
Base Graphics – Manage the layout of the graphic window
10.47
Base Graphics – Manage the layout of the graphic window
> mat <- matrix(1:4,2,2)
> print(mat)
[,1] [,2]
[1,] 1 3
[2,] 2 4
> layout(mat)
10.48
Base Graphics – Manage the layout of the graphic window
> layout.show(4)
10.49
Base Graphics – Manage the layout of the graphic window
mat <- matrix(1:6,3,2)
layout(mat)
layout.show(6)
10.50
Base Graphics – Manage the layout of the graphic window
mat <- matrix(1:6,2,3)
layout(mat)
layout.show(6)
10.51
Base Graphics – Manage the layout of the graphic window
> m <- matrix(c(1:3,3),2,2)
> print(m)
[,1] [,2]
[1,] 1 3
[2,] 2 3
> layout(m)
> layout.show(3)
10.52
Base Graphics – Manage the layout of the graphic window
In these examples, the byrow option of matrix () was not used and
therefore, by default, the sub-windows were sorted according to the
columns order.
To set the rows order, simply specify the parameter byrow=TRUE in
the matrix() function.
By default, layout() splits the device with regular height and
weight. These can be changed as needed with the width and height
options.
Dimensions are usually given in a relative way, but can also be
specified in centimeters (see ?layout).
10.53
Base Graphics – Manage the layout of the graphic window
m <- matrix(1:4,2,2)
layout(m,
widths=c(1,3),
heights = c(3,1))
layout.show(4)
10.54
Base Graphics – Manage the layout of the graphic window
m <- matrix(c(1,1,2,1),2,2)
> print(m)
[,1] [,2]
[1,] 1 2
[2,] 1 1
layout(m,
widths=c(2,1),
heights = c(1,2))
layout.show(2)
10.55
Base Graphics – Manage the layout of the graphic window
Finally, the numbers in the array can include zero, giving the
possibility to create complex partitions.
m <- matrix(0:3,2,2)
layout(m, c(1,3),c(1,3))
layout.show(3)
10.56
Base Graphics – Manage the layout of the graphic window
10.57
Base Graphics – Manage the layout of the graphic window
10.58
Base Graphics – Save a chart from code to a file
In addition to the RStudio tab plot, you can also save the graphs by
coding.
For instance, in order to save a plot in pdf format:
pdf("test.pdf");
plot(density(y), col="red")
dev.off()
The procedure is very similar for other graphic formats (jpeg,png,ps):
jpeg("test.jpg"); plot(density(y), col="red"); dev.off()
bmp("test.bmp"); plot(density(y), col="red"); dev.off()
10.59
11.
Statistical analysis
Stats package
Formulae
Generic Functions
The stats package
11. 2
Recommended and Contributed statistical packages
11. 3
The structure of the module
11. 4
A simple example of an analysis of variance – ANOVA
The function to carry out the analysis of variance in stats is aov().
We use the R built-in dataset, called: InsectSprays.
https://www.rdocumentation.org/packages/datasets/versions/3.6.1/topics/InsectSprays
11. 5
A simple example of an analysis of variance – ANOVA
We can import the R built-in dataset using the data() function.
> data(InsectSprays)
> View(InsectSprays)
11. 6
Box plot – Descriptive statistics
11. 7
A simple example of an analysis of variance – ANOVA
ANOVA has been carried out on the square root of the response
through the aov() function:
aov.spray <- aov(sqrt(count) ~ spray, data=InsectSprays)
The main (and compulsory) input argument for the aov(), as in the
boxplot() function, is a formula which specifies the output
(response) on the left of the tilde symbol ~ and the predictor on the
right.
The option data=InsectSprays specifies that the variables (count
and spray) are components in the InsectSprays data frame.
11. 8
A simple example of an analysis of variance – ANOVA
Equivalently:
aov.spray <-
aov(sqrt(InsectSprays$count)~InsectSprays$spray)
Or, if you know the column numbers of the dataset, also:
aov.spray <-
aov(sqrt(InsectSprays[,1])~InsectSprays[,2])
11. 9
The summary() function
11.10
Analysis of the results
> print(aov.spray)
Call:
aov(formula = sqrt(InsectSprays[, 1]) ~
InsectSprays[, 2])
Terms:
InsectSprays[, 2] Residuals
Sum of Squares 88.43787 26.05798
Deg. of Freedom 5 66
Residual standard error: 0.6283453
Estimated effects may be unbalanced
11.11
Analysis of the results
> summary(aov.spray)
Df Sum Sq Mean Sq F value Pr(>F)
InsectSprays[, 2] 5 88.44 17.688 44.8 <2e-16 ***
Residuals 66 26.06 0.395
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A graphical representation of the results can be performed with
par(mfcol=c(2,2))
plot(aov.spray)
termplot(aov.spray, se=TRUE, partial.resid = TRUE, rug=TRUE)
11.12
11.13
11.14
Formulae
11.15
Formulae
a:b interaction effect between a and b
a*b additive and interaction effects. a*b is equal to a+b+a:b
poly(a,n) polynomials of a up to degree n
^n includes the interactions up to the level n. (a+b+c)^2 is equal to
a+b+c+a:b+a:c+b:c
b %in% a the effects of b are nested in a. b %in% a is equal to
a+a:b or a/b
-b removes the effect of b, for instance (a+b+c)^2-a:b is equal to
a+b+c+a:c+b:c
11.16
Formulae
-1 y~x-1 is a regression that passes through the origin
1 y~1 fits a model without effects (only the intercept)
offset(...) adds an effect to the model without the estimation of
other parameters (for instance: offset(3*x))
We observe that the arithmetic operators of R used in a formula take
on a different meaning than the one they have in a traditional
mathematical expression.
For instance, the formula y~x1+x2 defines a model 𝑦 = 𝛽1 𝑥1 +
𝛽2 𝑥2 + 𝛼 and not 𝑦 = 𝛽 𝑥1 + 𝑥2 + 𝛼 as in the usual meaning of the +
operator.
11.17
Formulae – the I() function
To include arithmetic operations in a formula, use the I()function:
the formula y~I(x1+x2) defines the model: 𝑦 = 𝛽 𝑥1 + 𝑥2 + 𝛼.
Similarly, to define the model 𝑦 = 𝛽1 𝑥 + 𝛽2 𝑥 2 + 𝛼, we will use the
formula y~poly(x,2) and not y~x+x^2.
Furthermore, it is possible to include a function in a formula in order
to perform a variable transformation, as done in the previous
example.
aov() accepts a particular syntax for the definition of the random
effect: y~a+Error(b) means the additive effect of the fixed term a
with the random effect of b
11.18
Generic function
The functions that are used to extract the results of the analyses
(typically print() and summary()) act according to the class of the
object passed as input data.
This kind of function is called generic function.
11.19
Generic function
11. 20
Generic function
11. 21
Generic function
A function such as aov() and lm() returns a list with the results of
the statistical analysis: not only can they be viewed, but they can
also be used in the environment.
11. 22
Generic function
The results of the statistical analysis can be extracted with the usual
syntax used for lists.
11. 23
Generic function
aov.spray$coefficients
11. 24
14781 Contributed Packages
11. 25
12.
Programming
Function, Scope, Debug Mode
Conditional execution
Loops and Vectorization
Script versus function
Run executes the program line by line, Source runs the entire
program. The button in the center re-executes the instructions.
12.2
Script versus function
12.3
Script versus function
# Script that implements the BSM formula
S <- 100; X <- 110; r <-0.05; T <- 1; sigma <- 0.2
d1 <- (log(S/X)+(r+sigma^2/2)*T)/(sigma*sqrt(T))
d2 <- d1 - sigma * sqrt(T)
Call <- S*pnorm(d1) - X*exp(-r*T)*pnorm(d2)
Put <- X*exp(-r*T) * pnorm(-d2) - S*pnorm(-d1)
paste("Call Option price: ", round(Call,2),
"Put Option price: ", round(Put,2))
12.4
Script versus function
[1] "Call Option price: 6.04 Put Option price: 10.68"
12.5
Script versus function
12.6
Script versus function
return(object) output
12.7
Script versus function
12.8
Script versus function
12.9
Script versus function
12.10
Script versus function
The only object, besides the function stored in memory, is the price
list, i.e. the output.
12.11
Scope of variables
If you also want to have the inputs stored in memory, you can rewrite
the call to the function using = instead of <-
Prices<- BSMprice(S <- 100, X <- 110, r <- 0.05, T <- 1,
sigma <- 0.2)
> ls()
[1] "BSMprice" "Prices" "r" "S" "sigma" "T" "X"
Or equivalently:
S <- 100; X <- 110; r <- 0.05; T <- 1; sigma <- 0.2
Prices <- BSMprice(S,X,r,T,sigma)
12.12
Scope of variables
Or equivalently:
S = 100; X = 110; r = 0.05; T = 1; sigma = 0.2
Prices <- BSMprice(S,X,r,T,sigma)
Note how the operator = and <- has the same effect on the visibility
of the variables (scope) for assigning values to an object outside the
use of a function, while it assumes a different meaning if it is used
within input arguments of a function.
Prices <- BSMprice(S=100, ...) means that when the function
is called, the variable S exists only inside the object and not outside.
12.13
Scope of variables
Prices <- BSMprice(S <- 100, ...) means that when the
function is called, the variable S exists inside the object and will
continue to exist in memory even outside.
Defining your own functions within a code allows you to manage
programming more effectively in terms of:
- Management of the scope of objects and therefore of memory
- Reuse the same function more easily in different parts of the code
(greater efficiency)
- Share calculation capabilities with other programmers more
effectively.
12.14
Debug mode
12.15
Debug mode
12.16
Debug mode
Reading the logs that appear in the console we can check that we
are in the debug mode.
> debugSource('C:/Software R Course/Bs.R')
Called from: eval(expr, p)
Browse[1]> n
debug at C:/Software R Course/Bs.R#1: S <- 100
Browse[2]>
12.17
Debug mode
12.18
Debug mode
12.19
Debug mode
12.20
Debug mode
12.21
Debug mode
12.22
Conditional execution: if statement
12.23
Conditional execution: if statement
12.24
Conditional execution: if statement
12.25
Conditional execution: if statement
12.26
Conditional execution: if statement
12.27
Loops and vectorization
X 4 -1 5 12 5 5 -4
Y 1 1 0 1 0 0 1
indx 1 2 3 4 5 6 7
12.28
Loops and vectorization
12.29
Loops and vectorization
When using a for loop to fill an array, it is
b <- 5 mandatory to define its size and mode in
X <- c(4,-1,5,12,5,5,-4) advance.
Y <- numeric(length(X))
An ordered indentation of the code
for (i in 1:length(X)) {
allows a better understanding and
if(X[i]==b){
readability of the code. Note the aligment
Y[i] <- 0 of the curly brackets }
} else {
Y[i] <- 1 To fully understand logic, the reader is
}
invited to activate the debug mode and
execute the statements step-by-step.
}
12.30
Loops and vectorization
12.31
Loops and vectorization
z <- x + y
In traditional automation languages where the vectorization feature is
not supported, the use of for is essential.
The equivalent instruction in short form (i.e. without coding a grouped
expression) is:
z <- numeric(length(x))
for (i in 1:length(z)) z[i] <- x[i] + y[i]
Beyond the fact of having to write more code, the execution of a loop
or more generally of a control structure is computationally much
more expensive than an instruction in vectorized form.
12.32
Loops and vectorization
12.33
tic and toc
# install.packages("tictoc")
library(tictoc)
tic()
b <- 5
X <- c(4,-1,5,12,5,5,-4)
Y <- numeric(length(X))
Y[X!=b] <- 1
toc()
12.34
While loop
while(test_expression) {
statement Test FALSE
Expression
}
TRUE
The previous example is repeated
Body of while Exit Loop
using a while loop.
12.35
While loop
When using a while loop to fill an array, it
b=5;X=c(4,-1,5,12,5,5,-4) is mandatory to define its size and mode
Y=numeric(length(X));i=1 in advance.
while (i <= length(X)) {
An ordered indentation of the code
if(X[i]==b){
allows a better understanding and
Y[i] <- 0
readibility of the code. Note the aligment
} else { of the curly brackets }
Y[i] <- 1
} To fully understand logic, the reader is
i=i+1
invited to activate the debug mode and
execute the statements step-by-step.
}
12.36
To conclude… three types of programmers
Once a programmer of the first type has reached the goal for which
"the code does what it has to do" he feels satisfied.
12.37
Programming between art and technique
12.38
Code Wisdom – https://twitter.com/CodeWisdom
"Code never lies, comments sometimes do." – Ron Jeffries
"Make it correct, make it clear, make it concise, make it faster. In that order."
– Wes Dyer
"Debugging is like being the detective in a crime movie where you are also
the murderer" – Filipe Fortes
"The only way to learn a new programming language is by writing programs
in it" – Dennis Ritchie
"Everyday life is like programming, I guess. If you love something you can
put beauty into it." – Donald Knuth
"Tidy datasets are all alike, but every messy dataset is messy in its own
way." – Hadley Wickham
12.39
Bibliography and Sitography
12.40
Pier Giuseppe Giribone
Phd, CIIA®, CESGA®, CIWM®, PhD
[email protected]
Website: http://www.diptem.unige.it/piergiribone/
View publication stats