Tutorial01 R Introduction
Tutorial01 R Introduction
Tutorial 1 – Introduction to R
Dr Ivan Olier
[email protected]
What is R
1
25/06/2019
R and RStudio
• Rstudio is an integrated development environment for R (www.rstudio.com)
• It can be run locally (desktop version) or in a server (RStudio Server).
2
25/06/2019
Environment
Source
Files, plots,
packages, help,
viewer
Console
Resources
• CRAN (http://cran.r-project.org/)
• Bioconductor (http://www.bioconductor.org/)
• Omegahat (http://www.omegahat.org/)
• Inside-R (http://www.inside-r.org)
• R-bloggers (http://www.r-bloggers.com)
3
25/06/2019
First steps
• Finding help:
>help()
>help(solve)
>?solve
>help(“[[“)
>help.start()
>??solve
>example(solve)
• To run demos:
>demo()
>demo(lm.glm)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7
4
25/06/2019
10
5
25/06/2019
Numeric vectors
• Vector arithmetic
>y<-2*x+1
• Basic operators: +, -, *, /, ^
11
Exercise
• Write an expression in R that estimates the standard deviation (not “sd”) of this vector:
12
6
25/06/2019
Solution
> x<-c(1.2,2.3,1.4,2.1,3.2)
> x.bar<-mean(x)
> N<-length(x)
> s<-sqrt(sum((x-x.bar)^2)/(N-1))
>s
[1] 0.795613
13
Regular sequences
• 1:30 will generate the vector c(1,2,3,…,29,30)
• 30:1 may be used to generate the sequence backwards.
• : (colon) has high priority:
> 2*1:5
[1] 2 4 6 8 10
14
7
25/06/2019
Logical vectors
• Logical values: TRUE or T, FALSE, or F, and NA (not available)
> a<-TRUE
> b<-F
> c<-NA
• Logical operators: <, <=, >, >=, == for equality and != for inequality, & (and) for intersection, | (or) for union, and
! for negation.
• Examples: if x<-c(1,2,3,4,5,6)
> x>3
[1] FALSE FALSE FALSE TRUE TRUE TRUE
> !(x<=3)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
15
• is.nan(x) to detect NaNs, is.infinite(x), infinite elements, and is.finite(x) to detect finite elements.
• Try:
> is.nan(0/0)
> is.na(0/0)
16
8
25/06/2019
Character vectors
• Character strings are entered using either matching double (") or single (’) quotes:
> ”Hello world”
> ’Hello world’
• Escape sequences: \n, newline, \t, tab and \b, backspace—see ?Quotes for a full
list.
• Character vectors:
> c(“hello”,”world”)
> rep(“hello”,5)
17
Subsetting
> x<-c(10,20,30,40,50)
• By index vector:
>x[3]
[1] 30
> x[2:4]
[1] 20 30 40
• Logical conditions:
> x[x>30]
[1] 40 50
• Excluding values:
> x[-c(1,3)]
[1] 20 40 50
18
9
25/06/2019
Exercise
• Detect missing values in the following vector and replace by mean value.
X=[1, 2, 3, 4, ?, 6, 7, ?]
Answer:
19
Solution
> mean(x)
[1] NA
> x.nomiss<-x[!is.na(x)]
> x.nomiss
[1] 1 2 3 4 6 7
> x[is.na(x)]<-mean(x.nomiss)
>x
[1] 1.000 2.000 3.000 4.000 3.833 6.000 7.000 3.833
Alternatively:
> mean(x,na.rm = T)
[1] 3.833333
> x[is.na(x)]<-mean(x,na.rm = T)
20
10
25/06/2019
Objects
• Object – any entity which R operates on.
• Class of an object:
• numeric
• logical
• character
• list
• matrix
• array
• factor
• data.frame
• …
21
Objects
• Creating objects: class_name(length)
> numeric(3)
[1] 0 0 0
> numeric()
numeric(0)
22
11
25/06/2019
Factors
• factor – is a vector object used to group components of other vectors of the
same length.
• Example:
23
Factors
• Example (cont.):
> x<-c(2,1.5,3.8,1.3,4.2,7.1,5.5,2.9)
24
12
25/06/2019
Arrays
• Array – it is a multiply subscripted collection of data entries.
• Creating an array: array(data=vector, dim=dimension_vector)
• Example:
> array(data=1:24,dim=c(3,4,2))
,,1
[,1] [,2] [,3] [,4] 2
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
3
,,2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22 4
[2,] 14 17 20 23
[3,] 15 18 21 24
25
Matrices
• Matrix – 2D array.
• It can be used “array” function or: matrix(data=vector, nrow, ncol)
• Example:
> matrix(data=1:12,nrow=3,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
26
13
25/06/2019
• To retrieve an element:
> x[2,3]
[1] 10
• And a row:
> x[2,]
[1] 2 6 10 14 18
27
28
14
25/06/2019
• Mixed vector and array arithmetic (recycling rule): Any short vector operands are extended by recycling their values
until they match the size of any other operands.
> z<-c(2,3)
> z*X
[,1] [,2] [,3] [,4]
[1,] 2 12 14 30
[2,] 6 10 24 22
[3,] 6 18 18 36
29
Matrix operations
Operation Function Example
Matrix multiplication %*% X %*% Y
Matrix inversion solve() solve(X)
Linear equation solve() solve(X,b)
Transpose of a matrix t() t(X)
Eigenvalues and eigen() eigen(X)
eigenvectors
Binding matrices:
- column-wise cbind() cbind(X,Y,Z)
- Row-wise rbind() rbind(X,Y,Z)
30
15
25/06/2019
Exercise
• Create a 3x3 random matrix A and a random vector b with 3 elements.
• Solve the linear system: Ax + b = 0:
1. By using solve(A,b)
2. By estimating the inverse of A (using solve) and then x=A-1b
31
Lists
• List - an ordered collection of objects.
> Lst<-list(name="Fred", wife="Mary”, no.children=3, + child.ages=c(4,7,9))
> Lst
$name
[1] "Fred"
$wife
[1] "Mary"
$no.children
[1] 3
$child.ages
[1] 4 7 9
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32
32
16
25/06/2019
Lists
• Accessing to the components:
> Lst[[1]]
[1] "Fred"
> Lst$wife
[1] "Mary”
> Lst$wife<-"Liz"
> Lst[["no.children"]]
[1] 3
> Lst[["no.children"]]<-4
33
Lists
• Getting the component names:
> names(Lst)
[1] "name” "wife” "no.children" "child.ages” "pet" "occupation"
• Deleting a component:
> Lst$occupation<-NULL
• Concatenating lists:
List.ABC<-c(list.A, list.B, list.C)
34
17
25/06/2019
Data frames
• A data frame (data.frame) is a list with the restriction that components must be vectors (numeric, character, or
logical) of same length.
• It is the most suitable R object for datasets.
• In R:
> dat1<-data.frame(var1=1:4,var2=-4:-1,var3=-2:1)
35
Data frames
• It is still a list, so:
> dat1$var4<-5:8
> dat1
var1 var2 var3 var4
1 1 -4 -1 5
2 2 -3 0 6
3 3 -2 1 7
4 4 -1 2 8
> names(dat1)
[1] "var1" "var2" "var3" "var4”
36
18
25/06/2019
Data frames
• To access a particular row:
> dat1[2,]
var1 var2 var3
2 2 -3 0
• Subsetting:
> dat1[dat1$var2>=-2,c("var1","var3")]
var1 var3
3 3 1
4 4 2
37
• To see the first or last parts of a data frame (or any other object): head() or tail(), respectively.
• Or, if you are in Rstudio: View()
> View(mtcars)
38
19
25/06/2019
Exercise
• From ‘mtcars’ dataset, estimate the average mpg (miles per gallon) by number of
cylinders.
39
Solution
• Let’s have a look at the mtcars dataset:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
40
20
25/06/2019
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
• Example:
> write.table(dat1, file=“file1.txt”, sep=“\t”, row.names=F)
• write.csv()
41
• read.csv()
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")
• read.delim()
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")
42
21
25/06/2019
R Packages
• All R functions and datasets are stored in packages.
• The standard (or base) packages are considered part of the R source code.
43
The tidyverse
• The tidyverse is a collection of R
packages designed for data science.
• An R package is a collection of functions,
data, and documentation that extends
the capabilities of base R.
• tidyverse includes the two packages
that we will learn today:
• ggplot2 package for data visualisation
• dplyr package for data transformation
• You can install the complete tidyverse with a
single line of code:
install.packages("tidyverse")
44
44
22
25/06/2019
The tidyverse
• You will not be able to use the functions, objects, and help files in a package until you
load it with library().
• Once you have installed a package, you can load it with the library() function:
45
45
ggplot2
• R has several systems for making graphs, but ggplot2 is one of the most elegant
and most versatile.
• ggplot2 implements the grammar of graphics, a coherent system for describing and
building graphs.
46
46
23
25/06/2019
Do cars with big engines use more fuel than cars with small engines?
47
47
Do cars with big engines use more fuel than cars with small engines?
48
48
24
25/06/2019
Geometric objects
• ggplot2 provides over 30 geoms, and extension
packages provide even more (see https://www.ggplot2-
exts.org)
• The best way to get a comprehensive overview is the
ggplot2 cheatsheet, which you can find at
http://rstudio.com/cheatsheets.
• To learn more about any single geom, use help:
?geom_smooth
49
49
dplyr
• Another core member of the tidyverse
install.packages("nycflights13")
library(nycflights13)
library(tidyverse)
50
25
25/06/2019
nycflights13::flights
• Data types:
• int : integers.
• dbl : doubles, or real numbers.
• chr : character vectors, or strings.
• dttm : date-times (a date + a
time).
• lgl : logical (TRUE or FALSE).
• fctr : factors, which R uses to
represent categorical variables with
fixed possible values.
• date : dates.
51
51
dplyr basics
• dplyr has five key functions that allow you to solve the vast majority of your data
manipulation challenges:
• Pick observations by their values (filter()).
• Reorder the rows (arrange()).
• Pick variables by their names (select()).
• Create new variables with functions of existing variables (mutate()).
• Collapse many values down to a single summary (summarise()).
• These can all be used in conjunction with group_by() which changes the scope of each
function from operating on the entire dataset to operating on it group-by-group.
• These six functions provide the verbs for a language of data manipulation.
52
52
26
25/06/2019
dplyr examples
To filter flights on 1st of January:
dset <- flights %>%
filter(month == 1, day == 1)
… or equivalently:
dset <- flights %>%
filter(month %in% c(1,12), day == 1)
53
dplyr examples
• … and selecting year, month, and day columns only:
dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(year, month, day)
54
27
25/06/2019
dplyr examples
• Adding a new column:
dset <- flights %>%
mutate(gain = arr_delay - dep_delay)
55
56
28
25/06/2019
Functions
Arguments
Value (R object)
Function
• Example:
> pow<-function(x,ex){
+ x^ex
+}
> pow(4,2)
[1] 16
57
pow<-function(x, ex){
x^ex
}
• Then:
> source(“script_pow.R”)
58
29
25/06/2019
Functions
• Returned value – It is the last value used in a function.
• return(value) – to explicitly indicate the value to return.
• Example:
pow<-function(x, ex){
res<-x^ex
return(res)
}
59
Functions
• Default values – they can be defined along with the arguments:
• Example:
pow<-function(x, ex=2){
res<-x^ex
return(res)
}
> pow(4)
[1] 16
• Assignments within functions – are local and temporary and are lost after exit from the function.
> res<-3
> pow(2,4)
> res
[1] 3
60
30
25/06/2019
Conditional execution
• if statements: if (expr_1) expr_2 else expr_3
• Example:
if(x>0) {
p<- 1
} else if(x<0) {
p<- -1
} else {
p<- 0
}
61
loops
• for loops: for (name in expr_1) expr_2
• name – the loop variable
• expr_1 – vector expression (often a sequence)
• expr_2 – expression repeatedly evaluated.
• Example:
> y<-numeric()
> for(ix in 1:length(x)){
+ if(x[ix]>0) y[ix]<- 1
+ else if(x[ix]<0) y[ix]<- -1
+ else y[ix]<- 0
+}
>y
[1] -1 1 1 0 -1
62
31
25/06/2019
$y
[1] 15
$z
[1] 200
63
Exercise
• Write a function (you may need more than one) that imputes missing values in a dataset. It
should support two different imputation methods: mean and median. “mean” should be
the default method.
Dataset
Dataset
Method={“mean” | “median”}
miss.imp
64
32
25/06/2019
Statistical models in R
• Linear regression
65
R formulae
• R uses a formula syntax to specify the form of many statistical models (and others) :
response ~ predictor_variables
• is formulated in R as:
Y~X
• And:
• As:
Y~X+Z
66
33
25/06/2019
Example
• “women” data
• Weight is modeled as a function
of Height:
Weight = β0 + β1Height + ε
> dat<-women
> mod<-lm(weight~height, dat)
> dat$pred.weight<-predict(mod)
> plot(dat$height,dat$weight,ylab="Weight”,xlab="Height")
> lines(dat$height,dat$pred.weight, col=2)
67
R formulae
68
34
25/06/2019
R formulae
Likewise, this model (in which the three-way interaction has been omitted):
is represented as:
Y~.
69
70
35
25/06/2019
71
• Tips:
• Functions: nrow(), ncol(), is.factor(), str(), table(), cor(), hist(), plot()
72
36
25/06/2019
Call:
lm(formula = charges ~ ., data = dset)
Coefficients:
(Intercept) age sexmale bmi children smokeryes
-11938.5 256.9 -131.3 339.2 475.5 23848.5
73
74
37
25/06/2019
• Estimate the RMSE between the true and the estimated insurance charges.
75
R code
#Step 1 - Importing "insurance" dataset
dset<-read.csv("insurance.csv", stringsAsFactors=T)
#Step 2
# Dataset size
nrow(dset)
ncol(dset)
# Dataset format
str(dset)
# Number of male/female, smoker/non-smoker.
table(dset$sex)
table(dset$smoker)
# Data distribution among the regions
table(dset$region)
# Distribution of the response variable
summary(dset$charges)
hist(dset$charges) # To confirm the distribution is right-skewed
# Use "cor" function to compute correlation matrix
cor(dset[c("age", "bmi", "children", "charges")])
# or, "plot" for a distributional plot
plot(dset[c("age", "bmi", "children", "charges")])
76
38
25/06/2019
R code
#Step 3
#Fitting a linear model:
ins_model <- lm(charges ~ age + children + bmi + sex + smoker + region, data = dset)
# or using "."
ins_model <- lm(charges ~ ., data = dset)
ins_model #notice the dummy variables.
#Step 4
summary(ins_model)
#Histogram of the residuals
est.charges<-predict(ins_model)
res<-dset$charges-est.charges
hist(res,breaks=30)
#RMSE function
RMSE<-function(y.est, y.tru){
sqrt(sum((y.est-y.tru)^2)/length(y.est))
}
RMSE(y.est=est.charges,y.tru=dset$charges)
77
39