0% found this document useful (0 votes)
23 views

Tutorial01 R Introduction

This document provides an introduction to R, including: 1) R is a leading tool for statistics, data analysis, and machine learning that includes a programming language and over 2,000 user-contributed packages. 2) RStudio is an integrated development environment for R that can be used locally or on a server. 3) Resources for learning R include CRAN, Bioconductor, Omegahat, Inside-R, and R-bloggers. Basic functions, variables, vectors, sequences, and logical values are demonstrated.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Tutorial01 R Introduction

This document provides an introduction to R, including: 1) R is a leading tool for statistics, data analysis, and machine learning that includes a programming language and over 2,000 user-contributed packages. 2) RStudio is an integrated development environment for R that can be used locally or on a server. 3) Resources for learning R include CRAN, Bioconductor, Omegahat, Inside-R, and R-bloggers. Basic functions, variables, vectors, sequences, and logical values are demonstrated.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

25/06/2019

Tutorial 1 – Introduction to R
Dr Ivan Olier
[email protected]

ECI – International Summer School /


Machine Learning
2019

What is R

• R is an integrated suite of software facilities for data manipulation,


calculation and graphical display. Among other things it has:
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular
matrices,
• a large, coherent, integrated collection of intermediate tools for
data analysis,
• graphical facilities for data analysis and display either directly at
the computer or on hardcopy, and
• a well developed, simple and effective programming language
(called ‘S’) which includes conditionals, loops, user defined
recursive functions and input and output facilities.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 2

1
25/06/2019

• R is the leading tool for statistics, data analysis,


and machine learning.
• It is a programming language (objects,
functions, and packages)
• Over 2,000 cutting-edge, user-contributed
packages available on CRAN
(http://www.cran.r-project.org).
Why R • It is platform independent (linux, mac,
windows)
• R allows you to integrate with other languages
(C++, Java, Python), enables you to interact
with many data sources and other statistical
packages (SPSS, SAS, Stata, etc).
• A robust, vibrant community.
• R is free.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3

R and RStudio
• Rstudio is an integrated development environment for R (www.rstudio.com)
• It can be run locally (desktop version) or in a server (RStudio Server).

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 4

2
25/06/2019

Environment

Source

Files, plots,
packages, help,
viewer

Console

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5

Resources
• CRAN (http://cran.r-project.org/)

• Bioconductor (http://www.bioconductor.org/)

• Omegahat (http://www.omegahat.org/)

• Inside-R (http://www.inside-r.org)

• R-bloggers (http://www.r-bloggers.com)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 6

3
25/06/2019

First steps
• Finding help:
>help()
>help(solve)
>?solve
>help(“[[“)
>help.start()
>??solve
>example(solve)

• To quit the R program:


>q()

• To run demos:
>demo()
>demo(lm.glm)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7

R functions and variables


• They are case sensitive (A ≠ a)
• All alphanumeric symbols plus “.” and “_”
• They must start with a letter, or
• with “.” if the next character is not a digit.
• Examples:
>var1
>.var2
>.var_3
>_var3
>1var
>.1var

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8

4
25/06/2019

R functions and variables


• ; or a newline: separate commands.
>help(solve); example(solve)
>help(solve)
>example(solve)

• { and }: group commands


> {help(solve)
+ example(solve)
+}

• +: if you see this symbol is because the command is not complete


• #: Everything to the end of the line is a comment
>example(solve) #this is an example of ‘solve’

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 9

Numbers and vectors


• Assigning operators
> x<-1
Or
> x=1
Or
> assign(“x”,1)

• Vectors: simplest data structure in R


> x<-c(2,3,1,4)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 10

10

5
25/06/2019

Numeric vectors
• Vector arithmetic
>y<-2*x+1

• Basic operators: +, -, *, /, ^

• Basic math functions: log(x), exp(x), sin(x), cos(x), tan(x), sqrt(x), …

• Largest and smallest elements of a vector: max(x) and min(x).

• Number of elements: length(x).

• Their total and their product: sum(x) and prod(x)

• Their mean, variance and standard deviation: mean(x), var(x), sd(x).


2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 11

11

Exercise
• Write an expression in R that estimates the standard deviation (not “sd”) of this vector:

X=[1.2, 2.3, 1.4, 2.1, 3.2]’

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 12

12

6
25/06/2019

Solution
> x<-c(1.2,2.3,1.4,2.1,3.2)
> x.bar<-mean(x)
> N<-length(x)
> s<-sqrt(sum((x-x.bar)^2)/(N-1))
>s
[1] 0.795613

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 13

13

Regular sequences
• 1:30 will generate the vector c(1,2,3,…,29,30)
• 30:1 may be used to generate the sequence backwards.
• : (colon) has high priority:
> 2*1:5
[1] 2 4 6 8 10

• Sequences: seq(from=value, to=value, by=value)


> seq(from=2, to=10, by=2)
> seq(2,10,2)

• To replicate an object: rep(object, times=value, each=value)


> x<-c(1,2,3)
> rep(x,times=4)
> rep(x,each=4)
> rep(x,times=4,each=4)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 14

14

7
25/06/2019

Logical vectors
• Logical values: TRUE or T, FALSE, or F, and NA (not available)
> a<-TRUE
> b<-F
> c<-NA

• Logical operators: <, <=, >, >=, == for equality and != for inequality, & (and) for intersection, | (or) for union, and
! for negation.
• Examples: if x<-c(1,2,3,4,5,6)
> x>3
[1] FALSE FALSE FALSE TRUE TRUE TRUE

> !(x<=3)
[1] FALSE FALSE FALSE TRUE TRUE TRUE

> x<3 | x>5


[1] TRUE TRUE FALSE FALSE FALSE TRUE

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 15

15

Missing values (and similar)


• NA represents a “not available” or “missing value”.
• NaN is a “Not a Number” value and results from a numerical computation.
• Inf is an infinite value.
• Examples:
> c(1,3,NA,5)
> 0/0 #produces a NaN
> Inf-Inf

• Detecting missing values in a vector:


> x<-c(1,2,3,NA,5)
> is.na(x)

• is.nan(x) to detect NaNs, is.infinite(x), infinite elements, and is.finite(x) to detect finite elements.
• Try:
> is.nan(0/0)
> is.na(0/0)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 16

16

8
25/06/2019

Character vectors
• Character strings are entered using either matching double (") or single (’) quotes:
> ”Hello world”
> ’Hello world’

• Escape sequences: \n, newline, \t, tab and \b, backspace—see ?Quotes for a full
list.
• Character vectors:
> c(“hello”,”world”)
> rep(“hello”,5)

• paste() function concatenates strings:


> paste(“hello”,”world”)
> paste(“hello”,1:5)
> paste(“hello”,1:5,”world”,sep=“_”)
> paste(“hello”,1:5,”world”,sep=“_”,collapse=“,”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17

17

Subsetting
> x<-c(10,20,30,40,50)

• By index vector:
>x[3]
[1] 30
> x[2:4]
[1] 20 30 40

• Logical conditions:
> x[x>30]
[1] 40 50

• Excluding values:
> x[-c(1,3)]
[1] 20 40 50

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 18

18

9
25/06/2019

Exercise
• Detect missing values in the following vector and replace by mean value.

X=[1, 2, 3, 4, ?, 6, 7, ?]

Answer:

X=[1, 2, 3, 4, 3.833, 6, 7, 3.833]

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19

19

Solution
> mean(x)
[1] NA
> x.nomiss<-x[!is.na(x)]
> x.nomiss
[1] 1 2 3 4 6 7
> x[is.na(x)]<-mean(x.nomiss)
>x
[1] 1.000 2.000 3.000 4.000 3.833 6.000 7.000 3.833

Alternatively:
> mean(x,na.rm = T)
[1] 3.833333
> x[is.na(x)]<-mean(x,na.rm = T)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 20

20

10
25/06/2019

Objects
• Object – any entity which R operates on.
• Class of an object:
• numeric
• logical
• character
• list
• matrix
• array
• factor
• data.frame
• …

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 21

21

Objects
• Creating objects: class_name(length)
> numeric(3)
[1] 0 0 0
> numeric()
numeric(0)

• Testing an object: is.class_name(object)


> x<-c(1,2,3)
> is.numeric(x)
[1] TRUE

• Coercing an object to be of a different class: as.class_name(object)


> as.character(x)
[1] "1" "2" "3"

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 22

22

11
25/06/2019

Factors
• factor – is a vector object used to group components of other vectors of the
same length.
• Example:

x 2.0 1.5 3.8 1.3 4.2 7.1 5.5 2.9


c g1 g1 g2 g1 g2 g2 g1 g2
• In R:
> c<-factor(c("g1","g1","g2","g1","g2","g2","g1","g2"))
>c
[1] g1 g1 g2 g1 g2 g2 g1 g2
Levels: g1 g2

• To know the labels:


> levels(c)
[1] "g1" "g2”

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 23

23

Factors
• Example (cont.):
> x<-c(2,1.5,3.8,1.3,4.2,7.1,5.5,2.9)

• tapply – apply a function to a list of factors


tapply(vector, factors, function)

> tapply(x, c, mean)


g1 g2
2.575 4.500

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 24

24

12
25/06/2019

Arrays
• Array – it is a multiply subscripted collection of data entries.
• Creating an array: array(data=vector, dim=dimension_vector)
• Example:
> array(data=1:24,dim=c(3,4,2))
,,1
[,1] [,2] [,3] [,4] 2
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
3
,,2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22 4
[2,] 14 17 20 23
[3,] 15 18 21 24

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 25

25

Matrices
• Matrix – 2D array.
• It can be used “array” function or: matrix(data=vector, nrow, ncol)
• Example:
> matrix(data=1:12,nrow=3,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

• is.matrix and is.array can be used to test a data


• as.matrix and as.array, to coerce a data to be matrix or array, respectively.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 26

26

13
25/06/2019

Indexing arrays and matrices


> x <- array(1:20, dim=c(4,5))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

• To retrieve an element:
> x[2,3]
[1] 10

• And a row:
> x[2,]
[1] 2 6 10 14 18

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27

27

Indexing arrays and matrices (cont.)


• Using an array of indexes to retrieve values:
> y<-array(c(2:4,2:3),dim=c(3,2))
>y
[,1] [,2]
[1,] 2 2
[2,] 3 3
[3,] 4 2
> x[y]
[1] 6 11 8

• Or, to assign a value:


> x[y]<-NA
>x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 NA 10 14 18
[3,] 3 7 NA 15 19
[4,] 4 NA 12 16 20

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 28

28

14
25/06/2019

Operations with arrays


• Arithmetic operations – are done element-wise.
> X<-array(1:12,c(3,4)); Y<-array(12:1,c(3,4))
> 2*X+Y
[,1] [,2] [,3] [,4]
[1,] 14 17 20 23
[2,] 15 18 21 24
[3,] 16 19 22 25

• Mixed vector and array arithmetic (recycling rule): Any short vector operands are extended by recycling their values
until they match the size of any other operands.

> z<-c(2,3)
> z*X
[,1] [,2] [,3] [,4]
[1,] 2 12 14 30
[2,] 6 10 24 22
[3,] 6 18 18 36

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 29

29

Matrix operations
Operation Function Example
Matrix multiplication %*% X %*% Y
Matrix inversion solve() solve(X)
Linear equation solve() solve(X,b)
Transpose of a matrix t() t(X)
Eigenvalues and eigen() eigen(X)
eigenvectors
Binding matrices:
- column-wise cbind() cbind(X,Y,Z)
- Row-wise rbind() rbind(X,Y,Z)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 30

30

15
25/06/2019

Exercise
• Create a 3x3 random matrix A and a random vector b with 3 elements.
• Solve the linear system: Ax + b = 0:

1. By using solve(A,b)
2. By estimating the inverse of A (using solve) and then x=A-1b

• Compute the eigenvalues and eigenvectors of A.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 31

31

Lists
• List - an ordered collection of objects.
> Lst<-list(name="Fred", wife="Mary”, no.children=3, + child.ages=c(4,7,9))

> Lst
$name
[1] "Fred"

$wife
[1] "Mary"

$no.children
[1] 3

$child.ages
[1] 4 7 9
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32

32

16
25/06/2019

Lists
• Accessing to the components:
> Lst[[1]]
[1] "Fred"
> Lst$wife
[1] "Mary”

> Lst$wife<-"Liz"

> Lst[["no.children"]]
[1] 3
> Lst[["no.children"]]<-4

• Adding components to the list:


> Lst$pet<-"dog"
> Lst[["occupation"]]<-"Teacher"
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33

33

Lists
• Getting the component names:
> names(Lst)
[1] "name” "wife” "no.children" "child.ages” "pet" "occupation"

• Deleting a component:
> Lst$occupation<-NULL

• Concatenating lists:
List.ABC<-c(list.A, list.B, list.C)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34

34

17
25/06/2019

Data frames
• A data frame (data.frame) is a list with the restriction that components must be vectors (numeric, character, or
logical) of same length.
• It is the most suitable R object for datasets.

var1 var2 var3


1 -4 -2
2 -3 -1
3 -2 0
4 -1 1

• In R:
> dat1<-data.frame(var1=1:4,var2=-4:-1,var3=-2:1)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 35

35

Data frames
• It is still a list, so:

> dat1[["var1"]] #or dat1[[1]] or dat1$var1


[1] 1 2 3 4

> dat1$var4<-5:8
> dat1
var1 var2 var3 var4
1 1 -4 -1 5
2 2 -3 0 6
3 3 -2 1 7
4 4 -1 2 8

> names(dat1)
[1] "var1" "var2" "var3" "var4”

> dat1$var4<-NULL #drops “var4”

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 36

36

18
25/06/2019

Data frames
• To access a particular row:
> dat1[2,]
var1 var2 var3
2 2 -3 0

• Subsetting:
> dat1[dat1$var2>=-2,c("var1","var3")]
var1 var3
3 3 1
4 4 2

• Or, by using the function subset:


> subset(dat1, var2>=-2, select=c(var1,var3))
var1 var3
3 3 1
4 4 2
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 37

37

Accessing built-in datasets


• Around 100 datasets are supplied with R.
> data()

• All the datasets supplied with R are available directly by name.


> mtcars

• To see the first or last parts of a data frame (or any other object): head() or tail(), respectively.
• Or, if you are in Rstudio: View()
> View(mtcars)

• Number of rows or columns: nrow() or ncol().


• Row names: row.names()
> row.names(mtcars)
> row.names(mtcars)<-c(“dfhdf”,”asdasd”,…)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 38

38

19
25/06/2019

Exercise
• From ‘mtcars’ dataset, estimate the average mpg (miles per gallon) by number of
cylinders.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 39

39

Solution
• Let’s have a look at the mtcars dataset:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

• Create a factor of the ‘cyl’ variable


> cyl.f<-factor(mtcars$cyl)
> cyl.f
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8

• Use tapply to estimate the averages


> tapply(mtcars$mpg, cyl.f, mean)
4 6 8
26.66364 19.74286 15.10000

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 40

40

20
25/06/2019

Exporting data to files


• write.table()

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

• Example:
> write.table(dat1, file=“file1.txt”, sep=“\t”, row.names=F)

• write.csv()

> write.csv(dat1, file=“file2.csv”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 41

41

Reading data from files


• read.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

• read.csv()
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")

• read.delim()
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 42

42

21
25/06/2019

R Packages
• All R functions and datasets are stored in packages.

• The standard (or base) packages are considered part of the R source code.

• To see installed packages: library()

• To load a particular package: library(package_name)

• To install a new package: install.packages(new_packages)

• To update packages: update.packages()

• To see loaded packages: search()

• Contributed packages (CRAN, Biocoductor, Omegahat)


2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 43

43

The tidyverse
• The tidyverse is a collection of R
packages designed for data science.
• An R package is a collection of functions,
data, and documentation that extends
the capabilities of base R.
• tidyverse includes the two packages
that we will learn today:
• ggplot2 package for data visualisation
• dplyr package for data transformation
• You can install the complete tidyverse with a
single line of code:

install.packages("tidyverse")
44

44

22
25/06/2019

The tidyverse
• You will not be able to use the functions, objects, and help files in a package until you
load it with library().
• Once you have installed a package, you can load it with the library() function:

This tells you that tidyverse is loading the ggplot2,


tibble, tidyr, readr, purrr, and dplyr packages.

45

45

ggplot2
• R has several systems for making graphs, but ggplot2 is one of the most elegant
and most versatile.
• ggplot2 implements the grammar of graphics, a coherent system for describing and
building graphs.

46

46

23
25/06/2019

Do cars with big engines use more fuel than cars with small engines?

• Data (data frame): ggplot2::mpg


• A data frame is a rectangular collection of variables (in the columns) and observations (in the rows)
• mpg contains observations collected by the US Environment Protection Agency on 38 models of car.

Open its help page by


running:
?mpg

47

47

Do cars with big engines use more fuel than cars with small engines?

displ: engine size


hwy: fuel efficiency

Cars with big engines use more fuel

48

48

24
25/06/2019

Geometric objects
• ggplot2 provides over 30 geoms, and extension
packages provide even more (see https://www.ggplot2-
exts.org)
• The best way to get a comprehensive overview is the
ggplot2 cheatsheet, which you can find at
http://rstudio.com/cheatsheets.
• To learn more about any single geom, use help:
?geom_smooth

49

49

dplyr
• Another core member of the tidyverse

install.packages("nycflights13")
library(nycflights13)
library(tidyverse)

• To explore the basic data manipulation verbs of dplyr, we will use


nycflights13::flights
• This data frame contains all 336,776 flights that departed from New York City in
2013.
• The data comes from the US Bureau of Transportation Statistics, and is documented
in ?flights
50

50

25
25/06/2019

nycflights13::flights

• Data types:
• int : integers.
• dbl : doubles, or real numbers.
• chr : character vectors, or strings.
• dttm : date-times (a date + a
time).
• lgl : logical (TRUE or FALSE).
• fctr : factors, which R uses to
represent categorical variables with
fixed possible values.
• date : dates.

51

51

dplyr basics
• dplyr has five key functions that allow you to solve the vast majority of your data
manipulation challenges:
• Pick observations by their values (filter()).
• Reorder the rows (arrange()).
• Pick variables by their names (select()).
• Create new variables with functions of existing variables (mutate()).
• Collapse many values down to a single summary (summarise()).
• These can all be used in conjunction with group_by() which changes the scope of each
function from operating on the entire dataset to operating on it group-by-group.
• These six functions provide the verbs for a language of data manipulation.

52

52

26
25/06/2019

dplyr examples
To filter flights on 1st of January:
dset <- flights %>%
filter(month == 1, day == 1)

… or on 1st of January or 1st of December:


dset <- flights %>%
filter(month == 1 | month == 12, day == 1)

… or equivalently:
dset <- flights %>%
filter(month %in% c(1,12), day == 1)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 53

53

dplyr examples
• … and selecting year, month, and day columns only:
dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(year, month, day)

… or dropping those columns:


dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(-c(year, month, day))

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 54

54

27
25/06/2019

dplyr examples
• Adding a new column:
dset <- flights %>%
mutate(gain = arr_delay - dep_delay)

• Using “group_by” and “summarise” to aggregate rows:


dset <- flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = T))

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 55

55

Files and directories


• getwd() – Get the absolute path of the current working directory.
> getwd()

• setwd() – To set the working directory.


> setwd(“/home/joy/Desktop/”)

• list.files() - character vector of the names of files in a particular directory.


> list.files()
> list.files(“/home/joy/Desktop/”)

• list.dirs() – for directories.


> list.dirs()
> list.dirs(“/home/joy/Desktop/”, recursive=F)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 56

56

28
25/06/2019

Functions

Arguments
Value (R object)
Function

• function – R object that can be used to do a particular operation or process.


• It is defined as:
> name <- function(arg_1, arg_2, ...) expression

• Example:
> pow<-function(x,ex){
+ x^ex
+}
> pow(4,2)
[1] 16

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 57

57

Using a text editor


• In Rstudio:
File -> New File -> R Script
• R scripts – To write one or more functions, and/or pieces of code.
• Suggested file extension for R scripts: .R
• To run the code: source(file_name.R)
• Example: script_pow.R

pow<-function(x, ex){
x^ex
}

• Then:
> source(“script_pow.R”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 58

58

29
25/06/2019

Functions
• Returned value – It is the last value used in a function.
• return(value) – to explicitly indicate the value to return.
• Example:
pow<-function(x, ex){
res<-x^ex
return(res)
}

• Named arguments – when given, they may be given in any ordered:


> pow(x=4, ex=2)
> pow(ex=2, x=4)
> pow(4, 2) # x<-4, ex<-2
> pow(2, 4) # x<-2, ex<-4

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 59

59

Functions
• Default values – they can be defined along with the arguments:
• Example:
pow<-function(x, ex=2){
res<-x^ex
return(res)
}
> pow(4)
[1] 16

• Assignments within functions – are local and temporary and are lost after exit from the function.
> res<-3
> pow(2,4)
> res
[1] 3

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 60

60

30
25/06/2019

Conditional execution
• if statements: if (expr_1) expr_2 else expr_3
• Example:
if(x>0) {
p<- 1
} else if(x<0) {
p<- -1
} else {
p<- 0
}

• ifelse – vectorised version of if/else.


> x<-c(-2,3,1,0,-1)
> ifelse(x>=0,1,-1)
[1] -1 1 1 1 -1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 61

61

loops
• for loops: for (name in expr_1) expr_2
• name – the loop variable
• expr_1 – vector expression (often a sequence)
• expr_2 – expression repeatedly evaluated.
• Example:

> y<-numeric()
> for(ix in 1:length(x)){
+ if(x[ix]>0) y[ix]<- 1
+ else if(x[ix]<0) y[ix]<- -1
+ else y[ix]<- 0
+}
>y
[1] -1 1 1 0 -1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 62

62

31
25/06/2019

Apply a function over a list


• lapply – Apply a function to the elements of a list (or vector)
• Example:
> lst1<-list(x=c(1,2,3,4,NA), y=c(10,20), z=c(100,200,NA,300))
> res.lst1<-lapply(lst1, mean, na.rm=T)
> res.lst1
$x
[1] 2.5

$y
[1] 15

$z
[1] 200

• sapply – like lapply, but attempts to simplify to a vector or matrix if possible.


> res2.lst1<-sapply(lst1, mean, na.rm=T)
> res2.lst1
x y z
2.5 15.0 200.0

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 63

63

Exercise
• Write a function (you may need more than one) that imputes missing values in a dataset. It
should support two different imputation methods: mean and median. “mean” should be
the default method.

Dataset
Dataset
Method={“mean” | “median”}
miss.imp

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 64

64

32
25/06/2019

Statistical models in R
• Linear regression

• R function for linear regression: lm()

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 65

65

R formulae
• R uses a formula syntax to specify the form of many statistical models (and others) :
response ~ predictor_variables

• ~ operator – is read as “is modeled as a function of”


• For example:

• is formulated in R as:
Y~X

• And:

• As:
Y~X+Z

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 66

66

33
25/06/2019

Example
• “women” data
• Weight is modeled as a function
of Height:

Weight = β0 + β1Height + ε

> dat<-women
> mod<-lm(weight~height, dat)
> dat$pred.weight<-predict(mod)

> plot(dat$height,dat$weight,ylab="Weight”,xlab="Height")
> lines(dat$height,dat$pred.weight, col=2)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 67

67

R formulae

For example, the following model:

can be represented by the following three formulae:

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 68

68

34
25/06/2019

R formulae
Likewise, this model (in which the three-way interaction has been omitted):

is represented as:

“.” can be used to indicate “include all variables”


For example, if X,Z, and W are all variables in a dataset, then:
Y~X+Z+W
is equivalent to:

Y~.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 69

69

Problem – predicting medical expenses using linear regression


• In order for an insurance company to make money, it needs to collect more in yearly premiums than it spends
on medical care to its beneficiaries.
• The goal of this analysis is to use patient data to estimate the average medical care expenses for such
population segments.

• STEP 1 – collecting data.


• STEP 2 – exploring and preparing the data.
• STEP 3 – training a model on the data.
• STEP 4 – evaluating model performance.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 70

70

35
25/06/2019

Step 1 – collecting data


• insurance.csv (simulated data) – 1338 examples of beneficiaries currently enrolled in the insurance plan.
• Variables:
• age: age of the beneficiary.
• sex: gender, either male or female.
• bmi: body mass index.
• children: number of children
• smoker: yes or no, depending on whether the insured regularly smokes tobacco.
• region: place of residence in the US (northeast, southeast, southwest, or northwest).
• charges: medical care expenses.

• Import dataset into R environment.


• Tip:
• read.csv, argument stringsAsFactors.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 71

71

Step 2 – exploring the data


• Size of the dataset?
• Is the dataset formatted as expected (numeric variables, factors)?
• Indicate number of male/female, smokers/non-smokers.
• Are the data evenly divided among the regions?
• Would you say that the response variable approximately follows a normal distribution?
• Determine how the independent variables are related to the dependent variable and each other.

• Tips:
• Functions: nrow(), ncol(), is.factor(), str(), table(), cor(), hist(), plot()

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 72

72

36
25/06/2019

Step 3 – training a model on the data


• Fit a linear regression model to data.
• Tips:
• Use lm() to fit the model and predict() to make predictions.

Call:
lm(formula = charges ~ ., data = dset)

Coefficients:
(Intercept) age sexmale bmi children smokeryes
-11938.5 256.9 -131.3 339.2 475.5 23848.5

regionnorthwest regionsoutheast regionsouthwest


-353.0 -1035.0 -960.1

• Dummy coding – applied to factors: sexfemale/sexmale, smokerno/smokeryes,


regionnortheast/regionnorthwest/regionsoutheast/regionsouthwest.
• Reference categories: sexfemale, smokerno, regionnortheast.
• Thus, males have $131.30 less medical costs each year relative to females and smokers cost an average of $23,848.50 more than
non-smokers.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 73

73

Step 4 – evaluating model performance


• Produce a result summary of the fitted model.
> summary(ins_model)

1. Residuals: summary statistics for the errors


2. Stars: predictive power of each variable (statistically significant variable < 0.5)
3. R-squared: how well the model explains the response.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 74

74

37
25/06/2019

Step 4 – evaluating model performance


• Plot a histogram of the residuals (errors)
• Write an R function that computes the RMSE between two numeric vectors.
• RMSE is defined as:

• Estimate the RMSE between the true and the estimated insurance charges.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 75

75

R code
#Step 1 - Importing "insurance" dataset
dset<-read.csv("insurance.csv", stringsAsFactors=T)

#Step 2
# Dataset size
nrow(dset)
ncol(dset)
# Dataset format
str(dset)
# Number of male/female, smoker/non-smoker.
table(dset$sex)
table(dset$smoker)
# Data distribution among the regions
table(dset$region)
# Distribution of the response variable
summary(dset$charges)
hist(dset$charges) # To confirm the distribution is right-skewed
# Use "cor" function to compute correlation matrix
cor(dset[c("age", "bmi", "children", "charges")])
# or, "plot" for a distributional plot
plot(dset[c("age", "bmi", "children", "charges")])

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 76

76

38
25/06/2019

R code
#Step 3
#Fitting a linear model:
ins_model <- lm(charges ~ age + children + bmi + sex + smoker + region, data = dset)
# or using "."
ins_model <- lm(charges ~ ., data = dset)
ins_model #notice the dummy variables.

#Step 4
summary(ins_model)
#Histogram of the residuals
est.charges<-predict(ins_model)
res<-dset$charges-est.charges
hist(res,breaks=30)
#RMSE function
RMSE<-function(y.est, y.tru){
sqrt(sum((y.est-y.tru)^2)/length(y.est))
}
RMSE(y.est=est.charges,y.tru=dset$charges)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 77

77

39

You might also like