0% found this document useful (0 votes)
36 views39 pages

Tutorial01 R Introduction

This document provides an introduction to R, including: 1) R is a leading tool for statistics, data analysis, and machine learning that includes a programming language and over 2,000 user-contributed packages. 2) RStudio is an integrated development environment for R that can be used locally or on a server. 3) Resources for learning R include CRAN, Bioconductor, Omegahat, Inside-R, and R-bloggers. Basic functions, variables, vectors, sequences, and logical values are demonstrated.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views39 pages

Tutorial01 R Introduction

This document provides an introduction to R, including: 1) R is a leading tool for statistics, data analysis, and machine learning that includes a programming language and over 2,000 user-contributed packages. 2) RStudio is an integrated development environment for R that can be used locally or on a server. 3) Resources for learning R include CRAN, Bioconductor, Omegahat, Inside-R, and R-bloggers. Basic functions, variables, vectors, sequences, and logical values are demonstrated.

Uploaded by

HGE05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

25/06/2019

Tutorial 1 – Introduction to R
Dr Ivan Olier
[email protected]

ECI – International Summer School /


Machine Learning
2019

What is R

• R is an integrated suite of software facilities for data manipulation,


calculation and graphical display. Among other things it has:
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular
matrices,
• a large, coherent, integrated collection of intermediate tools for
data analysis,
• graphical facilities for data analysis and display either directly at
the computer or on hardcopy, and
• a well developed, simple and effective programming language
(called ‘S’) which includes conditionals, loops, user defined
recursive functions and input and output facilities.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 2

1
25/06/2019

• R is the leading tool for statistics, data analysis,


and machine learning.
• It is a programming language (objects,
functions, and packages)
• Over 2,000 cutting-edge, user-contributed
packages available on CRAN
(http://www.cran.r-project.org).
Why R • It is platform independent (linux, mac,
windows)
• R allows you to integrate with other languages
(C++, Java, Python), enables you to interact
with many data sources and other statistical
packages (SPSS, SAS, Stata, etc).
• A robust, vibrant community.
• R is free.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3

R and RStudio
• Rstudio is an integrated development environment for R (www.rstudio.com)
• It can be run locally (desktop version) or in a server (RStudio Server).

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 4

2
25/06/2019

Environment

Source

Files, plots,
packages, help,
viewer

Console

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5

Resources
• CRAN (http://cran.r-project.org/)

• Bioconductor (http://www.bioconductor.org/)

• Omegahat (http://www.omegahat.org/)

• Inside-R (http://www.inside-r.org)

• R-bloggers (http://www.r-bloggers.com)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 6

3
25/06/2019

First steps
• Finding help:
>help()
>help(solve)
>?solve
>help(“[[“)
>help.start()
>??solve
>example(solve)

• To quit the R program:


>q()

• To run demos:
>demo()
>demo(lm.glm)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7

R functions and variables


• They are case sensitive (A ≠ a)
• All alphanumeric symbols plus “.” and “_”
• They must start with a letter, or
• with “.” if the next character is not a digit.
• Examples:
>var1
>.var2
>.var_3
>_var3
>1var
>.1var

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8

4
25/06/2019

R functions and variables


• ; or a newline: separate commands.
>help(solve); example(solve)
>help(solve)
>example(solve)

• { and }: group commands


> {help(solve)
+ example(solve)
+}

• +: if you see this symbol is because the command is not complete


• #: Everything to the end of the line is a comment
>example(solve) #this is an example of ‘solve’

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 9

Numbers and vectors


• Assigning operators
> x<-1
Or
> x=1
Or
> assign(“x”,1)

• Vectors: simplest data structure in R


> x<-c(2,3,1,4)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 10

10

5
25/06/2019

Numeric vectors
• Vector arithmetic
>y<-2*x+1

• Basic operators: +, -, *, /, ^

• Basic math functions: log(x), exp(x), sin(x), cos(x), tan(x), sqrt(x), …

• Largest and smallest elements of a vector: max(x) and min(x).

• Number of elements: length(x).

• Their total and their product: sum(x) and prod(x)

• Their mean, variance and standard deviation: mean(x), var(x), sd(x).


2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 11

11

Exercise
• Write an expression in R that estimates the standard deviation (not “sd”) of this vector:

X=[1.2, 2.3, 1.4, 2.1, 3.2]’

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 12

12

6
25/06/2019

Solution
> x<-c(1.2,2.3,1.4,2.1,3.2)
> x.bar<-mean(x)
> N<-length(x)
> s<-sqrt(sum((x-x.bar)^2)/(N-1))
>s
[1] 0.795613

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 13

13

Regular sequences
• 1:30 will generate the vector c(1,2,3,…,29,30)
• 30:1 may be used to generate the sequence backwards.
• : (colon) has high priority:
> 2*1:5
[1] 2 4 6 8 10

• Sequences: seq(from=value, to=value, by=value)


> seq(from=2, to=10, by=2)
> seq(2,10,2)

• To replicate an object: rep(object, times=value, each=value)


> x<-c(1,2,3)
> rep(x,times=4)
> rep(x,each=4)
> rep(x,times=4,each=4)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 14

14

7
25/06/2019

Logical vectors
• Logical values: TRUE or T, FALSE, or F, and NA (not available)
> a<-TRUE
> b<-F
> c<-NA

• Logical operators: <, <=, >, >=, == for equality and != for inequality, & (and) for intersection, | (or) for union, and
! for negation.
• Examples: if x<-c(1,2,3,4,5,6)
> x>3
[1] FALSE FALSE FALSE TRUE TRUE TRUE

> !(x<=3)
[1] FALSE FALSE FALSE TRUE TRUE TRUE

> x<3 | x>5


[1] TRUE TRUE FALSE FALSE FALSE TRUE

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 15

15

Missing values (and similar)


• NA represents a “not available” or “missing value”.
• NaN is a “Not a Number” value and results from a numerical computation.
• Inf is an infinite value.
• Examples:
> c(1,3,NA,5)
> 0/0 #produces a NaN
> Inf-Inf

• Detecting missing values in a vector:


> x<-c(1,2,3,NA,5)
> is.na(x)

• is.nan(x) to detect NaNs, is.infinite(x), infinite elements, and is.finite(x) to detect finite elements.
• Try:
> is.nan(0/0)
> is.na(0/0)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 16

16

8
25/06/2019

Character vectors
• Character strings are entered using either matching double (") or single (’) quotes:
> ”Hello world”
> ’Hello world’

• Escape sequences: \n, newline, \t, tab and \b, backspace—see ?Quotes for a full
list.
• Character vectors:
> c(“hello”,”world”)
> rep(“hello”,5)

• paste() function concatenates strings:


> paste(“hello”,”world”)
> paste(“hello”,1:5)
> paste(“hello”,1:5,”world”,sep=“_”)
> paste(“hello”,1:5,”world”,sep=“_”,collapse=“,”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17

17

Subsetting
> x<-c(10,20,30,40,50)

• By index vector:
>x[3]
[1] 30
> x[2:4]
[1] 20 30 40

• Logical conditions:
> x[x>30]
[1] 40 50

• Excluding values:
> x[-c(1,3)]
[1] 20 40 50

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 18

18

9
25/06/2019

Exercise
• Detect missing values in the following vector and replace by mean value.

X=[1, 2, 3, 4, ?, 6, 7, ?]

Answer:

X=[1, 2, 3, 4, 3.833, 6, 7, 3.833]

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19

19

Solution
> mean(x)
[1] NA
> x.nomiss<-x[!is.na(x)]
> x.nomiss
[1] 1 2 3 4 6 7
> x[is.na(x)]<-mean(x.nomiss)
>x
[1] 1.000 2.000 3.000 4.000 3.833 6.000 7.000 3.833

Alternatively:
> mean(x,na.rm = T)
[1] 3.833333
> x[is.na(x)]<-mean(x,na.rm = T)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 20

20

10
25/06/2019

Objects
• Object – any entity which R operates on.
• Class of an object:
• numeric
• logical
• character
• list
• matrix
• array
• factor
• data.frame
• …

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 21

21

Objects
• Creating objects: class_name(length)
> numeric(3)
[1] 0 0 0
> numeric()
numeric(0)

• Testing an object: is.class_name(object)


> x<-c(1,2,3)
> is.numeric(x)
[1] TRUE

• Coercing an object to be of a different class: as.class_name(object)


> as.character(x)
[1] "1" "2" "3"

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 22

22

11
25/06/2019

Factors
• factor – is a vector object used to group components of other vectors of the
same length.
• Example:

x 2.0 1.5 3.8 1.3 4.2 7.1 5.5 2.9


c g1 g1 g2 g1 g2 g2 g1 g2
• In R:
> c<-factor(c("g1","g1","g2","g1","g2","g2","g1","g2"))
>c
[1] g1 g1 g2 g1 g2 g2 g1 g2
Levels: g1 g2

• To know the labels:


> levels(c)
[1] "g1" "g2”

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 23

23

Factors
• Example (cont.):
> x<-c(2,1.5,3.8,1.3,4.2,7.1,5.5,2.9)

• tapply – apply a function to a list of factors


tapply(vector, factors, function)

> tapply(x, c, mean)


g1 g2
2.575 4.500

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 24

24

12
25/06/2019

Arrays
• Array – it is a multiply subscripted collection of data entries.
• Creating an array: array(data=vector, dim=dimension_vector)
• Example:
> array(data=1:24,dim=c(3,4,2))
,,1
[,1] [,2] [,3] [,4] 2
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
3
,,2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22 4
[2,] 14 17 20 23
[3,] 15 18 21 24

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 25

25

Matrices
• Matrix – 2D array.
• It can be used “array” function or: matrix(data=vector, nrow, ncol)
• Example:
> matrix(data=1:12,nrow=3,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

• is.matrix and is.array can be used to test a data


• as.matrix and as.array, to coerce a data to be matrix or array, respectively.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 26

26

13
25/06/2019

Indexing arrays and matrices


> x <- array(1:20, dim=c(4,5))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

• To retrieve an element:
> x[2,3]
[1] 10

• And a row:
> x[2,]
[1] 2 6 10 14 18

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27

27

Indexing arrays and matrices (cont.)


• Using an array of indexes to retrieve values:
> y<-array(c(2:4,2:3),dim=c(3,2))
>y
[,1] [,2]
[1,] 2 2
[2,] 3 3
[3,] 4 2
> x[y]
[1] 6 11 8

• Or, to assign a value:


> x[y]<-NA
>x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 NA 10 14 18
[3,] 3 7 NA 15 19
[4,] 4 NA 12 16 20

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 28

28

14
25/06/2019

Operations with arrays


• Arithmetic operations – are done element-wise.
> X<-array(1:12,c(3,4)); Y<-array(12:1,c(3,4))
> 2*X+Y
[,1] [,2] [,3] [,4]
[1,] 14 17 20 23
[2,] 15 18 21 24
[3,] 16 19 22 25

• Mixed vector and array arithmetic (recycling rule): Any short vector operands are extended by recycling their values
until they match the size of any other operands.

> z<-c(2,3)
> z*X
[,1] [,2] [,3] [,4]
[1,] 2 12 14 30
[2,] 6 10 24 22
[3,] 6 18 18 36

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 29

29

Matrix operations
Operation Function Example
Matrix multiplication %*% X %*% Y
Matrix inversion solve() solve(X)
Linear equation solve() solve(X,b)
Transpose of a matrix t() t(X)
Eigenvalues and eigen() eigen(X)
eigenvectors
Binding matrices:
- column-wise cbind() cbind(X,Y,Z)
- Row-wise rbind() rbind(X,Y,Z)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 30

30

15
25/06/2019

Exercise
• Create a 3x3 random matrix A and a random vector b with 3 elements.
• Solve the linear system: Ax + b = 0:

1. By using solve(A,b)
2. By estimating the inverse of A (using solve) and then x=A-1b

• Compute the eigenvalues and eigenvectors of A.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 31

31

Lists
• List - an ordered collection of objects.
> Lst<-list(name="Fred", wife="Mary”, no.children=3, + child.ages=c(4,7,9))

> Lst
$name
[1] "Fred"

$wife
[1] "Mary"

$no.children
[1] 3

$child.ages
[1] 4 7 9
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32

32

16
25/06/2019

Lists
• Accessing to the components:
> Lst[[1]]
[1] "Fred"
> Lst$wife
[1] "Mary”

> Lst$wife<-"Liz"

> Lst[["no.children"]]
[1] 3
> Lst[["no.children"]]<-4

• Adding components to the list:


> Lst$pet<-"dog"
> Lst[["occupation"]]<-"Teacher"
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33

33

Lists
• Getting the component names:
> names(Lst)
[1] "name” "wife” "no.children" "child.ages” "pet" "occupation"

• Deleting a component:
> Lst$occupation<-NULL

• Concatenating lists:
List.ABC<-c(list.A, list.B, list.C)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34

34

17
25/06/2019

Data frames
• A data frame (data.frame) is a list with the restriction that components must be vectors (numeric, character, or
logical) of same length.
• It is the most suitable R object for datasets.

var1 var2 var3


1 -4 -2
2 -3 -1
3 -2 0
4 -1 1

• In R:
> dat1<-data.frame(var1=1:4,var2=-4:-1,var3=-2:1)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 35

35

Data frames
• It is still a list, so:

> dat1[["var1"]] #or dat1[[1]] or dat1$var1


[1] 1 2 3 4

> dat1$var4<-5:8
> dat1
var1 var2 var3 var4
1 1 -4 -1 5
2 2 -3 0 6
3 3 -2 1 7
4 4 -1 2 8

> names(dat1)
[1] "var1" "var2" "var3" "var4”

> dat1$var4<-NULL #drops “var4”

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 36

36

18
25/06/2019

Data frames
• To access a particular row:
> dat1[2,]
var1 var2 var3
2 2 -3 0

• Subsetting:
> dat1[dat1$var2>=-2,c("var1","var3")]
var1 var3
3 3 1
4 4 2

• Or, by using the function subset:


> subset(dat1, var2>=-2, select=c(var1,var3))
var1 var3
3 3 1
4 4 2
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 37

37

Accessing built-in datasets


• Around 100 datasets are supplied with R.
> data()

• All the datasets supplied with R are available directly by name.


> mtcars

• To see the first or last parts of a data frame (or any other object): head() or tail(), respectively.
• Or, if you are in Rstudio: View()
> View(mtcars)

• Number of rows or columns: nrow() or ncol().


• Row names: row.names()
> row.names(mtcars)
> row.names(mtcars)<-c(“dfhdf”,”asdasd”,…)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 38

38

19
25/06/2019

Exercise
• From ‘mtcars’ dataset, estimate the average mpg (miles per gallon) by number of
cylinders.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 39

39

Solution
• Let’s have a look at the mtcars dataset:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

• Create a factor of the ‘cyl’ variable


> cyl.f<-factor(mtcars$cyl)
> cyl.f
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8

• Use tapply to estimate the averages


> tapply(mtcars$mpg, cyl.f, mean)
4 6 8
26.66364 19.74286 15.10000

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 40

40

20
25/06/2019

Exporting data to files


• write.table()

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

• Example:
> write.table(dat1, file=“file1.txt”, sep=“\t”, row.names=F)

• write.csv()

> write.csv(dat1, file=“file2.csv”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 41

41

Reading data from files


• read.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

• read.csv()
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")

• read.delim()
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 42

42

21
25/06/2019

R Packages
• All R functions and datasets are stored in packages.

• The standard (or base) packages are considered part of the R source code.

• To see installed packages: library()

• To load a particular package: library(package_name)

• To install a new package: install.packages(new_packages)

• To update packages: update.packages()

• To see loaded packages: search()

• Contributed packages (CRAN, Biocoductor, Omegahat)


2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 43

43

The tidyverse
• The tidyverse is a collection of R
packages designed for data science.
• An R package is a collection of functions,
data, and documentation that extends
the capabilities of base R.
• tidyverse includes the two packages
that we will learn today:
• ggplot2 package for data visualisation
• dplyr package for data transformation
• You can install the complete tidyverse with a
single line of code:

install.packages("tidyverse")
44

44

22
25/06/2019

The tidyverse
• You will not be able to use the functions, objects, and help files in a package until you
load it with library().
• Once you have installed a package, you can load it with the library() function:

This tells you that tidyverse is loading the ggplot2,


tibble, tidyr, readr, purrr, and dplyr packages.

45

45

ggplot2
• R has several systems for making graphs, but ggplot2 is one of the most elegant
and most versatile.
• ggplot2 implements the grammar of graphics, a coherent system for describing and
building graphs.

46

46

23
25/06/2019

Do cars with big engines use more fuel than cars with small engines?

• Data (data frame): ggplot2::mpg


• A data frame is a rectangular collection of variables (in the columns) and observations (in the rows)
• mpg contains observations collected by the US Environment Protection Agency on 38 models of car.

Open its help page by


running:
?mpg

47

47

Do cars with big engines use more fuel than cars with small engines?

displ: engine size


hwy: fuel efficiency

Cars with big engines use more fuel

48

48

24
25/06/2019

Geometric objects
• ggplot2 provides over 30 geoms, and extension
packages provide even more (see https://www.ggplot2-
exts.org)
• The best way to get a comprehensive overview is the
ggplot2 cheatsheet, which you can find at
http://rstudio.com/cheatsheets.
• To learn more about any single geom, use help:
?geom_smooth

49

49

dplyr
• Another core member of the tidyverse

install.packages("nycflights13")
library(nycflights13)
library(tidyverse)

• To explore the basic data manipulation verbs of dplyr, we will use


nycflights13::flights
• This data frame contains all 336,776 flights that departed from New York City in
2013.
• The data comes from the US Bureau of Transportation Statistics, and is documented
in ?flights
50

50

25
25/06/2019

nycflights13::flights

• Data types:
• int : integers.
• dbl : doubles, or real numbers.
• chr : character vectors, or strings.
• dttm : date-times (a date + a
time).
• lgl : logical (TRUE or FALSE).
• fctr : factors, which R uses to
represent categorical variables with
fixed possible values.
• date : dates.

51

51

dplyr basics
• dplyr has five key functions that allow you to solve the vast majority of your data
manipulation challenges:
• Pick observations by their values (filter()).
• Reorder the rows (arrange()).
• Pick variables by their names (select()).
• Create new variables with functions of existing variables (mutate()).
• Collapse many values down to a single summary (summarise()).
• These can all be used in conjunction with group_by() which changes the scope of each
function from operating on the entire dataset to operating on it group-by-group.
• These six functions provide the verbs for a language of data manipulation.

52

52

26
25/06/2019

dplyr examples
To filter flights on 1st of January:
dset <- flights %>%
filter(month == 1, day == 1)

… or on 1st of January or 1st of December:


dset <- flights %>%
filter(month == 1 | month == 12, day == 1)

… or equivalently:
dset <- flights %>%
filter(month %in% c(1,12), day == 1)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 53

53

dplyr examples
• … and selecting year, month, and day columns only:
dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(year, month, day)

… or dropping those columns:


dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(-c(year, month, day))

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 54

54

27
25/06/2019

dplyr examples
• Adding a new column:
dset <- flights %>%
mutate(gain = arr_delay - dep_delay)

• Using “group_by” and “summarise” to aggregate rows:


dset <- flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = T))

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 55

55

Files and directories


• getwd() – Get the absolute path of the current working directory.
> getwd()

• setwd() – To set the working directory.


> setwd(“/home/joy/Desktop/”)

• list.files() - character vector of the names of files in a particular directory.


> list.files()
> list.files(“/home/joy/Desktop/”)

• list.dirs() – for directories.


> list.dirs()
> list.dirs(“/home/joy/Desktop/”, recursive=F)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 56

56

28
25/06/2019

Functions

Arguments
Value (R object)
Function

• function – R object that can be used to do a particular operation or process.


• It is defined as:
> name <- function(arg_1, arg_2, ...) expression

• Example:
> pow<-function(x,ex){
+ x^ex
+}
> pow(4,2)
[1] 16

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 57

57

Using a text editor


• In Rstudio:
File -> New File -> R Script
• R scripts – To write one or more functions, and/or pieces of code.
• Suggested file extension for R scripts: .R
• To run the code: source(file_name.R)
• Example: script_pow.R

pow<-function(x, ex){
x^ex
}

• Then:
> source(“script_pow.R”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 58

58

29
25/06/2019

Functions
• Returned value – It is the last value used in a function.
• return(value) – to explicitly indicate the value to return.
• Example:
pow<-function(x, ex){
res<-x^ex
return(res)
}

• Named arguments – when given, they may be given in any ordered:


> pow(x=4, ex=2)
> pow(ex=2, x=4)
> pow(4, 2) # x<-4, ex<-2
> pow(2, 4) # x<-2, ex<-4

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 59

59

Functions
• Default values – they can be defined along with the arguments:
• Example:
pow<-function(x, ex=2){
res<-x^ex
return(res)
}
> pow(4)
[1] 16

• Assignments within functions – are local and temporary and are lost after exit from the function.
> res<-3
> pow(2,4)
> res
[1] 3

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 60

60

30
25/06/2019

Conditional execution
• if statements: if (expr_1) expr_2 else expr_3
• Example:
if(x>0) {
p<- 1
} else if(x<0) {
p<- -1
} else {
p<- 0
}

• ifelse – vectorised version of if/else.


> x<-c(-2,3,1,0,-1)
> ifelse(x>=0,1,-1)
[1] -1 1 1 1 -1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 61

61

loops
• for loops: for (name in expr_1) expr_2
• name – the loop variable
• expr_1 – vector expression (often a sequence)
• expr_2 – expression repeatedly evaluated.
• Example:

> y<-numeric()
> for(ix in 1:length(x)){
+ if(x[ix]>0) y[ix]<- 1
+ else if(x[ix]<0) y[ix]<- -1
+ else y[ix]<- 0
+}
>y
[1] -1 1 1 0 -1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 62

62

31
25/06/2019

Apply a function over a list


• lapply – Apply a function to the elements of a list (or vector)
• Example:
> lst1<-list(x=c(1,2,3,4,NA), y=c(10,20), z=c(100,200,NA,300))
> res.lst1<-lapply(lst1, mean, na.rm=T)
> res.lst1
$x
[1] 2.5

$y
[1] 15

$z
[1] 200

• sapply – like lapply, but attempts to simplify to a vector or matrix if possible.


> res2.lst1<-sapply(lst1, mean, na.rm=T)
> res2.lst1
x y z
2.5 15.0 200.0

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 63

63

Exercise
• Write a function (you may need more than one) that imputes missing values in a dataset. It
should support two different imputation methods: mean and median. “mean” should be
the default method.

Dataset
Dataset
Method={“mean” | “median”}
miss.imp

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 64

64

32
25/06/2019

Statistical models in R
• Linear regression

• R function for linear regression: lm()

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 65

65

R formulae
• R uses a formula syntax to specify the form of many statistical models (and others) :
response ~ predictor_variables

• ~ operator – is read as “is modeled as a function of”


• For example:

• is formulated in R as:
Y~X

• And:

• As:
Y~X+Z

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 66

66

33
25/06/2019

Example
• “women” data
• Weight is modeled as a function
of Height:

Weight = β0 + β1Height + ε

> dat<-women
> mod<-lm(weight~height, dat)
> dat$pred.weight<-predict(mod)

> plot(dat$height,dat$weight,ylab="Weight”,xlab="Height")
> lines(dat$height,dat$pred.weight, col=2)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 67

67

R formulae

For example, the following model:

can be represented by the following three formulae:

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 68

68

34
25/06/2019

R formulae
Likewise, this model (in which the three-way interaction has been omitted):

is represented as:

“.” can be used to indicate “include all variables”


For example, if X,Z, and W are all variables in a dataset, then:
Y~X+Z+W
is equivalent to:

Y~.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 69

69

Problem – predicting medical expenses using linear regression


• In order for an insurance company to make money, it needs to collect more in yearly premiums than it spends
on medical care to its beneficiaries.
• The goal of this analysis is to use patient data to estimate the average medical care expenses for such
population segments.

• STEP 1 – collecting data.


• STEP 2 – exploring and preparing the data.
• STEP 3 – training a model on the data.
• STEP 4 – evaluating model performance.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 70

70

35
25/06/2019

Step 1 – collecting data


• insurance.csv (simulated data) – 1338 examples of beneficiaries currently enrolled in the insurance plan.
• Variables:
• age: age of the beneficiary.
• sex: gender, either male or female.
• bmi: body mass index.
• children: number of children
• smoker: yes or no, depending on whether the insured regularly smokes tobacco.
• region: place of residence in the US (northeast, southeast, southwest, or northwest).
• charges: medical care expenses.

• Import dataset into R environment.


• Tip:
• read.csv, argument stringsAsFactors.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 71

71

Step 2 – exploring the data


• Size of the dataset?
• Is the dataset formatted as expected (numeric variables, factors)?
• Indicate number of male/female, smokers/non-smokers.
• Are the data evenly divided among the regions?
• Would you say that the response variable approximately follows a normal distribution?
• Determine how the independent variables are related to the dependent variable and each other.

• Tips:
• Functions: nrow(), ncol(), is.factor(), str(), table(), cor(), hist(), plot()

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 72

72

36
25/06/2019

Step 3 – training a model on the data


• Fit a linear regression model to data.
• Tips:
• Use lm() to fit the model and predict() to make predictions.

Call:
lm(formula = charges ~ ., data = dset)

Coefficients:
(Intercept) age sexmale bmi children smokeryes
-11938.5 256.9 -131.3 339.2 475.5 23848.5

regionnorthwest regionsoutheast regionsouthwest


-353.0 -1035.0 -960.1

• Dummy coding – applied to factors: sexfemale/sexmale, smokerno/smokeryes,


regionnortheast/regionnorthwest/regionsoutheast/regionsouthwest.
• Reference categories: sexfemale, smokerno, regionnortheast.
• Thus, males have $131.30 less medical costs each year relative to females and smokers cost an average of $23,848.50 more than
non-smokers.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 73

73

Step 4 – evaluating model performance


• Produce a result summary of the fitted model.
> summary(ins_model)

1. Residuals: summary statistics for the errors


2. Stars: predictive power of each variable (statistically significant variable < 0.5)
3. R-squared: how well the model explains the response.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 74

74

37
25/06/2019

Step 4 – evaluating model performance


• Plot a histogram of the residuals (errors)
• Write an R function that computes the RMSE between two numeric vectors.
• RMSE is defined as:

• Estimate the RMSE between the true and the estimated insurance charges.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 75

75

R code
#Step 1 - Importing "insurance" dataset
dset<-read.csv("insurance.csv", stringsAsFactors=T)

#Step 2
# Dataset size
nrow(dset)
ncol(dset)
# Dataset format
str(dset)
# Number of male/female, smoker/non-smoker.
table(dset$sex)
table(dset$smoker)
# Data distribution among the regions
table(dset$region)
# Distribution of the response variable
summary(dset$charges)
hist(dset$charges) # To confirm the distribution is right-skewed
# Use "cor" function to compute correlation matrix
cor(dset[c("age", "bmi", "children", "charges")])
# or, "plot" for a distributional plot
plot(dset[c("age", "bmi", "children", "charges")])

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 76

76

38
25/06/2019

R code
#Step 3
#Fitting a linear model:
ins_model <- lm(charges ~ age + children + bmi + sex + smoker + region, data = dset)
# or using "."
ins_model <- lm(charges ~ ., data = dset)
ins_model #notice the dummy variables.

#Step 4
summary(ins_model)
#Histogram of the residuals
est.charges<-predict(ins_model)
res<-dset$charges-est.charges
hist(res,breaks=30)
#RMSE function
RMSE<-function(y.est, y.tru){
sqrt(sum((y.est-y.tru)^2)/length(y.est))
}
RMSE(y.est=est.charges,y.tru=dset$charges)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 77

77

39

You might also like