0% found this document useful (0 votes)

38 views39 pages

Tutorial01 R Introduction

This document provides an introduction to R, including: 1) R is a leading tool for statistics, data analysis, and machine learning that includes a programming language and over 2,000 user-contributed packages. 2) RStudio is an integrated development environment for R that can be used locally or on a server. 3) Resources for learning R include CRAN, Bioconductor, Omegahat, Inside-R, and R-bloggers. Basic functions, variables, vectors, sequences, and logical values are demonstrated.

Uploaded by

HGE05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views39 pages

Tutorial01 R Introduction

Uploaded by

HGE05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

25/06/2019

Tutorial 1 – Introduction to R
Dr Ivan Olier
[email protected]

ECI – International Summer School /

Machine Learning
2019

What is R

• R is an integrated suite of software facilities for data manipulation,

calculation and graphical display. Among other things it has:
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular
matrices,
• a large, coherent, integrated collection of intermediate tools for
data analysis,
• graphical facilities for data analysis and display either directly at
the computer or on hardcopy, and
• a well developed, simple and effective programming language
(called ‘S’) which includes conditionals, loops, user defined
recursive functions and input and output facilities.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 2

1
25/06/2019

• R is the leading tool for statistics, data analysis,

and machine learning.
• It is a programming language (objects,
functions, and packages)
• Over 2,000 cutting-edge, user-contributed
packages available on CRAN
(http://www.cran.r-project.org).
Why R • It is platform independent (linux, mac,
windows)
• R allows you to integrate with other languages
(C++, Java, Python), enables you to interact
with many data sources and other statistical
packages (SPSS, SAS, Stata, etc).
• A robust, vibrant community.
• R is free.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 3

R and RStudio
• Rstudio is an integrated development environment for R (www.rstudio.com)
• It can be run locally (desktop version) or in a server (RStudio Server).

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 4

2
25/06/2019

Environment

Source

Files, plots,
packages, help,
viewer

Console

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 5

Resources
• CRAN (http://cran.r-project.org/)

• Bioconductor (http://www.bioconductor.org/)

• Omegahat (http://www.omegahat.org/)

• Inside-R (http://www.inside-r.org)

• R-bloggers (http://www.r-bloggers.com)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 6

3
25/06/2019

First steps
• Finding help:
>help()
>help(solve)
>?solve
>help(“[[“)
>help.start()
>??solve
>example(solve)

• To quit the R program:

>q()

• To run demos:
>demo()
>demo(lm.glm)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 7

R functions and variables

• They are case sensitive (A ≠ a)
• All alphanumeric symbols plus “.” and “_”
• They must start with a letter, or
• with “.” if the next character is not a digit.
• Examples:
>var1
>.var2
>.var_3
>_var3
>1var
>.1var

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 8

4
25/06/2019

R functions and variables

• ; or a newline: separate commands.
>help(solve); example(solve)
>help(solve)
>example(solve)

• { and }: group commands

> {help(solve)
+ example(solve)
+}

• +: if you see this symbol is because the command is not complete

• #: Everything to the end of the line is a comment
>example(solve) #this is an example of ‘solve’

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 9

Numbers and vectors

• Assigning operators
> x<-1
Or
> x=1
Or
> assign(“x”,1)

• Vectors: simplest data structure in R

> x<-c(2,3,1,4)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 10

5
25/06/2019

Numeric vectors
• Vector arithmetic
>y<-2*x+1

• Basic operators: +, -, *, /, ^

• Basic math functions: log(x), exp(x), sin(x), cos(x), tan(x), sqrt(x), …

• Largest and smallest elements of a vector: max(x) and min(x).

• Number of elements: length(x).

• Their total and their product: sum(x) and prod(x)

• Their mean, variance and standard deviation: mean(x), var(x), sd(x).

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 11

Exercise
• Write an expression in R that estimates the standard deviation (not “sd”) of this vector:

X=[1.2, 2.3, 1.4, 2.1, 3.2]’

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 12

6
25/06/2019

Solution
> x<-c(1.2,2.3,1.4,2.1,3.2)
> x.bar<-mean(x)
> N<-length(x)
> s<-sqrt(sum((x-x.bar)^2)/(N-1))
>s
[1] 0.795613

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 13

Regular sequences
• 1:30 will generate the vector c(1,2,3,…,29,30)
• 30:1 may be used to generate the sequence backwards.
• : (colon) has high priority:
> 2*1:5
[1] 2 4 6 8 10

• Sequences: seq(from=value, to=value, by=value)

> seq(from=2, to=10, by=2)
> seq(2,10,2)

• To replicate an object: rep(object, times=value, each=value)

> x<-c(1,2,3)
> rep(x,times=4)
> rep(x,each=4)
> rep(x,times=4,each=4)
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 14

7
25/06/2019

Logical vectors
• Logical values: TRUE or T, FALSE, or F, and NA (not available)
> a<-TRUE
> b<-F
> c<-NA

• Logical operators: <, <=, >, >=, == for equality and != for inequality, & (and) for intersection, | (or) for union, and
! for negation.
• Examples: if x<-c(1,2,3,4,5,6)
> x>3
[1] FALSE FALSE FALSE TRUE TRUE TRUE

> !(x<=3)
[1] FALSE FALSE FALSE TRUE TRUE TRUE

> x<3 | x>5

[1] TRUE TRUE FALSE FALSE FALSE TRUE

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 15

Missing values (and similar)

• NA represents a “not available” or “missing value”.
• NaN is a “Not a Number” value and results from a numerical computation.
• Inf is an infinite value.
• Examples:
> c(1,3,NA,5)
> 0/0 #produces a NaN
> Inf-Inf

• Detecting missing values in a vector:

> x<-c(1,2,3,NA,5)
> is.na(x)

• is.nan(x) to detect NaNs, is.infinite(x), infinite elements, and is.finite(x) to detect finite elements.
• Try:
> is.nan(0/0)
> is.na(0/0)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 16

8
25/06/2019

Character vectors
• Character strings are entered using either matching double (") or single (’) quotes:
> ”Hello world”
> ’Hello world’

• Escape sequences: \n, newline, \t, tab and \b, backspace—see ?Quotes for a full
list.
• Character vectors:
> c(“hello”,”world”)
> rep(“hello”,5)

• paste() function concatenates strings:

> paste(“hello”,”world”)
> paste(“hello”,1:5)
> paste(“hello”,1:5,”world”,sep=“_”)
> paste(“hello”,1:5,”world”,sep=“_”,collapse=“,”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 17

Subsetting
> x<-c(10,20,30,40,50)

• By index vector:
>x[3]
[1] 30
> x[2:4]
[1] 20 30 40

• Logical conditions:
> x[x>30]
[1] 40 50

• Excluding values:
> x[-c(1,3)]
[1] 20 40 50

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 18

9
25/06/2019

Exercise
• Detect missing values in the following vector and replace by mean value.

X=[1, 2, 3, 4, ?, 6, 7, ?]

Answer:

X=[1, 2, 3, 4, 3.833, 6, 7, 3.833]

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 19

Solution
> mean(x)
[1] NA
> x.nomiss<-x[!is.na(x)]
> x.nomiss
[1] 1 2 3 4 6 7
> x[is.na(x)]<-mean(x.nomiss)
>x
[1] 1.000 2.000 3.000 4.000 3.833 6.000 7.000 3.833

Alternatively:
> mean(x,na.rm = T)
[1] 3.833333
> x[is.na(x)]<-mean(x,na.rm = T)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 20

10
25/06/2019

Objects
• Object – any entity which R operates on.
• Class of an object:
• numeric
• logical
• character
• list
• matrix
• array
• factor
• data.frame
• …

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 21

Objects
• Creating objects: class_name(length)
> numeric(3)
[1] 0 0 0
> numeric()
numeric(0)

• Testing an object: is.class_name(object)

> x<-c(1,2,3)
> is.numeric(x)
[1] TRUE

• Coercing an object to be of a different class: as.class_name(object)

> as.character(x)
[1] "1" "2" "3"

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 22

11
25/06/2019

Factors
• factor – is a vector object used to group components of other vectors of the
same length.
• Example:

x 2.0 1.5 3.8 1.3 4.2 7.1 5.5 2.9

c g1 g1 g2 g1 g2 g2 g1 g2
• In R:
> c<-factor(c("g1","g1","g2","g1","g2","g2","g1","g2"))
>c
[1] g1 g1 g2 g1 g2 g2 g1 g2
Levels: g1 g2

• To know the labels:

> levels(c)
[1] "g1" "g2”

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 23

Factors
• Example (cont.):
> x<-c(2,1.5,3.8,1.3,4.2,7.1,5.5,2.9)

• tapply – apply a function to a list of factors

tapply(vector, factors, function)

> tapply(x, c, mean)

g1 g2
2.575 4.500

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 24

12
25/06/2019

Arrays
• Array – it is a multiply subscripted collection of data entries.
• Creating an array: array(data=vector, dim=dimension_vector)
• Example:
> array(data=1:24,dim=c(3,4,2))
,,1
[,1] [,2] [,3] [,4] 2
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
3
,,2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22 4
[2,] 14 17 20 23
[3,] 15 18 21 24

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 25

Matrices
• Matrix – 2D array.
• It can be used “array” function or: matrix(data=vector, nrow, ncol)
• Example:
> matrix(data=1:12,nrow=3,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

• is.matrix and is.array can be used to test a data

• as.matrix and as.array, to coerce a data to be matrix or array, respectively.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 26

13
25/06/2019

Indexing arrays and matrices

> x <- array(1:20, dim=c(4,5))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

• To retrieve an element:
> x[2,3]
[1] 10

• And a row:
> x[2,]
[1] 2 6 10 14 18

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 27

Indexing arrays and matrices (cont.)

• Using an array of indexes to retrieve values:
> y<-array(c(2:4,2:3),dim=c(3,2))
>y
[,1] [,2]
[1,] 2 2
[2,] 3 3
[3,] 4 2
> x[y]
[1] 6 11 8

• Or, to assign a value:

> x[y]<-NA
>x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 NA 10 14 18
[3,] 3 7 NA 15 19
[4,] 4 NA 12 16 20

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 28

14
25/06/2019

Operations with arrays

• Arithmetic operations – are done element-wise.
> X<-array(1:12,c(3,4)); Y<-array(12:1,c(3,4))
> 2*X+Y
[,1] [,2] [,3] [,4]
[1,] 14 17 20 23
[2,] 15 18 21 24
[3,] 16 19 22 25

• Mixed vector and array arithmetic (recycling rule): Any short vector operands are extended by recycling their values
until they match the size of any other operands.

> z<-c(2,3)
> z*X
[,1] [,2] [,3] [,4]
[1,] 2 12 14 30
[2,] 6 10 24 22
[3,] 6 18 18 36

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 29

Matrix operations
Operation Function Example
Matrix multiplication %*% X %*% Y
Matrix inversion solve() solve(X)
Linear equation solve() solve(X,b)
Transpose of a matrix t() t(X)
Eigenvalues and eigen() eigen(X)
eigenvectors
Binding matrices:
- column-wise cbind() cbind(X,Y,Z)
- Row-wise rbind() rbind(X,Y,Z)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 30

15
25/06/2019

Exercise
• Create a 3x3 random matrix A and a random vector b with 3 elements.
• Solve the linear system: Ax + b = 0:

1. By using solve(A,b)
2. By estimating the inverse of A (using solve) and then x=A-1b

• Compute the eigenvalues and eigenvectors of A.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 31

Lists
• List - an ordered collection of objects.
> Lst<-list(name="Fred", wife="Mary”, no.children=3, + child.ages=c(4,7,9))

> Lst
$name
[1] "Fred"

$wife
[1] "Mary"

$no.children
[1] 3

$child.ages
[1] 4 7 9
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 32

16
25/06/2019

Lists
• Accessing to the components:
> Lst[[1]]
[1] "Fred"
> Lst$wife
[1] "Mary”

> Lst$wife<-"Liz"

> Lst[["no.children"]]
[1] 3
> Lst[["no.children"]]<-4

• Adding components to the list:

> Lst$pet<-"dog"
> Lst[["occupation"]]<-"Teacher"
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 33

Lists
• Getting the component names:
> names(Lst)
[1] "name” "wife” "no.children" "child.ages” "pet" "occupation"

• Deleting a component:
> Lst$occupation<-NULL

• Concatenating lists:
List.ABC<-c(list.A, list.B, list.C)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 34

17
25/06/2019

Data frames
• A data frame (data.frame) is a list with the restriction that components must be vectors (numeric, character, or
logical) of same length.
• It is the most suitable R object for datasets.

var1 var2 var3

1 -4 -2
2 -3 -1
3 -2 0
4 -1 1

• In R:
> dat1<-data.frame(var1=1:4,var2=-4:-1,var3=-2:1)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 35

Data frames
• It is still a list, so:

> dat1[["var1"]] #or dat1[[1]] or dat1$var1

[1] 1 2 3 4

> dat1$var4<-5:8
> dat1
var1 var2 var3 var4
1 1 -4 -1 5
2 2 -3 0 6
3 3 -2 1 7
4 4 -1 2 8

> names(dat1)
[1] "var1" "var2" "var3" "var4”

> dat1$var4<-NULL #drops “var4”

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 36

18
25/06/2019

Data frames
• To access a particular row:
> dat1[2,]
var1 var2 var3
2 2 -3 0

• Subsetting:
> dat1[dat1$var2>=-2,c("var1","var3")]
var1 var3
3 3 1
4 4 2

• Or, by using the function subset:

> subset(dat1, var2>=-2, select=c(var1,var3))
var1 var3
3 3 1
4 4 2
2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 37

Accessing built-in datasets

• Around 100 datasets are supplied with R.
> data()

• All the datasets supplied with R are available directly by name.

> mtcars

• To see the first or last parts of a data frame (or any other object): head() or tail(), respectively.
• Or, if you are in Rstudio: View()
> View(mtcars)

• Number of rows or columns: nrow() or ncol().

• Row names: row.names()
> row.names(mtcars)
> row.names(mtcars)<-c(“dfhdf”,”asdasd”,…)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 38

19
25/06/2019

Exercise
• From ‘mtcars’ dataset, estimate the average mpg (miles per gallon) by number of
cylinders.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 39

Solution
• Let’s have a look at the mtcars dataset:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

• Create a factor of the ‘cyl’ variable

> cyl.f<-factor(mtcars$cyl)
> cyl.f
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8

• Use tapply to estimate the averages

> tapply(mtcars$mpg, cyl.f, mean)
4 6 8
26.66364 19.74286 15.10000

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 40

20
25/06/2019

Exporting data to files

• write.table()

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

• Example:
> write.table(dat1, file=“file1.txt”, sep=“\t”, row.names=F)

• write.csv()

> write.csv(dat1, file=“file2.csv”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 41

Reading data from files

• read.table()
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

• read.csv()
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")

• read.delim()
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "")

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 42

21
25/06/2019

R Packages
• All R functions and datasets are stored in packages.

• The standard (or base) packages are considered part of the R source code.

• To see installed packages: library()

• To load a particular package: library(package_name)

• To install a new package: install.packages(new_packages)

• To update packages: update.packages()

• To see loaded packages: search()

• Contributed packages (CRAN, Biocoductor, Omegahat)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 43

The tidyverse
• The tidyverse is a collection of R
packages designed for data science.
• An R package is a collection of functions,
data, and documentation that extends
the capabilities of base R.
• tidyverse includes the two packages
that we will learn today:
• ggplot2 package for data visualisation
• dplyr package for data transformation
• You can install the complete tidyverse with a
single line of code:

install.packages("tidyverse")
44

22
25/06/2019

The tidyverse
• You will not be able to use the functions, objects, and help files in a package until you
load it with library().
• Once you have installed a package, you can load it with the library() function:

This tells you that tidyverse is loading the ggplot2,

tibble, tidyr, readr, purrr, and dplyr packages.

ggplot2
• R has several systems for making graphs, but ggplot2 is one of the most elegant
and most versatile.
• ggplot2 implements the grammar of graphics, a coherent system for describing and
building graphs.

23
25/06/2019

Do cars with big engines use more fuel than cars with small engines?

• Data (data frame): ggplot2::mpg

• A data frame is a rectangular collection of variables (in the columns) and observations (in the rows)
• mpg contains observations collected by the US Environment Protection Agency on 38 models of car.

Open its help page by

running:
?mpg

Do cars with big engines use more fuel than cars with small engines?

displ: engine size

hwy: fuel efficiency

Cars with big engines use more fuel

24
25/06/2019

Geometric objects
• ggplot2 provides over 30 geoms, and extension
packages provide even more (see https://www.ggplot2-
exts.org)
• The best way to get a comprehensive overview is the
ggplot2 cheatsheet, which you can find at
http://rstudio.com/cheatsheets.
• To learn more about any single geom, use help:
?geom_smooth

dplyr
• Another core member of the tidyverse

install.packages("nycflights13")
library(nycflights13)
library(tidyverse)

• To explore the basic data manipulation verbs of dplyr, we will use

nycflights13::flights
• This data frame contains all 336,776 flights that departed from New York City in
2013.
• The data comes from the US Bureau of Transportation Statistics, and is documented
in ?flights
50

25
25/06/2019

nycflights13::flights

• Data types:
• int : integers.
• dbl : doubles, or real numbers.
• chr : character vectors, or strings.
• dttm : date-times (a date + a
time).
• lgl : logical (TRUE or FALSE).
• fctr : factors, which R uses to
represent categorical variables with
fixed possible values.
• date : dates.

dplyr basics
• dplyr has five key functions that allow you to solve the vast majority of your data
manipulation challenges:
• Pick observations by their values (filter()).
• Reorder the rows (arrange()).
• Pick variables by their names (select()).
• Create new variables with functions of existing variables (mutate()).
• Collapse many values down to a single summary (summarise()).
• These can all be used in conjunction with group_by() which changes the scope of each
function from operating on the entire dataset to operating on it group-by-group.
• These six functions provide the verbs for a language of data manipulation.

26
25/06/2019

dplyr examples
To filter flights on 1st of January:
dset <- flights %>%
filter(month == 1, day == 1)

… or on 1st of January or 1st of December:

dset <- flights %>%
filter(month == 1 | month == 12, day == 1)

… or equivalently:
dset <- flights %>%
filter(month %in% c(1,12), day == 1)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 53

dplyr examples
• … and selecting year, month, and day columns only:
dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(year, month, day)

… or dropping those columns:

dset <- flights %>%
filter(month %in% c(1,12), day == 1) %>%
select(-c(year, month, day))

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 54

27
25/06/2019

dplyr examples
• Adding a new column:
dset <- flights %>%
mutate(gain = arr_delay - dep_delay)

• Using “group_by” and “summarise” to aggregate rows:

dset <- flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = T))

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 55

Files and directories

• getwd() – Get the absolute path of the current working directory.
> getwd()

• setwd() – To set the working directory.

> setwd(“/home/joy/Desktop/”)

• list.files() - character vector of the names of files in a particular directory.

> list.files()
> list.files(“/home/joy/Desktop/”)

• list.dirs() – for directories.

> list.dirs()
> list.dirs(“/home/joy/Desktop/”, recursive=F)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 56

28
25/06/2019

Functions

Arguments
Value (R object)
Function

• function – R object that can be used to do a particular operation or process.

• It is defined as:
> name <- function(arg_1, arg_2, ...) expression

• Example:
> pow<-function(x,ex){
+ x^ex
+}
> pow(4,2)
[1] 16

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 57

Using a text editor

• In Rstudio:
File -> New File -> R Script
• R scripts – To write one or more functions, and/or pieces of code.
• Suggested file extension for R scripts: .R
• To run the code: source(file_name.R)
• Example: script_pow.R

pow<-function(x, ex){
x^ex
}

• Then:
> source(“script_pow.R”)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 58

29
25/06/2019

Functions
• Returned value – It is the last value used in a function.
• return(value) – to explicitly indicate the value to return.
• Example:
pow<-function(x, ex){
res<-x^ex
return(res)
}

• Named arguments – when given, they may be given in any ordered:

> pow(x=4, ex=2)
> pow(ex=2, x=4)
> pow(4, 2) # x<-4, ex<-2
> pow(2, 4) # x<-2, ex<-4

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 59

Functions
• Default values – they can be defined along with the arguments:
• Example:
pow<-function(x, ex=2){
res<-x^ex
return(res)
}
> pow(4)
[1] 16

• Assignments within functions – are local and temporary and are lost after exit from the function.
> res<-3
> pow(2,4)
> res
[1] 3

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 60

30
25/06/2019

Conditional execution
• if statements: if (expr_1) expr_2 else expr_3
• Example:
if(x>0) {
p<- 1
} else if(x<0) {
p<- -1
} else {
p<- 0
}

• ifelse – vectorised version of if/else.

> x<-c(-2,3,1,0,-1)
> ifelse(x>=0,1,-1)
[1] -1 1 1 1 -1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 61

loops
• for loops: for (name in expr_1) expr_2
• name – the loop variable
• expr_1 – vector expression (often a sequence)
• expr_2 – expression repeatedly evaluated.
• Example:

> y<-numeric()
> for(ix in 1:length(x)){
+ if(x[ix]>0) y[ix]<- 1
+ else if(x[ix]<0) y[ix]<- -1
+ else y[ix]<- 0
+}
>y
[1] -1 1 1 0 -1

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 62

31
25/06/2019

Apply a function over a list

• lapply – Apply a function to the elements of a list (or vector)
• Example:
> lst1<-list(x=c(1,2,3,4,NA), y=c(10,20), z=c(100,200,NA,300))
> res.lst1<-lapply(lst1, mean, na.rm=T)
> res.lst1
$x
[1] 2.5

$y
[1] 15

$z
[1] 200

• sapply – like lapply, but attempts to simplify to a vector or matrix if possible.

> res2.lst1<-sapply(lst1, mean, na.rm=T)
> res2.lst1
x y z
2.5 15.0 200.0

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 63

Exercise
• Write a function (you may need more than one) that imputes missing values in a dataset. It
should support two different imputation methods: mean and median. “mean” should be
the default method.

Dataset
Dataset
Method={“mean” | “median”}
miss.imp

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 64

32
25/06/2019

Statistical models in R
• Linear regression

• R function for linear regression: lm()

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 65

R formulae
• R uses a formula syntax to specify the form of many statistical models (and others) :
response ~ predictor_variables

• ~ operator – is read as “is modeled as a function of”

• For example:

• is formulated in R as:
Y~X

• And:

• As:
Y~X+Z

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 66

33
25/06/2019

Example
• “women” data
• Weight is modeled as a function
of Height:

Weight = β0 + β1Height + ε

> dat<-women
> mod<-lm(weight~height, dat)
> dat$pred.weight<-predict(mod)

> plot(dat$height,dat$weight,ylab="Weight”,xlab="Height")
> lines(dat$height,dat$pred.weight, col=2)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 67

R formulae

For example, the following model:

can be represented by the following three formulae:

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 68

34
25/06/2019

R formulae
Likewise, this model (in which the three-way interaction has been omitted):

is represented as:

“.” can be used to indicate “include all variables”

For example, if X,Z, and W are all variables in a dataset, then:
Y~X+Z+W
is equivalent to:

Y~.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 69

Problem – predicting medical expenses using linear regression

• In order for an insurance company to make money, it needs to collect more in yearly premiums than it spends
on medical care to its beneficiaries.
• The goal of this analysis is to use patient data to estimate the average medical care expenses for such
population segments.

• STEP 1 – collecting data.

• STEP 2 – exploring and preparing the data.
• STEP 3 – training a model on the data.
• STEP 4 – evaluating model performance.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 70

35
25/06/2019

Step 1 – collecting data

• insurance.csv (simulated data) – 1338 examples of beneficiaries currently enrolled in the insurance plan.
• Variables:
• age: age of the beneficiary.
• sex: gender, either male or female.
• bmi: body mass index.
• children: number of children
• smoker: yes or no, depending on whether the insured regularly smokes tobacco.
• region: place of residence in the US (northeast, southeast, southwest, or northwest).
• charges: medical care expenses.

• Import dataset into R environment.

• Tip:
• read.csv, argument stringsAsFactors.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 71

Step 2 – exploring the data

• Size of the dataset?
• Is the dataset formatted as expected (numeric variables, factors)?
• Indicate number of male/female, smokers/non-smokers.
• Are the data evenly divided among the regions?
• Would you say that the response variable approximately follows a normal distribution?
• Determine how the independent variables are related to the dependent variable and each other.

• Tips:
• Functions: nrow(), ncol(), is.factor(), str(), table(), cor(), hist(), plot()

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 72

36
25/06/2019

Step 3 – training a model on the data

• Fit a linear regression model to data.
• Tips:
• Use lm() to fit the model and predict() to make predictions.

Call:
lm(formula = charges ~ ., data = dset)

Coefficients:
(Intercept) age sexmale bmi children smokeryes
-11938.5 256.9 -131.3 339.2 475.5 23848.5

regionnorthwest regionsoutheast regionsouthwest

-353.0 -1035.0 -960.1

• Dummy coding – applied to factors: sexfemale/sexmale, smokerno/smokeryes,

regionnortheast/regionnorthwest/regionsoutheast/regionsouthwest.
• Reference categories: sexfemale, smokerno, regionnortheast.
• Thus, males have $131.30 less medical costs each year relative to females and smokers cost an average of $23,848.50 more than
non-smokers.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 73

Step 4 – evaluating model performance

• Produce a result summary of the fitted model.
> summary(ins_model)

1. Residuals: summary statistics for the errors

2. Stars: predictive power of each variable (statistically significant variable < 0.5)
3. R-squared: how well the model explains the response.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 74

37
25/06/2019

Step 4 – evaluating model performance

• Plot a histogram of the residuals (errors)
• Write an R function that computes the RMSE between two numeric vectors.
• RMSE is defined as:

• Estimate the RMSE between the true and the estimated insurance charges.

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 75

R code
#Step 1 - Importing "insurance" dataset
dset<-read.csv("insurance.csv", stringsAsFactors=T)

#Step 2
# Dataset size
nrow(dset)
ncol(dset)
# Dataset format
str(dset)
# Number of male/female, smoker/non-smoker.
table(dset$sex)
table(dset$smoker)
# Data distribution among the regions
table(dset$region)
# Distribution of the response variable
summary(dset$charges)
hist(dset$charges) # To confirm the distribution is right-skewed
# Use "cor" function to compute correlation matrix
cor(dset[c("age", "bmi", "children", "charges")])
# or, "plot" for a distributional plot
plot(dset[c("age", "bmi", "children", "charges")])

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 76

38
25/06/2019

R code
#Step 3
#Fitting a linear model:
ins_model <- lm(charges ~ age + children + bmi + sex + smoker + region, data = dset)
# or using "."
ins_model <- lm(charges ~ ., data = dset)
ins_model #notice the dummy variables.

#Step 4
summary(ins_model)
#Histogram of the residuals
est.charges<-predict(ins_model)
res<-dset$charges-est.charges
hist(res,breaks=30)
#RMSE function
RMSE<-function(y.est, y.tru){
sqrt(sum((y.est-y.tru)^2)/length(y.est))
}
RMSE(y.est=est.charges,y.tru=dset$charges)

2019 - ECI - International Summer School/Machine Learning - Dr Ivan Olier 77

Pse2023 1
No ratings yet
Pse2023 1
453 pages
Statistical Computing II-slide
No ratings yet
Statistical Computing II-slide
279 pages
R PDF
No ratings yet
R PDF
491 pages
CIND123 Swirl Lesson 15
No ratings yet
CIND123 Swirl Lesson 15
46 pages
R Pres
No ratings yet
R Pres
53 pages
Final Submission of Fundamental of Mathematics & Statictis
No ratings yet
Final Submission of Fundamental of Mathematics & Statictis
37 pages
MDPN460 Lecture03
No ratings yet
MDPN460 Lecture03
34 pages
FMS Final Submission
No ratings yet
FMS Final Submission
25 pages
Esai Seshan FMS Practical Final Submission
No ratings yet
Esai Seshan FMS Practical Final Submission
25 pages
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
No ratings yet
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
36 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Itw Midsem Notes
No ratings yet
Itw Midsem Notes
21 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
R Course ISLR Basics 2023
No ratings yet
R Course ISLR Basics 2023
77 pages
R Statistical Package
No ratings yet
R Statistical Package
63 pages
A Coragem de Nao Agradarpdf 2 PDF Free
No ratings yet
A Coragem de Nao Agradarpdf 2 PDF Free
269 pages
WIN SEM (2022-23) CSE4027 ETH AP2022236000324 Reference Material I 25-Jan-2023 Module-1 Topic-3 - R Datatypes
No ratings yet
WIN SEM (2022-23) CSE4027 ETH AP2022236000324 Reference Material I 25-Jan-2023 Module-1 Topic-3 - R Datatypes
41 pages
OIC Questions
No ratings yet
OIC Questions
24 pages
Network Analysis and Visualization With R and Igraph
No ratings yet
Network Analysis and Visualization With R and Igraph
62 pages
My R Report
No ratings yet
My R Report
52 pages
TLE10 Q1 Mod7 - ICT-Technical-Drafting Editing v3
No ratings yet
TLE10 Q1 Mod7 - ICT-Technical-Drafting Editing v3
29 pages
R - Lecture 2
No ratings yet
R - Lecture 2
51 pages
Rbasics
No ratings yet
Rbasics
96 pages
R Lab
No ratings yet
R Lab
114 pages
Arunav Da Prac
No ratings yet
Arunav Da Prac
55 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
35 pages
R-Basic Concepts
No ratings yet
R-Basic Concepts
67 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
Prerequis R
No ratings yet
Prerequis R
38 pages
Intr2R Week2 2020
No ratings yet
Intr2R Week2 2020
13 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Data Anlytics Using R Notes
No ratings yet
Data Anlytics Using R Notes
14 pages
Introduction To R Chap 2
No ratings yet
Introduction To R Chap 2
30 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
Vuniq User's Manual
No ratings yet
Vuniq User's Manual
96 pages
Project Report On Electronic Voting Machine Using Fpga Verilog
50% (2)
Project Report On Electronic Voting Machine Using Fpga Verilog
23 pages
R Studio
No ratings yet
R Studio
41 pages
R Session A
No ratings yet
R Session A
107 pages
R For Absolute Beginners - Hands-On R Tutorial: June 2018
No ratings yet
R For Absolute Beginners - Hands-On R Tutorial: June 2018
43 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
STATS LAB Basics of R PDF
No ratings yet
STATS LAB Basics of R PDF
77 pages
Session For R
No ratings yet
Session For R
23 pages
MIS 4.hafta (Introduction To R)
No ratings yet
MIS 4.hafta (Introduction To R)
52 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
RBigData NTL
No ratings yet
RBigData NTL
24 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
COBOL IMS DB Sample Program
No ratings yet
COBOL IMS DB Sample Program
9 pages
R Programming Course Notes: Overview and History of R
No ratings yet
R Programming Course Notes: Overview and History of R
22 pages
Introduction To Analytics and R File
No ratings yet
Introduction To Analytics and R File
29 pages
R Unit1
No ratings yet
R Unit1
26 pages
E Commerce
No ratings yet
E Commerce
24 pages
11-3085 Energy Analyzer Jump Start
No ratings yet
11-3085 Energy Analyzer Jump Start
26 pages
Chapter 1 Introduction To R
No ratings yet
Chapter 1 Introduction To R
33 pages
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
No ratings yet
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
58 pages
Cgaxis Models Volume 15
100% (1)
Cgaxis Models Volume 15
20 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Technical Publications: Invenia ABUS 2.0 Version 2.0.x Dicom Conformance Statement
No ratings yet
Technical Publications: Invenia ABUS 2.0 Version 2.0.x Dicom Conformance Statement
59 pages
Introduction To R
No ratings yet
Introduction To R
21 pages
R Programming
No ratings yet
R Programming
1 page
Homo Deus A Brief History of Tomorrow
No ratings yet
Homo Deus A Brief History of Tomorrow
19 pages
Rintro
No ratings yet
Rintro
14 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
No ratings yet
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
15 pages
100-00070 Rev A SpectroVisc Q3000 Series User's Guide
No ratings yet
100-00070 Rev A SpectroVisc Q3000 Series User's Guide
50 pages
Introduction To R: 1 Getting Started
No ratings yet
Introduction To R: 1 Getting Started
14 pages
# Frontend Development Roadmap 2025
No ratings yet
# Frontend Development Roadmap 2025
1 page
History of A I Poster Final
No ratings yet
History of A I Poster Final
1 page
EIS Installation Checklist For The X9-2/L Servers (Solaris/ X64)
No ratings yet
EIS Installation Checklist For The X9-2/L Servers (Solaris/ X64)
30 pages
IRR Fallohara Cup
No ratings yet
IRR Fallohara Cup
23 pages
PLC User - S Manual of Communication Module
No ratings yet
PLC User - S Manual of Communication Module
22 pages
CS687 - Access Control 1 - Spring 2020
No ratings yet
CS687 - Access Control 1 - Spring 2020
41 pages
Tara's Treetop Adventure: Author: Sangeeta Velegar Illustrator: Sarthak Sinha
No ratings yet
Tara's Treetop Adventure: Author: Sangeeta Velegar Illustrator: Sarthak Sinha
15 pages
Crash 2025 06 21 - 01.47.36 FML
No ratings yet
Crash 2025 06 21 - 01.47.36 FML
31 pages
Web Application Development Using Open Source and Java Technologies
No ratings yet
Web Application Development Using Open Source and Java Technologies
99 pages
Session 04 - Tree-Based Methods
No ratings yet
Session 04 - Tree-Based Methods
25 pages
Session 02 - Regression - and - Classification
No ratings yet
Session 02 - Regression - and - Classification
22 pages
Grade 8-COMPUTER 1st Monthly
No ratings yet
Grade 8-COMPUTER 1st Monthly
2 pages
Big Data Computing
No ratings yet
Big Data Computing
57 pages
Big Data Computing: Working With Data Models and Big Data Processing
No ratings yet
Big Data Computing: Working With Data Models and Big Data Processing
46 pages
Knowledge Management Complete Notes
No ratings yet
Knowledge Management Complete Notes
54 pages
Danfoss 3
No ratings yet
Danfoss 3
2 pages
Examenes Corte 1 y 3
No ratings yet
Examenes Corte 1 y 3
40 pages
Rane Sixty-Two Mixer Manual
No ratings yet
Rane Sixty-Two Mixer Manual
44 pages
Session 01 - Introduction
No ratings yet
Session 01 - Introduction
28 pages
Comprehensive Examination
No ratings yet
Comprehensive Examination
25 pages
Session 03 - Neural Networks
No ratings yet
Session 03 - Neural Networks
21 pages
Hands-On Activity: 2. Exploring The Semi-Structured Data Model of JSON
No ratings yet
Hands-On Activity: 2. Exploring The Semi-Structured Data Model of JSON
3 pages
E-Poster Terms and Conditions in 6th FSSM and 3rd Ina Sleep 2021
No ratings yet
E-Poster Terms and Conditions in 6th FSSM and 3rd Ina Sleep 2021
2 pages
F
No ratings yet
F
27 pages
Coulomb, Potencial, Campo, Densidad (V, S, L) 2
No ratings yet
Coulomb, Potencial, Campo, Densidad (V, S, L) 2
13 pages
Hands-On Activity: 3. Exploring The Array Data Model of An Image
No ratings yet
Hands-On Activity: 3. Exploring The Array Data Model of An Image
3 pages
HandsOn 3. Sensor Data
No ratings yet
HandsOn 3. Sensor Data
3 pages
Scraplane: A Blockchain and ML Based System To Facilitate Scrapping of Cars
No ratings yet
Scraplane: A Blockchain and ML Based System To Facilitate Scrapping of Cars
19 pages
Coulomb, Potencial, Campo, Densidad (V, S, L)
No ratings yet
Coulomb, Potencial, Campo, Densidad (V, S, L)
7 pages
Coulomb, Potencial, Campo, Densidad (V, S, L)
No ratings yet
Coulomb, Potencial, Campo, Densidad (V, S, L)
7 pages
Teorema de Divergencia
No ratings yet
Teorema de Divergencia
5 pages
Blockchain and BIM (Building InformationModeling) Progress in Academia and Industry2104.00547
No ratings yet
Blockchain and BIM (Building InformationModeling) Progress in Academia and Industry2104.00547
6 pages
Arjun Resume
No ratings yet
Arjun Resume
3 pages