Rcourse PDF
Rcourse PDF
Jonathan D. Rosenblatt
2019-09-29
2
Contents
1 Preface 7
1.1 Notation Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Introduction 9
2.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The R Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 R Basics 11
3.1 File types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Simple calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Probability calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Variable Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Missing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Piping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.8 Vector Creation and Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.9 Search Paths and Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.10 Simple Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.11 Object Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.12 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.13 Exctraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.14 Augmentations of the data.frame class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.15 Data Import and Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.16 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.17 Looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.18 Apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.19 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.20 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.21 Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.22 Complex Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.23 Vectors and Matrix Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.24 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.25 Practice Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 data.table 37
4.1 Make your own variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Reshaping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Practice Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3
CONTENTS CONTENTS
6 Linear Models 65
6.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 OLS Estimation in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Extra Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Practice Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
12 Plotting 157
12.1 The graphics System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
12.2 The ggplot2 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.3 Interactive Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
12.4 Other R Interfaces to JavaScript Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
12.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
12.6 Practice Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13 Reports 181
13.1 knitr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4
CONTENTS CONTENTS
19 RCpp 221
19.1 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
19.2 Practice Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5
CONTENTS CONTENTS
6
Chapter 1
Preface
This book accompanies BGU’s “R” course, at the department of Industrial Engineering and Management. It has
several purposes:
• Help me organize and document the course material.
• Help students during class so that they may focus on listening and not writing.
• Help students after class, so that they may self-study.
At its current state it is experimental. It can thus be expected to change from time to time, and include mistakes. I
will be enormously grateful to whoever decides to share with me any mistakes found.
I am enormously grateful to Yihui Xie, who’s bookdown R package made it possible to easily write a book which has
many mathematical formulae, and R output.
I hope the reader will find this text interesting and useful.
For reproducing my results you will want to run set.seed(1).
1.2 Acknowledgements
I have consulted many people during the writing of this text. I would like to thank Yoav Kessler1 , Lena Novack2 ,
Efrat Vilenski, Ron Sarafian, and Liad Shekel in particular, for their valuable inputs.
1 https://kesslerlab.wordpress.com/
2 http://fohs.bgu.ac.il/research/profileBrief.aspx?id=VeeMVried
7
1.2. ACKNOWLEDGEMENTS CHAPTER 1. PREFACE
8
Chapter 2
Introduction
2.1 What is R?
R was not designed to be a bona-fide programming language. It is an evolution of the S language, developed at Bell
labs (later Lucent) as a wrapper for the endless collection of statistical libraries they wrote in Fortran.
• R-help3 : an immensely active mailing list. Noways being replaced by StackExchange meta-site. Look for the R
tags in the StackOverflow4 and CrossValidated5 sites.
• Books9 : An insane amount of books written on the language. Some are free, some are not.
• Commercial R: being open source and lacking support may seem like a problem that would prohibit R from being
adopted for commercial applications. This void is filled by several very successful commercial versions such as
Microsoft R11 , with its accompanying CRAN equivalent called MRAN12 , Tibco’s Spotfire13 , and others14 .
1 https://wrathematics.github.io/2011/08/27/how-much-of-r-is-written-in-r/
2 https://cran.r-project.org/
3 https://www.r-project.org/mail.html
4 http://stackoverflow.com/
5 http://stats.stackexchange.com/
6 https://cran.r-project.org/web/views/
7 https://www.bioconductor.org/
8 https://www.neuroconductor.org/
9 https://www.r-project.org/doc/bib/R-books.html
10 https://groups.google.com/forum/#!forum/israel-r-user-group
11 https://mran.microsoft.com/open/
12 https://mran.microsoft.com/
13 http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
14 https://en.wikipedia.org/wiki/R_(programming_language)#Commercial_support_for_R
9
2.3. BIBLIOGRAPHIC NOTES CHAPTER 2. INTRODUCTION
• RStudio15 : since its earliest days R came equipped with a minimal text editor. It later received plugins for major
integrated development environments (IDEs) such as Eclipse, WinEdit and even VisualStudio16 . None of these,
however, had the impact of the RStudio IDE. Written completely in JavaScript, the RStudio IDE allows the
seamless integration of cutting edge web-design technologies, remote access, and other killer features, making it
today’s most popular IDE for R.
• CheatSheets17 Rstudio curates a list of CheatSheets. Very useful to print some, and have them around when
coding.
• RStartHere18 : a curated list of useful packages.
15 https://www.rstudio.com/products/rstudio/download-server/
16 https://www.visualstudio.com/vs/rtvs/
17 https://www.rstudio.com/resources/cheatsheets/
18 https://github.com/rstudio/RStartHere/blob/master/README.md#import
19 http://www.research.att.com/articles/featured_stories/2013_09/201309_SandR.html?fbid=Yxy4qyQzmMa
20 https://www.youtube.com/watch?v=_hcpuRB5nGs
21 https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2018.01169.x
22 https://blog.revolutionanalytics.com/2017/10/updated-history-of-r.html
10
Chapter 3
R Basics
We now start with the basics of R. If you have any experience at all with R, you can probably skip this section.
First, make sure you work with the RStudio IDE. Some useful pointers for this IDE include:
• Ctrl+Return(Enter) to run lines from editor.
• Alt+Shift+k for RStudio keyboard shortcuts.
• Ctrl+r to browse the command history.
• Alt+Shift+j to navigate between code sections
• tab for auto-completion
• Ctrl+1 to skip to editor.
• Ctrl+2 to skip to console.
• Ctrl+8 to skip to the environment list.
• Ctrl + Alt + Shift + M to select all instances of the selection (for refactoring).
• Code Folding:
– Alt+l collapse chunk.
– Alt+Shift+l unfold chunk.
– Alt+o collapse all.
– Alt+Shift+o unfold all.
• Alt+“-” for the assignment operator <-.
11
3.1. FILE TYPES CHAPTER 3. R BASICS
## [1] 15
70*81
## [1] 5670
2**4
## [1] 16
2^4
## [1] 16
log(10)
## [1] 2.302585
log(16, 2)
## [1] 4
log(1000, 10)
## [1] 3
## [1] 0.1171875
Notice that arguments do not need to be named explicitly
dbinom(3, 10, 0.5)
## [1] 0.1171875
The Binomial cumulative distribution function (CDF):
pbinom(q=3, size=10, prob=0.5) # Compute P(X<=3) for X~B(n=10, p=0.5)
## [1] 0.171875
The Binomial quantile function:
qbinom(p=0.1718, size=10, prob=0.5) # For X~B(n=10, p=0.5) returns k such that P(X<=k)=0.1718
12
CHAPTER 3. R BASICS 3.4. GETTING HELP
## [1] 3
Generate random variables:
rbinom(n=10, size=10, prob=0.5)
## [1] 4 4 5 7 4 7 7 6 6 3
R has many built-in distributions. Their names may change, but the prefixes do not:
• d prefix for the distribution function.
• p prefix for the cummulative distribution function (CDF).
• q prefix for the quantile function (i.e., the inverse CDF).
• r prefix to generate random samples.
Demonstrating this idea, using the CDF of several popular distributions:
• pbinom() for the Binomial CDF.
• ppois() for the Poisson CDF.
• pnorm() for the Gaussian CDF.
• pexp() for the Exponential CDF.
For more information see ?distributions.
If you don’t know the name of the function you are looking for, search local help files for a particular string:
??binomial
help.search('dbinom')
Or load a menu where you can navigate local help in a web-based fashion:
help.start()
If you are familiar with other programming languages you may prefer the = assignment rather than the <- assignment.
We recommend you make the effort to change your preferences. This is because thinking with <- helps to read
your code, distinguishes between assignments and function arguments: think of function(argument=value) versus
function(argument<-value). It also helps understand special assignment operators such as <<- and ->.
Remark. Style: We do not discuss style guidelines in this text, but merely remind the reader that good style is
extremely important. When you write code, think of other readers, but also think of future self. See Hadley’s style
guide8 for more.
To print the contents of an object just type its name
x
## [1] 7 4 6 3 4 5 2 5 7 4
8 http://adv-r.had.co.nz/Style.html
13
3.5. VARIABLE ASSIGNMENT CHAPTER 3. R BASICS
## [1] 7 4 6 3 4 5 2 5 7 4
Alternatively, you can assign and print simultaneously using parenthesis.
(x <- rbinom(n=10, size=10, prob=0.5)) # Assign and print.
## [1] 5 5 5 4 6 6 6 3 6 5
Operate on the object
mean(x) # compute mean
## [1] 5.1
var(x) # compute variance
## [1] 0.9888889
hist(x) # plot histogram
88
86
84
82
Time
R saves every object you create in RAM9 . The collection of all such objects is the workspace which you can inspect
with
ls()
## [1] "x"
or with Ctrl+8 in RStudio.
If you lost your object, you can use ls with a text pattern to search for it
ls(pattern='x')
## [1] "x"
To remove objects from the workspace:
rm(x) # remove variable
ls() # verify
## character(0)
You may think that if an object is removed then its memory is freed. This is almost true, and depends on a negotiation
mechanism between R and the operating system. R’s memory management is discussed in Chapter 15.
9S and S-Plus used to save objects on disk. Working from RAM has advantages and disadvantages. More on this in Chapter 15.
14
CHAPTER 3. R BASICS 3.6. MISSING
3.6 Missing
Unlike typically programming, when working with real life data, you may have missing values: measurements that were
simply not recorded/stored/etc. R has rather sophisticated mechanisms to deal with missing values. It distinguishes
between the following types:
1. NA: Not Available entries.
2. NaN: Not a number.
R tries to defend the analyst, and return an error, or NA when the presence of missing values invalidates the calculation:
missing.example <- c(10,11,12,NA)
mean(missing.example)
## [1] NA
Most functions will typically have an inner mechanism to deal with these. In the mean function, there is an na.rm
argument, telling R how to Remove NAs.
mean(missing.example, na.rm = TRUE)
## [1] 11
A more general mechanism is removing these manually:
clean.example <- na.omit(missing.example)
mean(clean.example)
## [1] 11
3.7 Piping
Because R originates in Unix and Linux environments, it inherits much of its flavor. Piping10 is an idea taken from the
Linux shell which allows to use the output of one expression as the input to another. Piping thus makes code easier
to read and write.
Remark. Volleyball fans may be confused with the idea of spiking a ball from the 3-meter line, also called piping11 .
So: (a) These are very different things. (b) If you can pipe, ASA-BGU12 is looking for you!
Prerequisites:
library(magrittr) # load the piping functions
x <- rbinom(n=1000, size=10, prob=0.5) # generate some toy data
Examples
x %>% var() # Instead of var(x)
x %>% hist() # Instead of hist(x)
x %>% mean() %>% round(2) %>% add(10)
The next example13 demonstrates the benefits of piping. The next two chunks of code do the same thing. Try parsing
them in your mind:
# Functional (onion) style
car_data <-
transform(aggregate(. ~ cyl,
data = subset(mtcars, hp > 100),
FUN = function(x) round(mean(x, 2))),
kpl = mpg*0.4251)
10 http://ryanstutorials.net/linuxtutorial/piping.php
11 https://www.youtube.com/watch?v=DEaj4X_JhSY
12 http://in.bgu.ac.il/sport/Pages/asa.aspx
13 Taken from http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
15
3.8. VECTOR CREATION AND MANIPULATION CHAPTER 3. R BASICS
Tip: RStudio has a keyboard shortcut for the %>% operator. Try Ctrl+Shift+m.
## [1] 12 13 14 15 16 17 18 19 20 21 22 23
x*2
## [1] 20 22 24 26 28 30 32 34 36 38 40 42
x^2
## [1] 100 121 144 169 196 225 256 289 324 361 400 441
sqrt(x)
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
## read.table(file = file, header = header, sep = sep, quote = quote,
## dec = dec, fill = fill, comment.char = comment.char, ...)
## <bytecode: 0x3e49070>
## <environment: namespace:utils>
Never mind what the function does. Note the environment: namespace:utils line at the end. It tells us that this
function is part of the utils package. We did not need to know this because it is loaded by default. Here are some
packages that I have currently loaded:
16
CHAPTER 3. R BASICS 3.10. SIMPLE PLOTTING
search()
Other packages can be loaded via the library function, or downloaded from the internet using the install.packages
function before loading with library. Note that you can easily speedup package download by using multiple CPUs.
Just call options(Ncpus = XXX), where XXX is the number of CPUs you want to use. Run parallel::detectCores()
if you are unsure how many CPUs you have on your machine.
2.5
2.0
0 20 40 60 80 100
Given an x argument and a y argument, plot tries to present a scatter plot. We call this the x,y syntax. R has another
unique syntax to state functional relations. We call y~x the “tilde” syntax, which originates in works of Wilkinson
and Rogers (1973) and was adopted in the early days of S.
plot(y ~ x, type='l') # y~x syntax
17
3.10. SIMPLE PLOTTING CHAPTER 3. R BASICS
400
300
count
200
100
The syntax y~x is read as “y is a function of x”. We will prefer the y~x syntax over the x,y syntax since it is easier
to read, and will be very useful when we discuss more complicated models.
Here are some arguments that control the plot’s appearance. We use type to control the plot type, main to control
the main title.
plot(y~x, type='l', main='Plotting a connected line')
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
We use xlab for the x-axis label, ylab for the y-axis.
plot(y~x, type='h', main='Sticks plot', xlab='Insert x axis label', ylab='Insert y axis label')
18
CHAPTER 3. R BASICS 3.10. SIMPLE PLOTTING
Sticks plot
4.0
Insert y axis label
3.5
3.0
2.5
2.0
0 20 40 60 80 100
We use pch to control the point type (pch is acronym for Plotting CHaracter).
plot(y~x, pch=5) # Point type with pcf
3.50
3.45
protein
3.40
3.35
5 10 15
Time
We use col to control the color, cex (Character EXpansion) for the point size, and abline (y=Bx+A) to add a
straight line.
plot(y~x, pch=10, type='p', col='blue', cex=4)
abline(3, 0.002)
19
3.11. OBJECT TYPES CHAPTER 3. R BASICS
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
20
CHAPTER 3. R BASICS 3.13. EXCTRACTION
x<- 1:10
y<- 3 + sin(x)
frame1 <- data.frame(x=x, sin=y)
## x sin
## 1 1 3.841471
## 2 2 3.909297
## 3 3 3.141120
## 4 4 2.243198
## 5 5 2.041076
## 6 6 2.720585
Now using the RStudio Excel-like viewer:
View(frame1)
We highly advise against editing the data this way since there will be no documentation of the changes you made.
Always transform your data using scripts, so that everything is documented.
Verifying this is a data frame:
class(frame1) # the object is of type data.frame
## [1] "data.frame"
Check the dimension of the data
dim(frame1)
## [1] 10 2
Note that checking the dimension of a vector is different than checking the dimension of a data frame.
length(x)
## [1] 10
The length of a data.frame is merely the number of columns.
length(frame1)
## [1] 2
3.13 Exctraction
R provides many ways to subset and extract elements from vectors and other objects. The basics are fairly simple,
but not paying attention to the “personality” of each extraction mechanism may cause you a lot of headache.
For starters, extraction is done with the [ operator. The operator can take vectors of many types.
Extracting element with by integer index:
frame1[1, 2] # exctract the element in the 1st row and 2nd column.
## [1] 3.841471
Extract column by index:
frame1[,1]
## [1] 1 2 3 4 5 6 7 8 9 10
Extract column by name:
21
3.13. EXCTRACTION CHAPTER 3. R BASICS
frame1[, 'sin']
## [1] "numeric"
class(frame1['sin']) # extracts a data frame
## [1] "data.frame"
class(frame1[,1:2]) # extracts a data frame
## [1] "data.frame"
class(frame1[2]) # extracts a data frame
## [1] "data.frame"
class(frame1[2, ]) # extract a data frame
## [1] "data.frame"
class(frame1$sin) # extracts a column vector
## [1] "numeric"
The subset() function does the same
subset(frame1, select=sin)
subset(frame1, select=2)
subset(frame1, select= c(2,0))
If you want to force the stripping of the class attribute when extracting, try the [[ mechanism instead of [.
a <- frame1[1] # [ extraction
b <- frame1[[1]] # [[ extraction
class(a)==class(b) # objects have differing classes
## [1] FALSE
a==b # objects are element-wise identical
## x
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
## [9,] TRUE
## [10,] TRUE
The different types of output classes cause different behaviors. Compare the behavior of [ on seemingly identical
objects.
frame1[1][1]
## x
## 1 1
22
CHAPTER 3. R BASICS 3.14. AUGMENTATIONS OF THE DATA.FRAME CLASS
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
frame1[[1]][1]
## [1] 1
If you want to learn more about subsetting see Hadley’s guide19 .
## V1 V2 V3 V4
## 1 idnum age gender spnbmd
## 2 1 11.7 male 0.01808067
## 3 1 12.7 male 0.06010929
## 4 1 13.75 male 0.005857545
## 5 2 13.25 male 0.01026393
## 6 2 14.3 male 0.2105263
Oh dear. read.,table tried to guess the structure of the input, but failed to recognize the header row. Set it manually
with header=TRUE:
tirgul1 <- read.table('data/bone.data', header = TRUE)
head(tirgul1)
23
3.15. DATA IMPORT AND EXPORT CHAPTER 3. R BASICS
Now let’s import the exported file. Being a .csv file, I can use read.csv instead of read.table.
my.data<- read.csv(file=temp.file.name) # import
head(my.data) # verify import
We can now call the read.table function to import text files. If you care about your sanity, see ?read.table before
starting imports. Some notable properties of the function:
• read.table will try to guess column separators (tab, comma, etc.)
• read.table will try to guess if a header row is present.
• read.table will convert character vectors to factors unless told not to using the stringsAsFactors=FALSE
argument.
• The output of read.table needs to be explicitly assigned to an object for it to be saved.
24
CHAPTER 3. R BASICS 3.16. FUNCTIONS
3.15.9 Databases
R does not need to read from text files; it can read directly from a database. This is very useful since it allows the
filtering, selecting and joining operations to rely on the database’s optimized algorithms. Then again, if you will only
be analyzing your data with R, you are probably better of by working from a file, without the databases’ overhead.
See Chapter 15 for more on this matter.
3.16 Functions
One of the most basic building blocks of programming is the ability of writing your own functions. A function in R, like
everything else, is an object accessible using its name. We first define a simple function that sums its two arguments
my.sum <- function(x,y) {
return(x+y)
}
my.sum(10,2)
## [1] 12
From this example you may notice that:
• The function function tells R to construct a function object.
• Unlike some programming languages, a period (.) is allowed as part of an object’s name.
• The arguments of the function, i.e. (x,y), need to be named but we are not required to specify their class.
This makes writing functions very easy, but it is also the source of many bugs, and slowness of R compared to
type declaring languages (C, Fortran,Java,…).
• A typical R function does not change objects22 but rather creates new ones. To save the output of my.sum we
will need to assign it using the <- operator.
Here is a (slightly) more advanced function:
my.sum.2 <- function(x, y , absolute=FALSE) {
if(absolute==TRUE) {
result <- abs(x+y)
}
else{
result <- x+y
}
result
}
my.sum.2(-10,2,TRUE)
## [1] 8
Things to note:
• if(condition){expression1} else{expression2} does just what the name suggests.
• The function will output its last evaluated expression. You don’t need to use the return function explicitly.
• Using absolute=FALSE sets the default value of absolute to FALSE. This is overridden if absolute is stated
explicitly in the function call.
21 https://github.com/r-lib/vroom
22 This is a classical functional programming paradigm. If you want an object oriented flavor of R programming, see Hadley’s Advanced
R book23 .
25
3.17. LOOPING CHAPTER 3. R BASICS
An important behavior of R is the scoping rules. This refers to the way R seeks for variables used in functions. As a
rule of thumb, R will first look for variables inside the function and if not found, will search for the variable values in
outer environments24 . Think of the next example.
a <- 1
b <- 2
x <- 3
scoping <- function(a,b){
a+b+x
}
scoping(10,11)
## [1] 24
3.17 Looping
The real power of scripting is when repeated operations are done by iteration. R supports the usual for, while, and
repated loops. Here is an embarrassingly simple example
for (i in 1:5){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
A slightly more advanced example, is vector multiplication
result <- 0
n <- 1e3
x <- 1:n
y <- (1:n)/n
for(i in 1:n){
result <- result+ x[i]*y[i]
}
Remark. Vector Operations: You should NEVER write your own vector and matrix products like in the previous
example. Only use existing facilities such as %*%, sum(), etc.
Remark. Parallel Operations: If you already know that you will be needing to parallelize your work, get used to
working with foreach loops in the foreach package, rather then regular for loops.
3.18 Apply
For applying the same function to a set of elements, there is no need to write an explicit loop. This is such an
elementary operation that every programming language will provide some facility to apply, or map the function to
all elements of a set. R provides several facilities to perform this. The most basic of which is lapply which applies a
function over all elements of a list, and return a list of outputs:
the.list <- list(1,'a',mean) # a list of 3 elements from different classes
lapply(X = the.list, FUN = class) # apply the function `class` to each elements
## [[1]]
## [1] "numeric"
##
## [[2]]
24 More formally, this is called Lexical Scoping25 .
26
CHAPTER 3. R BASICS 3.19. RECURSION
## [1] "character"
##
## [[3]]
## [1] "standardGeneric"
## attr(,"package")
## [1] "methods"
sapply(X = the.list, FUN = class) # lapply with cleaned output
What is the function you are using requires some arguments? One useful trick is to create your own function that
takes only one argument:
quantile.25 <- function(x) quantile(x,0.25)
sapply(USArrests, quantile.25)
What if you are applying the same function with two lists of arguments? Use mapply. The following will compute
a different quantile to each column in the data:
quantiles <- c(0.1, 0.5, 0.3, 0.2)
mapply(quantile, USArrests, quantiles)
• sapply: The same as lapply but tries to arrange output in a vector or matrix, and not an unstructured list.
• vapply: A safer version of sapply, where the output class is pre-specified.
• apply: For applying over the rows or columns of matrices.
• mapply: For applying functions with more than a single input.
• tapply: For splitting vectors and applying functions on subsets.
• rapply: A recursive version of lapply.
• eapply: Like lapply, only operates on environments instead of lists.
• Map+Reduce: For a Common Lisp26 look and feel of lapply.
• parallel::parLapply: A parallel version of lapply from the package parallel.
• parallel::parLBapply: A parallel version of lapply, with load balancing from the package parallel.
3.19 Recursion
The R compiler is really not designed for recursion, and you will rarely need to do so.
See the RCpp Chapter 19 for linking C code, which is better suited for recursion. If you really insist to write recursions
in R, make sure to use the Recall function, which, as the name suggests, recalls the function in which it is place. Here
is a demonstration with the Fibonacci series.
fib<-function(n) {
if (n <= 2) fn<-1
else fn <- Recall(n - 1) + Recall(n - 2)
return(fn)
}
fib(5)
## [1] 5
26 https://en.wikipedia.org/wiki/Common_Lisp
27
3.20. STRINGS CHAPTER 3. R BASICS
3.20 Strings
Note: this section is courtesy of Ron Sarafian.
Strings may appear as character vectors,files names, paths (directories), graphing elements, and more.
Strings can be concatenated with the super useful paste function.
a <- "good"
b <- "morning"
is.character(a)
## [1] TRUE
paste(a,b)
## [1] "good.morning"
paste(a,b,1:3, paste='@@@', collapse = '^^^^')
## [1] "ood"
substr(c, start=6, stop=12) <- "evening"
The grep function is a very powerful tool to search for patterns in text. These patterns are called regular expressions27
(d <- c(a,b,c))
## [1] 1 3
grep("good",d, value=TRUE, ignore.case=TRUE)
28
CHAPTER 3. R BASICS 3.21. DATES AND TIMES
## a b c
## "thiszis" "justzan" "example"
strsplit(x, "z") # split x on the letter z
## $a
## [1] "this" "is"
##
## $b
## [1] "just" "an"
##
## $c
## [1] "example"
Some more examples:
nchar(x) # count the nuber of characters in every element of a string vector.
## a b c
## 7 7 7
toupper(x) # translate characters in character vectors to upper case
## a b c
## "THISZIS" "JUSTZAN" "EXAMPLE"
tolower(toupper(x)) # vice verca
## a b c
## "thiszis" "justzan" "example"
letters[1:10] # lower case letters vector
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
LETTERS[1:10] # upper case letters vector
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
cat("the sum of", 1, "and", 2, "is", 1+2) # concatenate and print strings and values
3.21.1 Dates
R provides several packages for dealing with date and date/time data. We start with the base package.
R needs to be informed explicitly that an object holds dates. The as.Date function convert values to dates. You can
pass it a character, a numeric, or a POSIXct (we’ll soon explain what it is).
29 https://r4ds.had.co.nz/strings.html
30 https://www.linkedin.com/in/ron-sarafian-4a5a95110/
29
3.21. DATES AND TIMES CHAPTER 3. R BASICS
## [1] "character"
start <- as.Date(start)
class(start)
## [1] "Date"
But what if our date is not in the yyyy-mm-dd format? We can tell R what is the character date’s format.
as.Date("14/5/1948", format="%d/%m/%Y")
## [1] "1948-05-14"
as.Date("14may1948", format="%d%b%Y")
## [1] "1948-05-14"
Things to note:
• The format of the date is specified with the format= argument. %d for day of the month, / for separation, %m
for month, and %Y for year in four digits. See ?strptime for more available formatting.
• If it returns NA, then use the command Sys.setlocale("LC_TIME","C")
Many functions are content aware, and adapt their behavior when dealing with dates:
(today <- Sys.Date()) # the current date
## [1] "2019-03-31"
today + 1 # Add one day
## [1] "2019-04-01"
today - start # Diffenrece between dates
## [1] "1948-05-14"
3.21.2 Times
Specifying times is similar to dates, only that more formatting parameters are required. The POSIXct is the object
class for times. It expects strings to be in the format YYYY-MM-DD HH:MM:SS. With POSIXct you can also specify
the timezone, e.g., "Asia/Jerusalem".
time1 <- Sys.time()
class(time1)
## [1] "difftime"
Things to note:
• Be careful about DST, because as.POSIXct("2019-03-29 01:30")+3600 will not add 1 hour, but 2 with the
result: [1] "2019-03-29 03:30:00 IDT"
30
CHAPTER 3. R BASICS 3.21. DATES AND TIMES
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
ymd_hms("2000-01-01 00:00:01")
## [1] "1S"
minutes(c(2,3))
31
3.22. COMPLEX OBJECTS CHAPTER 3. R BASICS
days(5)
And you can also extract and assign the time components:
t
## [1] 1
second(t) <- 26
t
Analyzing temporal data is different than actually storing it. If you are interested in time-series analysis, try the
tseries, forecast and zoo packages.
## List of 4
## $ : num 7
## $ : chr "hello"
## $ :List of 3
## ..$ a: num 7
## ..$ b: num 8
## ..$ c: num 9
## $ FOO:function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
32
CHAPTER 3. R BASICS 3.23. VECTORS AND MATRIX PRODUCTS
Some (very) advanced users may want a deeper look into object. Try the lobstr31 package, or the .Inter-
nal(inspect(…)) function described here32 .
x <- c(7,10)
.Internal(inspect(x))
Definition 3.1 (Matrix Product). The matrix-product between matrix 𝑛 × 𝑚 matrix 𝐴, and 𝑚 × 𝑝 matrix 𝐵, is a
𝑛 × 𝑝 matrix 𝐶, where:
𝑚
𝑐𝑖,𝑗 ∶= ∑ 𝑎𝑖,𝑘 𝑏𝑘,𝑗
𝑘=1
Vectors can be seen as single row/column matrices. We can thus use matrix products to define the following:
Definition 3.2 (Dot Product). The dot-product, a.k.a. scalar-product, or inner-product, between row-vectors 𝑥 ∶=
(𝑥1 , … , 𝑥𝑛 ) and 𝑦 ∶= (𝑦1 , … , 𝑦𝑛 ) is defined as the matrix product between the 1 × 𝑛 matrix 𝑥′ , and the 𝑛 × 1 matrix y:
𝑥′ 𝑦 ∶= ∑ 𝑥𝑖 𝑦𝑖
𝑖
Definition 3.3 (Outer Product). The outer product between row-vectors 𝑥 ∶= (𝑥1 , … , 𝑥𝑛 ) and 𝑦 ∶= (𝑦1 , … , 𝑦𝑛 ) is
defined as the matrix product between the 𝑛 × 1 matrix 𝑥, and the 1 × 𝑛 matrix 𝑦′ :
(𝑥𝑦′ )𝑖,𝑗 ∶= 𝑥𝑖 𝑦𝑗
## [,1]
## [1,] -3.298627
x %*% y # Dot product.
## [,1]
## [1,] -3.298627
crossprod(x,y) # Dot product.
## [,1]
## [1,] -3.298627
crossprod(t(x),y) # Outer product.
33
3.23. VECTORS AND MATRIX PRODUCTS CHAPTER 3. R BASICS
## [1] 1 1 1 1 1
(A <- matrix(data = rep(1:5,5), nrow = 5, ncol = 5, byrow = TRUE)) #
## [,1]
34
CHAPTER 3. R BASICS 3.23. VECTORS AND MATRIX PRODUCTS
## [1,] 15
## [2,] 15
## [3,] 15
## [4,] 15
## [5,] 15
0.5 * A
## [,1]
## [1,] 75
Can I write these functions myself? Yes! But a pure-R implementation will be much slower than %*%:
my.crossprod <- function(x,y){
result <- 0
for(i in 1:length(x)) result <- result + x[i]*y[i]
result
}
x <- rnorm(1e8)
y <- rnorm(1e8)
system.time(a1 <- my.crossprod(x,y))
## [1] TRUE
all.equal(a1,a3)
## [1] TRUE
all.equal(a2,a3)
## [1] TRUE
35
3.24. BIBLIOGRAPHIC NOTES CHAPTER 3. R BASICS
33 http://stats.stackexchange.com/questions/138/free-resources-for-learning-r
34 https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
35 http://www.christophsax.com/2018/05/15/tsbox/
36 http://www.gastonsanchez.com/r4strings/
37 http://adv-r.had.co.nz/
38 https://github.com/rstudio/RStartHere/blob/master/README.md
39 https://makingnoiseandhearingthings.com/2018/04/19/datasets-for-data-cleaning-practice/
40 https://www.datacamp.com/courses/free-introduction-to-r
41 https://www.r-exercises.com/start-here-to-learn-r/
36
Chapter 4
data.table
data.table is an excellent extension of the data.frame class. If used as a data.frame it will look and feel like a
data frame. If, however, it is used with it’s unique capabilities, it will prove faster and easier to manipulate. This is
because data.frames, like most of R objects, make a copy of themselves when modified. This is known as passing by
value1 , and it is done to ensure that object are not corrupted if an operation fails (if your computer shuts down before
the operation is completed, for instance). Making copies of large objects is clearly time and memory consuming. A
data.table can make changes in place. This is known as passing by reference2 , which is considerably faster than
passing by value.
Let’s start with importing some freely available car sales data from Kaggle3 .
library(data.table)
library(magrittr)
auto <- fread('data/autos.csv')
## [1] 371824 20
names(auto) # Variable names
37
CHAPTER 4. DATA.TABLE
## 103.3 Mb
Things to note:
• The import has been done with fread instead of read.csv. This is more efficient, and directly creates a
data.table object.
• The import is very fast.
• The data after import is slightly larger than when stored on disk (in this case). The extra data allows faster
operation of this object, and the rule of thumb is to have 3 to 5 times more RAM4 than file size (e.g.: 4GB RAM
for 1GB file)
• auto has two classes. It means that everything that expects a data.frame we can feed it a data.table and it
will work.
Let’s start with verifying that it behaves like a data.frame when expected.
auto[,2] %>% head
## name
## 1: Golf_3_1.6
## 2: A5_Sportback_2.7_Tdi
## 3: Jeep_Grand_Cherokee_"Overland"
## 4: GOLF_4_1_4__3T\xdcRER
## 5: Skoda_Fabia_1.4_TDI_PD_Classic
## 6: BMW_316i___e36_Limousine___Bastlerfahrzeug__Export
auto[[2]] %>% head
## [1] "Golf_3_1.6"
## [2] "A5_Sportback_2.7_Tdi"
## [3] "Jeep_Grand_Cherokee_\"Overland\""
## [4] "GOLF_4_1_4__3T\xdcRER"
## [5] "Skoda_Fabia_1.4_TDI_PD_Classic"
## [6] "BMW_316i___e36_Limousine___Bastlerfahrzeug__Export"
auto[1,2] %>% head
## name
## 1: Golf_3_1.6
But notice the difference between data.frame and data.table when subsetting multiple rows. Uhh!
auto[1:3] %>% dim # data.table will exctract *rows*
## [1] 3 20
as.data.frame(auto)[1:3] %>% dim # data.frame will exctract *columns*
## [1] 371824 3
Just use columns (,) and be explicit regarding the dimension you are extracting…
4 https://en.wikipedia.org/wiki/Random-access_memory
38
CHAPTER 4. DATA.TABLE
Now let’s do some data.table specific operations. The general syntax has the form DT[i,j,by]. SQL users may
think of i as WHERE, j as SELECT, and by as GROUP BY. We don’t need to name the arguments explicitly. Also, the Tab
key will typically help you to fill in column names.
auto[,vehicleType,] %>% table # Exctract column and tabulate
## .
## andere bus cabrio coupe kleinwagen
## 37899 3362 30220 22914 19026 80098
## kombi limousine suv
## 67626 95963 14716
auto[vehicleType=='coupe',,] %>% dim # Exctract rows
## [1] 19026 20
auto[,gearbox:model,] %>% head # exctract column range
## .
## automatik manuell
## 20223 77169 274432
auto[vehicleType=='coupe' & gearbox=='automatik',,] %>% dim # intersect conditions
## [1] 6008 20
auto[,table(vehicleType),] # uhh? why would this even work?!?
## vehicleType
## andere bus cabrio coupe kleinwagen
## 37899 3362 30220 22914 19026 80098
## kombi limousine suv
## 67626 95963 14716
auto[, mean(price), by=vehicleType] # average price by car group
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType V1
## 1: 20124.688
## 2: coupe 25951.506
## 3: suv 13252.392
## 4: kleinwagen 5691.167
## 5: limousine 11111.107
## 6: cabrio 15072.998
## 7: bus 10300.686
## 8: kombi 7739.518
## 9: andere 676327.100
The .N operator is very useful if you need to count the length of the result. Notice where I use it:
39
CHAPTER 4. DATA.TABLE
## [1] 371824
auto[,.N, vehicleType] # will count rows by type
## vehicleType N
## 1: 37899
## 2: coupe 19026
## 3: suv 14716
## 4: kleinwagen 80098
## 5: limousine 95963
## 6: cabrio 22914
## 7: bus 30220
## 8: kombi 67626
## 9: andere 3362
You may concatenate results into a vector:
auto[,c(mean(price), mean(powerPS)),]
## vehicleType V1
## 1: 20124.68801
## 2: 71.23249
## 3: coupe 25951.50589
## 4: coupe 172.97614
## 5: suv 13252.39182
## 6: suv 166.01903
## 7: kleinwagen 5691.16738
## 8: kleinwagen 68.75733
## 9: limousine 11111.10661
## 10: limousine 132.26936
## 11: cabrio 15072.99782
## 12: cabrio 145.17684
## 13: bus 10300.68561
## 14: bus 113.58137
## 15: kombi 7739.51760
## 16: kombi 136.40654
## 17: andere 676327.09964
## 18: andere 102.11154
Use a list() instead of c(), within data.table commands:
auto[,list(mean(price), mean(powerPS)), by=vehicleType]
## Warning in gmean(price): The sum of an integer column for a group was more
40
CHAPTER 4. DATA.TABLE
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType V1 V2
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
You can add names to your new variables:
auto[,list(Price=mean(price), Power=mean(powerPS)), by=vehicleType]
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType Price Power
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
You can use .() to replace the longer list() command:
auto[,.(Price=mean(price), Power=mean(powerPS)), by=vehicleType]
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType Price Power
## 1: 20124.688 71.23249
## 2: coupe 25951.506 172.97614
## 3: suv 13252.392 166.01903
## 4: kleinwagen 5691.167 68.75733
## 5: limousine 11111.107 132.26936
## 6: cabrio 15072.998 145.17684
## 7: bus 10300.686 113.58137
## 8: kombi 7739.518 136.40654
## 9: andere 676327.100 102.11154
And split by multiple variables:
auto[,.(Price=mean(price), Power=mean(powerPS)), by=.(vehicleType,fuelType)] %>% head
## Warning in gmean(price): The sum of an integer column for a group was more
## than type 'integer' can hold so the result has been coerced to 'numeric'
## automatically for convenience.
## vehicleType fuelType Price Power
## 1: benzin 11820.443 70.14477
## 2: coupe diesel 51170.248 179.48704
41
CHAPTER 4. DATA.TABLE
## [1] 310497
auto[,mean(price<1e4),] # Proportion of prices lower than 10,000
## [1] 0.8350644
auto[,.(Power=mean(powerPS)), by=.(PriceRange=price>1e4)]
## PriceRange Power
## 1: FALSE 101.8838
## 2: TRUE 185.9029
Things to note:
• The term price<1e4 creates on the fly a binary vector of TRUE=1 / FALSE=0 for prices less than 10k and
then sums/means this vector, hence sum is actually a count, and mean is proportion=count/total
• Summing all prices lower than 10k is done with the command auto[price<1e4,sum(price),]
You may sort along one or more columns
auto[order(-price), price,] %>% head # Order along price. Descending
## [1] 0 0 0 0 0 0
You may apply a function to ALL columns using a Subset of the Data using .SD
count.uniques <- function(x) length(unique(x))
auto[,lapply(.SD, count.uniques), vehicleType]
42
CHAPTER 4. DATA.TABLE 4.1. MAKE YOUR OWN VARIABLES
## 2: 8 35 3 51 1 5159
## 3: 8 37 3 61 1 4932
## 4: 8 38 3 68 1 7343
## 5: 8 39 3 82 1 7513
## 6: 7 38 3 70 1 5524
## 7: 8 33 3 63 1 6112
## 8: 8 38 3 75 1 7337
## 9: 8 38 3 41 1 2220
## lastSeen
## 1: 32813
## 2: 16568
## 3: 13367
## 4: 59354
## 5: 65813
## 6: 19125
## 7: 26094
## 8: 50668
## 9: 3294
Things to note:
• .SD is the data subset after splitting along the by argument.
• Recall that lapply applies the same function to all elements of a list. In this example, to all columns of .SD.
If you want to apply a function only to a subset of columns, use the .SDcols argument
auto[,lapply(.SD, count.uniques), by=vehicleType, .SDcols=price:gearbox]
Or create multiple variables at once. The syntax c("A","B"):=.(expression1,expression2)is read “save the list
of results from expression1 and expression2 using the vector of names A, and B”.
auto[,c('newVar','newVar2'):=.(log(price/powerPS),price^2/powerPS),]
4.2 Join
data.table can be used for joining. A join is the operation of aligning two (or more) data frames/tables along some
index. The index can be a single variable, or a combination thereof.
43
4.3. RESHAPING DATA CHAPTER 4. DATA.TABLE
Here is a simple example of aligning age and gender from two different data tables:
DT1 <- data.table(Names=c("Alice","Bob"), Age=c(29,31))
DT2 <- data.table(Names=c("Alice","Bob","Carl"), Gender=c("F","M","M"))
setkey(DT1, Names)
setkey(DT2, Names)
DT1[DT2,,]
44
CHAPTER 4. DATA.TABLE 4.3. RESHAPING DATA
The mtcars data encodes 11 characteristics of 32 types of automobiles. It is “wide” since the various characteristics
are encoded in different variables, making the data, well, simply wide.
mtcars %>% head
## [[1]]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## [[2]]
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
mtcars$type <- rownames(mtcars)
melt(mtcars, id.vars=c("type")) %>% head
45
4.3. RESHAPING DATA CHAPTER 4. DATA.TABLE
## Chick 0 2 4 6 8 10 12 14 16 18 20 21
## 1 18 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 16 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 15 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 13 41 48 53 60 65 67 71 70 71 81 91 96
## 5 9 42 51 59 68 85 96 90 92 93 100 100 98
## 6 20 41 47 54 58 65 73 77 89 98 107 115 117
## 7 10 41 44 52 63 74 81 89 96 101 112 120 124
## 8 8 42 50 61 71 84 93 110 116 126 134 125 NA
## 9 17 42 51 61 72 83 89 98 103 113 123 133 142
## 10 19 43 48 55 62 65 71 82 88 106 120 144 157
## 11 4 42 49 56 67 74 87 102 108 136 154 160 157
## 12 6 41 49 59 74 97 124 141 148 155 160 160 157
## 13 11 43 51 63 84 112 139 168 177 182 184 181 175
## 14 3 43 39 55 67 84 99 115 138 163 187 198 202
## 15 1 42 51 59 64 76 93 106 125 149 171 199 205
## 16 12 41 49 56 62 72 88 119 135 162 185 195 205
## 17 2 40 49 58 72 84 103 122 138 162 187 209 215
## 18 5 41 42 48 60 79 106 141 164 197 199 220 223
## 19 14 41 49 62 79 101 128 164 192 227 248 259 266
## 20 7 41 49 57 71 89 112 146 174 218 250 288 305
## 21 24 42 52 58 74 66 68 70 71 72 72 76 74
## 22 30 42 48 59 72 85 98 115 122 143 151 157 150
## 23 22 41 55 64 77 90 95 108 111 131 148 164 167
## 24 23 43 52 61 73 90 103 127 135 145 163 170 175
## 25 27 39 46 58 73 87 100 115 123 144 163 185 192
## 26 28 39 46 58 73 92 114 145 156 184 207 212 233
## 27 26 42 48 57 74 93 114 136 147 169 205 236 251
## 28 25 40 49 62 78 102 124 146 164 197 231 259 265
## 29 29 39 48 59 74 87 106 134 150 187 230 279 309
## 30 21 40 50 62 86 125 163 217 240 275 307 318 331
## 31 33 39 50 63 77 96 111 137 144 151 146 156 147
## 32 37 41 48 56 68 80 83 103 112 135 157 169 178
## 33 36 39 48 61 76 98 116 145 166 198 227 225 220
## 34 31 42 53 62 73 85 102 123 138 170 204 235 256
## 35 39 42 50 61 78 89 109 130 146 170 214 250 272
## 36 38 41 49 61 74 98 109 128 154 192 232 280 290
## 37 32 41 49 65 82 107 129 159 179 221 263 291 305
## 38 40 41 55 66 79 101 120 154 182 215 262 295 321
## 39 34 41 49 63 85 107 134 164 186 235 294 327 341
## 40 35 41 53 64 87 123 158 201 238 287 332 361 373
## 41 44 42 51 65 86 103 118 127 138 145 146 NA NA
46
CHAPTER 4. DATA.TABLE 4.4. BIBLIOGRAPHIC NOTES
5 https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
6 https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
7 https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html
8 https://www.r-bloggers.com/intro-to-the-data-table-package/
9 http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/
10 https://www.kaggle.com/orgesleka/used-cars-database
11 https://www.datacamp.com/courses/data-manipulation-in-r-with-datatable
47
4.5. PRACTICE YOURSELF CHAPTER 4. DATA.TABLE
48
Chapter 5
Exploratory Data Analysis (EDA) is a term coined by John W. Tukey1 in his seminal book (Tukey, 1977). It is also
(arguably) known as Visual Analytics, or Descriptive Statistics. It is the practice of inspecting, and exploring your
data, before stating hypotheses, fitting predictors, and other more ambitious inferential goals. It typically includes
the computation of simple summary statistics which capture some property of interest in the data, and visualization.
EDA can be thought of as an assumption free, purely algorithmic practice.
In this text we present EDA techniques along the following lines:
• How we explore: with summary-statistics, or visually?
• How many variables analyzed simultaneously: univariate, bivariate, or multivariate?
• What type of variable: categorical or continuous?
## gender
## Boy Girl
## 10 12
table(drink)
## drink
## Coffee Coke Sprite Tea Water
## 6 5 3 7 1
table(age)
## age
1 https://en.wikipedia.org/wiki/John_Tukey
49
5.1. SUMMARY STATISTICS CHAPTER 5. EXPLORATORY DATA ANALYSIS
## Old Young
## 10 12
If instead of the level counts you want the proportions, you can use prop.table
prop.table(table(gender))
## gender
## Boy Girl
## 0.4545455 0.5454545
library(magrittr)
cbind(gender, drink) %>% head # bind vectors into matrix and inspect (`c` for column)
## gender drink
## [1,] "Boy" "Coke"
## [2,] "Boy" "Coke"
## [3,] "Boy" "Coke"
## [4,] "Boy" "Coke"
## [5,] "Boy" "Coke"
## [6,] "Boy" "Sprite"
table1 <- table(gender, drink) # count frequencies of bivariate combinations
table1
## drink
## gender Coffee Coke Sprite Tea Water
## Boy 2 5 3 0 0
## Girl 4 0 0 7 1
## , , age = Old
##
## drink
## gender Coffee Coke Sprite Tea Water
## Boy 0 3 1 0 0
## Girl 1 0 0 5 0
##
## , , age = Young
##
## drink
## gender Coffee Coke Sprite Tea Water
## Boy 2 2 2 0 0
## Girl 3 0 0 2 1
table.2.2 <- ftable(gender, drink, age) # A human readable table (`f` for Flat).
table.2.2
50
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.1. SUMMARY STATISTICS
## Tea 0 0
## Water 0 0
## Girl Coffee 1 3
## Coke 0 0
## Sprite 0 0
## Tea 5 2
## Water 0 1
If you want proportions instead of counts, you need to specify the denominator, i.e., the margins. Think: what is the
margin in each of the following outputs?
prop.table(table1, margin = 1) # every *row* sums to to 1
## drink
## gender Coffee Coke Sprite Tea Water
## Boy 0.20000000 0.50000000 0.30000000 0.00000000 0.00000000
## Girl 0.33333333 0.00000000 0.00000000 0.58333333 0.08333333
prop.table(table1, margin = 2) # every *column* sums to 1
## drink
## gender Coffee Coke Sprite Tea Water
## Boy 0.3333333 1.0000000 1.0000000 0.0000000 0.0000000
## Girl 0.6666667 0.0000000 0.0000000 1.0000000 1.0000000
Definition 5.1 (Average). The mean, or average, of a sample 𝑥 ∶= (𝑥1 , … , 𝑥𝑛 ), denoted 𝑥̄ is defined as
𝑥̄ ∶= 𝑛−1 ∑ 𝑥𝑖 .
The sample mean is non robust. A single large observation may inflate the mean indefinitely. For this reason, we
define several other summaries of location, which are more robust, i.e., less affected by “contaminations” of the data.
We start by defining the sample quantiles, themselves not a summary of location.
Definition 5.2 (Quantiles). The 𝛼 quantile of a sample 𝑥, denoted 𝑥𝛼 , is (non uniquely) defined as a value above
100𝛼% of the sample, and below 100(1 − 𝛼)%.
We emphasize that sample quantiles are non-uniquely defined. See ?quantile for the 9(!) different definitions that R
provides.
Using the sample quantiles, we can now define another summary of location, the median.
Definition 5.3 (Median). The median of a sample 𝑥, denoted 𝑥0.5 is the 𝛼 = 0.5 quantile of the sample.
Definition 5.4 (Alpha Trimmed Mean). The 𝛼 trimmed mean of a sample 𝑥, denoted 𝑥𝛼 ̄ is the average of the sample
after removing the 𝛼 proportion of largest and 𝛼 proportion of smallest observations.
51
5.1. SUMMARY STATISTICS CHAPTER 5. EXPLORATORY DATA ANALYSIS
The simple mean and median are instances of the alpha trimmed mean: 𝑥0̄ and 𝑥0.5
̄ respectively.
Here are the R implementations:
x <- rexp(100) # generate some (assymetric) random data
mean(x) # simple mean
## [1] 1.017118
median(x) # median
## [1] 0.5805804
mean(x, trim = 0.2) # alpha trimmed mean with alpha=0.2
## [1] 0.7711528
Definition 5.5 (Standard Deviation). The standard deviation of a sample 𝑥, denoted 𝑆(𝑥), is defined as
Definition 5.6 (MAD). The Median Absolute Deviation from the median, denoted as 𝑀 𝐴𝐷(𝑥), is defined as
where 𝑐 is some constant, typically set to 𝑐 = 1.4826 so that MAD and 𝑆(𝑥) have the same large sample limit.
Definition 5.7 (IQR). The Inter Quartile Range of a sample 𝑥, denoted as 𝐼𝑄𝑅(𝑥), is defined as
## [1] 0.9981981
mad(x) # MAD
## [1] 0.6835045
IQR(x) # IQR
## [1] 1.337731
Definition 5.8 (Yule). The Yule measure of assymetry, denoted 𝑌 𝑢𝑙𝑒(𝑥) is defined as
Here is an R implementation
52
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.1. SUMMARY STATISTICS
## [1] 0.2080004
Things to note:
• A perfectly symmetric vector will return 0 because the median will be exactly on the midway.
• It is bounded between -1 and 1 because of the denominator
Definition 5.9 (Covariance). The covariance between two samples, 𝑥 and 𝑦, of same length 𝑛, is defined as
𝐶𝑜𝑣(𝑥, 𝑦) ∶= (𝑛 − 1)−1 ∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
We emphasize this is not the covariance you learned about in probability classes, since it is not the covariance between
two random variables but rather, between two samples. For this reasons, some authors call it the empirical covariance,
or sample covariance.
Definition 5.10 (Pearson’s Correlation Coefficient). Peasrson’s correlation coefficient, a.k.a. Pearson’s moment
product correlation, or simply, the correlation, denoted r(x,y), is defined as
𝐶𝑜𝑣(𝑥, 𝑦)
𝑟(𝑥, 𝑦) ∶= .
𝑆(𝑥)𝑆(𝑦)
If you find this definition enigmatic, just think of the correlation as the covariance between 𝑥 and 𝑦 after transforming
each to the unitless scale of z-scores.
Definition 5.11 (Z-Score). The z-scores of a sample 𝑥 are defined as the mean-centered, scale normalized observations:
𝑥𝑖 − 𝑥 ̄
𝑧𝑖 (𝑥) ∶= .
𝑆(𝑥)
## [1] 0.0203559
cor(x,y) # correlation between x and y (default is pearson)
## [1] 0.02380989
scale(x) %>% head # z-score of x
## [,1]
## [1,] 1.72293613
## [2,] 0.83367533
## [3,] 0.27703737
## [4,] -1.00110536
## [5,] 0.07671776
## [6,] -0.66044228
53
5.2. VISUALIZATION CHAPTER 5. EXPLORATORY DATA ANALYSIS
Definition 5.12 (Sample Covariance Matrix). Given 𝑛 observations on 𝑝 variables, denote 𝑥𝑖,𝑗 the 𝑖’th observation
of the 𝑗’th variable. The sample covariance matrix, denoted Σ̂ is defined as
where 𝑥𝑘̄ ∶= 𝑛−1 ∑𝑖 𝑥𝑖,𝑘 . Put differently, the 𝑘, 𝑙’th entry in Σ̂ is the sample covariance between variables 𝑘 and 𝑙.
Remark. Σ̂ is clearly non robust. How would you define a robust covariance matrix?
5.2 Visualization
Summarizing the information in a variable to a single number clearly conceals much of the story in the sample. This
is like inspecting a person using a caricature, instead of a picture. Visualizing the data, when possible, is more
informative.
barplot(table(age))
12
10
8
6
4
2
0
Old Young
54
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.2. VISUALIZATION
Boy Girl
Coffee
Coke
drink
Tea Sprite
Water
gender
Things to note:
• The proportion of each category is encoded in the width of the bars (more girls than boys here)
• Zero observations are marked as a line.
Boy
Young Girl Young
Old Old
Coffee
Coke
drink
Tea Sprite
Water
gender
When one of the variables is a (discrete) time variable, then the plot has a notion dynamics in time. For this see the
Alluvial plot 5.3.1.
If the variables represent a hierarchy, consider a Sunburst Plot:
library(sunburstR)
# read in sample visit-sequences.csv data provided in source
# https://gist.github.com/kerryrodden/7090426#file-visit-sequences-csv
sequences <- read.csv(
system.file("examples/visit-sequences.csv",package="sunburstR")
,header=F
,stringsAsFactors = FALSE
)
sunburst(sequences) # In the HTML version of the book this plot is interactive.
55
5.2. VISUALIZATION CHAPTER 5. EXPLORATORY DATA ANALYSIS
Legend
Unlike categorical variables, there are endlessly many ways to visualize continuous variables. The simplest way is to
look at the raw data via the stripchart.
sample1 <- rexp(10)
stripchart(sample1)
Clearly, if there are many observations, the stripchart will be a useless line of black dots. We thus bin them together,
and look at the frequency of each bin; this is the histogram. R’s histogram function has very good defaults to choose
the number of bins. Here is a histogram showing the counts of each bin.
sample1 <- rexp(100)
hist(sample1, freq=T, main='Counts')
56
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.2. VISUALIZATION
Counts
10 15 20 25 30
Frequency
5
0
sample1
The bin counts can be replaced with the proportion of each bin using the freq argument.
hist(sample1, freq=F, main='Proportion')
Proportion
0.6
0.4
Density
0.2
0.0
sample1
Things to note:
• The bins’ proportion summary is larger than 1 because it considers each bin’s width, which in this case has a
constant width of 0.5, hence the total proportion sum is 1/0.5=2.
The bins of a histogram are non overlapping. We can adopt a sliding window approach, instead of binning. This is
the density plot which is produced with the density function, and added to an existing plot with the lines function.
The rug function adds the original data points as ticks on the axes, and is strongly recommended to detect artifacts
introduced by the binning of the histogram, or the smoothing of the density plot.
hist(sample1, freq=F, main='Frequencies')
lines(density(sample1))
rug(sample1)
57
5.2. VISUALIZATION CHAPTER 5. EXPLORATORY DATA ANALYSIS
Frequencies
0.6
0.4
Density
0.2
0.0
sample1
Remark. Why would it make no sense to make a table, or a barplot, of continuous data?
One particularly useful visualization, due to John W. Tukey, is the boxplot. The boxplot is designed to capture the
main phenomena in the data, and simultaneously point to outlines.
boxplot(sample1)
3.0
2.0
1.0
0.0
Another way to deal with a massive amount of data points, is to emphasize important points, and conceal non-
important. This is the purpose of circle-packing (example from r-graph gallery2 ):
kvv
vas
vta
dws ibx rmb
ijl
sko
aeq
vbi
ram
lhb
gpw ycl
kmi
nmd
mjz
srx jzv
umj
wcf
gwu
xww
jfr
ycf btm
reu
jow
ysd gqb
knv
egk kff
dnd kgc
gkq
ixo olw
arw
kwt
ikl xtj
rdh
muu eom
lqx
yew
awg
ksz czq
ott
rla fzx
wxt
vph
lkw
vwo rvv
dxv
abc
dvc nww
yvo
ucc
rzd tzc hde
iyd
cwt
aco
2 https://www.r-graph-gallery.com/308-interactive-circle-packing/
58
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.2. VISUALIZATION
7
6
5
4
x1
A scatter-plot may be augmented with marginal univariate visualization. See, for instance, the rug function to add
the raw data on the margins:
plot(x2~x1)
rug(x1,side = 1)
rug(x2,side = 2)
10
9
8
x2
7
6
5
4
x1
59
5.2. VISUALIZATION CHAPTER 5. EXPLORATORY DATA ANALYSIS
2
y
−2
−2.5 0.0 2.5
x
Like the univariate stripchart, the scatter plot will be an uninformative mess in the presence of a lot of data. A nice
bivariate counterpart of the univariate histogram is the hexbin plot, which tessellates the plane with hexagons, and
reports their frequencies.
library(hexbin) # load required library
n <- 2e5
x1 <- rexp(n)
x2 <- 2* x1 + 4 + rnorm(n)
plot(hexbin(x = x1, y = x2))
30 Counts
25 24116
22609
21102
20 19594
18087
16580
x2
15 15073
13566
12058
10 10551
9044
7537
5 6030
4523
3015
0 1508
1
0 2 4 6 8 10 12
x1
Visualizing multivariate data is a tremendous challenge given that we cannot grasp 4 dimensional spaces, nor can the
computer screen present more than 2 dimensional spaces. We thus have several options: (i) To project the data to
2D. This is discussed in the Dimensionality Reduction Section 11.1. (ii) To visualize not the raw data, but rather its
summaries, like the covariance matrix.
Our own Multinav3 package adopts an interactive approach. For each (multivariate) observation a simple univariate
summary may be computed and visualized. These summaries may be compared, and the original (multivariate)
observation inspected upon demand. Contact Efrat4 for more details.
3 https://github.com/EfratVil/MultiNav
4 http://efratvil.github.io/home/index.html
60
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.2. VISUALIZATION
An alternative approach starts with the covariance matrix, Σ,̂ that can be visualized as an image. Note the use of
the :: operator (called Double Colon Operator, for help: ?'::'), which is used to call a function from some package,
without loading the whole package. We will use the :: operator when we want to emphasize the package of origin of
a function.
covariance <- cov(longley) # The covariance of the longley dataset
correlations <- cor(longley) # The correlations of the longley dataset
lattice::levelplot(correlations)
Employed 1.0
Year 0.8
Population 0.6
column
Armed.Forces 0.4
Unemployed 0.2
GNP
0.0
GNP.deflator
−0.2
GNP.deflator
GNP
Unemployed
Armed.Forces
Population
Year
Employed
row
If we believe the covariance has some structure, we can do better than viewing the raw correlations. In temporal, and
spatial data, we believe correlations decay as some function of distances. We can thus view correlations as a function
of the distance between observations. This is known as a variogram. Note that for a variogram to be informative, it
is implied that correlations are merely a function of distances (and not locations themselves). This is formally known
61
5.2. VISUALIZATION CHAPTER 5. EXPLORATORY DATA ANALYSIS
Figure 5.1: Variogram: plotting correlation as a function of spatial distance. Courtesy of Ron Sarafian.
62
CHAPTER 5. EXPLORATORY DATA ANALYSIS 5.3. MIXED TYPE DATA
The following example, from the ggalluvial package Vignette by Jason Cory Brunson6 , demonstrates the flow of
students between different majors, as semesters evolve.
library(ggalluvial)
data(majors)
majors$curriculum <- as.factor(majors$curriculum)
ggplot(majors,
aes(x = semester, stratum = curriculum, alluvium = student,
fill = curriculum, label = curriculum)) +
scale_fill_brewer(type = "qual", palette = "Set2") +
geom_flow(stat = "alluvium", lode.guidance = "rightleft",
color = "darkgray") +
geom_stratum() +
theme(legend.position = "bottom") +
ggtitle("student curricula across several semesters")
7.5
5.0
2.5
0.0
CURR1 CURR3 CURR5 CURR7 CURR9 CURR11 CURR13 CURR15
semester
Things to note:
• We used the ggalluvial package of the ggplot2 ecosystem. More on ggplot2 in the Plotting Chapter.
• Time is on the 𝑥 axis. Categories are color coded.
Remark. If the width of the lines encode magnitude, the plot is also called a Sankey diagram.
6 https://cran.r-project.org/web/packages/ggalluvial/vignettes/ggalluvial.html
63
5.5. PRACTICE YOURSELF CHAPTER 5. EXPLORATORY DATA ANALYSIS
64
Chapter 6
Linear Models
Example 6.2 (Rental Prices). Consider the prediction of rental prices given an appartment’s attributes.
Both examples require some statistical model, but they are very different. The first is a causal inference problem: we
want to design an intervention so that we need to recover the causal effect of temperature and pressure. The second
is a prediction1 problem, a.k.a. a forecasting2 problem, in which we don’t care about the causal effects, we just want
good predictions.
In this chapter we discuss the causal problem in Example 6.1. This means that when we assume a model, we assume
it is the actual data generating process, i.e., we assume the sampling distribution is well specified. In the econometric
literature, these are the structural equations3 . The second type of problems is discussed in the Supervised Learning
Chapter 10.
Here are some more examples of the types of problems we are discussing.
Example 6.3 (Plant Growth). Consider the treatment of various plants with various fertilizers to study the fertilizer’s
effect on growth.
Example 6.4 (Return to Education). Consider the study of return to education by analyzing the incomes of individuals
with different education years.
Example 6.5 (Drug Effect). Consider the study of the effect of a new drug for hemophilia, by analyzing the level of
blood coagulation after the administration of various amounts of the new drug.
Let’s present the linear model. We assume that a response4 variable is the sum of effects of some factors5 . Denoting the
response variable by 𝑦, the factors by 𝑥 = (𝑥1 , … , 𝑥𝑝 ), and the effects by 𝛽 ∶= (𝛽1 , … , 𝛽𝑝 ) the linear model assumption
implies that the expected response is the sum of the factors effects:
𝑝
𝐸[𝑦] = 𝑥1 𝛽1 + ⋯ + 𝑥𝑝 𝛽𝑝 = ∑ 𝑥𝑗 𝛽𝑗 = 𝑥′ 𝛽. (6.1)
𝑗=1
1 https://en.wikipedia.org/wiki/Prediction
2 https://en.wikipedia.org/wiki/Forecasting
3 https://en.wikipedia.org/wiki/Structural_equation_modeling
4 The “response” is also known as the “dependent” variable in the statistical literature, or the “labels” in the machine learning literature.
5 The “factors” are also known as the “independent variable”, or “the design”, in the statistical literature, and the “features”, or
“attributes” in the machine learning literature.
65
6.1. PROBLEM SETUP CHAPTER 6. LINEAR MODELS
Clearly, there may be other factors that affect the the caps’ diameters. We thus introduce an error term6 , denoted by
𝜀, to capture the effects of all unmodeled factors and measurement error7 . The implied generative process of a sample
of 𝑖 = 1, … , 𝑛 observations it thus
𝑦𝑖 = 𝑥′𝑖 𝛽 + 𝜀𝑖 = ∑ 𝑥𝑖,𝑗 𝛽𝑗 + 𝜀𝑖 , 𝑖 = 1, … , 𝑛. (6.2)
𝑗
or in matrix notation
𝑦 = 𝑋𝛽 + 𝜀. (6.3)
Let’s demonstrate Eq.(6.2). In our bottle-caps example [6.1], we may produce bottle caps at various temperatures.
We design an experiment where we produce bottle-caps at varying temperatures. Let 𝑥𝑖 be the temperature at which
bottle-cap 𝑖 was manufactured. Let 𝑦𝑖 be its measured diameter. By the linear model assumption, the expected
diameter varies linearly with the temperature: 𝔼[𝑦𝑖 ] = 𝛽0 + 𝑥𝑖 𝛽1 . This implies that 𝛽1 is the (expected) change in
diameter due to a unit change in temperature.
Remark. In Galton’s8 classical regression problem, where we try to seek the relation between the heights of sons and
fathers then 𝑝 = 1, 𝑦𝑖 is the height of the 𝑖’th father, and 𝑥𝑖 the height of the 𝑖’th son. This is a prediction problem,
more than it is a causal-inference problem.
There are many reasons linear models are very popular:
1. Before the computer age, these were pretty much the only models that could actually be computed9 . The whole
Analysis of Variance (ANOVA) literature is an instance of linear models, that relies on sums of squares, which
do not require a computer to work with.
2. For purposes of prediction, where the actual data generating process is not of primary importance, they are
popular because they simply work. Why is that? They are simple so that they do not require a lot of data to be
computed. Put differently, they may be biased, but their variance is small enough to make them more accurate
than other models.
3. For non continuous predictors, any functional relation can be cast as a linear model.
4. For the purpose of screening, where we only want to show the existence of an effect, and are less interested in
the magnitude of that effect, a linear model is enough.
5. If the true generative relation is not linear, but smooth enough, then the linear function is a good approximation
via Taylor’s theorem.
There are still two matters we have to attend: (i) How to estimate 𝛽? (ii) How to perform inference?
In the simplest linear models the estimation of 𝛽 is done using the method of least squares. A linear model with least
squares estimation is known as Ordinary Least Squares (OLS). The OLS problem:
Remark. Personally, I prefer the matrix notation because it is suggestive of the geometry of the problem. The reader
is referred to Friedman et al. (2001), Section 3.2, for more on the geometry of OLS.
Different software suits, and even different R packages, solve Eq.(6.4) in different ways so that we skip the details of
how exactly it is solved. These are discussed in Chapters 17 and 18.
The last matter we need to attend is how to do inference on 𝛽.̂ For that, we will need some assumptions on 𝜀. A
typical set of assumptions is the following:
6 The “error term” is also known as the “noise”, or the “common causes of variability”.
7 You may philosophize if the measurement error is a mere instance of unmodeled factors or not, but this has no real implication for our
purposes.
8 https://en.wikipedia.org/wiki/Regression_toward_the_mean
9 By “computed” we mean what statisticians call “fitted”, or “estimated”, and computer scientists call “learned”.
66
CHAPTER 6. LINEAR MODELS 6.2. OLS ESTIMATION IN R
1. Independence: we assume 𝜀𝑖 are independent of everything else. Think of them as the measurement error of
an instrument: it is independent of the measured value and of previous measurements.
2. Centered: we assume that 𝐸[𝜀] = 0, meaning there is no systematic error, sometimes it called The “Linearity
assumption”.
3. Normality: we will typically assume that 𝜀 ∼ 𝒩(0, 𝜎2 ), but we will later see that this is not really required.
We emphasize that these assumptions are only needed for inference on 𝛽 ̂ and not for the estimation itself, which is
done by the purely algorithmic framework of OLS.
Given the above assumptions, we can apply some probability theory and linear algebra to get the distribution of the
estimation error:
The reason I am not too strict about the normality assumption above, is that Eq.(6.6) is approximately correct even
if 𝜀 is not normal, provided that there are many more observations than factors (𝑛 ≫ 𝑝).
Things to note:
• We used the tilde syntax Gas~Temp, reading “gas as linear function of temperature”.
• The data argument tells R where to look for the variables Gas and Temp. We used Insul=='Before' to subset
observations before the insulation.
• The result is assigned to the object lm.1.
Like any other language, spoken or programmable, there are many ways to say the same thing. Some more elegant
than others…
lm.1 <- lm(y=Gas, x=Temp, data=whiteside[whiteside$Insul=='Before',])
lm.1 <- lm(y=whiteside[whiteside$Insul=='Before',]$Gas,x=whiteside[whiteside$Insul=='Before',]$Temp)
lm.1 <- whiteside[whiteside$Insul=='Before',] %>% lm(Gas~Temp, data=.)
## [1] "lm"
67
6.2. OLS ESTIMATION IN R CHAPTER 6. LINEAR MODELS
Objects of class lm are very complicated. They store a lot of information which may be used for inference, plotting,
etc. The str function, short for “structure”, shows us the various elements of the object.
str(lm.1)
## List of 12
## $ coefficients : Named num [1:2] 6.854 -0.393
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "Temp"
## $ residuals : Named num [1:26] 0.0316 -0.2291 -0.2965 0.1293 0.0866 ...
## ..- attr(*, "names")= chr [1:26] "1" "2" "3" "4" ...
## $ effects : Named num [1:26] -24.2203 -5.6485 -0.2541 0.1463 0.0988 ...
## ..- attr(*, "names")= chr [1:26] "(Intercept)" "Temp" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:26] 7.17 7.13 6.7 5.87 5.71 ...
## ..- attr(*, "names")= chr [1:26] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:26, 1:2] -5.099 0.196 0.196 0.196 0.196 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:26] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "Temp"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.2 1.35
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 24
## $ xlevels : Named list()
## $ call : language lm(formula = Gas ~ Temp, data = whiteside[Insul == "Before"])
## $ terms :Classes 'terms', 'formula' language Gas ~ Temp
## .. ..- attr(*, "variables")= language list(Gas, Temp)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "Gas" "Temp"
## .. .. .. ..$ : chr "Temp"
## .. ..- attr(*, "term.labels")= chr "Temp"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(Gas, Temp)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "Gas" "Temp"
## $ model :'data.frame': 26 obs. of 2 variables:
## ..$ Gas : num [1:26] 7.2 6.9 6.4 6 5.8 5.8 5.6 4.7 5.8 5.2 ...
## ..$ Temp: num [1:26] -0.8 -0.7 0.4 2.5 2.9 3.2 3.6 3.9 4.2 4.3 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language Gas ~ Temp
## .. .. ..- attr(*, "variables")= language list(Gas, Temp)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "Gas" "Temp"
## .. .. .. .. ..$ : chr "Temp"
## .. .. ..- attr(*, "term.labels")= chr "Temp"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
68
CHAPTER 6. LINEAR MODELS 6.2. OLS ESTIMATION IN R
## (Intercept) Temp
## 6.8538277 -0.3932388
Things to note:
• R automatically adds an (Intercept) term. This means we estimate 𝐺𝑎𝑠 = 𝛽0 + 𝛽1 𝑇 𝑒𝑚𝑝 + 𝜀 and not 𝐺𝑎𝑠 =
𝛽1 𝑇 𝑒𝑚𝑝 + 𝜀. This makes sense because we are interested in the contribution of the temperature to the variability
of the gas consumption about its mean, and not about zero.
• The effect of temperature, i.e., 𝛽1̂ , is -0.39. The negative sign means that the higher the temperature, the less
gas is consumed. The magnitude of the coefficient means that for a unit increase in the outside temperature, the
gas consumption decreases by 0.39 units.
We can use the predict function to make predictions, but we emphasize that if the purpose of the model is to make
predictions, and not interpret coefficients, better skip to the Supervised Learning Chapter 10.
# Gas predictions (b0+b1*temperature) vs. actual Gas measurements, ideal slope should be 1.
plot(predict(lm.1)~whiteside[Insul=='Before',Gas])
# plots identity line (slope 1), lty=Line Type, 2 means dashed line.
abline(0,1, lty=2)
7
6
predict(lm.1)
5
4
3
3 4 5 6 7
The model seems to fit the data nicely. A common measure of the goodness of fit is the coefficient of determination,
more commonly known as the 𝑅2 .
∑𝑖 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅2 ∶= 1 − , (6.7)
∑𝑖 (𝑦𝑖 − 𝑦)̄ 2
69
6.3. INFERENCE CHAPTER 6. LINEAR MODELS
## [1] 0.9438081
This is a nice result implying that about 94% of the variability in gas consumption can be attributed to changes in
the outside temperature.
Obviously, R does provide the means to compute something as basic as 𝑅2 , but I will let you find it for yourselves.
6.3 Inference
To perform inference on 𝛽,̂ in order to test hypotheses and construct confidence intervals, we need to quantify the
uncertainly in the reported 𝛽.̂ This is exactly what Eq.(6.6) gives us.
Luckily, we don’t need to manipulate multivariate distributions manually, and everything we need is already imple-
mented. The most important function is summary which gives us an overview of the model’s fit. We emphasize that
fitting a model with lm is an assumption free algorithmic step. Inference using summary is not assumption free, and
requires the set of assumptions leading to Eq.(6.6).
summary(lm.1)
##
## Call:
## lm(formula = Gas ~ Temp, data = whiteside[Insul == "Before"])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62020 -0.19947 0.06068 0.16770 0.59778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.85383 0.11842 57.88 <2e-16 ***
## Temp -0.39324 0.01959 -20.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2813 on 24 degrees of freedom
## Multiple R-squared: 0.9438, Adjusted R-squared: 0.9415
## F-statistic: 403.1 on 1 and 24 DF, p-value: < 2.2e-16
Things to note:
• The estimated 𝛽 ̂ is reported in the ‘Coefficients’ table, which has point estimates, standard errors, t-statistics,
and the p-values of a two-sided hypothesis test for each coefficient 𝐻0,𝑗 ∶ 𝛽𝑗 = 0, 𝑗 = 1, … , 𝑝.
• The 𝑅2 is reported at the bottom. The “Adjusted R-squared” is a variation that compensates for the model’s
complexity.
• The original call to lm is saved in the Call section.
• Some summary statistics of the residuals (𝑦𝑖 − 𝑦𝑖̂ ) in the Residuals section.
• The “residuals standard error”10 is √(𝑛 − 𝑝)−1 ∑𝑖 (𝑦𝑖 − 𝑦𝑖̂ )2 . The denominator of this expression is the degrees
of freedom, 𝑛 − 𝑝, which can be thought of as the hardness of the problem.
As the name suggests, summary is merely a summary. The full summary(lm.1) object is a monstrous object. Its various
elements can be queried using str(sumary(lm.1)).
Can we check the assumptions required for inference? Some. Let’s start with the linearity assumption. If we were
wrong, and the data is not arranged about a linear line, the residuals will have some shape. We thus plot the residuals
10 Sometimes known as the Root Mean Squared Error (RMSE).
70
CHAPTER 6. LINEAR MODELS 6.3. INFERENCE
0.2
−0.2
−0.6
0 2 4 6 8 10
I can’t say I see any shape. Let’s fit a wrong model, just to see what “shape” means.
lm.1.1 <- lm(Gas~I(Temp^2), data=whiteside[Insul=='Before',])
plot(residuals(lm.1.1)~whiteside[Insul=='Before',Temp]); abline(0,0, lty=2)
1.0
residuals(lm.1.1)
0.5
0.0
−0.5
0 2 4 6 8 10
Things to note:
To the next assumption. We assumed 𝜀𝑖 are independent of everything else. The residuals, 𝑦𝑖 − 𝑦𝑖̂ can be thought of
a sample of 𝜀𝑖 . When diagnosing the linearity assumption, we already saw their distribution does not vary with the
𝑥’s, Temp in our case. They may be correlated with themselves; a positive departure from the model, may be followed
by a series of positive departures etc. Diagnosing these auto-correlations is a real art, which is not part of our course.
The last assumption we required is normality. As previously stated, if 𝑛 ≫ 𝑝, this assumption can be relaxed. If 𝑛 is
in the order of 𝑝, we need to verify this assumption. My favorite tool for this task is the qqplot. A qqplot compares
the quantiles of the sample with the respective quantiles of the assumed distribution. If quantiles align along a line,
the assumed distribution is OK. If quantiles depart from a line, then the assumed distribution does not fit the sample.
qqnorm(resid(lm.1))
71
6.3. INFERENCE CHAPTER 6. LINEAR MODELS
0.6
Sample Quantiles
0.2
−0.2
−0.6
−2 −1 0 1 2
Theoretical Quantiles
Things to note:
• The qqnorm function plots a qqplot against a normal distribution. For non-normal distributions try qqplot.
• resid(lm.1) extracts the residuals from the linear model, i.e., the vector of 𝑦𝑖 − 𝑥′𝑖 𝛽.̂
Judging from the figure, the normality assumption is quite plausible. Let’s try the same on a non-normal sample,
namely a uniformly distributed sample, to see how that would look.
qqnorm(runif(100))
0.6
0.4
0.2
0.0
−2 −1 0 1 2
Theoretical Quantiles
72
CHAPTER 6. LINEAR MODELS 6.3. INFERENCE
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic + Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
73
6.3. INFERENCE CHAPTER 6. LINEAR MODELS
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
74
CHAPTER 6. LINEAR MODELS 6.3. INFERENCE
##
## mid old young
## medical 3 3 3
## mental 3 3 3
## physical 3 3 3
Since we have two factorial predictors, this multiple regression is nothing but a two way ANOVA. Let’s fit the model
and inspect it.
lm.2 <- lm(StressReduction~.,data=twoWay)
summary(lm.2)
##
## Call:
## lm(formula = StressReduction ~ ., data = twoWay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1 -1 0 1 1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0000 0.3892 10.276 7.34e-10 ***
## Treatmentmental 2.0000 0.4264 4.690 0.000112 ***
## Treatmentphysical 1.0000 0.4264 2.345 0.028444 *
## Ageold -3.0000 0.4264 -7.036 4.65e-07 ***
## Ageyoung 3.0000 0.4264 7.036 4.65e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9045 on 22 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.8926
## F-statistic: 55 on 4 and 22 DF, p-value: 3.855e-11
Things to note:
• Mid age and medical treatment are missing, hence it is implied that they are the baseline, and this model accounts
for the departure from this baseline.
• The data has 2 factors, but the coefficients table has 4 predictors. This is because lm noticed that Treatment
and Age are factors. Each level of each factor is thus encoded as a different (dummy) variable. The numerical
values of the factors are meaningless. Instead, R has constructed a dummy variable for each level of each factor.
The names of the effect are a concatenation of the factor’s name, and its level. You can inspect these dummy
variables with the model.matrix command.
model.matrix(lm.2) %>% lattice::levelplot()
Ageyoung 1.0
column
Ageold 0.8
Treatmentphysical 0.6
Treatmentmental 0.4
0.2
(Intercept) 0.0
1234567891011 213141516171819202122 324252627
If you are more familiar with the ANOVA literature, or that you don’t want the effects of each level separately, but
rather, the effect of all the levels of each factor, use the anova command.
75
6.3. INFERENCE CHAPTER 6. LINEAR MODELS
anova(lm.2)
The syntax Treatment * Age means “main effects with second order interactions”. The syntax (.)^2 means “every-
thing with second order interactions”, this time we don’t have I() as in the temperature example because here we want
the second order interaction and not the square of each variable.
Let’s inspect the model
summary(lm.3)
##
## Call:
## lm(formula = StressReduction ~ Treatment + Age + Treatment:Age -
76
CHAPTER 6. LINEAR MODELS 6.3. INFERENCE
## 1, data = twoWay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1 -1 0 1 1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Treatmentmedical 4.000e+00 5.774e-01 6.928 1.78e-06 ***
## Treatmentmental 6.000e+00 5.774e-01 10.392 4.92e-09 ***
## Treatmentphysical 5.000e+00 5.774e-01 8.660 7.78e-08 ***
## Ageold -3.000e+00 8.165e-01 -3.674 0.00174 **
## Ageyoung 3.000e+00 8.165e-01 3.674 0.00174 **
## Treatmentmental:Ageold 4.246e-16 1.155e+00 0.000 1.00000
## Treatmentphysical:Ageold 1.034e-15 1.155e+00 0.000 1.00000
## Treatmentmental:Ageyoung -3.126e-16 1.155e+00 0.000 1.00000
## Treatmentphysical:Ageyoung 5.128e-16 1.155e+00 0.000 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1 on 18 degrees of freedom
## Multiple R-squared: 0.9794, Adjusted R-squared: 0.9691
## F-statistic: 95 on 9 and 18 DF, p-value: 2.556e-13
Things to note:
• There are still 5 main effects, but also 4 interactions. This is because when allowing a different average response
for every 𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 ∗ 𝐴𝑔𝑒 combination, we are effectively estimating 3 ∗ 3 = 9 cell means, even if they are not
parametrized as cell means, but rather as main effect and interactions.
• The interactions do not seem to be significant.
• The assumptions required for inference are clearly not met in this example, which is there just to demonstrate
R’s capabilities.
Asking if all the interactions are significant, is asking if the different age groups have the same response to different
treatments. Can we answer that based on the various interactions? We might, but it is possible that no single
interaction is significant, while the combination is. To test for all the interactions together, we can simply check if the
model without interactions is (significantly) better than a model with interactions. I.e., compare lm.2 to lm.3. This
is done with the anova command.
anova(lm.2,lm.3, test='F')
77
6.4. EXTRA DIAGNOSTICS CHAPTER 6. LINEAR MODELS
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = StressReduction ~ ., data = twoWay)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -3.0000 0.7177 -4.18 0.000389 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Things to note:
• A contrast is a linear function of the coefficients. In our example 𝐻0 ∶ 𝛽1 −𝛽3 = 0, which justifies the construction
of my.contrast.
• We used the glht function (generalized linear hypothesis test) from the package multcomp.
• The contrast is significant, i.e., the effect of a medical treatment, is different than that of a physical treatment.
78
CHAPTER 6. LINEAR MODELS 6.5. BIBLIOGRAPHIC NOTES
16 https://en.wikipedia.org/wiki/Randomized_controlled_trial
17 https://cran.r-project.org/web/packages/olsrr/vignettes/regression_diagnostics.html
79
6.6. PRACTICE YOURSELF CHAPTER 6. LINEAR MODELS
80
Chapter 7
Example 7.1. Consider the relation between cigarettes smoked, and the occurance of lung cancer. Do we expect the
probability of cancer to be linear in the number of cigarettes? Probably not. Do we expect the variability of events to
be constant about the trend? Probably not.
Example 7.2. Consider the relation between the travel times to the distance travelled. Do you agree that the longer
the distance travelled, then not only the travel times get longer, but they also get more variable?
This model not allow for the non-linear relations of Example 7.1, nor does it allow for the distribution of 𝜀 to change
with 𝑥, as in Example 7.2. Generalize linear models (GLM), as the name suggests, are a generalization of the linear
models in Chapter 6 that allow that1 .
For Example 7.1, we would like something in the lines of
𝑦|𝑥 ∼ 𝐵𝑖𝑛𝑜𝑚(1, 𝑝(𝑥))
Even more generally, for some distribution 𝐹 (𝜃), with a parameter 𝜃, we would like to assume that the data is generated
via
𝑦|𝑥 ∼ 𝐹 (𝜃(𝑥)) (7.1)
do not discuss.
81
7.2. LOGISTIC REGRESSION CHAPTER 7. GENERALIZED LINEAR MODELS
GLMs allow models of the type of Eq.(7.1), while imposing some constraints on 𝐹 and on the relation 𝜃(𝑥). GLMs
assume the data distribution 𝐹 to be in a “well-behaved” family known as the Natural Exponential Family4 of distri-
butions. This family includes the Gaussian, Gamma, Binomial, Poisson, and Negative Binomial distributions. These
five include as special cases the exponential, chi-squared, Rayleigh, Weibull, Bernoulli, and geometric distributions.
GLMs also assume that the distribution’s parameter, 𝜃, is some simple function of a linear combination of the effects. In
our cigarettes example this amounts to assuming that each cigarette has an additive effect, but not on the probability
of cancer, but rather, on some simple function of it. Formally
𝑔(𝜃(𝑥)) = 𝑥′ 𝛽,
and we recall that
𝑥′ 𝛽 = 𝛽0 + ∑ 𝑥𝑗 𝛽𝑗 .
𝑗
−1
The function 𝑔 is called the link function, its inverse, 𝑔 is the mean function. We thus say that “the effects of each
cigarette is linear in link scale”. This terminology will later be required to understand R’s output.
Before we fit such a model, we try to justify this construction, in particular, the enigmatic link function in Eq.(7.5).
Let’s look at the simplest possible case: the comparison of two groups indexed by 𝑥: 𝑥 = 0 for the first, and 𝑥 = 1 for
the second. We start with some definitions.
Odds are the same as probabilities, but instead of telling me there is a 66% of success, they tell me the odds of success
are “2 to 1”. If you ever placed a bet, the language of “odds” should not be unfamiliar to you.
Definition 7.2 (Odds Ratio). The odds ratio between two binary random variables, 𝑦1 and 𝑦2 , is defined as the ratio
between their odds. Formally:
𝑃 (𝑦1 = 1)/𝑃 (𝑦1 = 0)
𝑂𝑅(𝑦1 , 𝑦2 ) ∶= .
𝑃 (𝑦2 = 1)/𝑃 (𝑦2 = 0)
Odds ratios (OR) compare between the probabilities of two groups, only that it does not compare them in probability
scale, but rather in odds scale. You can also think of ORs as a measure of distance between two Brenoulli distributions.
ORs have better mathematical properties than other candidate distance measures, such as 𝑃 (𝑦1 = 1) − 𝑃 (𝑦2 = 1).
Under the logit link assumption formalized in Eq.(7.6), the OR between two conditions indexed by 𝑦|𝑥 = 1 and 𝑦|𝑥 = 0,
returns:
𝑃 (𝑦 = 1|𝑥 = 1)/𝑃 (𝑦 = 0|𝑥 = 1)
𝑂𝑅(𝑦|𝑥 = 1, 𝑦|𝑥 = 0) = = 𝑒𝛽1 . (7.7)
𝑃 (𝑦 = 1|𝑥 = 0)/𝑃 (𝑦 = 0|𝑥 = 0)
The last equality demystifies the choice of the link function in the logistic regression: it allows us to interpret
𝛽 of the logistic regression as a measure of change of binary random variables, namely, as the (log)
odds-ratios due to a unit increase in 𝑥.
4 https://en.wikipedia.org/wiki/Natural_exponential_family
82
CHAPTER 7. GENERALIZED LINEAR MODELS 7.2. LOGISTIC REGRESSION
Remark. Another popular link function is the normal quantile function, a.k.a., the Gaussian inverse CDF, leading to
probit regression instead of logistic regression.
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
We will now attach the data so that its contents is available in the workspace (don’t forget to detach afterwards, or
you can expect some conflicting object names). We will also use the cut function to create a binary response variable
for Light, and Heavy plants (we are doing logistic regression, so we need a two-class response), notice also that cut
splits according to range and not to length. As a general rule of thumb, when we discretize continuous variables, we
lose information. For pedagogical reasons, however, we will proceed with this bad practice.
Look at the following output and think: how many group effects do we expect? What should be the sign of each
effect?
attach(PlantGrowth)
weight.factor<- cut(weight, 2, labels=c('Light', 'Heavy')) # binarize weights
plot(table(group, weight.factor))
table(group, weight.factor)
group
##
## Call:
## glm(formula = weight.factor ~ group, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1460 -0.6681 0.4590 0.8728 1.7941
##
83
7.2. LOGISTIC REGRESSION CHAPTER 7. GENERALIZED LINEAR MODELS
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4055 0.6455 0.628 0.5299
## grouptrt1 -1.7918 1.0206 -1.756 0.0792 .
## grouptrt2 1.7918 1.2360 1.450 0.1471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.054 on 29 degrees of freedom
## Residual deviance: 29.970 on 27 degrees of freedom
## AIC: 35.97
##
## Number of Fisher Scoring iterations: 4
Things to note:
Like in the linear models, we can use an ANOVA table to check if treatments have any effect, and not one treatment
at a time. In the case of GLMs, this is called an analysis of deviance table.
anova(glm.1, test='LRT')
Things to note:
• The anova function, like the summary function, are content-aware and produce a different output for the glm
class than for the lm class. All that anova does is call anova.glm.
• In GLMs there is no canonical test (like the F test for lm). LRT implies we want an approximate Likelihood Ratio
Test. We thus specify the type of test desired with the test argument.
• The distribution of the weights of the plants does vary with the treatment given, as we may see from the
significance of the group factor.
• Readers familiar with ANOVA tables, should know that we computed the GLM equivalent of a type I sum-
of-squares. Run drop1(glm.1, test='Chisq') for a GLM equivalent of a type III sum-of-squares.
• For help see ?anova.glm.
84
CHAPTER 7. GENERALIZED LINEAR MODELS 7.2. LOGISTIC REGRESSION
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
## 19 20 21 22 23 24 25 26 27 28 29 30
## 0.2 0.2 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9
Things to note:
• Like the summary and anova functions, the predict function is aware that its input is of glm class. All that
predict does is call predict.glm.
• In GLMs there are many types of predictions. The type argument controls which type is returned. Use
type=response for predictions in probability scale; use ‘type=link’ for predictions in log-odds scale.
• How do I know we are predicting the probability of a heavy plant, and not a light plant? Just run
contrasts(weight.factor) to see which of the categories of the factor weight.factor is encoded as 1, and
which as 0.
• For help see ?predict.glm.
Let’s detach the data so it is no longer in our workspace, and object names do not collide.
detach(PlantGrowth)
We gave an example with a factorial (i.e. discrete) predictor. We can do the same with multiple continuous predictors.
data('Pima.te', package='MASS') # Loads data
head(Pima.te)
## Start: AIC=302.41
## type ~ npreg + glu + bp + skin + bmi + ped + age
##
## Df Deviance AIC
## - bp 1 286.92 300.92
## - skin 1 286.94 300.94
## - age 1 287.74 301.74
## <none> 286.41 302.41
## - ped 1 291.06 305.06
## - npreg 1 292.55 306.55
## - bmi 1 294.52 308.52
## - glu 1 342.35 356.35
##
## Step: AIC=300.92
## type ~ npreg + glu + skin + bmi + ped + age
##
## Df Deviance AIC
## - skin 1 287.50 299.50
## - age 1 287.92 299.92
## <none> 286.92 300.92
## - ped 1 291.70 303.70
## - npreg 1 293.06 305.06
## - bmi 1 294.55 306.55
85
7.2. LOGISTIC REGRESSION CHAPTER 7. GENERALIZED LINEAR MODELS
##
## Call:
## glm(formula = type ~ npreg + glu + bmi + ped, family = binomial(link = "probit"),
## data = Pima.te)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9935 -0.6487 -0.3585 0.6326 2.5791
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.445143 0.569373 -9.563 < 2e-16 ***
## npreg 0.102410 0.025607 3.999 6.35e-05 ***
## glu 0.021739 0.002988 7.276 3.45e-13 ***
## bmi 0.048709 0.012291 3.963 7.40e-05 ***
## ped 0.534366 0.250584 2.132 0.033 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 420.30 on 331 degrees of freedom
## Residual deviance: 288.47 on 327 degrees of freedom
## AIC: 298.47
##
## Number of Fisher Scoring iterations: 5
Things to note:
• We used the ~. syntax to tell R to fit a model with all the available predictors.
• Since we want to focus on significant predictors, we used the step function to perform a step-wise regression,
i.e. sequentially remove non-significant predictors. The function reports each model it has checked, and the
variable it has decided to remove at each step.
• The output of step is a single model, with the subset of selected predictors.
86
CHAPTER 7. GENERALIZED LINEAR MODELS 7.3. POISSON REGRESSION
Poisson regression means we fit a model assuming 𝑦|𝑥 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛(𝜆(𝑥)). Put differently, we assume that for each
treatment, encoded as a combinations of predictors 𝑥, the response is Poisson distributed with a rate that depends on
the predictors.
The typical link function for Poisson regression is the logarithm: 𝑔(𝑡) = 𝑙𝑜𝑔(𝑡). This means that we assume 𝑦|𝑥 ∼
′
𝑃 𝑜𝑖𝑠𝑠𝑜𝑛(𝜆(𝑥) = 𝑒𝑥 𝛽 ). Why is this a good choice? We again resort to the two-group case, encoded by 𝑥 = 1 and
𝑥 = 0, to understand this model: 𝜆(𝑥 = 1) = 𝑒𝛽0 +𝛽1 = 𝑒𝛽0 𝑒𝛽1 = 𝜆(𝑥 = 0) 𝑒𝛽1 . We thus see that this link function
implies that a change in 𝑥 multiples the rate of events by 𝑒𝛽1 .
For our example5 we inspect the number of infected high-school kids, as a function of the days since an outbreak.
cases <-
structure(list(Days = c(1L, 2L, 3L, 3L, 4L, 4L, 4L, 6L, 7L, 8L,
8L, 8L, 8L, 12L, 14L, 15L, 17L, 17L, 17L, 18L, 19L, 19L, 20L,
23L, 23L, 23L, 24L, 24L, 25L, 26L, 27L, 28L, 29L, 34L, 36L, 36L,
42L, 42L, 43L, 43L, 44L, 44L, 44L, 44L, 45L, 46L, 48L, 48L, 49L,
49L, 53L, 53L, 53L, 54L, 55L, 56L, 56L, 58L, 60L, 63L, 65L, 67L,
67L, 68L, 71L, 71L, 72L, 72L, 72L, 73L, 74L, 74L, 74L, 75L, 75L,
80L, 81L, 81L, 81L, 81L, 88L, 88L, 90L, 93L, 93L, 94L, 95L, 95L,
95L, 96L, 96L, 97L, 98L, 100L, 101L, 102L, 103L, 104L, 105L,
106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L),
Students = c(6L, 8L, 12L, 9L, 3L, 3L, 11L, 5L, 7L, 3L, 8L,
4L, 6L, 8L, 3L, 6L, 3L, 2L, 2L, 6L, 3L, 7L, 7L, 2L, 2L, 8L,
3L, 6L, 5L, 7L, 6L, 4L, 4L, 3L, 3L, 5L, 3L, 3L, 3L, 5L, 3L,
5L, 6L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 5L, 4L, 4L, 3L,
5L, 4L, 3L, 5L, 3L, 4L, 2L, 3L, 3L, 1L, 3L, 2L, 5L, 4L, 3L,
0L, 3L, 3L, 4L, 0L, 3L, 3L, 4L, 0L, 2L, 2L, 1L, 1L, 2L, 0L,
2L, 1L, 1L, 0L, 0L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Days", "Students"
), class = "data.frame", row.names = c(NA, -109L))
attach(cases)
head(cases)
## Days Students
## 1 1 6
## 2 2 8
## 3 3 12
## 4 3 9
## 5 4 3
## 6 4 3
87
7.4. EXTENSIONS CHAPTER 7. GENERALIZED LINEAR MODELS
10 12
STUDENTS
8
6
4
2
0
0 20 40 60 80 100
DAYS
We now fit a model to check for the change in the rate of events as a function of the days since the outbreak.
glm.3 <- glm(Students ~ Days, family = poisson)
summary(glm.3)
##
## Call:
## glm(formula = Students ~ Days, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.00482 -0.85719 -0.09331 0.63969 1.73696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.990235 0.083935 23.71 <2e-16 ***
## Days -0.017463 0.001727 -10.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 215.36 on 108 degrees of freedom
## Residual deviance: 101.17 on 107 degrees of freedom
## AIC: 393.11
##
## Number of Fisher Scoring iterations: 5
Things to note:
• We used family=poisson in the glm function to tell R that we assume a Poisson distribution.
• The coefficients table is there as usual. When interpreting the table, we need to recall that the effect, i.e. the 𝛽,̂
are multiplicative due to the assumed link function.
• Each day decreases the rate of events by a factor of about 𝑒𝛽1 = 0.98.
• For more information see ?glm and ?family.
7.4 Extensions
As we already implied, GLMs are a very wide class of models. We do not need to use the default link function,but
more importantly, we are not constrained to Binomial, or Poisson distributed response. For exponential, gamma, and
other response distributions, see ?glm or the references in the Bibliographic Notes section.
88
CHAPTER 7. GENERALIZED LINEAR MODELS 7.5. BIBLIOGRAPHIC NOTES
6 https://www.datacamp.com/courses/generalized-linear-models-in-r
89
7.6. PRACTICE YOURSELF CHAPTER 7. GENERALIZED LINEAR MODELS
90
Chapter 8
Example 8.1 (Dependent Samples on the Mean). Consider inference on a population’s mean. Supposdly, more
observations imply more infotmation on the mean. This, however, is not the case if samples are completely dependant.
More observations do not add any new information. From this example one may think that dependence is a bad thing.
This is a false intuitiont: negative correlations imply oscilations about the mean, so they are actually more informative
on the mean than independent observations.
Example 8.2 (Repeated Measures). Consider a prospective study, i.e., data that originates from selecting a set of
subjects and making measurements on them over time. Also assume that some subjects received some treatment, and
other did not. When we want to infer on the population from which these subjects have been sampled, we need to
recall that some series of observations came from the same subject. If we were to ignore the subject of origin, and
treat each observation as an independent sample point, we will think we have more information in our data than we
actually do. For a rough intuition, think of a case where observatiosn within subject are perfectly dependent.
The sources of variability, i.e. noise, are known in the statistical literature as “random effects”. Specifying these sources
determines the correlation structure in our measurements. In the simplest linear models of Chapter 6, we thought of
the variability as a measurement error, independent of anything else. This, however, is rarely the case when time or
space are involved.
The variability in our data is rarely the object of interest. It is merely the source of uncertainty in our measurements.
The effects we want to infer on are assumingly non-random, thus known as “fixed-effects”. A model which has several
sources of variability, i.e. random-effects, and several deterministic effects to study, i.e. fixed-effects, is known as a
“mixed effects” model. If the model is also linear, it is known as a linear mixed model (LMM). Here are some examples
of such models.
Example 8.3 (Fixed and Random Machine Effect). Consider the problem of testing for a change in the distribution
of diamteters of manufactured bottle caps. We want to study the (fixed) effect of time: before versus after. Bottle caps
are produced by several machines. Clearly there is variablity in the diameters within-machine and between-machines.
Given many measurements on many bottle caps from many machines, we could standardize measurements by removing
each machine’s average. This implies the within-machine variability is the only source of variability we care about,
because the substration of the machine effect, removed information on the between-machine variability.
Alternatively, we could treat the between-machine variability as another source of noise/uncertainty when inferring on
the temporal fixed effect.
Example 8.4 (Fixed and Random Subject Effect). Consider an experimenal design where each subject is given 2
types of diets, and his health condition is recorded. We could standardize over subjects by removing the subject-wise
average, before comparing diets. This is what a paired t-test does. This also implies the within-subject variability is
the only source of variability we care about. Alternatively, for inference on the population of “all subjects” we need
to adress the between-subject variability, and not only the within-subject variability.
The unifying theme of the above examples, is that the variability in our data has several sources. Which are the sources
of variability that need to concern us? This is a delicate matter which depends on your goals. As a rule of thumb, we
91
8.1. PROBLEM SETUP CHAPTER 8. LINEAR MIXED MODELS
will suggest the following view: If information of an effect will be available at the time of prediction, treat
it as a fixed effect. If it is not, treat it as a random-effect.
LMMs are so fundamental, that they have earned many names:
• Mixed Effects: Because we may have both fixed effects we want to estimate and remove, and random effects
which contribute to the variability to infer against.
• Variance Components: Because as the examples show, variance has more than a single source (like in the
Linear Models of Chapter 6).
• Hirarchial Models: Because as Example 8.4 demonstrates, we can think of the sampling as hierarchical– first
sample a subject, and then sample its response.
• Multilevel Analysis: For the same reasons it is also known as Hierarchical Models.
• Repeated Measures: Because we make several measurements from each unit, like in Example 8.4.
• Longitudinal Data: Because we follow units over time, like in Example 8.4.
• Panel Data: Is the term typically used in econometric for such longitudinal data.
• MANOVA: Many of the problems that may be solved with a multivariate analysis of variance (MANOVA),
may be solved with an LMM for reasons we detail in 9.
• Structured Prediction: In the machine learning literature, predicting outcomes with structure, such as cor-
related vectors, is known as Structured Learning. Because LMMs merely specify correlations, using a LMM for
making predictions may be thought of as an instance of structured prediction.
Whether we are aiming to infer on a generative model’s parameters, or to make predictions, there is no “right” nor
“wrong” approach. Instead, there is always some implied measure of error, and an algorithm may be good, or bad,
with respect to this measure (think of false and true positives, for instance). This is why we care about dependencies
in the data: ignoring the dependence structure will probably yield inefficient algorithms. Put differently, if we ignore
the statistical dependence in the data we will probably me making more errors than possible/optimal.
We now emphasize:
1. Like in previous chapters, by “model” we refer to the assumed generative distribution, i.e., the sampling distri-
bution.
2. LMMs are a way to infer against the right level of variability. Using a naive linear model (which assumes a single
source of variability) instead of a mixed effects model, probably means your inference is overly anti-conservative.
Put differently, the uncertainty in your estimates is higher than the linear model from Chapter 6 may suggest.
3. In a LMM we will specify the dependence structure via the hierarchy in the sampling scheme (e.g. caps within
machine, students within class, etc.). Not all dependency models can be specified in this way. Dependency
structures that are not hierarchical include temporal dependencies (AR1 , ARIMA2 , ARCH3 and GARCH),
spatial4 , Markov Chains5 , and more. To specify dependency structures that are no hierarchical, see Chapter 8
in (the excellent) Weiss (2005).
4. If you are using the model merely for predictions, and not for inference on the fixed effects or variance components,
then stating the generative distribution may be be useful, but not necessarily. See the Supervised Learning
Chapter 10 for more on prediction problems. Also recall that machine learning from non-independent observations
(such as LMMs) is a delicate matter that is rarely treated in the literature.
𝑦|𝑥, 𝑢 = 𝑥′ 𝛽 + 𝑧 ′ 𝑢 + 𝜀 (8.1)
1 https://en.wikipedia.org/wiki/Autoregressive_model
2 https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
3 https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity
4 https://en.wikipedia.org/wiki/Spatial_dependence
5 https://en.wikipedia.org/wiki/Markov_chain
92
CHAPTER 8. LINEAR MIXED MODELS 8.2. MIXED MODELS WITH R
where 𝑥 are the factors with fixed effects, 𝛽, which we may want to study. The factors 𝑧, with effects 𝑢, are the random
effects which contribute to variability. In our repeated measures example (8.2) the treatment is a fixed effect, and
the subject is a random effect. In our bottle-caps example (8.3) the time (before vs. after) is a fixed effect, and the
machines may be either a fixed or a random effect (depending on the purpose of inference). In our diet example (8.4)
the diet is the fixed effect and the family is a random effect.
Notice that we state 𝑦|𝑥, 𝑧 merely as a convenient way to do inference on 𝑦|𝑥, instead of directly specifying 𝑉 𝑎𝑟[𝑦|𝑥].
This is exactly the power of LMMs: we specify the covariance not via the matrix 𝑉 𝑎𝑟[𝑦, 𝑧], but rather via the sampling
hierarchy.
Given a sample of 𝑛 observations (𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 ) from model (8.1), we will want to estimate (𝛽, 𝑢). Under some assumption
on the distribution of 𝜀 and 𝑧, we can use maximum likelihood (ML). In the context of LMMs, however, ML is typically
replaced with restricted maximum likelihood (ReML), because it returns unbiased estimates of 𝑉 𝑎𝑟[𝑦|𝑥] and ML does
not.
# Generate data
beta0 <- 2 # set global mean
y <- beta0 + z + epsilon # generate synthetic sample
93
8.2. MIXED MODELS WITH R CHAPTER 8. LINEAR MIXED MODELS
##
## Call:
## lm(formula = y ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.949 -7.275 1.629 8.668 10.005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.317 3.500 0.948 0.375
##
## Residual standard error: 9.898 on 7 degrees of freedom
The summary of the mixed-model
summary.lme.5 <- summary(lme.5)
summary.lme.5
## 1 2 3 4
## -1.411024 -1.598983 -1.493730 3.052394
sd(diffs) #
## [1] 2.278119
So we see that a paired t-test infers only against the within-group variability. Q:Is this a good think? A: depends…
10 A.k.a. the cluster effect.
94
CHAPTER 8. LINEAR MIXED MODELS 8.2. MIXED MODELS WITH R
## Batch Yield
## 1 A 1545
## 2 A 1440
## 3 A 1440
## 4 A 1520
## 5 A 1580
## 6 B 1540
And visually
lattice::dotplot(Yield~Batch)
1600
1550
Yield
1500
1450
A B C D E F
If we want to do inference on the (global) mean yield, we need to account for the two sources of variability: the
within-batch variability, and the between-batch variability We thus fit a mixed model, with an intercept and random
batch effect.
lme.1<- lmer( Yield ~ 1 | Batch , Dyestuff )
summary(lme.1)
95
8.2. MIXED MODELS WITH R CHAPTER 8. LINEAR MIXED MODELS
Things to note:
• The syntax Yield ~ 1 | Batch tells R to fit a model with a global intercept (1) and a random Batch effect
(|Batch). More on that later.
• As usual, summary is content aware and has a different behavior for lme class objects.
• The output distinguishes between random effects (𝑢), a source of variability, and fixed effect (𝛽), which we want
to study. The mean of the random effect is not reported because it is unassumingly 0.
• Were we not interested in the variance components, and only in the coefficients or predictions, an (almost)
equivalent lm formulation is lm(Yield ~ Batch).
Some utility functions let us query the lme object. The function coef will work, but will return a cumbersome output.
Better use fixef to extract the fixed effects, and ranef to extract the random effects. The model matrix (of the
fixed effects alone), can be extracted with model.matrix, and predictions made with predict. Note, however, that
predictions with mixed-effect models are better treated as prediction problems as in the Supervised Learning Chapter
10, but are a very delicate matter.
detach(Dyestuff)
In the Penicillin data, we measured the diameter of spread of an organism, along the plate used (a to x), and
penicillin type (A to F). We will now try to infer on the diameter of typical organism, and compute its variability over
plates and Penicillin types.
head(Penicillin)
## plate
## sample a b c d e f g h i j k l m n o p q r s t u v w x
## A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## D 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
And visually:
96
CHAPTER 8. LINEAR MIXED MODELS 8.2. MIXED MODELS WITH R
C D E
m
t
o
k
h
b
a
n
l
d
c
Plate
p
e
v
r
q
f
w
j
i
u
x
s
g
18 20 22 24 26
Let’s fit a mixed-effects model with a random plate effect, and a random sample effect:
lme.2 <- lmer ( diameter ~ 1 + (1|plate )+(1|sample) , Penicillin )
fixef(lme.2) # Fixed effects
## (Intercept)
## 22.97222
ranef(lme.2) # Random effects
## $plate
## (Intercept)
## a 0.80454389
## b 0.80454389
## c 0.18167120
## d 0.33738937
## e 0.02595303
## f -0.44120149
## g -1.37551052
## h 0.80454389
## i -0.75263783
## j -0.75263783
## k 0.96026206
## l 0.49310755
## m 1.42741658
## n 0.49310755
## o 0.96026206
## p 0.02595303
## q -0.28548332
## r -0.28548332
## s -1.37551052
## t 0.96026206
## u -0.90835601
## v -0.28548332
## w -0.59691966
## x -1.21979235
##
## $sample
## (Intercept)
## A 2.18705819
## B -1.01047625
## C 1.93789966
97
8.2. MIXED MODELS WITH R CHAPTER 8. LINEAR MIXED MODELS
## D -0.09689498
## E -0.01384214
## F -3.00374447
##
## with conditional variances for "plate" "sample"
Things to note:
• The syntax 1+ (1| plate ) + (1| sample ) fits a global intercept (mean), a random plate effect, and a random
sample effect.
• Were we not interested in the variance components, an (almost) equivalent lm formulation is lm(diameter ~
plate + sample).
• The output of ranef is somewhat controversial. Think about it: Why would we want to plot the estimates of a
random variable?
Since we have two random effects, we may compute the variability of the global mean (the only fixed effect) as we did
before. Perhaps more interestingly, we can compute the variability in the response, for a particular plate or sample
type.
random.effect.lme2 <- ranef(lme.2, condVar = TRUE)
qrr2 <- lattice::dotplot(random.effect.lme2, strip = FALSE)
plate
m
o
t
k
h
b
a
n
l
d
c
p
e
v
r
q
f
w
i
j
u
x
s
g
−2 −1 0 1 2
Variability in response for each sample type, over the various plates:
print(qrr2[[2]])
98
CHAPTER 8. LINEAR MIXED MODELS 8.2. MIXED MODELS WITH R
sample
−3 −2 −1 0 1 2
Things to note:
• The condVar argument of the ranef function tells R to compute the variability in response conditional on each
random effect at a time.
• The dotplot function, from the lattice package, is only there for the fancy plotting.
We used the penicillin example to demonstrate the incorporation of two random-effects. We could have, however,
compared between penicillin types. For this matter, penicillin types are fixed effects to infer on, and not part of the
uncertainty in the mean diameter. The appropriate model is the following:
lme.2.2 <- lmer( diameter ~ 1 + sample + (1|plate) , Penicillin )
I may now ask myself: does the sample, i.e. penicillin, have any effect? This is what the ANOVA table typically gives
us. The next table can be thought of as a “repeated measures ANOVA”:
anova(lme.2.2)
99
8.2. MIXED MODELS WITH R CHAPTER 8. LINEAR MIXED MODELS
0 2 4 6 8
200
335 332 372 333 352
400
300
200
334 308 371 369 351
400
300
200
310 309 370 349 350
400
300
200
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
We now want to estimate the (fixed) effect of the days of sleep deprivation on response time, while allowing each
subject to have his/hers own effect. Put differently, we want to estimate a random slope for the effect of day. The
fixed Days effect can be thought of as the average slope over subjects.
lme.3 <- lmer ( Reaction ~ Days + ( Days | Subject ) , data= sleepstudy )
Things to note:
• ~Days specifies the fixed effect.
• We used the Days|Subect syntax to tell R we want to fit the model ~Days within each subject.
• Were we fitting the model for purposes of prediction only, an (almost) equivalent lm formulation is
lm(Reaction~Days*Subject).
The fixed day effect is:
fixef(lme.3)
## (Intercept) Days
## 251.40510 10.46729
The variability in the average response (intercept) and day effect is
ranef(lme.3)
## $Subject
## (Intercept) Days
## 308 2.2575329 9.1992737
## 309 -40.3942719 -8.6205161
## 310 -38.9563542 -5.4495796
## 330 23.6888704 -4.8141448
## 331 22.2585409 -3.0696766
## 332 9.0387625 -0.2720535
## 333 16.8389833 -0.2233978
## 334 -7.2320462 1.0745075
## 335 -0.3326901 -10.7524799
## 337 34.8865253 8.6290208
## 349 -25.2080191 1.1730997
## 350 -13.0694180 6.6142185
## 351 4.5777099 -3.0152825
## 352 20.8614523 3.5364062
## 369 3.2750882 0.8722876
## 370 -25.6110745 4.8222518
## 371 0.8070591 -0.9881730
## 372 12.3133491 1.2842380
##
100
CHAPTER 8. LINEAR MIXED MODELS 8.3. SERIAL CORRELATIONS
330 337
331
280 333 352
372
335 351 332
260 371369
(Intercept)
240 308
334
220 350
349
370
200 309 310
0 5 10 15 20
Days
Here is a comparison of the random-day effect from lme versus a subject-wise linear model. They are not the same.
Within−subject Mixed model Population
02468 02468 02468 02468
450
400
350
300
250
200
310 309 370 349 350 334 308 371 369
450
400
350
300
250
200
02468 02468 02468 02468 02468
detach(Penicillin)
101
8.3. SERIAL CORRELATIONS CHAPTER 8. LINEAR MIXED MODELS
Instead, we will show how to solve this matter using the nlme package. This is because nlme allows to specify both
a block-covariance structure using the mixed-models framework, and the smooth parametric covariances we find in
temporal and spatial data.
The nlme::Ovary data is panel data of number of ovarian follicles in different mares (female horse), at various times.
with an AR(1) temporal correlation, alongside random-effects, we take an example from the help of nlme::corAR1.
library(nlme)
head(nlme::Ovary)
Things to note:
102
CHAPTER 8. LINEAR MIXED MODELS 8.4. EXTENSIONS
• The fitting is done with the nlme::lme function, and not lme4::lmer (which does not allow for non blocked
covariance models).
• sin(2*pi*Time) + cos(2*pi*Time) is a fixed effect that captures seasonality.
• The temporal covariance, is specified using the correlations= argument.
• AR(1) was assumed by calling correlation=corAR1(). See nlme::corClasses for a list of supported correlation
structures.
• From the summary, we see that a Mare random effect has also been added. Where is it specified? It is implied
by the random= argument. Read ?lme for further details.
300
20
250
200 15
column
150
10
100
5
50
row
8.4 Extensions
8.4.1 Cluster Robust Standard Errors
As previously stated, random effects are nothing more than a convenient way to specify dependencies within a level
of a random effect, i.e., within a group/cluster. This is also the motivation underlying cluster robust inference, which
is immensely popular with econometricians, but less so elsewhere. What is the difference between the two?
Mixed models framework is a bona-fide generalization of cluster robust inference. This author thus recommends using
the lme4 and nlme packages for mixed models to deal with correlations within cluster.
For a longer comparison between the two approaches, see Michael Clarck’s guide15 .
15 https://m-clark.github.io/docs/clustered/
16 https://cran.r-project.org/package=plm
17 https://www.jacob-long.com/post/panelr-intro/
18 https://en.wikipedia.org/wiki/Durbin%E2%80%93Wu%E2%80%93Hausman_test
19 https://cran.r-project.org/web/packages/plm/vignettes/plm.pdf
103
8.5. RELATION TO OTHER ESTIMATORS CHAPTER 8. LINEAR MIXED MODELS
This estimator can be viewed as a least squares estimator that accounts for correlations in the data. It is also a
maximum likelihood estimator under a Gaussian error assumption. Viewed as the latter, then linear mixed models
under a Gaussian error assumption, collapses to a GLS estimator.
104
CHAPTER 8. LINEAR MIXED MODELS 8.6. BIBLIOGRAPHIC NOTES
Now assume that the outcome of a MANOVA is measurements of an individual at several time periods. The measure-
ments are clearly correlated, so that MANOVA may be useful. But one may also treat the subject as a random effect,
with a univariate response. We thus see that this seemingly MANOVA problem can be solved with the mixed models
framework.
What MANOVA problems cannot be solved with mixed models? There may be cases where the covariance of the
multivariate outcome, 𝑦, is very complicated. If the covariance in 𝑦 may not be stated using a combination of random
and fixed effects, then the covariance has to be stated explicitly. It is also possible to consider mixed-models with
multivariate outcomes, i.e., a mixed MANOVA, or hirarchial MANOVA. The R functions we present herein permit
this.
23 http://rpsychologist.com/r-guide-longitudinal-lme-lmer
24 http://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixed-models/
25 https://m-clark.github.io/docs/clustered/
26 https://www.datacamp.com/courses/hierarchical-and-mixed-effects-models
105
8.7. PRACTICE YOURSELF CHAPTER 8. LINEAR MIXED MODELS
106
Chapter 9
The term “multivariate data analysis” is so broad and so overloaded, that we start by clarifying what is discussed
and what is not discussed in this chapter. Broadly speaking, we will discuss statistical inference, and leave more
“exploratory flavored” matters like clustering, and visualization, to the Unsupervised Learning Chapter 11.
We start with an example.
Example 9.1. Consider the problem of a patient monitored in the intensive care unit. At every minute the monitor
takes 𝑝 physiological measurements: blood pressure, body temperature, etc. The total number of minutes in our data
is 𝑛, so that in total, we have 𝑛 × 𝑝 measurements, arranged in a matrix. We also know the typical measurements for
this patient when healthy: 𝜇0 .
Formally, let 𝑦 be single (random) measurement of a 𝑝-variate random vector. Denote 𝜇 ∶= 𝐸[𝑦]. Here is the set of
problems we will discuss, in order of their statistical difficulty.
• Signal detection: a.k.a. multivariate hypothesis testing, i.e., testing if 𝜇 equals 𝜇0 and for 𝜇0 = 0 in particular.
In our example: “are the current measurement different than a typical one?”
• Signal counting: Counting the number of elements in 𝜇 that differ from 𝜇0 , and for 𝜇0 = 0 in particular. In
our example: “how many measurements differ than their typical values?”
• Signal identification: a.k.a. multiple testing, i.e., testing which of the elements in 𝜇 differ from 𝜇0 and for
𝜇0 = 0 in particular. In the ANOVA literature, this is known as a post-hoc analysis. In our example: “which
measurements differ than their typical values?”
• Signal estimation: Estimating the magnitudes of the departure of 𝜇 from 𝜇0 , and for 𝜇0 = 0 in particular. If
estimation follows a signal detection or signal identification stage, this is known as a selective estimation problem.
In our example: “what is the value of the measurements that differ than their typical values?”
• Multivariate Regression: a.k.a. MANOVA in statistical literature, and structured learning in the machine
learning literature. In our example: “what factors affect the physiological measurements?”
Example 9.2. Consider the problem of detecting regions of cognitive function in the brain using fMRI. Each mea-
surement is the activation level at each location in a brain’s region. If the region has a cognitive function, the mean
activation differs than 𝜇0 = 0 when the region is evoked.
Example 9.3. Consider the problem of detecting cancer encoding regions in the genome. Each measurement is the
vector of the genetic configuration of an individual. A cancer encoding region will have a different (multivariate)
distribution between sick and healthy. In particular, 𝜇 of sick will differ from 𝜇 of healthy.
Example 9.4. Consider the problem of the simplest multiple regression. The estimated coefficient, 𝛽 ̂ are a random
vector. Regression theory tells us that its covariance is (𝑋 ′ 𝑋)−1 𝜎2 , and null mean of 𝛽. We thus see that inference
on the vector of regression coefficients, is nothing more than a multivaraite inference problem.
107
9.1. SIGNAL DETECTION CHAPTER 9. MULTIVARIATE DATA ANALYSIS
Remark. In the above, “signal” is defined in terms of 𝜇. It is possible that the signal is not in the location, 𝜇, but
rather in the covariance, Σ. We do not discuss these problems here, and refer the reader to Nadler (2008).
Another possible question is: does a multivariate analysis gives us something we cannot get from a mass-univariate
analysis (i.e., a multivariate analysis on each variable separately). In Example 9.1 we could have just performed
multiple univariate tests, and sign an alarm when any of the univariate detectors was triggered. The reason we want
a multivariate detector, and not multiple univariate detectors is that it is possible that each measurement alone is
borderline, but together, the signal accumulates. In our ICU example is may mean that the pulse is borderline, the
body temperature is borderline, etc. Analyzed simultaneously, it is clear that the patient is in distress.
The next figure1 illustrates the idea that some bi-variate measurements may seem ordinary univariately, while very
anomalous when examined bi-variately.
Remark. The following figure may also be used to demonstrate the difference between Euclidean Distance and Maha-
lanobis Distance.
0
y
−2
−2 0 2
x
(𝑥̄ − 𝜇0 )2
𝑡2 (𝑥) ∶= = (𝑥̄ − 𝜇0 )𝑉 𝑎𝑟[𝑥]̄ −1 (𝑥̄ − 𝜇0 ), (9.1)
𝑉 𝑎𝑟[𝑥]̄
where 𝑉 𝑎𝑟[𝑥]̄ = 𝑆 2 (𝑥)/𝑛, and 𝑆 2 (𝑥) is the unbiased variance estimator 𝑆 2 (𝑥) ∶= (𝑛 − 1)−1 ∑(𝑥𝑖 − 𝑥)̄ 2 .
Generalizing Eq(9.1) to the multivariate case: 𝜇0 is a 𝑝-vector, 𝑥̄ is a 𝑝-vector, and 𝑉 𝑎𝑟[𝑥]̄ is a 𝑝 × 𝑝 matrix of the
covariance between the 𝑝 coordinated of 𝑥.̄ When operating with vectors, the squaring becomes a quadratic form, and
the division becomes a matrix inverse. We thus have
which is the definition of Hotelling’s 𝑇 2 test statistic. We typically denote the covariance between coordinates in 𝑥
̂
with Σ(𝑥), so that Σ ̂ , 𝑥 ] = (𝑛 − 1)−1 ∑(𝑥 − 𝑥̄ )(𝑥 − 𝑥̄ ). Using the Σ notation, Eq.(9.2) becomes
̂ 𝑘,𝑙 ∶= 𝐶𝑜𝑣[𝑥 𝑘 𝑙 𝑘,𝑖 𝑘 𝑙,𝑖 𝑙
̂ −1 (𝑥̄ − 𝜇0 ),
𝑇 2 (𝑥) ∶= 𝑛(𝑥̄ − 𝜇0 )′ Σ(𝑥) (9.3)
1 My thanks to Efrat Vilneski for the figure.
108
CHAPTER 9. MULTIVARIATE DATA ANALYSIS 9.1. SIGNAL DETECTION
109
9.1. SIGNAL DETECTION CHAPTER 9. MULTIVARIATE DATA ANALYSIS
## [1] 100 18
lattice::levelplot(x)
3
column
15
1
10
5 −1
−3
20 40 60 80
row
## $statistic
## [,1]
## [1,] 17.22438
##
## $pvalue
## [,1]
## [1,] 0.5077323
Things to note:
• stopifnot(n > 5 * p) is a little verification to check that the problem is indeed low dimensional. Otherwise,
the 𝜒2 approximation cannot be trusted.
• solve returns a matrix inverse.
• %*% is the matrix product operator (see also crossprod()).
• A function may return only a single object, so we wrap the statistic and its p-value in a list object.
Just for verification, we compare our home made Hotelling’s test, to the implementation in the rrcov package. The
statistic is clearly OK, but our 𝜒2 approximation of the distribution leaves room to desire. Personally, I would never
trust a Hotelling test if 𝑛 is not much greater than 𝑝, in which case I would use a high-dimensional adaptation (see
Bibliography).
rrcov::T2.test(x)
##
## One-sample Hotelling test
##
## data: x
## T2 = 17.22400, F = 0.79259, df1 = 18, df2 = 82, p-value = 0.703
## alternative hypothesis: true mean vector is not equal to (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
##
## sample estimates:
## [,1] [,2] [,3] [,4] [,5]
## mean x-vector -0.01746212 0.03776332 0.1006145 -0.2083005 0.1026982
## [,6] [,7] [,8] [,9] [,10]
## mean x-vector -0.05220043 -0.009497987 -0.1139856 0.02851701 -0.03089953
## [,11] [,12] [,13] [,14] [,15]
110
CHAPTER 9. MULTIVARIATE DATA ANALYSIS 9.2. SIGNAL COUNTING
## pvalue
## 0.6398998
And now we verify that both tests can indeed detect signal when present. Are p-values small enough to reject the “no
signal” null hypothesis?
mu <- rep(x = 10/p,times=p) # inject signal
x <- rmvnorm(n = n, mean = mu)
hotellingOneSample(x)
## $statistic
## [,1]
## [1,] 686.8046
##
## $pvalue
## [,1]
## [1,] 3.575926e-134
Simes(x)
## pvalue
## 2.765312e-10
… yes. All p-values are very small, so that all statistics can detect the non-null distribution.
Remark. In the sparsity or multiple-testing literature, what we call “signal counting” is known as “adapting to sparsit”,
or “adaptivity”.
In the ANOVA literature, an identification stage will typically follow a detection stage. These are known as the
omnibus F test, and post-hoc tests, respectively. In the multiple testing literature there will typically be no preliminary
detection stage. It is typically assumed that signal is present, and the only question is “where?”
The first question when approaching a multiple testing problem is “what is an error”? Is an error declaring a coordinate
in 𝜇 to be different than 𝜇0 when it is actually not? Is an error an overly high proportion of falsely identified coordinates?
The former is known as the family wise error rate (FWER), and the latter as the false discovery rate (FDR).
111
9.3. SIGNAL IDENTIFICATION CHAPTER 9. MULTIVARIATE DATA ANALYSIS
Remark. These types of errors have many names in many communities. See the Wikipedia entry on ROC3 for a table
of the (endless) possible error measures.
a. Because we want to make inference variable-wise, so it is natural to start with variable-wise statistics.
b. Because we want to avoid dealing with covariances if possible. Computing variable-wise p-values does not require
estimating covariances.
c. So that the identification problem is decoupled from the variable-wise inference problem, and may be applied
much more generally than in the setup we presented.
We start be generating some high-dimensional multivariate data and computing the coordinate-wise (i.e. hypothesis-
wise) p-value.
library(mvtnorm)
n <- 1e1
p <- 1e2
mu <- rep(0,p)
x <- rmvnorm(n = n, mean = mu)
dim(x)
## [1] 10 100
lattice::levelplot(x)
3
80
2
1
60
column
0
40
−1
−2
20
−3
26
row
We now compute the pvalues of each coordinate. We use a coordinate-wise t-test. Why a t-test? Because for the
purpose of demonstration we want a simple test. In reality, you may use any test that returns valid p-values.
t.pval <- function(y) t.test(y)$p.value
p.values <- apply(X = x, MARGIN = 2, FUN = t.pval)
plot(p.values, type='h')
3 https://en.wikipedia.org/wiki/Receiver_operating_characteristic
112
CHAPTER 9. MULTIVARIATE DATA ANALYSIS 9.3. SIGNAL IDENTIFICATION
1.0
0.8
0.6
p.values
0.4
0.2
0.0
0 20 40 60 80 100
Index
Things to note:
We are now ready to do the identification, i.e., find which coordinate of 𝜇 is different than 𝜇0 = 0. The workflow for
identification has the same structure, regardless of the desired error guarantees:
If we want 𝐹 𝑊 𝐸𝑅 ≤ 0.05, meaning that we allow a 5% probability of making any mistake, we will use the
method="holm" argument of p.adjust.
alpha <- 0.05
p.values.holm <- p.adjust(p.values, method = 'holm' )
which(p.values.holm < alpha)
## integer(0)
If we want 𝐹 𝐷𝑅 ≤ 0.05, meaning that we allow the proportion of false discoveries to be no larger than 5%, we use the
method="BH" argument of p.adjust.
alpha <- 0.05
p.values.BH <- p.adjust(p.values, method = 'BH' )
which(p.values.BH < alpha)
## integer(0)
We now inject some strong signal in 𝜇 just to see that the process works. We will artificially inject signal in the first
10 coordinates.
mu[1:10] <- 2 # inject signal in first 10 variables
x <- rmvnorm(n = n, mean = mu) # generate data
p.values <- apply(X = x, MARGIN = 2, FUN = t.pval)
p.values.BH <- p.adjust(p.values, method = 'BH' )
which(p.values.BH < alpha)
## [1] 1 2 3 4 5 6 7 9 10 55
Indeed- we are now able to detect that the first coordinates carry signal, because their respective coordinate-wise null
hypotheses have been rejected.
113
9.4. SIGNAL ESTIMATION (*) CHAPTER 9. MULTIVARIATE DATA ANALYSIS
a. Use Hotelling’s test to determine if 𝜇 equals 𝜇0 = 0. Can you detect the signal?
b. Perform t.test on each variable and extract the p-value. Try to identify visually the variables which depart
from 𝜇0 .
c. Use p.adjust to identify in which variables there are any departures from 𝜇0 = 0. Allow 5% probability of
making any false identification.
d. Use p.adjust to identify in which variables there are any departures from 𝜇0 = 0. Allow a 5% proportion
of errors within identifications.
2. Generate multivariate data from two groups: rmvnorm(n = 100, mean = rep(0,10)) for the first, and
rmvnorm(n = 100, mean = rep(0.1,10)) for the second.
a. Do we agree the groups differ?
b. Implement the two-group Hotelling test described in Wikipedia: (https://en.wikipedia.org/wiki/Hotelling%
27s_T-squared_distribution#Two-sample_statistic).
c. Verify that you are able to detect that the groups differ.
d. Perform a two-group t-test on each coordinate. On which coordinates can you detect signal while controlling
the FWER? On which while controlling the FDR? Use p.adjust.
3. Return to the previous problem, but set n=9. Verify that you cannot compute your Hotelling statistic.
4 You might find this shocking, but it does mean that you cannot trust the summary table of a model that was selected from a multitude
of models.
5 http://www.stat.berkeley.edu/~wfithian/
6 https://cran.r-project.org/web/views/gR.html
114
Chapter 10
Supervised Learning
Machine learning is very similar to statistics, but it is certainly not the same. As the name suggests, in machine
learning we want machines to learn. This means that we want to replace hard-coded expert algorithm, with data-
driven self-learned algorithm.
There are many learning setups, that depend on what information is available to the machine. The most common
setup, discussed in this chapter, is supervised learning. The name takes from the fact that by giving the machine data
samples with known inputs (a.k.a. features) and desired outputs (a.k.a. labels), the human is effectively supervising
the learning. If we think of the inputs as predictors, and outcomes as predicted, it is no wonder that supervised
learning is very similar to statistical prediction. When asked “are these the same?” I like to give the example of
internet fraud. If you take a sample of fraud “attacks”, a statistical formulation of the problem is highly unlikely. This
is because fraud events are not randomly drawn from some distribution, but rather, arrive from an adversary learning
the defenses and adapting to it. This instance of supervised learning is more similar to game theory than statistics.
Other types of machine learning problems include (Sammut and Webb, 2011):
• Unsupervised Learning: Where we merely analyze the inputs/features, but no desirable outcome is available
to the learning machine. See Chapter 11.
• Semi Supervised Learning: Where only part of the samples are labeled. A.k.a. co-training, learning from
labeled and unlabeled data, transductive learning.
• Active Learning: Where the machine is allowed to query the user for labels. Very similar to adaptive design
of experiments.
• Learning on a Budget: A version of active learning where querying for labels induces variable costs.
• Weak Learning: A version of supervised learning where the labels are given not by an expert, but rather
by some heuristic rule. Example: mass-labeling cyber attacks by a rule based software, instead of a manual
inspection.
• Reinforcement Learning:
Similar to active learning, in that the machine may query for labels. Different from active learning, in that the
machine does not receive labels, but rewards.
• Structure Learning: An instance of supervised learning where we predict objects with structure such as
dependent vectors, graphs, images, tensors, etc.
• Online Learning: An instance of supervised learning, where we need to make predictions where data inputs as
a stream.
• Transduction: An instance of supervised learning where we need to make predictions for a new set of predictors,
but which are known at the time of learning. Can be thought of as semi-supervised extrapolation.
• Covariate shift: An instance of supervised learning where we need to make predictions for a set of predictors
that ha a different distribution than the data generating source.
• Targeted Learning: A form of supervised learning, designed at causal inference for decision making.
115
10.1. PROBLEM SETUP CHAPTER 10. SUPERVISED LEARNING
• Co-training: An instance of supervised learning where we solve several problems, and exploit some assumed
relation between the problems.
• Manifold learning: An instance of unsupervised learning, where the goal is to reduce the dimension of the
data by embedding it into a lower dimensional manifold. A.k.a. support estimation.
• Similarity Learning: Where we try to learn how to measure similarity between objects (like faces, texts,
images, etc.).
• Metric Learning: Like similarity learning, only that the similarity has to obey the definition of a metric.
• Learning to learn: Deals with the carriage of “experience” from one learning problem to another. A.k.a.
cummulative learning, knowledge transfer, and meta learning.
Example 10.1 (Rental Prices). Consider the problem of predicting if a mail is spam or not based on its attributes:
length, number of exclamation marks, number of recipients, etc.
Given 𝑛 samples with inputs 𝑥 from some space 𝒳 and desired outcome, 𝑦, from some space 𝒴. In our example,
𝑦 is the spam/no-spam label, and 𝑥 is a vector of the mail’s attributes. Samples, (𝑥, 𝑦) have some distribution we
denote 𝑃 . We want to learn a function that maps inputs to outputs, i.e., that classifies to spam given. This function
is called a hypothesis, or predictor, denoted 𝑓, that belongs to a hypothesis class ℱ such that 𝑓 ∶ 𝒳 → 𝒴. We also
choose some other function that fines us for erroneous prediction. This function is called the loss, and we denote it by
𝑙 ∶ 𝒴 × 𝒴 → ℝ+ .
Remark. The hypothesis in machine learning is only vaguely related the hypothesis in statistical testing, which is quite
confusing.
Remark. The hypothesis in machine learning is not a bona-fide statistical model since we don’t assume it is the data
generating process, but rather some function which we choose for its good predictive performance.
The fundamental task in supervised (statistical) learning is to recover a hypothesis that minimizes the average loss in
the sample, and not in the population. This is know as the risk minimization problem.
Definition 10.1 (Risk Function). The risk function, a.k.a. generalization error, or test error, is the population
average loss of a predictor 𝑓:
Another fundamental problem is that we do not know the distribution of all possible inputs and outputs, 𝑃 . We
typically only have a sample of (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑛. We thus state the empirical counterpart of (10.2), which consists
of minimizing the average loss. This is known as the empirical risk miminization problem (ERM).
Definition 10.2 (Empirical Risk). The empirical risk function, a.k.a. in-sample error, or train error, is the sample
average loss of a predictor 𝑓:
116
CHAPTER 10. SUPERVISED LEARNING 10.1. PROBLEM SETUP
A good candidate proxy for 𝑓 ∗ is its empirical counterpart, 𝑓,̂ known as the empirical risk minimizer:
𝑓 ̂ ∶= 𝑎𝑟𝑔𝑚𝑖𝑛𝑓 {𝑅𝑛 (𝑓)}. (10.4)
When using a linear hypothesis with squared loss, we see that the empirical risk minimization problem collapses to an
ordinary least-squares problem:
𝑓 ̂ ∶= 𝑎𝑟𝑔𝑚𝑖𝑛𝛽 {1/𝑛 ∑(𝑥′𝑖 𝛽 − 𝑦𝑖 )2 }. (10.6)
𝑖
When data samples are assumingly independent, then maximum likelihood estimation is also an instance of ERM,
when using the (negative) log likelihood as the loss function.
If we don’t assume any structure on the hypothesis, 𝑓, then 𝑓 ̂ from (10.4) will interpolate the data, and 𝑓 ̂ will be a
very bad predictor. We say, it will overfit the observed data, and will have bad performance on new data.
We have several ways to avoid overfitting:
1. Restrict the hypothesis class ℱ (such as linear functions).
2. Penalize for the complexity of 𝑓. The penalty denoted by ‖𝑓‖.
3. Unbiased risk estimation: 𝑅𝑛 (𝑓) is not an unbiased estimator of 𝑅(𝑓). Why? Think of estimating the mean
with the sample minimum… Because 𝑅𝑛 (𝑓) is downward biased, we may add some correction term, or compute
𝑅𝑛 (𝑓) on different data than the one used to recover 𝑓.̂
Almost all ERM algorithms consist of some combination of all the three methods above.
117
10.1. PROBLEM SETUP CHAPTER 10. SUPERVISED LEARNING
Collecting ideas from the above sections, a typical supervised learning pipeline will include: choosing the hypothesis
class, choosing the penalty function and level, unbiased risk estimator. We emphasize that choosing the penalty
function, ‖𝑓‖ is not enough, and we need to choose how “hard” to apply it. This if known as the regularization level,
denoted by 𝜆 in Eq.(10.7).
Examples of such combos include:
1. Linear regression, no penalty, train-validate test.
2. Linear regression, no penalty, AIC.
3. Linear regression, 𝑙2 penalty, V-fold CV. This combo is typically known as ridge regression.
4. Linear regression, 𝑙1 penalty, V-fold CV. This combo is typically known as LASSO regression.
5. Linear regression, 𝑙1 and 𝑙2 penalty, V-fold CV. This combo is typically known as elastic net regression.
6. Logistic regression, 𝑙2 penalty, V-fold CV.
7. SVM classification, 𝑙2 penalty, V-fold CV.
8. Deep network, no penalty, V-fold CV.
9. Unrestricted, ‖𝜕 2 𝑓‖2 , V-fold CV. This combo is typically known as a smoothing spline.
For fans of statistical hypothesis testing we will also emphasize: Testing and prediction are related, but are not the
same:
• In the current chapter, we do not claim our models, 𝑓, are generative. I.e., we do not claim that there is some
causal relation between 𝑥 and 𝑦. We only claim that 𝑥 predicts 𝑦.
• It is possible that we will want to ignore a significant predictor, and add a non-significant one (Foster and Stine,
2004).
• Some authors will use hypothesis testing as an initial screening for candidate predictors. This is a useful heuris-
tic, but that is all it is– a heuristic. It may also fail miserably if predictors are linearly dependent (a.k.a.
multicollinear).
118
CHAPTER 10. SUPERVISED LEARNING 10.2. SUPERVISED LEARNING IN R
In these examples, I will use two data sets from the ElemStatLearn package, that accompanies the seminal book by
Friedman et al. (2001). I use the spam data for categorical predictions, and prostate for continuous predictions. In
spam we will try to decide if a mail is spam or not. In prostate we will try to predict the size of a cancerous tumor.
You can now call ?prostate and ?spam to learn more about these data sets.
We also define some utility functions that we will require down the road.
l2 <- function(x) x^2 %>% sum %>% sqrt
l1 <- function(x) abs(x) %>% sum
MSE <- function(x) x^2 %>% mean
missclassification <- function(tab) sum(tab[c(2,3)])/sum(tab)
## [1] 0.4383709
119
10.2. SUPERVISED LEARNING IN R CHAPTER 10. SUPERVISED LEARNING
# Test error:
MSE( predict(ols.1, newdata=prostate.test)- prostate.test$lcavol)
## [1] 0.5084068
Things to note:
• I use the newdata argument of the predict function to make the out-of-sample predictions required to compute
the test-error.
• The test error is larger than the train error. That is overfitting in action.
We now implement a V-fold CV, instead of our train-test approach. The assignment of each observation to each fold
is encoded in fold.assignment. The following code is extremely inefficient, but easy to read.
folds <- 10
fold.assignment <- sample(1:folds, nrow(prostate), replace = TRUE)
errors <- NULL
for (k in 1:folds){
prostate.cross.train <- prostate[fold.assignment!=k,] # train subset
prostate.cross.test <- prostate[fold.assignment==k,] # test subset
.ols <- lm(lcavol~. ,data = prostate.cross.train) # train
.predictions <- predict(.ols, newdata=prostate.cross.test)
.errors <- .predictions-prostate.cross.test$lcavol # save prediction errors in the fold
errors <- c(errors, .errors) # aggregate error over folds.
}
## [1] 0.5742128
Let’s try all possible variable subsets, and choose the best performer with respect to the Cp criterion, which is an
unbiased risk estimator. This is done with leaps::regsubsets. We see that the best performer has 3 predictors.
regfit.full <- prostate.train %>%
leaps::regsubsets(lcavol~.,data = ., method = 'exhaustive') # best subset selection
plot(regfit.full, scale = "Cp")
2.7
2.8
3.7
4.3
Cp
5.2
7
9
30
(Intercept)
lweight
age
lbph
svi
lcp
gleason
pgg45
lpsa
subset-1.bb
Things to note:
• The plot shows us which is the variable combination which is the best, i.e., has the smallest Cp.
• Scanning over all variable subsets is impossible when the number of variables is large.
Instead of the Cp criterion, we now compute the train and test errors for all the possible predictor subsets2 . In the
resulting plot we can see overfitting in action.
2 Example taken from https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/ch6.html
120
CHAPTER 10. SUPERVISED LEARNING 10.2. SUPERVISED LEARNING IN R
Plotting results.
plot(train.errors, ylab = "MSE", pch = 19, type = "o")
points(val.errors, pch = 19, type = "b", col="blue")
legend("topright",
legend = c("Training", "Validation"),
col = c("black", "blue"),
pch = 19)
Training
0.65
Validation
MSE
0.55
0.45
1 2 3 4 5 6 7 8
Index
Checking all possible models is computationally very hard. Forward selection is a greedy approach that adds one
variable at a time.
ols.0 <- lm(lcavol~1 ,data = prostate.train)
model.scope <- list(upper=ols.1, lower=ols.0)
step(ols.0, scope=model.scope, direction='forward', trace = TRUE)
## Start: AIC=30.1
## lcavol ~ 1
##
## Df Sum of Sq RSS AIC
## + lpsa 1 54.776 47.130 -19.570
## + lcp 1 48.805 53.101 -11.578
## + svi 1 35.829 66.077 3.071
## + pgg45 1 23.789 78.117 14.285
## + gleason 1 18.529 83.377 18.651
## + lweight 1 9.186 92.720 25.768
## + age 1 8.354 93.552 26.366
## <none> 101.906 30.097
121
10.2. SUPERVISED LEARNING IN R CHAPTER 10. SUPERVISED LEARNING
## truth
## prediction 0 1
## FALSE 1778 227
## TRUE 66 980
# compute the train (in sample) misclassification
missclassification(confusion.train)
3 https://en.wikipedia.org/wiki/Akaike_information_criterion
122
CHAPTER 10. SUPERVISED LEARNING 10.2. SUPERVISED LEARNING IN R
## [1] 0.09603409
# make out-of-sample prediction
.predictions.test <- predict(ols.2, newdata = spam.test.dummy) > 0.5
# inspect the confusion matrix
(confusion.test <- table(prediction=.predictions.test, truth=spam.test.dummy$spam))
## truth
## prediction 0 1
## FALSE 884 139
## TRUE 60 467
# compute the train (in sample) misclassification
missclassification(confusion.test)
## [1] 0.1283871
Things to note:
• I can use lm for categorical outcomes. lm will simply dummy-code the outcome.
• A linear predictor trained on 0’s and 1’s will predict numbers. Think of these numbers as the probability of 1,
and my prediction is the most probable class: predicts()>0.5.
• The train error is smaller than the test error. This is overfitting in action.
The glmnet package is an excellent package that provides ridge, LASSO, and elastic net regularization, for all GLMs,
so for linear models in particular.
suppressMessages(library(glmnet))
# Train error:
MSE( predict(ridge.2, newx =X.train.scaled)- y.train)
## [1] 1.006028
# Test error:
X.test.scaled <- X.test %>% sweep(MARGIN = 2, STATS = means, FUN = `-`) %>%
sweep(MARGIN = 2, STATS = sds, FUN = `/`)
MSE(predict(ridge.2, newx = X.test.scaled)- y.test)
## [1] 0.7678264
Things to note:
• The alpha=0 parameters tells R to do ridge regression. Setting 𝑎𝑙𝑝ℎ𝑎 = 1 will do LASSO, and any other value,
with return an elastic net with appropriate weights.
• The family='gaussian' argument tells R to fit a linear model, with least squares loss.
• Features for regularized predictors should be z-scored before learning.
• We use the sweep function to z-score the predictors: we learn the z-scoring from the train set, and apply it to
both the train and the test.
• The test error is smaller than the train error. This may happen because risk estimators are random. Their
variance may mask the overfitting.
We now use the LASSO penalty.
lasso.1 <- glmnet(x=X.train.scaled, y=y.train, , family='gaussian', alpha = 1)
123
10.2. SUPERVISED LEARNING IN R CHAPTER 10. SUPERVISED LEARNING
# Train error:
MSE( predict(lasso.1, newx =X.train.scaled)- y.train)
## [1] 0.5525279
# Test error:
MSE( predict(lasso.1, newx = X.test.scaled)- y.test)
## [1] 0.5211263
Things to note:
• We used cv.glmnet to do an automatic search for the optimal level of regularization (the lambda argument in
glmnet) using V-fold CV.
• Just like the glm function, 'family='binomial' is used for logistic regression.
• We z-scored features so that they all have the same scale.
• We set alpha=0 for an 𝑙2 penalization of the coefficients of the logistic regression.
# Train confusion matrix:
.predictions.train <- predict(logistic.2, newx = X.train.spam.scaled, type = 'class')
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
## truth
## prediction email spam
## email 1778 167
## spam 66 1040
# Train misclassification error
missclassification(confusion.train)
## [1] 0.0763684
# Test confusion matrix:
X.test.spam.scaled <- X.test.spam %>% sweep(MARGIN = 2, STATS = means.spam, FUN = `-`) %>%
sweep(MARGIN = 2, STATS = sds.spam, FUN = `/`) %>% as.matrix
## truth
## prediction email spam
## email 885 110
## spam 59 496
# Test misclassification error:
missclassification(confusion.test)
## [1] 0.1090323
124
CHAPTER 10. SUPERVISED LEARNING 10.2. SUPERVISED LEARNING IN R
10.2.2 SVM
A support vector machine (SVM) is a linear hypothesis class with a particular loss function known as a hinge loss4 .
We learn an SVM with the svm function from the e1071 package, which is merely a wrapper for the libsvm5 C library;
the most popular implementation of SVM today.
library(e1071)
svm.1 <- svm(spam~., data = spam.train, kernel='linear')
## truth
## prediction email spam
## email 1774 106
## spam 70 1101
missclassification(confusion.train)
## [1] 0.057686
# Test confusion matrix:
.predictions.test <- predict(svm.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
## truth
## prediction email spam
## email 876 75
## spam 68 531
missclassification(confusion.test)
## [1] 0.09225806
We can also use SVM for regression.
svm.2 <- svm(lcavol~., data = prostate.train, kernel='linear')
# Train error:
MSE( predict(svm.2)- prostate.train$lcavol)
## [1] 0.4488577
# Test error:
MSE( predict(svm.2, newdata = prostate.test)- prostate.test$lcavol)
## [1] 0.5547759
Things to note:
• The use of kernel='linear' forces the predictor to be linear. Various hypothesis classes may be used by
changing the kernel argument.
4 https://en.wikipedia.org/wiki/Hinge_loss
5 https://www.csie.ntu.edu.tw/~cjlin/libsvm/
125
10.2. SUPERVISED LEARNING IN R CHAPTER 10. SUPERVISED LEARNING
# Train error:
MSE( predict(nnet.1)- prostate.train$lcavol)
## [1] 1.177099
# Test error:
MSE( predict(nnet.1, newdata = prostate.test)- prostate.test$lcavol)
## [1] 1.21175
And nnet classification.
nnet.2 <- nnet(spam~., size=5, data=spam.train, rang = 0.1, decay = 5e-4, maxit = 1000, trace=FALSE)
## truth
## prediction email spam
## email 1806 59
## spam 38 1148
missclassification(confusion.train)
## [1] 0.03179285
# Test confusion matrix:
.predictions.test <- predict(nnet.2, newdata = spam.test, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
## truth
## prediction email spam
## email 897 64
## spam 47 542
missclassification(confusion.test)
## [1] 0.0716129
6 https://cran.r-project.org/package=tensorflow
126
CHAPTER 10. SUPERVISED LEARNING 10.2. SUPERVISED LEARNING IN R
# Train error:
MSE( predict(tree.1)- prostate.train$lcavol)
## [1] 0.4909568
# Test error:
MSE( predict(tree.1, newdata = prostate.test)- prostate.test$lcavol)
## [1] 0.5623316
We can use the rpart.plot package to visualize and interpret the predictor.
rpart.plot::rpart.plot(tree.1)
1.3
100%
yes lcp < 0.26 no
0.72 2.5
66% 34%
lpsa < 2.4 lcp < 1.7
0.23
39%
lweight >= 3.2
Or the newer ggparty7 package, for trees fitted with the party8 package.
Trees are very prone to overfitting. To avoid this, we reduce a tree’s complexity by pruning it. This is done with the
rpart::prune function (not demonstrated herein).
We now fit a classification tree.
tree.2 <- rpart(spam~., data=spam.train)
## truth
## prediction email spam
## email 1785 217
## spam 59 990
missclassification(confusion.train)
## [1] 0.09046214
# Test confusion matrix:
.predictions.test <- predict(tree.2, newdata = spam.test, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
## truth
## prediction email spam
## email 906 125
7 https://cran.r-project.org/web/packages/ggparty/vignettes/ggparty-graphic-partying.html
8 https://cran.r-project.org/package=party
127
10.2. SUPERVISED LEARNING IN R CHAPTER 10. SUPERVISED LEARNING
## spam 38 481
missclassification(confusion.test)
## [1] 0.1051613
## CART
##
## 67 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 61, 60, 59, 60, 60, 61, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.04682924 0.9118374 0.5026786 0.7570798
## 0.14815712 0.9899308 0.4690557 0.7972803
## 0.44497285 1.1912870 0.3264172 1.0008574
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04682924.
# Train error:
MSE( predict(tree.3)- prostate.train$lcavol)
## [1] 0.6188435
# Test error:
MSE( predict(tree.3, newdata = prostate.test)- prostate.test$lcavol)
## [1] 0.545632
Things to note:
• A tree was trained because of the method='rpart' argument. Many other predictive models are available. See
here10 .
• The pruning of the tree was done automatically by the caret::train() function.
• The method of pruning is controlled by a control object, generated with the caret::trainControl() function.
In our case, method = "cv" for cross-validation, and number = 10 for 10-folds.
• The train error is larger than the test error. This is possible because the tree is not an ERM on the train data.
Rather, it is an ERM on the variations of the data generated by the cross-validation process.
9 http://topepo.github.io/caret/available-models.html#
10 http://topepo.github.io/caret/available-models.html
128
CHAPTER 10. SUPERVISED LEARNING 10.2. SUPERVISED LEARNING IN R
## truth
## prediction email spam
## email 856 86
## spam 88 520
missclassification(confusion.test)
## [1] 0.1122581
## truth
## prediction email spam
## email 1776 227
## spam 68 980
missclassification(confusion.train)
## [1] 0.09668961
11 https://github.com/tidymodels/parsnip
12 https://twitter.com/topepos
13 https://en.wikipedia.org/wiki/Instance-based_learning
129
10.2. SUPERVISED LEARNING IN R CHAPTER 10. SUPERVISED LEARNING
## truth
## prediction email spam
## email 884 138
## spam 60 468
missclassification(confusion.test)
## [1] 0.1277419
## truth
## prediction email spam
## email 1025 55
## spam 819 1152
missclassification(confusion.train)
## [1] 0.2864635
# Test confusion matrix:
.predictions.test <- predict(nb.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
## truth
## prediction email spam
## email 484 42
## spam 460 564
missclassification(confusion.test)
## [1] 0.323871
## Random Forest
##
## 67 samples
## 8 predictor
130
CHAPTER 10. SUPERVISED LEARNING 10.3. BIBLIOGRAPHIC NOTES
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 62, 59, 60, 60, 59, 61, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.7885535 0.6520820 0.6684168
## 5 0.7782809 0.6687843 0.6550590
## 8 0.7894338 0.6665277 0.6626417
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
# Train error:
MSE( predict(rf.1)- prostate.train$lcavol)
## [1] 0.1340291
# Test error:
MSE( predict(rf.1, newdata = prostate.test)- prostate.test$lcavol)
## [1] 0.5147782
Some of the many many many packages that learn random-forests include: randomForest14 , ranger15 .
10.2.9 Boosting
The fundamental idea behind Boosting is to construct a predictor, as the sum of several “weak” predictors. These
weak predictors, are not trained on the same data. Instead, each predictor is trained on the residuals of the previous.
Think of it this way: The first predictor targets the strongest signal. The second targets what the first did not predict.
Etc. At some point, the residuals cannot be predicted anymore, and the learning will stabilize. Boosting is typically,
but not necessarily, implemented as a sum of trees (@(trees)).
131
10.4. PRACTICE YOURSELF CHAPTER 10. SUPERVISED LEARNING
20 https://www.datacamp.com/courses/supervised-learning-in-r-classification
21 https://www.datacamp.com/courses/supervised-learning-in-r-regression
132
Chapter 11
Unsupervised Learning
This chapter deals with machine learning problems which are unsupervised. This means the machine has access to a
set of inputs, 𝑥, but the desired outcome, 𝑦 is not available. Clearly, learning a relation between inputs and outcomes is
impossible, but there are still a lot of problems of interest. In particular, we may want to find a compact representation
of the inputs, be it for visualization of further processing. This is the problem of dimensionality reduction. For the
same reasons we may want to group similar inputs. This is the problem of clustering.
In the statistical terminology, with some exceptions, this chapter can be thought of as multivariate exploratory
statistics. For multivariate inference, see Chapter 9.
Example 11.2. Consider the correctness of the answers to a questionnaire with 𝑝 questions. The data may seemingly
reside in a 𝑝 dimensional space, but if there is a thing such as “skill”, then given the correctness of a person’s reply to
a subset of questions, we have a good idea how he scores on the rest. If skill is indeed a one dimensional quality, then
the questionnaire data should organize around a single line in the 𝑝 dimensional cube.
Example 11.3. Consider 𝑛 microphones recording an individual. The digitized recording consists of 𝑝 samples. Are
the recordings really a shapeless cloud of 𝑛 points in ℝ𝑝 ? Since they all record the same sound, one would expect
the 𝑛 𝑝-dimensional points to arrange around the original, noisless, sound: a single point in ℝ𝑝 . If microphones have
different distances to the source, volumes and echoes may differ. We would thus expect the 𝑛 points to arrange about
a line in ℝ𝑝 .
133
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
a score. He did not care about variance maximization, however. He simply wanted a small set of coordinates in some
(linear) space that approximates the original data well.
Before we proceed, we give an example to fix ideas. Consider the crime rate data in USArrests, which encodes reported
murder events, assaults, rapes, and the urban population of each american state.
head(USArrests)
134
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
is most convenient, mathematically, to constrain the 𝑙2 norm to some constant: ‖𝑣‖22 = ∑ 𝑣𝑗2 = 1. The first “best
separating score”, known as the first principal component (PC), is thus
The second PC, is the same, only that it is required to be orthogonal to the first PC:
The construction of the next PCs follows the same lines: find a linear transformation of the data that best separates
observations and is orthogonal to the previous PCs.
but you may think of it as the number of free coordinates needed to navigate along the manifold.
135
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
Figure 11.1: Various embedding algorithms. No embedding of the sphere to the plane is perfect. This is obviously not
new. Maps makers have known this for centuries!
Example 11.4. Assume 𝑛 respondents answer 𝑝 quantitative questions: 𝑥𝑖 ∈ ℝ𝑝 , 𝑖 = 1, … , 𝑛. Also assume, their
responses are some linear function of a single personality attribute, 𝑠𝑖 . We can think of 𝑠𝑖 as the subject’s “intelligence”.
We thus have
𝑥𝑖 = 𝑠 𝑖 𝐴 + 𝜀 𝑖 (11.1)
And in matrix notation:
𝑋 = 𝑆𝐴 + 𝜀, (11.2)
where 𝐴 is the 𝑞 × 𝑝 matrix of factor loadings, and 𝑆 the 𝑛 × 𝑞 matrix of latent personality traits. In our particular
example where 𝑞 = 1, the problem is to recover the unobservable intelligence scores, 𝑠1 , … , 𝑠𝑛 , from the observed
answers 𝑋.
We may try to estimate 𝑆𝐴 by assuming some distribution on 𝑆 and 𝜀 and apply maximum likelihood. Under standard
assumptions on the distribution of 𝑆 and 𝜀, recovering 𝑆 from 𝑆𝐴̂ is still impossible as there are infinitely many such
solutions. In the statistical parlance we say the problem is non identifiable, and in the applied mathematics parlance
we say the problem is ill posed.
Remark. The non-uniqueness (non-identifiability) of the FA solution under variable rotation is never mentioned in the
PCA context. Why is this? This is because PCA and FA solve different problems. 𝑆 ̂ in PCA is well defined because
PCA does not seek a single 𝑆 but rather a sequence of 𝑆𝑞 with dimensions growing from 𝑞 = 1 to 𝑞 = 𝑝.
5 https://en.wikipedia.org/wiki/G_factor_(psychometrics)
136
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
local-MDS is aimed at solving the case where Euclidean distances, implied by PCA and FA, are a bad measure of
distance. Instead of using the graph of Euclidean distances between any two points, 𝒢 = 𝑋 ′ 𝑋, local-MDS computes
6 https://stats.stackexchange.com/questions/123063/is-there-any-good-reason-to-use-pca-instead-of-efa-also-can-pca-be-a-substitut
7 Then again, it is possible that the true distances are the white matter fibers connecting going within the cortex, in which case, Euclidean
distances are more appropriate than geodesic distances. We put that aside for now.
137
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
𝒢 starting with the Euclidean distance between pairs of nearest points. Longer distances are solved as a shortest
path problem8 . For instance, using the Floyd–Warshall algorithm9 , which sums distances between closest points.
This is akin to computing the distance between Jerusalem to Beijing by computing Euclidean distances between
Jerusalem-Bagdad, Bagdad-Teheran, Teheran-Ashgabat, Ashgabat-Tashkent,and so on. Because the geographical-
distance10 between nearby cities is well approximated with the Euclidean distance, summing local distanes is better
than operating directly with the Euclidean distance between Jerusalem and Beijing.
After computing 𝒢, local-MDS ends with the usual MDS for the embedding. Because local-MDS ends with a regular
MDS, it can be seen as a non-linear embedding into a linear manifold ℳ.
11.1.4.5 t-SNE
t-SNE is a recently popularized visualization method for high dimentional data. t-SNE starts by computing a proximity
graph, 𝒢. Computation of distances in the graph assumes a Gaussian decay of distances. Put differently: only the
nearest observations have a non-vanishing similarity. This stage is similar (in spirit) to the growing of 𝒢 in local-MDS
(11.1.4.2).
The second stage in t-SNE consists of finding a mapping to 2D (or 3D), which conserves distances in 𝒢. The uniquness
of t-SNE compared to other space embeddings is in the way distances are computed in the target 2D (or 3D) space.
138
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
What if 𝑥 is not continuous, i.e., 𝒳 ≠ ℝ𝑝 ? We could dummy-code 𝑥, and then use plain PCA. A more principled view,
when 𝑥 is categorical, is known as Correspondence Analysis.
We already saw the basics of PCA in 11.1.1. The fitting is done with the stats::prcomp function. The bi-plot is a
useful way to visualize the output of PCA.
# library(devtools)
# install_github("vqv/ggbiplot")
ggbiplot::ggbiplot(pca.1)
standardized PC2 (15.3% explained var.)
1
Mu
rde
r
Assault
0
−1 pe
Ra
−2
−2 −1 0 1
standardized PC1 (78.6% explained var.)
Things to note:
• We used the ggbiplot::ggbiplot function (available from github, but not from CRAN), because it has a nicer
output than stats::biplot.
• The data is presented in the plane of PC1 and PC2.
• The bi-plot plots the loadings as arrows. The coordinates of the arrows belong to the weight of each of the
original variables in each PC. For example, the x-value of each arrow is the loadings on the PC1. Since the
weights of Murder, Assault, and Rape are almost the same, we conclude that PC1 captures the average crime
rate in each state.
The scree plot depicts the quality of the approximation of 𝑋 as 𝑞 grows, i.e., as we allow increase the dimension of our
new score. This is depicted using the proportion of variability in 𝑋 that is removed by each added PC. It is customary
to choose 𝑞 as the first PC that has a relative low contribution to the approximation of 𝑋. This is known as the “knee
heuristic”.
139
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
ggbiplot::ggscreeplot(pca.1)
0.8
proportion of explained variance
0.6
0.4
0.2
The scree plot suggests a PC1 alone captures about 0.8 of the variability in crime levels. The next plot, is the classical
class-room introduction to PCA. It shows that states are indeed arranged along a single line in the “Assault-Murder”
plane. This line is PC1.
2
1
Assault
PC 1
PC 2
0
−1
−1 0 1 2
Murder
# Another one...
amap::acp()
11.1.5.2 FA
140
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
Biplot from fa
−1.0 −0.5 0.0 0.5 1.0
1.0
3
Alaska
Colorado
Nevada
2
California
Rape
0.5
Oregon
Washington
Utah
1
Hawaii Arizona
Michigan
Minnesota New Mexico
PC2
0.0
Vermont
Iowa
Wisconsin
0
New
NorthHampshire
South Dakota
Dakota
Connecticut
Maine Maryland
Assault
Florida
West Virginia
−3 −2 −1
Rhode Island
Murder
−0.5
Alabama
Georgia
South Carolina
Louisiana
North Carolina
Mississippi
−1.0
−3 −2 −1 0 1 2 3
PC1
Numeric comparison with PCA:
fa.1$loadings
##
## Loadings:
## PC1 PC2
## Murder 0.895 -0.361
## Assault 0.934 -0.145
## Rape 0.828 0.554
##
## PC1 PC2
## SS loadings 2.359 0.458
141
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
Mrd
2 1
Rap Ass
Let’s add a rotation (Varimax), and note that the rotation has indeed changed the loadings of the variables, thus the
interpretation of the factors.
fa.2 <- psych::principal(USArrests.1, nfactors = 2, rotate = "varimax")
fa.2$loadings
##
## Loadings:
## RC1 RC2
## Murder 0.930 0.257
## Assault 0.829 0.453
## Rape 0.321 0.943
##
## RC1 RC2
## SS loadings 1.656 1.160
## Proportion Var 0.552 0.387
## Cumulative Var 0.552 0.939
Things to note:
142
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
11.1.5.3 ICA
ica.1 <- fastICA::fastICA(USArrests.1, n.com=2) # Also performs projection pursuit
plot(ica.1$S)
abline(h=0, v=0, lty=2)
text(ica.1$S, pos = 4, labels = rownames(USArrests.1))
Alaska
Nevada
Colorado
California
2
Oregon
Washington
Utah
1
Arizona
Michigan
ica.1$S[,2]
Hawaii Missouri
Minnesota New Mexico
Ohio
Indiana
Nebraska
Idaho Oklahoma
Vermont
Iowa Kansas
Massachusetts
0
New
North SouthMontana
Wisconsin Dakota
Hampshire
Dakota Virginia
New NewMaryland
Jersey Illinois
Texas York
Tennessee Florida
Maine Pennsylvania
Connecticut Arkansas
Wyoming
Delaware
Kentucky
−1
West Virginia
Rhode Island Alabama
Georgia
South Carolina
Louisiana
−2
North Carolina
Mississippi
−1 0 1 2
ica.1$S[,1]
Things to note:
11.1.5.4 MDS
143
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
Mississippi
1.5
North Carolina
Louisiana
South Carolina
Georgia
Alabama
Rhode
WestIsland
0.5
Kentucky Virginia
mds.1[,2]
ArizonaMissouri
Michigan Hawaii
Utah
Washington
Oregon
California
−1.5
Nevada
Alaska Colorado
−3 −2 −1 0 1 2
mds.1[,1]
Things to note:
• We first compute a dissimilarity graph with stats::dist(). See cluster::daisy for a wider variety of dissim-
ilarity measures.
• We learn the MDS embedding with stats::cmdscale.
• Embedding with the first two components of PCA is exactly the same as classical MDS with Euclidean distances.
Let’s try other strain functions for MDS, like Sammon’s stress, and compare it with PCA.
mds.2 <- MASS::sammon(state.disimilarity, trace = FALSE)
plot(mds.2$points, pch = 19)
abline(h=0, v=0, lty=2)
text(mds.2$points, pos = 4, labels = rownames(USArrests.2))
North Carolina
Mississippi
1.5
Georgia
Louisiana Delaware Rhode Island
mds.2$points[,2]
SouthAlabama
Carolina
Kentucky West Virginia
0.5
Texas
Tennessee Wyoming Maine
Florida Arkansas Massachusetts
Connecticut
Illinois
Maryland
New New Jersey
York Virginia Idaho New
Pennsylvania North Dakota
Hampshire
Oklahoma South
Montana
Kansas Dakota
Wisconsin
Iowa
New Mexico Ohio
IndianaNebraska Vermont
Minnesota
−0.5
Michigan Missouri
Arizona UtahHawaii
Washington
Oregon
−1.5
Nevada California
Colorado
Alaska
−3 −2 −1 0 1 2
mds.2$points[,1]
Things to note:
• MASS::sammon does the embedding.
• Sammon stress is different than PCA.
11.1.5.5 t-SNE
For a native R implementation: tsne package11 . For an R wrapper for C libraries: Rtsne package12 .
11 https://cran.r-project.org/web/packages/tsne/
12 https://github.com/jkrijthe/Rtsne
144
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
The MNIST13 image bank of hand-written images has its own data format. The import process is adapted from David
Dalpiaz14 :
show_digit <- function(arr784, col = gray(12:1 / 12), ...) {
image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}
# load images
train <- load_image_file("data/train-images-idx3-ubyte")
test <- load_image_file("data/t10k-images-idx3-ubyte")
# load labels
train$y = as.factor(load_label_file("data/train-labels-idx1-ubyte"))
test$y = as.factor(load_label_file("data/t10k-labels-idx1-ubyte"))
13 http://yann.lecun.com/exdb/mnist/
14 https://gist.github.com/daviddalpiaz/ae62ae5ccd0bada4b9acd6dbc9008706
145
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
0.0
0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
0.0
0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
0.0
0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
tSNE
11
1 1117
11
tSNE dimension 2
1113 1111111 11
11
1111
1 1113 1
411111 8
11 1 1 1 11 1
11 8
11 11
11 11 11 1 4 7
11117 81 11111111111111111 1111 22
20
1 111111
11111 11
111111 11 6111
1 1111111 11
7 11
16
8 1 11111 11111811 22
12 222222222222222222
2222228
22
222
2 22
2 22
22 2222222222
18
2111 111
11 111
1 11 11
11 111
11 111111 1
1
111111 1111141
1
2
11111 111
21111
7722 2
222
2
222
2 2
222222223232223
2 22 2 22 2 2222
2 2
22 2 2 2 22
2
22
2 2
2
22
222
2
2
222
111 71 77772 22 2 2
2222222 2 2 2 2 22 2
22 2 2 2 2 2 2 2 2 22
7 3333377777 77 22723232 2 222 222223 22220 22 260 66668 6 6
10
7728 777 32 66
66 6
666
97
7 77 27 77777777 222 7 9 97
9 4 4 728
5 272228083222 78 2 00666666
46
66 666
6 6
66
66
666
66
66 66
6 66
666666
66 6
66
66666
6666666
6
66666668
66
66666666 6 6666
7
2777
777
7 7
7
7
7 77 7
7
7
7
77 7
2 777
7 3 7
7 7 7 97
3 9
9 99
944
8844 4
94
94
4
4 4
44 4 98 988 8 8
888 8
8 8 8 8
88
8 8 8 8 8 38
8 55
555
5 5 55 55
3
5 5 5
5 6 6566
6 6
6 66
6666
66 6666
66
66
666
6666
64
6 66666
66 6 66666666 66 666
77797777
7
7
77 77
7
7777777
9
7
7
77 7
7
7
77
7
7
7
727
7
77
77
7 7 9494
9
7
99
99
99
9
9
9
9 9
9
4
9
44 44
4
44
44
4
44
444
44
44
444
4
4 4
444 9
72
9
7
5 3
1
88 88 8
8
88
8
88
8 88
8
88
88
8
88
8
88
8
38 82
8
8
8 88
8
82
8 868 5
555
55
5
5
55
5
5
55
5
5
55
5
55
5
55 5
55
5 5
55
55 5
58
5 3 5
3 3
6
82
58
5
66656
6
6
3
6
60 6 6 3 66
4
0
6
7 7 1 777 7 777 999 9 9 999 8 8 8 5 5 2
0
0
77
777777777777772 777 77777
2 9 599 9
4
989
9 9 494440 888 38888888 888 88
888
8 888888888385555 555555555 555
35555553353 53 5 60000000 00 0 0 0 0
7 7 77 777 77 7 7 77 4
9999 9
9 99 99919 8 8 88 8 8 8 8 8
3 65 555 5 55 5 5 5
5 5 5 2 0 0 0 00 0 0 0 00
977
77777 77777
777 77777 777 9999999 94449
999
99
44 999794
3994949
9
39444914
44954 888
3788 888
89
28 8858588
88888
888899555555555555283 555 0000000
3 3350 0000 00 006
0
00 5000 00
000 0
000 00000 00000000000020
000000 0 0000009
7777 777779777 444944 90
999994
99
94
44
9
99 997
9 4
99
999 9
94
2
9 94
99 99 9
39 4 358 38 88888 83 88085
88 335593553339
5
3 33
3
338
3333
2
3333
3
833
3
3 33
32
33 333
33
3
33 333
333 3333 00000 000
0
0
0
000
0
0
0000000 0
000
0 0
0 0 00
0
0 0 000
00
0 00
9 4 44 4 4 7 79 9 9 9 74 3 3 3 3 3 33 3333 3 3 8 3 0
0 0 0 0 0 0
44444944
4444 44444
44 4
4444
4
4
4444 4 4
44
444444 4
4 9
4449999
9
9
49
99
7 9 9
49
99
9
99
7
4
49
7 97777
4
5
3
5 33 9
3 3
83333 3 33
33
33
33333 3
33
3
83 3
3333 383 3
3 5
3
333333333333
333
3
3 3
333
33
5
33 3
3
3
3
35353
9 5
33333335555 5 2
5565 000000000 00 00 0
0 00
6 0000000000
0
0 0000 000
4 4 4 4 9 99 4 44 4 99 9
9 9 99
4 45 5 8 335 33
9 3 3
3 3 3 533 3 33 33 5 55
6 0
4444945499 9 99999444
29 49999999
444 85585588 5855 555333 3333 35333 335 8885 5
−20
4 94 9 8 33 3
333
3
8 3 3535
4 444444
4 4444
4 494
4 99999994
4 9 8
588558
85855883
88 8398
35353553 83
5 55
58 353 55
4 4 8 8 8
5 8
−20 −10 0 10 20 30
tSNE dimension 1
15 https://rpubs.com/marwahsi/tnse
146
CHAPTER 11. UNSUPERVISED LEARNING 11.1. DIMENSIONALITY REDUCTION
## PC1 PC2
## Murder -0.1626431 1
## Assault -0.8200474 0
## Rape -0.5486979 0
Things to note:
• I used the elasticnet::spca function.
• Is the solutions sparse? Yes! PC2 depends on a single variable only: Murder.
library(kernlab)
kpc <- kpca(~.,data=as.data.frame(USArrests.1), kernel="rbfdot", kpar=list(sigma=0.2), features=2)
plot(rotated(kpc),
xlab="1st Principal Component",
ylab="2nd Principal Component")
Missouri Indiana
Wyoming
Kentucky
Delaware
OregonWashington
Kansas
2
Texas
Illinois
Tennessee Montana
Massachusetts
Utah
Alabama Pennsylvania
New York Nebraska
Louisiana Hawaii
0
Mississippi
South Carolina
Georgia
North
Arizona Carolina Rhode Island
Idaho
Maryland Colorado West Virginia
South Dakota
New Mexico Connecticut
Michigan Minnesota
−2
California
Florida Wisconsin
Iowa
Maine
Vermont
Alaska
Nevada New Hampshire
−4
North Dakota
−4 −2 0 2 4
Things to note:
• We used kernlab::kpca for kPCA.
• rotated projects the data on its principal components (the above “scores”).
• See ?'kpca-class' or ?rotated for help on available utility functions.
• kernel= governs the class of feature mappings.
• kpar=list(sigma=0.2) provides parameters specific to each type of kernel. See ?kpca.
• features=2 is the number of principal components (scores) to learn.
16 http://bl.ocks.org/eesur/be2abfb3155a38be4de4
147
11.1. DIMENSIONALITY REDUCTION CHAPTER 11. UNSUPERVISED LEARNING
• You may notice the “Horseshoe” pattern of the kPCA embedding, a.k.a. “Croissants”, or the Guttman Effect.
The horseshoe is known phenomenon in low dimensional visualizations. See J. De Leeuw’s online paper17 for
more details.
17 https://rpubs.com/deleeuw/133786
148
CHAPTER 11. UNSUPERVISED LEARNING 11.2. CLUSTERING
11.2 Clustering
Example 11.6. Consider the tagging of your friends’ pictures on Facebook. If you tagged some pictures, Facebook
may try to use a supervised approach to automatically label photos. If you never tagged pictures, a supervised approach
is impossible. It is still possible, however, to group simiar pictures together.
Example 11.7. Consider the problem of spam detection. It would be nice if each user could label several thousands
emails, to apply a supervised learning approach to spam detection. This is an unrealistic demand, so a pre-clustering
stage is useful: the user only needs to label a couple dozens of (hopefully homogenous) clusters, before solving the
supervised learning problem.
A finite mixture is the marginal distribution of 𝐾 distinct classes, when the class variable is latent. This is useful for
clustering: If we know the distribution of the sub-populations being mixed, we may simply cluster each data point to
the most likely sub-group. Learning how is each sub-population disctirubted, when no class labels are available is no
easy task, but it is possible. For instance, by means of maximum likelihood.
149
11.2. CLUSTERING CHAPTER 11. UNSUPERVISED LEARNING
2. Assign each point to its closest cluster center (in Euclidean distance).
4. Return cluster assignments and means.
Remark. If trained as a statistician, you may wonder- what population quantity is K-means actually estimating? The
estimand of K-means is known as the K principal points. Principal points are points which are self consistent, i.e.,
they are the mean of their neighbourhood.
11.2.2.2 K-Means++
K-means++ is a fast version of K-means thanks to a smart initialization.
11.2.2.3 K-Medoids
If a Euclidean distance is inappropriate for a particular set of variables, or that robustness to corrupt observations is
required, or that we wish to constrain the cluster centers to be actual observations, then the K-Medoids algorithm is an
adaptation of K-means that allows this. It is also known under the name partition around medoids (PAM) clustering,
suggesting _its relation to graph partitioning: a very important and well-studied problem in computer sciences.
The k-medoids algorithm works as follows.
1. Given a dissimilarity graph, 𝒢.
2. Choose the number of clusters 𝐾.
3. Arbitrarily assign points to clusters.
4. While clusters keep changing:
1. Within each cluster, set the center as the data point that minimizes the sum of distances to other points in
the cluster.
2. Assign each point to its closest cluster center.
5. Return Cluster assignments and centers.
Remark. If trained as a statistician, you may wonder- what population quantity is K-medoids actually estimating?
The estimand of K-medoids is the median of their neighbourhood. A delicate matter is that quantiles are not easy to
define for multivariate variables so that the “multivaraitre median”, may be a more subtle quantity than you may
think. See Small (1990).
11.2.3 Clustering in R
11.2.3.1 K-Means
The following code is an adaptation from David Hitchcock19 .
19 http://people.stat.sc.edu/Hitchcock/chapter6_R_examples.txt
150
CHAPTER 11. UNSUPERVISED LEARNING 11.2. CLUSTERING
k <- 2
kmeans.1 <- stats::kmeans(USArrests.1, centers = k)
head(kmeans.1$cluster) # cluster asignments
2 2
2
22 2 2 2 2
2 2
22 222 2 2 2 22 2
22 2 2
1
222 22 2 2
Murder 1
11 212 22 11
1 2 2 22
0
11 11 1 11
1 1 1111 11 1 1 11 11111
1 1 1 1 11 1 11
−1
11111 111 1111 11 1
1 1
2 2 2 2
1.5
22 22 2 2 22
222 2 22 2 2222 2
2 2
2
1 2 1 2
2 2
2
1 1 11 112
22
Assault 1 22 2
0.0
1 1 11 2
11 1 1 1 11
111 111 1 1 1111 111
11 11 1 11 11
−1.5
1 111 1 1111 1
2 2 22
22 2
2
2
2 2
2 22 2
1 22 2 2 2 1
1
1 1 11 222
2 22
2
1
1 2222 2 2
1 111 1 22
11 2 2
Rape
−1 0
111 1 1 1
2 12
1 11 111 1 2 1 1111 11 2
111 111 1 1
11
11 1 11 11 1
−1 0 1 2 −1 0 1 2
Things to note:
• The stats::kmeans function does the clustering.
• The cluster assignment is given in the cluster element of the stats::kmeans output.
• The visual inspection confirms that similar states have been assigned to the same cluster.
11.2.3.2 K-Medoids
Start by growing a distance graph with dist and then partition using pam.
state.disimilarity <- dist(USArrests.1)
kmed.1 <- cluster::pam(x= state.disimilarity, k=2)
head(kmed.1$clustering)
Mississippi
1.5
North Carolina
Louisiana
South Carolina
GeorgiaAlabama
0.5
KentuckyRhode Island
West Virginia
Florida Delaware
Texas
Illinois ArkansasWyoming
PC 2
Tennessee
Maryland
New York Pennsylvania
ConnecticutMaine
New Jersey
Virginia Montana New Hampshire
South Dakota North Dakota
OklahomaKansas Idaho Wisconsin
Massachusetts Iowa
Vermont
New Mexico Ohio Nebraska
Indiana
−0.5
Minnesota
Michigan
Arizona Missouri
Hawaii
Utah
Washington
Oregon
California
−1.5
Nevada Colorado
Alaska
−3 −2 −1 0 1 2
PC 1
151
11.2. CLUSTERING CHAPTER 11. UNSUPERVISED LEARNING
Things to note:
• K-medoids starts with the computation of a dissimilarity graph, done by the dist function.
• The clustering is done by the cluster::pam function.
• Inspecting the output confirms that similar states have been assigned to the same cluster.
• Many other similarity measures can be found in proxy::dist().
• See cluster::clara() for a big-data implementation of PAM.
Cluster Dendrogram
1.2
Florida
0.6
Carolina
Delaware
NorthColorado
Missouri
Island
Hawaii
Arizona
Kentucky
Georgia
Mississippi
West Virginia
Massachusetts
Nevada
Alabama
Michigan
California
Alaska
Utah
Maryland
Mexico
0.0
Distance
Arkansas
SouthLouisiana
Carolina
Oregon
Washington
North Dakota
Nebraska
Idaho
Minnesota
Wyoming
Maine
Connecticut
Dakota
Virginia
York
Illinois
Texas
Tennessee
Jersey
Oklahoma
Hampshire
Kansas
Montana
Pennsylvania
Rhode
New Wisconsin
Ohio
Indiana
Iowa
Vermont
New
New
South
New
state.disimilarity
hclust (*, "single")
Things to note:
We try other types of linkages, to verify that the indeed affect the clustering. Starting with complete linkage.
hirar.2 <- hclust(state.disimilarity, method='complete')
plot(hirar.2, labels=rownames(USArrests.1), ylab="Distance")
152
CHAPTER 11. UNSUPERVISED LEARNING 11.2. CLUSTERING
Cluster Dendrogram
6
4
2
Carolina
0
Distance
Florida
Colorado
Hawaii
Island
Delaware
Georgia
Michigan
Virginia
Missouri
Nevada
Utah
Mississippi
Arizona
Kentucky
Massachusetts
Minnesota
Alabama
Maine
North Dakota
Idaho
Alaska
California
Nebraska
Maryland
NorthMexico
Wyoming
Virginia
Arkansas
SouthLouisiana
Carolina
Oregon
Washington
Connecticut
South Dakota
Kansas
York
Illinois
Jersey
Oklahoma
Texas
Tennessee
New Hampshire
Pennsylvania
Montana
Wisconsin
Iowa
Vermont
Ohio
Indiana
Rhode
New
West
New
New
state.disimilarity
linkage-1.bb hclust (*, "complete")
Cluster Dendrogram
2.0
Florida
New Carolina
0.0
Delaware
Island
Hawaii
Colorado
Distance
Georgia
Missouri
Kentucky
Arizona
Virginia
Nevada
Michigan
Mississippi
Utah
Massachusetts
Arkansas
Alabama
Minnesota
WestDakota
Wyoming
California
Alaska
New Mexico
Maryland
Maine
Idaho
Nebraska
SouthLouisiana
Carolina
Oregon
Washington
Virginia
Connecticut
Dakota
York
Illinois
Kansas
Texas
Tennessee
Jersey
Oklahoma
New Hampshire
Pennsylvania
Montana
Wisconsin
Iowa
Vermont
Indiana
Ohio
Rhode
North
North
New
South
state.disimilarity
linkage-1.bb hclust (*, "average")
If we know how many clusters we want, we can use stats::cuttree to get the class assignments.
cut.2.2 <- cutree(hirar.2, k=2)
head(cut.2.2)
153
11.3. BIBLIOGRAPHIC NOTES CHAPTER 11. UNSUPERVISED LEARNING
51
3
834993
54 45 76
662
75
70
90 7886
85 1147
18
27
66
95
32
3 988
53
40
72
17 68
761
39
8781
60
48559
50632
73
7142
338035
551
97
98
28
16
44
1
38
36
100308
20
19
644
22
94
46
96
6721
7741
82
23
10
8925
24
57
92
99
14
13
31
43
74
58
9137
65
15
69
79
52
56
34
84
29
12
26
data.disimilarity.10
hclust (*, "complete")
# data with strong separation between groups
the.data.11 <-the.data.10 +sample(c(0,10,20), data.n, replace=TRUE) # Shift each group
data.disimilarity.11 <- dist(the.data.11)
hirar.11 <- hclust(data.disimilarity.11, method = "complete")
plot(hirar.11, ylab="Distance", main=paste('Strong Separation Between',n.groups, 'Groups'))
20
0
19
45
15
53
86
62
18
51
98
20
39
75
68
61
32
11
87
27
70
95
21
76
22
55
30
38
52
66
81
74
88
69
49
41
77
46
93
781
23
35
40
89
83
52
14
28
54 6
63
60
97
58
85
25
73 7
12
47
64
96
59
94
72
16
44
79
100
9
82
17
83
804
42
43
36
48
10
71
33
65
24
57
92
99
13
31
50
84
37
56
34
67
29
90
26
91
data.disimilarity.11
hclust (*, "complete")
154
CHAPTER 11. UNSUPERVISED LEARNING 11.4. PRACTICE YOURSELF
8. Determine the color of the points to be the truth species with col=iris$Species.
See DataCap’s Unsupervised Learning in R23 , Cluster Analysis in R24 , Dimensionality Reduction in R25 , and Advanced
Dimensionality Reduction in R26 for more self practice.
23 https://www.datacamp.com/courses/unsupervised-learning-in-r
24 https://www.datacamp.com/courses/cluster-analysis-in-r
25 https://www.datacamp.com/courses/dimensionality-reduction-in-r
26 https://www.datacamp.com/courses/advanced-dimensionality-reduction-in-r
155
11.4. PRACTICE YOURSELF CHAPTER 11. UNSUPERVISED LEARNING
156
Chapter 12
Plotting
Whether you are doing EDA, or preparing your results for publication, you need plots. R has many plotting mech-
anisms, allowing the user a tremendous amount of flexibility, while abstracting away a lot of the tedious details. To
be concrete, many of the plots in R are simply impossible to produce with Excel, SPSS, or SAS, and would take a
tremendous amount of work to produce with Python, Java and lower level programming languages.
In this text, we will focus on two plotting packages. The basic graphics package, distributed with the base R
distribution, and the ggplot2 package.
Before going into the details of the plotting packages, we start with some philosophy. The graphics package originates
from the mainframe days. Computers had no graphical interface, and the output of the plot was immediately sent to
a printer. Once a plot has been produced with the graphics package, just like a printed output, it cannot be queried
nor changed, except for further additions.
The philosophy of R is that everyting is an object. The graphics package does not adhere to this philosophy, and
indeed it was soon augmented with the grid package (R Core Team, 2016), that treats plots as objects. grid is a low
level graphics interface, and users may be more familiar with the lattice package built upon it (Sarkar, 2008).
lattice is very powerful, but soon enough, it was overtaken in popularity by the ggplot2 package (Wickham, 2009).
ggplot2 was the PhD project of Hadley Wickham1 , a name to remember… Two fundamental ideas underlay ggplot2:
(i) everything is an object, and (ii), plots can be described by a simple grammar, i.e., a language to describe the
building blocks of the plot. The grammar in ggplot2 are is the one stated by Wilkinson (2006). The objects and
grammar of ggplot2 have later evolved to allow more complicated plotting and in particular, interactive plotting.
Interactive plotting is a very important feature for EDA, and reporting. The major leap in interactive plotting was
made possible by the advancement of web technologies, such as JavaScript and D3.JS2 . Why is this? Because an
interactive plot, or report, can be seen as a web-site. Building upon the capabilities of JavaScript and your web
browser to provide the interactivity, greatly facilitates the development of such plots, as the programmer can rely on
the web-browsers capabilities for interactivity.
157
12.1. THE GRAPHICS SYSTEM CHAPTER 12. PLOTTING
10 12 14 16 18 20
Girth
65 70 75 80 85
Height
20
20
16
16
16
Girth
Girth
Girth
12
12
12
8
0 10 20 30 0 10 20 30 0 10 20 30
20
20
16
16
16
Girth
Girth
Girth
12
12
12
8
0 10 20 30 0 10 20 30 0 10 20 30
par(par.old)
Things to note:
• The par command controls the plotting parameters. mfrow=c(2,3) is used to produce a matrix of plots with 2
rows and 3 columns.
• The par.old object saves the original plotting setting. It is restored after plotting using par(par.old).
• The type argument controls the type of plot.
• The main argument controls the title.
• See ?plot and ?par for more options.
Control the plotting characters with the pch argument, and size with the cex argument.
158
CHAPTER 12. PLOTTING 12.1. THE GRAPHICS SYSTEM
+
10 12 14 16 18 20
+
++++
++
+++++
Girth
+
+++
+++
+++
+++
++
+++
8
0 5 10 15 20 25 30
Index
Control the line’s type with lty argument, and width with lwd.
par(mfrow=c(2,3))
plot(Girth, type='l', lty=1, lwd=2)
plot(Girth, type='l', lty=2, lwd=2)
plot(Girth, type='l', lty=3, lwd=2)
plot(Girth, type='l', lty=4, lwd=2)
plot(Girth, type='l', lty=5, lwd=2)
plot(Girth, type='l', lty=6, lwd=2)
20
20
20
16
16
16
Girth
Girth
Girth
12
12
12
8
0 10 20 30 0 10 20 30 0 10 20 30
20
20
16
16
16
Girth
Girth
Girth
12
12
12
8
0 10 20 30 0 10 20 30 0 10 20 30
159
12.1. THE GRAPHICS SYSTEM CHAPTER 12. PLOTTING
10 12 14 16 18 20
Girth
0 5 10 15 20 25 30
Index
plot(Girth)
points(x=1:30, y=rep(12,30), cex=0.5, col='darkblue')
lines(x=rep(c(5,10), 7), y=7:20, lty=2 )
lines(x=rep(c(5,10), 7)+2, y=7:20, lty=2 )
lines(x=rep(c(5,10), 7)+4, y=7:20, lty=2 , col='darkgreen')
lines(x=rep(c(5,10), 7)+6, y=7:20, lty=4 , col='brown', lwd=4)
10 12 14 16 18 20
Girth
0 5 10 15 20 25 30
Index
Things to note:
160
CHAPTER 12. PLOTTING 12.1. THE GRAPHICS SYSTEM
α = log(fi)
Girth
0 5 10 15 20 25 30
Index
Or does it?
Things to note:
• The following functions add the elements they are named after: segments, arrows, rect, polygon, title.
• mtext adds mathematical text, which needs to be wrapped in expression(). For more information for mathe-
matical annotation see ?plotmath.
Add a legend.
plot(Girth, pch='G',ylim=c(8,77), xlab='Tree number', ylab='', type='b', col='blue')
points(Volume, pch='V', type='b', col='red')
legend(x=2, y=70, legend=c('Girth', 'Volume'), pch=c('G','V'), col=c('blue','red'), bg='grey')
V
70
G Girth
V Volume V
VV
VV
50
V
V
V VVV
30
VVV
VVVVVV V G
V V
VV VV GGGGGGGGGGGG
GGGGG
GGGGGGGGGGGG
10
VVV
G
0 5 10 15 20 25 30
Tree number
161
12.1. THE GRAPHICS SYSTEM CHAPTER 12. PLOTTING
12
11
Girth
10
9
8
0 5 10 15
Index
Use layout for complicated plot layouts.
A<-matrix(c(1,1,2,3,4,4,5,6), byrow=TRUE, ncol=2)
layout(A,heights=c(1/14,6/14,1/14,6/14))
Box no. 1
Box no. 4
Always detach.
detach(trees)
162
CHAPTER 12. PLOTTING 12.1. THE GRAPHICS SYSTEM
Export tiff.
tiff(filename='graphicExample.tiff')
plot(rnorm(100))
dev.off()
Things to note:
• The tiff function tells R to open a .tiff file, and write the output of a plot.
• Only a single (the last) plot is saved.
• dev.off to close the tiff device, and return the plotting to the R console (or RStudio).
If you want to produce several plots, you can use a counter in the file’s name. The counter uses the printf3 format
string.
tiff(filename='graphicExample%d.tiff') #Creates a sequence of files
plot(rnorm(100))
boxplot(rnorm(100))
hist(rnorm(100))
dev.off()
To see the list of all open devices use dev.list(). To close all device, (not only the last one), use graphics.off().
See ?pdf and ?jpeg for more info.
x = 1995:2005
y = c(81.1, 83.1, 84.3, 85.2, 85.4, 86.5, 88.3, 88.6, 90.8, 91.1, 91.3)
plot.new()
plot.window(xlim = range(x), ylim = range(y))
abline(h = -4:4, v = -4:4, col = "lightgrey")
lines(x, y, lwd = 2)
title(main = "A Line Graph Example",
xlab = "Time",
ylab = "Quality of R Graphics")
axis(1)
axis(2)
box()
A Line Graph Example
90
Quality of R Graphics
88
86
84
82
Time
3 https://en.wikipedia.org/wiki/Printf_format_string
163
12.1. THE GRAPHICS SYSTEM CHAPTER 12. PLOTTING
Things to note:
12.1.3.2 Rosette
n = 17
theta = seq(0, 2 * pi, length = n + 1)[1:n]
x = sin(theta)
y = cos(theta)
v1 = rep(1:n, n)
v2 = rep(1:n, rep(n, n))
plot.new()
plot.window(xlim = c(-1, 1), ylim = c(-1, 1), asp = 1)
segments(x[v1], y[v1], x[v2], y[v2])
box()
12.1.3.3 Arrows
plot.new()
plot.window(xlim = c(0, 1), ylim = c(0, 1))
arrows(.05, .075, .45, .9, code = 1)
arrows(.55, .9, .95, .075, code = 2)
arrows(.1, 0, .9, 0, code = 3)
text(.5, 1, "A", cex = 1.5)
text(0, 0, "B", cex = 1.5)
text(1, 0, "C", cex = 1.5)
B C
164
CHAPTER 12. PLOTTING 12.1. THE GRAPHICS SYSTEM
x = 1:10
y = runif(10) + rep(c(5, 6.5), c(5, 5))
yl = y - 0.25 - runif(10)/3
yu = y + 0.25 + runif(10)/3
plot.new()
plot.window(xlim = c(0.5, 10.5), ylim = range(yl, yu))
arrows(x, yl, x, yu, code = 3, angle = 90, length = .125)
points(x, y, pch = 19, cex = 1.5)
axis(1, at = 1:10, labels = LETTERS[1:10])
axis(2, las = 1)
box()
7.5
7.0
6.5
6.0
5.5
5.0
4.5
A B C D E F G H I J
12.1.3.5 Histogram
A histogram is nothing but a bunch of rectangle elements.
plot.new()
plot.window(xlim = c(0, 5), ylim = c(0, 10))
rect(0:4, 0, 1:5, c(7, 8, 4, 3), col = "lightblue")
axis(1)
axis(2, las = 1)
10
0 1 2 3 4 5
165
12.1. THE GRAPHICS SYSTEM CHAPTER 12. PLOTTING
vertex1 = c(1, 2, 3, 4)
vertex2 = c(2, 3, 4, 1)
for(i in 1:50) {
x = 0.9 * x[vertex1] + 0.1 * x[vertex2]
y = 0.9 * y[vertex1] + 0.1 * y[vertex2]
polygon(x, y, col = "cornsilk")
}
12.1.3.6 Circles
Circles are just dense polygons.
R = 1
xc = 0
yc = 0
n = 72
t = seq(0, 2 * pi, length = n)[1:(n-1)]
x = xc + R * cos(t)
y = yc + R * sin(t)
plot.new()
plot.window(xlim = range(x), ylim = range(y), asp = 1)
polygon(x, y, col = "lightblue", border = "navyblue")
12.1.3.7 Spiral
k = 5
n = k * 72
theta = seq(0, k * 2 * pi, length = n)
R = .98^(1:n - 1)
x = R * cos(theta)
y = R * sin(theta)
plot.new()
166
CHAPTER 12. PLOTTING 12.2. THE GGPLOT2 SYSTEM
ggplot2 provides a convenience function for many plots: qplot. We take a non-typical approach by ignoring qplot,
and presenting the fundamental building blocks. Once the building blocks have been understood, mastering qplot
will be easy.
The nlme::Milk dataset has the protein level of various cows, at various times, with various diets.
library(nlme)
data(Milk)
head(Milk)
4 http://www.ats.ucla.edu/stat/r/seminars/ggplot2_intro/ggplot2_intro.htm
167
12.2. THE GGPLOT2 SYSTEM CHAPTER 12. PLOTTING
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
Things to note:
• The ggplot function is the constructor of the ggplot2 object. If the object is not assigned, it is plotted.
• The aes argument tells R that the Time variable in the Milk data is the x axis, and protein is y.
• The geom_point defines the Geom, i.e., it tells R to plot the points as they are (and not lines, histograms, etc.).
• The ggplot2 object is build by compounding its various elements separated by the + operator.
• All the variables that we will need are assumed to be in the Milk data frame. This means that (a) the data
needs to be a data frame (not a matrix for instance), and (b) we will not be able to use variables that are not
in the Milk data frame.
4.5
4.0
Diet
protein
barley
3.5
barley+lupins
lupins
3.0
2.5
5 10 15
Time
The color argument tells R to use the variable Diet as the coloring. A legend is added by default. If we wanted a
fixed color, and not a variable dependent color, color would have been put outside the aes function.
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(color="green")
168
CHAPTER 12. PLOTTING 12.2. THE GGPLOT2 SYSTEM
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
Let’s save the ggplot2 object so we can reuse it. Notice it is not plotted.
p <- ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point()
We can change^{In the Object-Oriented Programming lingo, this is known as mutating5 } existing plots using the +
operator. Here, we add a smoothing line to the plot p.
p + geom_smooth(method = 'gam')
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
Things to note:
To split the plot along some variable, we use faceting, done with the facet_wrap function.
p + facet_wrap(~Diet)
5 https://en.wikipedia.org/wiki/Immutable_object
169
12.2. THE GGPLOT2 SYSTEM CHAPTER 12. PLOTTING
4.5
4.0
protein
3.5
3.0
2.5
5 10 15 5 10 15 5 10 15
Time
Instead of faceting, we can add a layer of the mean of each Diet subgroup, connected by lines.
p + stat_summary(aes(color=Diet), fun.y="mean", geom="line")
4.5
4.0
Diet
protein
barley
3.5
barley+lupins
lupins
3.0
2.5
5 10 15
Time
Things to note:
To demonstrate the layers added with the geoms_* functions, we start with a histogram.
170
CHAPTER 12. PLOTTING 12.2. THE GGPLOT2 SYSTEM
125
100
75
count
50
25
A bar plot.
ggplot(Milk, aes(x=Diet)) +
geom_bar()
400
300
count
200
100
A scatter plot.
tp <- ggplot(Milk, aes(x=Time, y=protein))
tp + geom_point()
171
12.2. THE GGPLOT2 SYSTEM CHAPTER 12. PLOTTING
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
3.50
3.45
protein
3.40
3.35
5 10 15
Time
And now, a simple line plot, reusing the tp object, and connecting lines along Cow.
tp + geom_line(aes(group=Cow))
172
CHAPTER 12. PLOTTING 12.2. THE GGPLOT2 SYSTEM
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
The line plot is completely incomprehensible. Better look at boxplots along time (even if omitting the Cow information).
tp + geom_boxplot(aes(group=Time))
4.5
4.0
protein
3.5
3.0
2.5
0 5 10 15 20
Time
We can do some statistics for each subgroup. The following will compute the mean and standard errors of protein at
each time point.
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.data = 'mean_se')
173
12.2. THE GGPLOT2 SYSTEM CHAPTER 12. PLOTTING
3.9
3.8
3.7
protein
3.6
3.5
3.4
3.3
5 10 15
Time
For less popular statistical summaries, we may specify the statistical function in stat_summary. The median is a first
example.
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.y="median", geom="point")
3.8
protein
3.6
3.4
5 10 15
Time
174
CHAPTER 12. PLOTTING 12.2. THE GGPLOT2 SYSTEM
1.35
1.30
protein
1.25
1.20
5 10 15
Time
Faceting allows to split the plotting along some variable. face_wrap tells R to compute the number of columns and
rows of plots automatically.
ggplot(Milk, aes(x=protein, color=Diet)) +
geom_density() +
facet_wrap(~Time)
1 2 3 4 5
2.0
1.5
1.0
0.5
0.0
6 7 8 9 10
2.0
1.5
1.0 Diet
0.5
density
0.0 barley
11 12 13 14 15
barley+lupins
2.0
1.5 lupins
1.0
0.5
0.0
2.53.03.54.04.5
16 17 18 19
2.0
1.5
1.0
0.5
0.0
2.53.03.54.04.52.53.03.54.04.52.53.03.54.04.52.53.03.54.04.5
protein
facet_grid forces the plot to appear allow rows or columns, using the ~ syntax.
ggplot(Milk, aes(x=Time, y=protein)) +
geom_point() +
facet_grid(Diet~.) # `.~Diet` to split along columns and not rows.
175
12.2. THE GGPLOT2 SYSTEM CHAPTER 12. PLOTTING
4.5
4.0
barley
3.5
3.0
2.5
4.5
barley+lupins
4.0
protein
3.5
3.0
2.5
4.5
4.0
lupins
3.5
3.0
2.5
5 10 15
Time
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Time
176
CHAPTER 12. PLOTTING 12.3. INTERACTIVE GRAPHICS
4.5
4.0
protein
3.5
3.0
2.5
5 10 15
Saving plots can be done using ggplot2::ggsave, or with pdf like the graphics plots:
pdf(file = 'myplot.pdf')
print(tp) # You will need an explicit print command!
dev.off()
Remark. If you are exporting a PDF for publication, you will probably need to embed your fonts in the PDF. In this
case, use cairo_pdf() instead of pdf().
Finally, what every user of ggplot2 constantly uses, is the (excellent!) online documentation at http://docs.ggplot2.
org.
12.3.1 Plotly
You can create nice interactive graphs using plotly::plot_ly:
6 http://www.ggplot2-exts.org/gallery/
7 https://github.com/rstudio/RStartHere
8 https://d3js.org/
9 http://www.htmlwidgets.org/
10 http://gallery.htmlwidgets.org/
11 https://github.com/rstudio/RStartHere
177
12.4. OTHER R INTERFACES TO JAVASCRIPT PLOTTING CHAPTER 12. PLOTTING
library(plotly)
set.seed(100)
d <- diamonds[sample(nrow(diamonds), 1000), ]
plot_ly(data = d, x = ~carat, y = ~price, color = ~carat, size = ~carat, text = ~paste("Clarity: ", clarity)
More conveniently, any ggplot2 graph can be made interactive using plotly::ggplotly:
p <- ggplot(data = d, aes(x = carat, y = price)) +
geom_smooth(aes(colour = cut, fill = cut), method = 'loess') +
facet_wrap(~ cut) # make ggplot
ggplotly(p) # from ggplot to plotly
How about exporting plotly objects? Well, a plotly object is nothing more than a little web site: an HTML file.
When showing a plotly figure, RStudio merely servers you as a web browser. You could, alternatively, export this
HTML file to send your colleagues as an email attachment, or embed it in a web site. To export these, use the
plotly::export or the htmlwidgets::saveWidget functions.
For more on plotly see https://plot.ly/r/.
178
CHAPTER 12. PLOTTING 12.6. PRACTICE YOURSELF
1.35
1.30
protein
1.25
1.20
5 10 15
Time “‘
3. Write a function that creates a boxplot from scratch. See how I built a line graph in Section 12.1.3.
4. Export my plotly example using the RStudio interface and send it to yourself by email.
ggplot2:
1. Read about the “oats” dataset using ? MASS::oats.
1. Inspect, visually, the dependency of the yield (Y) in the Varieties (V) and the Nitrogen treatment (N).
2. Compute the mean and the standard error of the yield for every value of Varieties and Nitrogen treatment.
3. Change the axis labels to be informative with labs function and give a title to the plot with ggtitle
function.
2. Read about the “mtcars” data set using ? mtcars.
1. Inspect, visually, the dependency of the Fuel consumption (mpg) in the weight (wt)
2. Inspect, visually, the assumption that the Fuel consumption also depends on the number of cylinders.
3. Is there an interaction between the number of cylinders to the weight (i.e. the slope of the regression line is
different between the number of cylinders)? Use geom_smooth.
See DataCamp’s Data Visualization with ggplot226 for more self practice.
26 https://www.datacamp.com/courses/data-visualization-with-ggplot2-1
179
12.6. PRACTICE YOURSELF CHAPTER 12. PLOTTING
180
Chapter 13
Reports
If you have ever written a report, you are probably familiar with the process of preparing your figures in some software,
say R, and then copy-pasting into your text editor, say MS Word. While very popular, this process is both tedious,
and plain painful if your data has changed and you need to update the report. Wouldn’t it be nice if you could produce
figures and numbers from within the text of the report, and everything else would be automated? It turns out it is
possible. There are actually several systems in R that allow this. We start with a brief review.
1. Sweave: LaTeX is a markup language that compiles to Tex programs that compile, in turn, to documents
(typically PS or PDFs). If you never heard of it, it may be because you were born the the MS Windows+MS
Word era. You should know, however, that LaTeX was there much earlier, when computers were mainframes
with text-only graphic devices. You should also know that LaTeX is still very popular (in some communities)
due to its very rich markup syntax, and beautiful output. Sweave (Leisch, 2002) is a compiler for LaTeX that
allows you do insert R commands in the LaTeX source file, and get the result as part of the outputted PDF. It’s
name suggests just that: it allows to weave S1 output into the document, thus, Sweave.
2. knitr: Markdown is a text editing syntax that, unlike LaTeX, is aimed to be human-readable, but also compilable
by a machine. If you ever tried to read HTML or LaTeX source files, you may understand why human-readability
is a desirable property. There are many markdown compilers. One of the most popular is Pandoc, written by
the Berkeley philosopher(!) Jon MacFarlane. The availability of Pandoc gave Yihui Xie2 , a name to remember,
the idea that it is time for Sweave to evolve. Yihui thus wrote knitr (Xie, 2015), which allows to write human
readable text in Rmarkdown, a superset of markdown, compile it with R and the compile it with Pandoc. Because
Pandoc can compile to PDF, but also to HTML, and DOCX, among others, this means that you can write in
Rmarkdown, and get output in almost all text formats out there.
3. bookdown: Bookdown (Xie, 2016) is an evolution of knitr, also written by Yihui Xie, now working for
RStudio. The text you are now reading was actually written in bookdown. It deals with the particular needs
of writing large documents, and cross referencing in particular (which is very challenging if you want the text to
be human readable).
4. Shiny: Shiny is essentially a framework for quick web-development. It includes (i) an abstraction layer that
specifies the layout of a web-site which is our report, (ii) the command to start a web server to deliver the site.
For more on Shiny see Chang et al. (2017).
13.1 knitr
13.1.1 Installation
To run knitr you will need to install the package.
install.packages('knitr')
It is also recommended that you use it within RStudio (version>0.96), where you can easily create a new .Rmd file.
1 Recall, S was the original software from which R evolved.
2 https://yihui.name/
181
13.1. KNITR CHAPTER 13. REPORTS
- bullet
- bullet
- second level bullet
- second level bullet
Compiles into:
• bullet
• bullet
– second level bullet
– second level bullet
An enumerated list starts with an arbitrary number:
1. number
1. number
1. second level number
1. second level number
Compiles into:
1. number
2. number
1. second level number
2. second level number
For more on markdown see https://bookdown.org/yihui/bookdown/markdown-syntax.html.
13.1.3 Rmarkdown
Rmarkdown, is an extension of markdown due to RStudio, that allows to incorporate R expressions in the text, that
will be evaluated at the time of compilation, and the output automatically inserted in the outputted text. The output
can be a .PDF, .DOCX, .HTML or others, thanks to the power of Pandoc.
The start of a code chunk is indicated by three backticks and the end of a code chunk is indicated by three backticks.
Here is an example.
```{r eval=FALSE}
rnorm(10)
```
This chunk will compile to the following output (after setting eval=FALSE to eval=TRUE):
rnorm(10)
182
CHAPTER 13. REPORTS 13.1. KNITR
• The eval= argument is not required, since it is set to eval=TRUE by default. It does demonstrate how to set the
options of the code chunk.
```{r eval=FALSE}
plot(rnorm(10))
```
1
0
−1
2 4 6 8 10
Index
You can also call r expressions inline. This is done with a single tick and the r argument. For instance:
will output
13.1.4 BibTex
BibTex is both a file format and a compiler. The bibtex compiler links documents to a reference database stored in
the .bib file format.
Bibtex is typically associated with Tex and LaTex typesetting, but it also operates within the markdown pipeline.
Just store your references in a .bib file, add a bibliography: yourFile.bib in the YML preamble of your Rmarkdown
file, and call your references from the Rmarkdown text using @referencekey. Rmarkdow will take care of creating
the bibliography, and linking to it from the text.
13.1.5 Compiling
Once you have your .Rmd file written in RMarkdown, knitr will take care of the compilation for you. You can call the
knitr::knitr function directly from some .R file, or more conveniently, use the RStudio (0.96) Knit button above
the text editing window. The location of the output file will be presented in the console.
183
13.2. BOOKDOWN CHAPTER 13. REPORTS
13.2 bookdown
As previously stated, bookdown is an extension of knitr intended for documents more complicated than simple
reports– such as books. Just like knitr, the writing is done in RMarkdown. Being an extension of knitr, bookdown
does allow some markdowns that are not supported by other compilers. In particular, it has a more powerful cross
referencing system.
13.3 Shiny
Shiny (Chang et al., 2017) is different than the previous systems, because it sets up an interactive web-site, and not
a static file. The power of Shiny is that the layout of the web-site, and the settings of the web-server, is made with
several simple R commands, with no need for web-programming. Once you have your app up and running, you can
setup your own Shiny server on the web, or publish it via Shinyapps.io3 . The freemium versions of the service can
deal with a small amount of traffic. If you expect a lot of traffic, you will probably need the paid versions.
13.3.1 Installation
To setup your first Shiny app, you will need the shiny package. You will probably want RStudio, which facilitates
the process.
install.packages('shiny')
Once installed, you can run an example app to get the feel of it.
library(shiny)
runExample("01_hello")
Remember to press the Stop button in RStudio to stop the web-server, and get back to RStudio.
The site’s layout, is specified in the ui.R file using one of the layout functions. For instance, the function
sidebarLayout, as the name suggest, will create a sidebar. More layouts are detailed in the layout guide4 .
The active elements in the UI, that control your report, are known as widgets. Each widget will have a unique inputId
so that it’s values can be sent from the UI to the server. More about widgets, in the widget gallery5 .
The inputId on the UI are mapped to input arguments on the server side. The value of the mytext inputId can be
queried by the server using input$mytext. These are called reactive values. The way the server “listens” to the UI, is
governed by a set of functions that must wrap the input object. These are the observe, reactive, and reactive*
class of functions.
With observe the server will get triggered when any of the reactive values change. With observeEvent the server
will only be triggered by specified reactive values. Using observe is easier, and observeEvent is more prudent
programming.
A reactive function is a function that gets triggered when a reactive element changes. It is defined on the server
side, and reside within an observe function.
We now analyze the 1_Hello app using these ideas. Here is the ui.R file.
3 https://www.shinyapps.io/
4 http://shiny.rstudio.com/articles/layout-guide.html
5 http://shiny.rstudio.com/gallery/widget-gallery.html
184
CHAPTER 13. REPORTS 13.3. SHINY
library(shiny)
shinyUI(fluidPage(
titlePanel("Hello Shiny!"),
sidebarLayout(
sidebarPanel(
sliderInput(inputId = "bins",
label = "Number of bins:",
min = 1,
max = 50,
value = 30)
),
mainPanel(
plotOutput(outputId = "distPlot")
)
)
))
shinyServer(function(input, output) {
Things to note:
• ShinyUI is a (deprecated) wrapper for the UI.
• fluidPage ensures that the proportions of the elements adapt to the window side, thus, are fluid.
• The building blocks of the layout are a title, and the body. The title is governed by titlePanel, and the body
is governed by sidebarLayout. The sidebarLayout includes the sidebarPanel to control the sidebar, and the
mainPanel for the main panel.
• sliderInput calls a widget with a slider. Its inputId is bins, which is later used by the server within the
renderPlot reactive function.
• plotOutput specifies that the content of the mainPanel is a plot (textOutput for text). This expectation is
satisfied on the server side with the renderPlot function (renderText).
• shinyServer is a (deprecated) wrapper function for the server.
• The server runs a function with an input and an output. The elements of input are the inputIds from the UI.
The elements of the output will be called by the UI using their outputId.
This is the output.
Here is another example, taken from the RStudio Shiny examples6 .
ui.R:
library(shiny)
fluidPage(
6 https://github.com/rstudio/shiny-examples/tree/master/006-tabsets
185
13.3. SHINY CHAPTER 13. REPORTS
titlePanel("Tabsets"),
sidebarLayout(
sidebarPanel(
radioButtons(inputId = "dist",
label = "Distribution type:",
c("Normal" = "norm",
"Uniform" = "unif",
"Log-normal" = "lnorm",
"Exponential" = "exp")),
br(), # add a break in the HTML page.
sliderInput(inputId = "n",
label = "Number of observations:",
value = 500,
min = 1,
max = 1000)
),
mainPanel(
tabsetPanel(type = "tabs",
tabPanel(title = "Plot", plotOutput(outputId = "plot")),
tabPanel(title = "Summary", verbatimTextOutput(outputId = "summary")),
tabPanel(title = "Table", tableOutput(outputId = "table"))
)
)
)
)
server.R:
library(shiny)
dist(input$n)
})
186
CHAPTER 13. REPORTS 13.3. SHINY
Things to note:
• We reused the sidebarLayout.
• As the name suggests, radioButtons is a widget that produces radio buttons, above the sliderInput widget.
Note the different inputIds.
• Different widgets are separated in sidebarPanel by commas.
• br() produces extra vertical spacing (break).
• tabsetPanel produces tabs in the main output panel. tabPanel governs the content of each panel. Notice the use
of various output functions (plotOutput,verbatimTextOutput, tableOutput) with corresponding outputIds.
• In server.R we see the usual function(input,output).
• The reactive function tells the server the trigger the function whenever input changes.
• The output object is constructed outside the reactive function. See how the elements of output correspond to
the outputIds in the UI.
This is the output:
13.3.3.1 Widgets
• actionButton Action Button.
• checkboxGroupInput A group of check boxes.
• checkboxInput A single check box.
• dateInput A calendar to aid date selection.
• dateRangeInput A pair of calendars for selecting a date range.
• fileInput A file upload control wizard.
• helpText Help text that can be added to an input form.
• numericInput A field to enter numbers.
• radioButtons A set of radio buttons.
• selectInput A box with choices to select from.
• sliderInput A slider bar.
• submitButton A submit button.
• textInput A field to enter text.
See examples here7 .
187
13.4. FLEXDASHBOARD CHAPTER 13. REPORTS
• renderPlot plots.
• renderPrint any printed output.
• renderTable data frame, matrix, other table like structures.
• renderText character strings.
• renderUI a Shiny tag object or HTML.
Your Shiny app can use any R object. The things to remember:
• The working directory of the app is the location of server.R.
• The code before shinyServer is run only once.
• The code inside ‘shinyServer is run whenever a reactive is triggered, and may thus slow things.
To keep learning, see the RStudio’s tutorial8 , and the Biblipgraphic notes herein.
13.3.4 shinydashboard
A template for Shiny to give it s modern look.
13.4 flexdashboard
If you want to quickly write an interactive dashboard, which is simple enough to be a static HTML file and does not
need an HTML server, then Shiny may be an overkill. With flexdashboard you can write your dashboard a single
.Rmd file, which will generate an interactive dashboard as a static HTML file.
See [http://rmarkdown.rstudio.com/flexdashboard/] for more info.
8 http://shiny.rstudio.com/tutorial/
9 http://rmarkdown.rstudio.com/
10 https://yihui.name/knitr/
11 http://shiny.rstudio.com/tutorial/
12 http://zevross.com/blog/2016/04/19/r-powered-web-applications-with-shiny-a-tutorial-and-cheat-sheet-with-40-example-apps/
13 https://www.rstudio.com/resources/webinars/shiny-developer-conference/
188
Chapter 14
Sparse Representations
Analyzing “bigdata” in R is a challenge because the workspace is memory resident, i.e., all your objects are stored in
RAM. As a rule of thumb, fitting models requires about 5 times the size of the data. This means that if you have 1
GB of data, you might need about 5 GB to fit a linear models. We will discuss how to compute out of RAM in the
Memory Efficiency Chapter 15. In this chapter, we discuss efficient representations of your data, so that it takes less
memory. The fundamental idea, is that if your data is sparse, i.e., there are many zero entries in your data, then a
naive data.frame or matrix will consume memory for all these zeroes. If, however, you have many recurring zeroes,
it is more efficient to save only the non-zero entries.
When we say data, we actually mean the model.matrix. The model.matrix is a matrix that R grows, converting all
your factors to numeric variables that can be computed with. Dummy coding of your factors, for instance, is something
that is done in your model.matrix. If you have a factor with many levels, you can imagine that after dummy coding
it, many zeroes will be present.
The Matrix package replaces the matrix class, with several sparse representations of matrix objects.
When using sparse representation, and the Matrix package, you will need an implementation of your favorite model
fitting algorithm (e.g. lm) that is adapted to these sparse representations; otherwise, R will cast the sparse matrix
into a regular (non-sparse) matrix, and you will have saved nothing in RAM.
Remark. If you are familiar with MATLAB you should know that one of the great capabilities of MATLAB, is the
excellent treatment of sparse matrices with the sparse function.
Before we go into details, here is a simple example. We will create a factor of letters with the letters function.
Clearly, this factor can take only 26 values. This means that 25/26 of the model.matrix will be zeroes after dummy
coding. We will compare the memory footprint of the naive model.matrix with the sparse representation of the same
matrix.
library(magrittr)
reps <- 1e6 # number of samples
y<-rnorm(reps)
x<- letters %>%
sample(reps, replace=TRUE) %>%
factor
## [1] l s q b h p
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
We dummy code x with the model.matrix function.
X.1 <- model.matrix(~x-1)
head(X.1)
## xa xb xc xd xe xf xg xh xi xj xk xl xm xn xo xp xq xr xs xt xu xv xw xx
189
CHAPTER 14. SPARSE REPRESENTATIONS
## 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## xy xz
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
##
## [1,] . . . . . . . . . . . 1 . . . . . . . . . . . . . .
## [2,] . . . . . . . . . . . . . . . . . . 1 . . . . . . .
## [3,] . . . . . . . . . . . . . . . . 1 . . . . . . . . .
## [4,] . 1 . . . . . . . . . . . . . . . . . . . . . . . .
## [5,] . . . . . . . 1 . . . . . . . . . . . . . . . . . .
## [6,] . . . . . . . . . . . . . . . 1 . . . . . . . . . .
## [1] 1000000 26
dim(X.2)
## [1] 1000000 26
The memory footprint of the matrices, given by the pryr::object_size function, are very very different.
pryr::object_size(X.1)
## 272 MB
pryr::object_size(X.2)
## 12 MB
Things to note:
• The sparse representation takes a whole lot less memory than the non sparse.
• The as(,"sparseMatrix") function grows the dummy variable representation of the factor x.
• The pryr package provides many facilities for inspecting the memory footprint of your objects and code.
With a sparse representation, we not only saved on RAM, but also on the computing time of fitting a model. Here is
the timing of a non sparse representation:
190
CHAPTER 14. SPARSE REPRESENTATIONS 14.1. SPARSE MATRIX REPRESENTATIONS
## [1] TRUE
You can also visualize the non zero entries, i.e., the sparsity structure.
image(X.2[1:26,1:26])
10
Row
15
20
25
5 10 15 20 25
Column
Dimensions: 26 x 26
191
14.1. SPARSE MATRIX REPRESENTATIONS CHAPTER 14. SPARSE REPRESENTATIONS
will be
1 2 𝑎2
[ ].
2 3 𝑏3
Figure 14.1: The CSR data structure. From Shah and Gilbert (2004). Remember that MATLAB is written in C,
where the indexing starts at 0, and not 1.
1. Working with sparse representations requires using a function that is aware of the representation you are using.
2. A mathematician may write 𝐴𝑥 = 𝑏 ⇒ 𝑥 = 𝐴−1 𝑏. This is a predicate1 of 𝑥,i.e., a property that 𝑥 satisfies, which
helps with its analysis. A computer, however, would never compute 𝐴−1 in order to find 𝑥, but rather use one
of many endlessly many numerical algorithms. A computer will typically “search” various 𝑥’s until it finds the
one that fulfils the predicate.
1 https://en.wikipedia.org/wiki/Predicate_(mathematical_logic)
192
CHAPTER 14. SPARSE REPRESENTATIONS 14.2. SPARSE MATRICES AND SPARSE MODELS IN R
set.seed(1)
performance <- data.frame()
2 http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/
193
14.2. SPARSE MATRICES AND SPARSE MODELS IN R CHAPTER 14. SPARSE REPRESENTATIONS
Things to note:
• The simulation calls glmnet twice. Once with the non-sparse object x, and once with its sparse version sx.
• The degree of sparsity of sx is 85%. We know this because we “injected” zeroes in 0.85 of the locations of x.
• Because y is continuous glmnet will fit a simple OLS model. We will see later how to use it to fit GLMs and use
lasso, ridge, and elastic-net regularization.
We now inspect the computing time, and the memory footprint, only to discover that sparse representations make a
BIG difference.
suppressPackageStartupMessages(library('ggplot2'))
ggplot(performance, aes(x = Format, y = ElapsedTime, fill = Format)) +
stat_summary(fun.data = 'mean_cl_boot', geom = 'bar') +
stat_summary(fun.data = 'mean_cl_boot', geom = 'errorbar') +
ylab('Elapsed Time in Seconds')
0.9
Elapsed Time in Seconds
Format
0.6
Sparse
Full
0.3
0.0
Sparse Full
Format
194
CHAPTER 14. SPARSE REPRESENTATIONS 14.3. BEYOND SPARSITY
40
30
Matrix Size in MB
Format
20 Sparse
Full
10
Sparse Full
Format
How do we perform other types of regression with the glmnet? We just need to use the family and alpha arguments
of glmnet::glmnet. The family argument governs the type of GLM to fit: logistic, Poisson, probit, or other types of
GLM. The alpha argument controls the type of regularization. Set to alpha=0 for ridge, alpha=1 for lasso, and any
value in between for elastic-net regularization.
195
14.5. BIBLIOGRAPHIC NOTES CHAPTER 14. SPARSE REPRESENTATIONS
0 1 0
⎡0 0 6⎤
⎢ ⎥
⎣1 0 1⎦
2. Write a function that takes two matrices in CSC and returns their matrix product.
7 https://cran.r-project.org/web/packages/Matrix/vignettes/Intro2Matrix.pdf
8 http://netlib.org/linalg/html_templates/node90.html
196
Chapter 15
Memory Efficiency
As put by Kane et al. (2013), it was quite puzzling when very few of the competitors, for the Million dollars prize
in the Netflix challenge1 , were statisticians. This is perhaps because the statistical community historically uses SAS,
SPSS, and R. The first two tools are very well equipped to deal with big data, but are very unfriendly when trying to
implement a new method. R, on the other hand, is very friendly for innovation, but was not equipped to deal with
the large data sets of the Netflix challenge. A lot has changed in R since 2006. This is the topic of this chapter.
As we have seen in the Sparsity Chapter 14, an efficient representation of your data in RAM will reduce computing
time, and will allow you to fit models that would otherwise require tremendous amounts of RAM. Not all problems
are sparse however. It is also possible that your data does not fit in RAM, even if sparse. There are several scenarios
to consider:
1. Your data fits in RAM, but is too big to compute with.
2. Your data does not fit in RAM, but fits in your local storage (HD, SSD, etc.)
3. Your data does not fit in your local storage.
If your data fits in RAM, but is too large to compute with, a solution is to replace the algorithm you are using. Instead
of computing with the whole data, your algorithm will compute with parts of the data, also called chunks, or batches.
These algorithms are known as external memory algorithms (EMA), or batch processing.
If your data does not fit in RAM, but fits in your local storage, you have two options. The first is to save your data
in a database management system (DBMS). This will allow you to use the algorithms provided by your DBMS, or let
R use an EMA while “chunking” from your DBMS. Alternatively, and preferably, you may avoid using a DBMS, and
work with the data directly form your local storage by saving your data in some efficient manner.
Finally, if your data does not fit on you local storage, you will need some external storage solution such as a distributed
DBMS, or distributed file system.
Remark. If you use Linux, you may be better of than Windows users. Linux will allow you to compute with larger
datasets using its swap file that extends RAM using your HD or SSD. On the other hand, relying on the swap file is a
BAD practice since it is much slower than RAM, and you can typically do much better using the tricks of this chapter.
Also, while I LOVE Linux, I would never dare to recommend switching to Linux just to deal with memory contraints.
197
15.1. EFFICIENT COMPUTING FROM RAM CHAPTER 15. MEMORY EFFICIENCY
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
coef(a)
198
CHAPTER 15. MEMORY EFFICIENCY 15.2. COMPUTING FROM A DATABASE
group_by(v1,v2) %>%
summarize(
x1 = sum(abs(x1)),
x2 = sum(abs(x2)),
x3 = sum(abs(x3))
)
)
file.remove('my_db.sqlite3')
my_db <- src_sqlite(path = "my_db.sqlite3", create = TRUE)
3 This is slowly changing. Indeed, Microsoft’s SQL Server 2016 is already providing in-database-analytics4 , and other will surely follow.
5 https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
199
15.3. COMPUTING FROM EFFICIENT FILE STRUCTRURES CHAPTER 15. MEMORY EFFICIENCY
library(nycflights13)
flights_sqlite <- copy_to(
dest= my_db,
df= flights,
temporary = FALSE,
indexes = list(c("year", "month", "day"), "carrier", "tailnum"))
Things to note:
• src_sqlite to start an empty table, managed by SQLite, at the desired path.
• copy_to copies data from R to the database.
• Typically, setting up a DBMS like this makes no sense, since it requires loading the data into RAM, which is
precisely what we want to avoid.
We can now start querying the DBMS.
select(flights_sqlite, year:day, dep_delay, arr_delay)
6 https://cran.r-project.org/web/packages/bigmemory/index.html
200
CHAPTER 15. MEMORY EFFICIENCY 15.3. COMPUTING FROM EFFICIENT FILE STRUCTRURES
15.3.1 bigmemory
We now demonstrate the workflow of the bigmemory package. We will see that bigmemory, with it’s big.matrix
object is a very powerful mechanism. If you deal with big numeric matrices, you will find it very useful. If you deal
with big data frames, or any other non-numeric matrix, bigmemory may not be the appropriate tool, and you should
try ff, or the commercial RevoScaleR.
# download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPU
# unzip(zipfile="2010_Carrier_PUF.zip")
library(bigmemory)
x <- read.big.matrix("data/2010_BSA_Carrier_PUF.csv", header = TRUE,
backingfile = "airline.bin",
descriptorfile = "airline.desc",
type = "integer")
dim(x)
## [1] 2801660 11
pryr::object_size(x)
## 696 B
class(x)
## [1] "big.matrix"
## attr(,"package")
## [1] "bigmemory"
Things to note:
• The basic building block of the bigmemory ecosystem, is the big.matrix class, we constructed with
read.big.matrix.
• read.big.matrix handles the import to R, and the saving to a memory mapped file. The implementation is
such that at no point does R hold the data in RAM.
• The memory mapped file will be there after the session is over. It can thus be called by other R sessions using
attach.big.matrix("airline.desc"). This will be useful when parallelizing.
• pryr::object_size return the size of the object. Since x holds only the memory mappings, it is much smaller
than the 100MB of data that it holds.
We can now start computing with the data. Many statistical procedures for the big.matrix object are provided by the
biganalytics package. In particular, the biglm.big.matrix and bigglm.big.matrix functions, provide an interface
from big.matrix objects, to the EMA linear models in biglm::biglm and biglm::bigglm.
library(biganalytics)
biglm.2 <- bigglm.big.matrix(BENE_SEX_IDENT_CD~CAR_LINE_HCPCS_CD, data=x)
coef(biglm.2)
## (Intercept) CAR_LINE_HCPCS_CD
## 1.537848e+00 1.210282e-07
• bigtabulate: Extend the bigmemory package with ‘table’, ‘tapply’, and ‘split’ support for ‘big.matrix’ objects.
• bigalgebra: For matrix operation.
• bigpca: principle components analysis (PCA), or singular value decomposition (SVD).
• bigFastlm: for (fast) linear models.
• biglasso: extends lasso and elastic nets.
• GHap: Haplotype calling from phased SNP data.
201
15.4. FF CHAPTER 15. MEMORY EFFICIENCY
15.3.2 bigstep
The bigstep7 package uses the bigmemory framework to perform stepwise model selction, when the data cannot fit
into RAM.
TODO
15.4 ff
The ff packages replaces R’s in-RAM storage mechanism with on-disk (efficient) storage. Unlike bigmemory, ff
supports all of R vector types such as factors, and not only numeric. Unlike big.matrix, which deals with (numeric)
matrices, the ffdf class can deal with data frames.
Here is an example. First open a connection to the file, without actually importing it using the LaF::laf_open_csv
function.
.dat <- LaF::laf_open_csv(filename = "data/2010_BSA_Carrier_PUF.csv",
column_types = c("integer", "integer", "categorical", "categorical", "categorical", "int
column_names = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "ser
skip = 1)
Now write the data to local storage as an ff data frame, using laf_to_ffdf.
data.ffdf <- ffbase::laf_to_ffdf(laf = .dat)
202
CHAPTER 15. MEMORY EFFICIENCY 15.4. FF
## 1 1 1 NA 99213 M1B
## 2 1 1 NA A0425 O1A
## 3 1 1 NA A0425 O1A
## 4 1 1 NA A0425 O1A
## 5 1 1 NA A0425 O1A
## 6 1 1 NA A0425 O1A
## 7 1 1 NA A0425 O1A
## 8 1 1 NA A0425 O1A
## : : : : : :
## 2801653 2 6 V82 85025 T1D
## 2801654 2 6 V82 87186 T1H
## 2801655 2 6 V82 99213 M1B
## 2801656 2 6 V82 99213 M1B
## 2801657 2 6 V82 A0429 O1A
## 2801658 2 6 V82 G0328 T1H
## 2801659 2 6 V86 80053 T1B
## 2801660 2 6 V88 76856 I3B
## service.count
## 1 1
## 2 1
## 3 1
## 4 2
## 5 2
## 6 3
## 7 3
## 8 4
## : :
## 2801653 1
## 2801654 1
## 2801655 1
## 2801656 1
## 2801657 1
## 2801658 1
## 2801659 1
## 2801660 1
We can verify that the ffdf data frame has a small RAM footprint.
pryr::object_size(data.ffdf)
##
## 1 2 3 4 5 6
## 517717 495315 492851 457643 419429 418705
The EMA implementation of biglm::biglm and biglm::bigglm have their ff versions.
library(biglm)
mymodel.ffdf <- biglm(payment ~ factor(sex) + factor(age) + place.served,
data = data.ffdf)
summary(mymodel.ffdf)
## Large data regression model: biglm(payment ~ factor(sex) + factor(age) + place.served, data = data.ffdf)
203
15.5. DISK.FRAME CHAPTER 15. MEMORY EFFICIENCY
15.5 disk.frame
TODO: https://github.com/xiaodaigh/disk.frame
15.6 matter
Memory-efficient reading, writing, and manipulation of structured binary data on disk as vectors, matrices, arrays,
lists, and data frames.
TODO
15.7 iotools
A low level facility for connecting to on-disk binary storage. Unlike ff, and bigmemory, it behaves like native R
objects, with their copy-on-write policy. Unlike reader, it allows chunking. Unlike read.csv, it allows fast I/O.
iotools is thus a potentially very powerfull facility. See Arnold et al. (2015) for details.
TODO
15.8 HDF5
Like ff, HDF5 is an on-disk efficient file format. The package h5 is interface to the “HDF5” library supporting fast
storage and retrieval of R-objects like vectors, matrices and arrays.
TODO
15.9 DelayedArray
An abstraction layer for operations on array objects, which supports various backend storage of arrays such as:
• In RAM: base8 , Matrix9 , DelayedArray10 .
• In Disk: HDF5Array11 , matterArray12 .
Link13 Several application packages already build upon the DelayedArray pacakge:
8
9
10 https://bioconductor.org/packages/release/bioc/html/DelayedArray.html
11 https://bioconductor.org/packages/release/bioc/html/HDF5Array.html
12 https://github.com/PeteHaitch/matterArray
13 https://bioconductor.org/packages/release/bioc/html/DelayedArray.html
204
CHAPTER 15. MEMORY EFFICIENCY 15.10. COMPUTING FROM A DISTRIBUTED FILE SYSTEM
14 https://github.com/PeteHaitch/DelayedMatrixStats
15
16 http://technodocbox.com/C_and_CPP/112025624-Massive-data-shared-and-distributed-memory-and-concurrent-programming-
bigmemory-and-foreach.html
17 http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf
18 https://cran.r-project.org/web/views/HighPerformanceComputing.html
19 https://cran.r-project.org/web/views/Databases.html
20 https://www.peterhickey.org/slides/2017/2017-08-01_Peter_Hickey_JSM.pdf
205
15.12. PRACTICE YOURSELF CHAPTER 15. MEMORY EFFICIENCY
206
Chapter 16
Parallel Computing
You would think that because you have an expensive multicore computer your computations will speed up. Well, no;
unless you actively make sure of that. By default, no matter how many cores you have, the operating system will
allocate each R session to a single core.
To parallelise computations, we need to distinguish between two types of parallelism:
1. Explicit parallelism: where the user handles the parallelisation.
2. Implicit parallelism: where the parallelisation is abstracted away from the user.
Clearly, implicit parallelism is more desirable. It is, however, very hard to design software that can parallelise any
algorithm, while adapting to your hardware, operating system, and other the software running on your device. A lot
of parallelisation still has to be explicit, but stay tuned for technologies like Ray1 , Apache Spark2 , Apache Flink3 ,
Chapel4 , PyTorch5 , and others, which are making great advances in handling parallelism for you. Before we can
understand what those do, we start with explicit parallelism.
207
16.2. TERMINOLOGY CHAPTER 16. PARALLEL COMPUTING
3. Profile your code to detect how much RAM and CPU are consumed by each line of code. See Hadley’s guide10 .
In the best possible scenario, the number of operations you can perform scales with the number of processors:
. This is called perfect scaling. It is rarely observed in practice, since parallelizing incurs some computational overhead:
setting up environments, copying memory, … For this reason, the typical speedup is sub-linear. Computer scientists
call this Amdahl’s law11 .
16.2 Terminology
Here are some terms we will be needing.
16.2.1 Hardware:
• Cluster: A collection of interconnected computers.
• Node/Machine: A single physical machine in the cluster. Components of a single node do not communicate
via the cluster’s network, but rather, via the node’s circuitry.
• Processor/Socket/CPU:
• RAM: Random Access Memory. One of many types of memory in a computer. Possibly the most relevant type
of memory when computing with data.
• GPU: Graphical Processing Unit. A computing unit, separate from the CPU. Originally dedicated to graphics
and gaming, thus its name. Currently, GPUs are extremely popular for fitting and servicing Deep Neural
Networks.
• TPU: Tensor Processing Unit. A computing unit, dedicated to servicing and fitting Deep Neural Networks.
16.2.2 Software:
• Process: A sequence of instructions in memory, with accompanying data. Various processes typically see
different locations of memory. Interpreted languages like R, and Python operate on processes.
• Thread: A sub-sequence of instructions, within a process. Various threads in a process may see the same
memory. Compiled languages like C, C++, may operate on threads.
208
CHAPTER 16. PARALLEL COMPUTING 16.3. EXPLICIT PARALLELISM
• Socket: data sent via a network interface. Put differently- information is sent between R processes as if these
were different machines in a network. Information may be structured by the particular application, and does not
abide to some
• Parallel Virtual Machine (PVM): a communication protocol and software, developed the University of Ten-
nessee, Oak Ridge National Laboratory and Emory University, and first released in 1989. Runs on Windows and
Unix, thus allowing to compute on clusters running these two operating systems. Noways, it is mostly replaced
by MPI. The same group responsible for PVM will later deliver Programming with Big Data in R (pbdR12 ): a
whole ecosystem of packages for running R on large computing clusters.
• Message Passing Interface (MPI): A communication protocol that has become the de-facto standard for
communication in large distributed clusters. Particularly, for heterogeneous computing clusters with varying
operating systems and hardware. The protocol has various software implementations such as OpenMPI13 and
MPICH14 , Deino15 , LAM/MPI16 . Interestingly, large computing clusters use MPI, while modern BigData analysis
platforms such as Spark, and Ray do not. Why is this? See Jonathan Dursi’s excellent but controversial blog
post17 .
• NetWorkSpaces (NWS): A master-slave communication protocol where the master is not an R-session, but
rather, an NWS server.
For more on inter-process communication, see Wiki18 .
16.3.1 parallel
The parallel package, maintained by the R-core team, was introduced in 2011 to unify two popular parallisation
packages: snow and multicore. The multicore package was designed to parallelise using the fork mechanism, on
Linux machines. The snow package was designed to parallise using the spawn echanism, on all operating systems.
Servers/R-sessions started with snow will not see the parent’s data, which will have to be copied to sapwned sessions.
At least there is less data redundancy. Spawning, unlike forking, can be done on remote machines, which communicate
using MPI.
16.3.2 Foreach
For reasons detailed in Kane et al. (2013), we recommend the foreach parallelisation package (Analytics and Weston,
2015). It allows us to:
1. Decouple between our parallel algorithm and the parallelisation mechanism: we write parallelisable code once,
and can later switch between paralliseation mechanisms. Currently supported mechanisms include:
• fork: Called with the doMC backend.
• MPI, VPM, NWS: Called with the doSNOW or doMPI backends.
• futures: Called with the doFuture backend.
• redis: Called with the doRedis backend. Similar to NWS, only that data made available to different processes
using Redis19 .
• Future mechanism may also be supported.
2. Combine with the big.matrix object from Chapter 15 for shared memory parallelisation: all the machines may
see the same data, so that we don’t need to export objects from machine to machine.
What do we mean by “switch the underlying parallelisation mechanism”? It means there are several packages that will
handle communication between machines. Some are very general and will work on any cluster. Some are more specific
and will work only on a single multicore machine (not a cluster) with a particular operating system. These mechanisms
include multicore, snow, parallel, and Rmpi. The compatibility between these mechanisms and foreach is provided
by another set of packages: doMC , doMPI, doRedis, doParallel, and doSNOW.
12 https://pbdr.org
13 https://en.wikipedia.org/wiki/Open_MPI
14 https://en.wikipedia.org/wiki/MPICH
15 http://mpi.deino.net/
16 https://en.wikipedia.org/wiki/LAM/MPI
17 https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html
18 https://en.wikipedia.org/wiki/Inter-process_communication
19 https://en.wikipedia.org/wiki/Redis
209
16.3. EXPLICIT PARALLELISM CHAPTER 16. PARALLEL COMPUTING
Remark. I personally prefer the multicore mechanism, with the doMC adapter for foreach. I will not use this
combo, however, because multicore will not work on Windows machines. I will thus use the more general snow and
doParallel combo. If you do happen to run on Linux, or Unix, you will want to replace all doParallel functionality
with doMC.
Let’s start with a simple example, taken from “Getting Started with doParallel and foreach”20 .
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
result <- foreach(i=1:3) %dopar% sqrt(i)
class(result)
## [1] "list"
result
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
Things to note:
• makeCluster creates an object with the information our cluster. On a single machine it is very simple. On a
cluster of machines, you will need to specify the i.p. addresses or other identifiers of the machines.
• registerDoParallel is used to inform the foreach package of the presence of our cluster.
• The foreach function handles the looping. In particular note the %dopar operator that ensures that looping
is in parallel. %dopar% can be replaced by %do% if you want serial looping (like the for loop), for instance, for
debugging.
• The output of the various machines is collected by foreach to a list object.
• In this simple example, no data is shared between machines so we are not putting the shared memory capabilities
to the test.
• We can check how many workers were involved using the getDoParWorkers() function.
• We can check the parallelisation mechanism used with the getDoParName() function.
Here is a more involved example. We now try to make Bootstrap21 inference on the coefficients of a logistic regression.
Bootstrapping means that in each iteration, we resample the data, and refit the model.
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 1e4
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime
## elapsed
## 10.443
Things to note:
• As usual, we use the foreach function with the %dopar% operator to loop in parallel.
• The icounts function generates a counter.
20 http://debian.mc.vanderbilt.edu/R/CRAN/web/packages/doParallel/vignettes/gettingstartedParallel.pdf
21 https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
210
CHAPTER 16. PARALLEL COMPUTING 16.3. EXPLICIT PARALLELISM
• The .combine=cbind argument tells the foreach function how to combine the output of different machines, so
that the returned object is not the default list.
How long would that have taken in a simple (serial) loop? We only need to replace %dopar% with %do% to test.
stime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
stime
## elapsed
## 16.9
Yes. Parallelising is clearly faster.
Let’s see how we can combine the power of bigmemory and foreach by creating a file mapped big.matrix object,
which is shared by all machines. The following example is taken from Kane et al. (2013), and uses the big.matrix
object we created in Chapter 15.
library(bigmemory)
x <- attach.big.matrix("airline.desc")
library(foreach)
library(doSNOW)
cl <- makeSOCKcluster(rep("localhost", 4)) # make a cluster of 4 machines
registerDoSNOW(cl) # register machines for foreach()
Get a “description” of the big.matrix object that will be used to call it from each machine.
xdesc <- describe(x)
We are all set up to loop, in parallel, and compute quantiles of CAR_LINE_ICD9_DGNS_CD for each value of
BENE_AGE_CAT_CD.
qs <- foreach(g = G, .combine = rbind) %dopar% {
require("bigmemory")
x <- attach.big.matrix(xdesc)
GetDepQuantiles(rows = g, data = x)
}
qs
211
16.4. IMPLICIT PARALLELISM CHAPTER 16. PARALLEL COMPUTING
16.3.3 Rdsm
16.3.4 pbdR
Not on a single machine. https://pbdr.org
The pbdMPI package provides S4 classes to directly interface MPI in order to support the Single Program/Multiple
Data (SPMD) parallel programming style which is particularly useful for batch parallel execution. The pbdSLAP builds
on this and uses scalable linear algebra packages (namely BLACS, PBLAS, and ScaLAPACK) in double precision based
on ScaLAPACK version 2.0.2. The pbdBASE builds on these and provides the core classes and methods for distributed
data types upon which the pbdDMAT builds to provide distributed dense matrices for “Programming with Big Data”.
The pbdNCDF4 package permits multiple processes to write to the same file (without manual synchronization) and
supports terabyte-sized files. The pbdDEMO package provides examples for these packages, and a detailed vignette.
The pbdPROF package profiles MPI communication SPMD code via MPI profiling libraries, such as fpmpi, mpiP, or
TAU.
212
CHAPTER 16. PARALLEL COMPUTING 16.5. BIBLIOGRAPHIC NOTES
## BLAS
## "/usr/lib/libopenblasp-r0.2.19.so"
24 https://en.wikipedia.org/wiki/OpenBLAS
25
26 http://www.parallelr.com/r-with-parallel-computing/
27 https://cran.r-project.org/web/views/HighPerformanceComputing.html
28 https://www.datacamp.com/courses/parallel-programming-in-r
213
16.6. PRACTICE YOURSELF CHAPTER 16. PARALLEL COMPUTING
214
Chapter 17
In your algebra courses you would write 𝐴𝑥 = 𝑏 and solve 𝑥 = 𝐴−1 𝑏. This is useful to understand the algebraic
properties of 𝑥, but a computer would never recover 𝑥 that way. Even the computation of the sample variance,
𝑆 2 (𝑥) = (𝑛 − 1)−1 ∑(𝑥𝑖 − 𝑥)̄ 2 is not solved that way in a computer, because of numerical and speed considerations.
In this chapter, we discuss several ways a computer solves systems of linear equations, with their application to
statistics, namely, to OLS problems.
17.1 LU Factorization
Definition 17.1 (LU Factorization). For some matrix 𝐴, the LU factorization is defined as
𝐴 = 𝐿𝑈 (17.1)
The LU factorization is essentially the matrix notation for the Gaussian elimination1 you did in your introductory
algebra courses.
For a square 𝑛 × 𝑛 matrix, the LU factorization requires 𝑛3 /3 operations, and stores 𝑛2 + 𝑛 elements in memory.
Seeing the matrix 𝐴 as a function, non-negative matrices can be thought of as functions that generalize the squaring
operation.
Definition 17.3 (Cholesky Factorization). For some non-negative matrix 𝐴, the Cholesky factorization is defined as
𝐴 = 𝑇 ′𝑇 (17.2)
For obvious reasons, the Cholesky factorization is known as the square root of a matrix.
Because Cholesky is less general than LU, it is also more efficient. It can be computed in 𝑛3 /6 operations, and requires
storing 𝑛(𝑛 + 1)/2 elements.
1 https://en.wikipedia.org/wiki/Gaussian_elimination
215
17.3. QR FACTORIZATION CHAPTER 17. NUMERICAL LINEAR ALGEBRA
17.3 QR Factorization
Definition 17.4 (QR Factorization). For some matrix 𝐴, the QR factorization is defined as
𝐴 = 𝑄𝑅 (17.3)
The QR factorization is very useful to solve the OLS problem as we will see in 17.6. The QR factorization takes 2𝑛3 /3
operations to compute. Three major methods for computing the QR factorization exist. These rely on Householder
transformations, Givens transformations, and a (modified) Gram-Schmidt procedure (Gentle, 2012).
𝐴 = 𝑈 Σ𝑉 ′ (17.4)
The SVD factorization is very useful for algebraic analysis, but less so for computations. This is because it is (typically)
solved via the QR factorization.
𝑋 ′ 𝑋𝛽 = 𝑋 ′ 𝑦. (17.6)
Eq.(17.6) are known as the normal equations. The normal equations are the link between the OLS problem, and the
matrix factorization discussed above.
Using the QR decomposition in the normal equations we have that
𝛽 ̂ = 𝑅(1∶𝑝,1∶𝑝)
−1
𝑦,
216
CHAPTER 17. NUMERICAL LINEAR ALGEBRA 17.7. NUMERICAL LIBRARIES FOR LINEAR ALGEBRA
17.7.1 OpenBlas
17.7.2 MKL
2 https://en.wikipedia.org/wiki/Comparison_of_linear_algebra_libraries
3 http://dirk.eddelbuettel.com/blog/2018/04/15/#018_mkl_for_debian_ubuntu
4 https://www.r-bloggers.com/why-is-r-slow-some-explanations-and-mklopenblas-setup-to-try-to-fix-this/
5 https://www.r-bloggers.com/for-faster-r-use-openblas-instead-better-than-atlas-trivial-to-switch-to-on-ubuntu/
6 https://gist.github.com/pachamaltese/e4b819ccf537d465a8d49e6d60252d89
217
17.9. PRACTICE YOURSELF CHAPTER 17. NUMERICAL LINEAR ALGEBRA
218
Chapter 18
Convex Optimization
TODO
1 https://cran.r-project.org/web/views/Optimization.html
219
18.4. PRACTICE YOURSELF CHAPTER 18. CONVEX OPTIMIZATION
220
Chapter 19
RCpp
221
19.2. PRACTICE YOURSELF CHAPTER 19. RCPP
222
Chapter 20
Debugging Tools
TODO. In the meanwhile, get started with Wickham (2011), and get pro with Cotton (2017).
223
20.2. PRACTICE YOURSELF CHAPTER 20. DEBUGGING TOOLS
224
Chapter 21
The Hadleyverse
The Hadleyverse, short for “Hadley Wickham’s universe”, is a set of packages that make it easier to handle data. If
you are developing packages, you should be careful since using these packages may create many dependencies and
compatibility issues. If you are analyzing data, and the portability of your functions to other users, machines, and
operating systems is not of a concern, you will LOVE these packages. The term Hadleyverse refers to all of Hadley’s
packages, but here, we mention only a useful subset, which can be collectively installed via the tidyverse package:
21.1 readr
The readr package (Wickham et al., 2016) replaces base functions for importing and exporting data such as
read.table. It is faster, with a cleaner syntax.
We will not go into the details and refer the reader to the official documentation here1 and the R for data sciecne2
book.
21.2 dplyr
When you think of data frame operations, think dplyr (Wickham and Francois, 2016). Notable utilities in the package
include:
225
21.2. DPLYR CHAPTER 21. THE HADLEYVERSE
• right_join(x,y,by="col") return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no
match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all
combinations of the matches are returned.
• anti_join(x,y,by="col") return all rows from ‘x’ where there are not matching values in ‘y’, keeping just
columns from ‘x’.
The following example involve data.frame objects, but dplyr can handle other classes. In particular data.tables
from the data.table package (Dowle and Srinivasan, 2017), which is designed for very large data sets.
dplyr can work with data stored in a database. In which case, it will convert your command to the appropriate SQL
syntax, and issue it to the database. This has the advantage that (a) you do not need to know the specific SQL
implementation of your database, and (b), you can enjoy the optimized algorithms provided by the database supplier.
For more on this, see the databses vignette3 .
The following examples are taken from Kevin Markham4 . The nycflights13::flights has delay data for US flights.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The data is of class tbl_df which is an extension of the data.frame class, designed for large data sets. Notice that
the printing of flights is short, even without calling the head function. This is a feature of the tbl_df class (
print(data.frame) would try to load all the data, thus take a long time).
class(flights) # a tbl_df is an extension of the data.frame class
library(dplyr)
filter(flights, month == 1, day == 1) #dplyr style
flights %>% filter(month == 1, day == 1) # dplyr with piping.
More filtering.
filter(flights, month == 1 | month == 2) # First OR second month.
slice(flights, 1:10) # selects first ten rows.
3 https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
4 https://github.com/justmarkham/dplyr-tutorial/blob/master/dplyr-tutorial.Rmd
226
CHAPTER 21. THE HADLEYVERSE 21.2. DPLYR
select(flights, year, month, day) # select columns year, month, and day
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, c(tail_num = "tailnum")) # rename column
# simple statistics
summarise(flights,
delay = mean(dep_delay, na.rm = TRUE)
)
# random subsample
sample_n(flights, 10)
sample_frac(flights, 0.01)
We now perform operations on subgroups. we group observations along the plane’s tail number (tailnum), and
compute the count, average distance traveled, and average delay. We group with group_by, and compute subgroup
statistics with summarise.
by_tailnum <- group_by(flights, tailnum)
delay
We can group along several variables, with a hierarchy. We then collapse the hierarchy one by one.
daily <- group_by(flights, year, month, day)
per_day <- summarise(daily, flights = n())
per_month <- summarise(per_day, flights = sum(flights))
per_year <- summarise(per_month, flights = sum(flights))
Things to note:
• Every call to summarise collapses one level in the hierarchy of grouping. The output of group_by recalls the
hierarchy of aggregation, and collapses along this hierarchy.
We can use dplyr for two table operations, i.e., joins. For this, we join the flight data, with the airplane data in
airplanes.
227
21.2. DPLYR CHAPTER 21. THE HADLEYVERSE
library(dplyr)
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
# select the subset of interesting flight data.
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA United Air Lines I~
## 2 2013 1 1 5 LGA IAH N24211 UA United Air Lines I~
## 3 2013 1 1 5 JFK MIA N619AA AA American Airlines ~
## 4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
## 5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines In~
## 6 2013 1 1 5 EWR ORD N39463 UA United Air Lines I~
## 7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
## 8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airline~
## 9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
## 10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines ~
## # ... with 336,766 more rows
flights2 %>% left_join(weather)
228
CHAPTER 21. THE HADLEYVERSE 21.2. DPLYR
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
# join with explicit column matching
flights2 %>% left_join(airports, by= c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Geor~ 30.0 -95.3
## 2 2013 1 1 5 LGA IAH N24211 UA Geor~ 30.0 -95.3
## 3 2013 1 1 5 JFK MIA N619AA AA Miam~ 25.8 -80.3
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Hart~ 33.6 -84.4
## 6 2013 1 1 5 EWR ORD N39463 UA Chic~ 42.0 -87.9
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort~ 26.1 -80.2
## 8 2013 1 1 6 LGA IAD N829AS EV Wash~ 38.9 -77.5
## 9 2013 1 1 6 JFK MCO N593JB B6 Orla~ 28.4 -81.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic~ 42.0 -87.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## # dst <chr>, tzone <chr>
Types of join with SQL equivalent.
# Create simple data
(df1 <- data_frame(x = c(1, 2), y = 2:1))
## # A tibble: 2 x 3
229
21.2. DPLYR CHAPTER 21. THE HADLEYVERSE
## x a b
## <dbl> <dbl> <chr>
## 1 1 10 a
## 2 3 10 a
# Return only matched rows
df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 1 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
# Return all rows in df1.
df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
# Return all rows in df2.
df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 2 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 3 NA 10 a
# Return all rows.
df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a
## Joining, by = "x"
## # A tibble: 3 x 4
## x y a b
## <dbl> <int> <dbl> <chr>
## 1 1 2 10 a
## 2 2 1 NA <NA>
## 3 3 NA 10 a
# Like left_join, but returning only columns in df1
df1 %>% semi_join(df2, by = "x") # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
## # A tibble: 1 x 2
## x y
## <dbl> <int>
## 1 1 2
230
CHAPTER 21. THE HADLEYVERSE 21.3. TIDYR
21.3 tidyr
21.4 reshape2
21.5 stringr
21.6 anytime
21.7 Biblipgraphic Notes
21.8 Practice Yourself
231
21.8. PRACTICE YOURSELF CHAPTER 21. THE HADLEYVERSE
232
Chapter 22
Causal Inferense
How come everyone in the past did not know what every kid knows these days: that cigarettes are bad for you. The
reason is the difficulty in causal inference. Scientists knew about the correlations between smoking and disease, but
no one could prove one caused the other. These could have been nothing more than correlations, with some external
cause.
Cigarettes were declared dangerous without any direct causal evidence. It was in the USA’s surgeon general report
of 19641 that it was decided that despite of the impossibility of showing a direct causal relation, the circumstantial
evidence is just too strong, and declared cigarettes as dangerous.
1 https://profiles.nlm.nih.gov/ps/retrieve/Narrative/NN/p-nid/60
233
22.1. CAUSAL INFERENCE FROM DESIGNED EXPERIMENTS CHAPTER 22. CAUSAL INFERENSE
2 https://profiles.nlm.nih.gov/ps/retrieve/Narrative/NN/p-nid/60
234
Bibliography
235
BIBLIOGRAPHY BIBLIOGRAPHY
236
BIBLIOGRAPHY BIBLIOGRAPHY
Rosenblatt, J., Gilron, R., and Mukamel, R. (2016). Better-than-chance classification for signal detection. arXiv
preprint arXiv:1608.08873.
Rosenblatt, J. D. and Benjamini, Y. (2014). Selective correlations; not voodoo. NeuroImage, 103:401–410.
Rosset, S. and Tibshirani, R. J. (2018). From fixed-x to random-x regression: Bias-variance decompositions, covariance
penalties, and prediction error estimation. Journal of the American Statistical Association, (just-accepted).
Sammut, C. and Webb, G. I. (2011). Encyclopedia of machine learning. Springer Science & Business Media.
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. Springer, New York. ISBN 978-0-387-75968-5.
Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., and Mansmann, U. (2009). State of the art in
parallel computing with r. Journal of Statistical Software, 47(1).
Searle, S. R., Casella, G., and McCulloch, C. E. (2009). Variance components, volume 391. John Wiley & Sons.
Shah, V. and Gilbert, J. R. (2004). Sparse matrices in matlab* p: Design and implementation. In International
Conference on High-Performance Computing, pages 144–155. Springer.
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge
university press.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge university press.
Simes, R. J. (1986). An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751–754.
Small, C. G. (1990). A survey of multidimensional medians. International Statistical Review/Revue Internationale de
Statistique, pages 263–277.
Tukey, J. W. (1977). Exploratory data analysis. Reading, Mass.
Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
Venables, W. N. and Ripley, B. D. (2013). Modern applied statistics with S-PLUS. Springer Science & Business Media.
Venables, W. N., Smith, D. M., Team, R. D. C., et al. (2004). An introduction to r.
Wang, C., Chen, M.-H., Schifano, E., Wu, J., and Yan, J. (2015). Statistical methods and computing for big data.
arXiv preprint arXiv:1502.07989.
Weihs, C., Mersmann, O., and Ligges, U. (2013). Foundations of Statistical Algorithms: With References to R Packages.
CRC Press.
Weiss, R. E. (2005). Modeling longitudinal data. Springer Science & Business Media.
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Wickham, H. (2011). testthat: Get started with testing. The R Journal, 3(1):5–10.
Wickham, H. (2014). Advanced R. CRC Press.
Wickham, H. and Francois, R. (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0.
Wickham, H., Hester, J., and Francois, R. (2016). readr: Read Tabular Data. R package version 1.0.0.
Wilcox, R. R. (2011). Introduction to robust estimation and hypothesis testing. Academic Press.
Wilkinson, G. and Rogers, C. (1973). Symbolic description of factorial models for analysis of variance. Applied
Statistics, pages 392–399.
Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.
Xie, Y. (2015). Dynamic Documents with R and knitr, volume 29. CRC Press.
Xie, Y. (2016). bookdown: Authoring Books and Technical Documents with R Markdown. CRC Press.
237