Modern Statistics With R
Modern Statistics With R
Måns Thulin
1 Introduction 17
1.1 Welcome to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 About this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 The basics 21
2.1 Installing R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 A first look at RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Running R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 R scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Variables and functions . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Storing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 What’s in a name? . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 Vectors and data frames . . . . . . . . . . . . . . . . . . . . . . 30
2.4.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.5 Mathematical operations . . . . . . . . . . . . . . . . . . . . . 35
2.5 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.1 Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6.2 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Plotting numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7.1 Our first plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7.2 Colours, shapes and axis labels . . . . . . . . . . . . . . . . . . 45
2.7.3 Axis limits and scales . . . . . . . . . . . . . . . . . . . . . . . 46
2.7.4 Comparing groups . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7.6 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Plotting categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.8.1 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.9 Saving your plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.10 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7
8 CONTENTS
6 R programming 207
6.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.1.1 Creating functions . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.1.2 Local and global variables . . . . . . . . . . . . . . . . . . . . . 209
6.1.3 Will your function work? . . . . . . . . . . . . . . . . . . . . . 211
6.1.4 More on arguments . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.1.5 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.1.6 Sourcing other scripts . . . . . . . . . . . . . . . . . . . . . . . 215
6.2 More on pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.2.1 Ce ne sont pas non plus des pipes . . . . . . . . . . . . . . . . . 215
6.2.2 Writing functions with pipes . . . . . . . . . . . . . . . . . . . 217
6.3 Checking conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.1 if and else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.2 & & && . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.3.3 ifelse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.3.4 switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.3.5 Failing gracefully . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.4 Iteration using loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.4.1 for loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.4.2 Loops within loops . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.4.3 Keeping track of what’s happening . . . . . . . . . . . . . . . . 227
6.4.4 Loops and lists . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.4.5 while loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.5 Iteration using vectorisation and functionals . . . . . . . . . . . . . . . 231
6.5.1 A first example with apply . . . . . . . . . . . . . . . . . . . . 232
6.5.2 Variations on a theme . . . . . . . . . . . . . . . . . . . . . . . 233
6.5.3 purrr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.5.4 Specialised functions . . . . . . . . . . . . . . . . . . . . . . . . 235
6.5.5 Exploring data with functionals . . . . . . . . . . . . . . . . . . 236
6.5.6 Keep calm and carry on . . . . . . . . . . . . . . . . . . . . . . 238
6.5.7 Iterating over multiple variables . . . . . . . . . . . . . . . . . 238
6.6 Measuring code performance . . . . . . . . . . . . . . . . . . . . . . . 240
6.6.1 Timing functions . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.6.2 Measuring memory usage - and a note on compilation . . . . . 243
12 CONTENTS
11 Debugging 421
11.1 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
11.1.1 Find out where the error occured with traceback . . . . . . . 422
11.1.2 Interactive debugging of functions with debug . . . . . . . . . . 423
11.1.3 Investigate the environment with recover . . . . . . . . . . . . 424
11.2 Common error messages . . . . . . . . . . . . . . . . . . . . . . . . . . 425
11.2.1 + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
11.2.2 could not find function . . . . . . . . . . . . . . . . . . . . 425
11.2.3 object not found . . . . . . . . . . . . . . . . . . . . . . . . . 425
11.2.4 cannot open the connection and No such file or
directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
11.2.5 invalid 'description' argument . . . . . . . . . . . . . . . 426
11.2.6 missing value where TRUE/FALSE needed . . . . . . . . . . . 427
11.2.7 unexpected '=' in ... . . . . . . . . . . . . . . . . . . . . . 427
11.2.8 attempt to apply non-function . . . . . . . . . . . . . . . . 428
11.2.9 undefined columns selected . . . . . . . . . . . . . . . . . . 428
11.2.10 subscript out of bounds . . . . . . . . . . . . . . . . . . . . 428
11.2.11 Object of type ‘closure’ is not subsettable . . . . . . 429
11.2.12 $ operator is invalid for atomic vectors . . . . . . . . . 429
11.2.13 (list) object cannot be coerced to type ‘double’ . . . 429
11.2.14 arguments imply differing number of rows . . . . . . . . . 430
11.2.15 non-numeric argument to a binary operator . . . . . . . . 430
11.2.16 non-numeric argument to mathematical function . . . . . 430
11.2.17 cannot allocate vector of size ... . . . . . . . . . . . . . 431
11.2.18 Error in plot.new() : figure margins too large . . . . 431
11.2.19 Error in .Call.graphics(C_palette2, .Call(C_palette2,
NULL)) : invalid graphics state . . . . . . . . . . . . . . . 431
11.3 Common warning messages . . . . . . . . . . . . . . . . . . . . . . . . 431
11.3.1 replacement has ... rows ... . . . . . . . . . . . . . . . . . 431
16 CONTENTS
11.3.2 the condition has length > 1 and only the first
element will be used . . . . . . . . . . . . . . . . . . . . . . 432
11.3.3 number of items to replace is not a multiple of
replacement length . . . . . . . . . . . . . . . . . . . . . . . 432
11.3.4 longer object length is not a multiple of shorter
object length . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
11.3.5 NAs introduced by coercion . . . . . . . . . . . . . . . . . . 433
11.3.6 package is not available (for R version 4.x.x) . . . . 433
11.4 Messages printed when installing ggplot2 . . . . . . . . . . . . . . . . 434
Bibliography 567
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
Index 573
To cite this book, please use the following:
• Thulin, M. (2021). Modern Statistics with R. Eos Chasma Press. ISBN
9789152701515.
Chapter 1
Introduction
1.1 Welcome to R
Welcome to the wonderful world of R!
R is not like other statistical software packages. It is free, versatile, fast, and mod-
ern. It has a large and friendly community of users that help answer questions and
develop new R tools. With more than 17,000 add-on packages available, R offers
more functions for data analysis than any other statistical software. This includes
specialised tools for disciplines as varied as political science, environmental chem-
istry, and astronomy, and new methods come to R long before they come to other
programs. R makes it easy to construct reproducible analyses and workflows that
allow you to easily repeat the same analysis more than once.
R is not like other programming languages. It was developed by statisticians as a
tool for data analysis and not by software engineers as a tool for other programming
tasks. It is designed from the ground up to handle data, and that shows. But it is
also flexible enough to be used to create interactive web pages, automated reports,
and APIs.
R is, simply put, currently the best tool there is for data analysis.
17
18 CHAPTER 1. INTRODUCTION
This is not a book that has been written with the intention that you should read it
back-to-back. Rather, it is intended to serve as a guide to what to do next as you
explore R. Think of it as a conversation, where you and I discuss different topics
related to data analysis and data wrangling. At times I’ll do the talking, introduce
concepts and pose questions. At times you’ll do the talking, working with exercises
and discovering all that R has to offer. The best way to learn R is to use R. You
should strive for active learning, meaning that you should spend more time with
R and less time stuck with your nose in a book. Together we will strive for an
exploratory approach, where the text guides you to discoveries and the exercises
challenge you to go further. This is how I’ve been teaching R since 2008, and I hope
that it’s a way that you will find works well for you.
The book contains more than 200 exercises. Apart from a number of open-ended
questions about ethical issues, all exercises involve R code. These exercises all have
worked solutions. It is highly recommended that you actually work with all the
exercises, as they are central to the approach to learning that this book seeks to
support: using R to solve problems is a much better way to learn the language than
to just read about how to use R to solve problems. Once you have finished an exercise
(or attempted but failed to finish it) read the proposed solution - it may differ from
what you came up with and will sometimes contain comments that you may find
interesting. Treat the proposed solutions as a part of our conversation. As you work
with the exercises and compare your solutions to those in the back of the book, you
will gain more and more experience working with R and build your own library of
examples of how problems can be solved.
Some books on R focus entirely on data science - data wrangling and exploratory
data analysis - ignoring the many great tools R has to offer for deeper data analyses.
Others focus on predictive modelling or classical statistics but ignore data-handling,
which is a vital part of modern statistical work. Many introductory books on statis-
tical methods put too little focus on recent advances in computational statistics and
advocate methods that have become obsolete. Far too few books contain discussions
of ethical issues in statistical practice. This book aims to cover all of these topics
and show you the state-of-the-art tools for all these tasks. It covers data science and
(modern!) classical statistics as well as predictive modelling and machine learning,
and deals with important topics that rarely appear in other introductory texts, such
as simulation. It is written for R 4.0 or later and will teach you powerful add-on
packages like data.table, dplyr, ggplot2, and caret.
The book is organised as follows:
Chapter 2 covers basic concepts and shows how to use R to compute descriptive
statistics and create nice-looking plots.
Chapter 3 is concerned with how to import and handle data in R, and how to perform
routine statistical analyses.
Chapter 4 covers exploratory data analysis using statistical graphics, as well as un-
1.2. ABOUT THIS BOOK 19
The basics
Let’s start from the very beginning. This chapter acts as an introduction to R. It
will show you how to install and work with R and RStudio.
After working with the material in this chapter, you will be able to:
• Create reusable R scripts,
• Store data in R,
• Use functions in R to analyse data,
• Install add-on packages adding additional features to R,
• Compute descriptive statistics like the mean and the median,
• Do mathematical calculations,
• Create nice-looking plots, including scatterplots, boxplots, histograms and bar
charts,
• Find errors in your code.
21
22 CHAPTER 2. THE BASICS
you have a modern computer (which in this case means a computer from 2010 or
later), you should go with the 64-bit version.
You have now installed the R programming language. Working with it is easier with
an integrated development environment, or IDE for short, which allows you to easily
write, run and debug your code. This book is written for use with the RStudio IDE,
but 99.9 % of it will work equally well with other IDE’s, like Emacs with ESS or
Jupyter notebooks.
To download RStudio, go to the RStudio download page
https://rstudio.com/products/rstudio/download/#download
Click on the link to download the installer for your operating system, and then run
it.
4. The Script panel, used for writing code. This is where you’ll spend most of
your time working.
If you launch RStudio by opening a file with R code, the Script panel will appear,
otherwise it won’t. Don’t worry if you don’t see it at this point - you’ll learn how to
open it soon enough.
The Console panel will contain R’s startup message, which shows information about
which version of R you’re running2 :
R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
2 In addition to the version number, each relase of R has a nickname referencing a Peanuts comic
by Charles Schulz. The “Camp Pontanezen” nickname of R 4.1.0 is a reference to the Peanuts comic
from February 12, 1986.
24 CHAPTER 2. THE BASICS
You can resize the panels as you like, either by clicking and dragging their borders
or using the minimise/maximise buttons in the upper right corner of each panel.
When you exit RStudio, you will be asked if you wish to save your workspace, meaning
that the data that you’ve worked with will be stored so that it is available the next
time you run R. That might sound like a good idea, but in general, I recommend that
you don’t save your workspace, as that often turns out to cause problems down the
line. It is almost invariably a much better idea to simply rerun the code you worked
with in your next R session.
Code chunks will frequently contain multiple lines. You can select and copy both
lines from the digital version of this book and simultaneously paste them directly
into the Console:
2*2
1+2*3-5
As you can see, when you type the code into the Console panel and press Enter,
R runs (or executes) the code and returns an answer. To get you started, the first
exercise will have you write a line of code to perform a computation. You can find a
solution to this and other exercises at the end of the book, in Chapter 13.
∼
3 The
word manipulate has different meanings. Just to be perfectly clear: whenever I speak of
manipulating data in this book, I will mean handling and transforming the data, not tampering
with it.
2.3. RUNNING R CODE 25
Exercise 2.1. Use R to compute the product of the first ten integers: 1 ⋅ 2 ⋅ 3 ⋅ 4 ⋅ 5 ⋅
6 ⋅ 7 ⋅ 8 ⋅ 9 ⋅ 10.
2.3.1 R scripts
When working in the Console panel4 , you can use the up arrow ↑ on your keyboard
to retrieve lines of code that you’ve previously used. There is however a much better
way of working with R code: to put it in script files. These are files containing R
code, that you can save and then run again whenever you like.
In the Script panel, when you press Enter, you insert a new line instead of running the
code. That’s because the Script panel is used for writing code rather than running
it. To actually run the code, you must send it to the Console panel. This can be
done in several ways. Let’s give them a try to see which you prefer.
• Press the Source button in the upper right corner of the Script panel.
• Press Ctrl+Shift+Enter on your keyboard.
• Press Ctrl+Alt+Enter on your keyboard to run the code without printing the
code and its output in the Console.
To run a part of the script, first select the lines you wish to run, e.g. by highlighting
them using your mouse. Then do one of the following:
• Press the Run button at the upper right corner of the Script panel.
• Press Ctrl+Enter on your keyboard (this is how I usually do it!).
To save your script, click the Save icon, choose File > Save in the menu or press
Ctrl+S. R script files should have the file extension .R, e.g. My first R script.R.
Remember to save your work often, and to save your code for all the examples and
exercises in this book - you will likely want to revisit old examples in the future, to
see how something was done.
4 I.e. when the Console panel is active and you see a blinking text cursor in it.
26 CHAPTER 2. THE BASICS
is used to assign the value 4 to the variable x. It is read as “assign 4 to x”. The <-
part is made by writing a less than sign (<) and a hyphen (-) with no space between
them6 .
If we now type x in the Console, R will return the answer 4. Well, almost. In fact,
R returns the following rather cryptic output:
[1] 4
The meaning of the 4 is clear - it’s a 4. We’ll return to what the [1] part means
soon.
Now that we’ve created a variable, called x, and assigned a value (4) to it, x will have
the value 4 whenever we use it again. This works just like a mathematical formula,
where we for instance can insert the value 𝑥 = 4 into the formula 𝑥+1. The following
two lines of code will compute 𝑥 + 1 = 4 + 1 = 5 and 𝑥 + 𝑥 = 4 + 4 = 8:
x + 1
x + x
typed, meaning that the data type of an R variable also can change over time. This also means that
there is no need to declare variable types in R (which is either liberating or terrifying, depending
on what type of programmer you are).
6 In RStudio, you can also create the assignment operator <- by using the keyboard shortcut
Alt+- (i.e. press Alt and the - button at the same time).
2.4. VARIABLES AND FUNCTIONS 27
x <- 1 + 2 + 3 + 4
R first evaluates the entire right-hand side, which in this case amounts to computing
1+2+3+4, and then assigns the result (10) to x. Note that the value previously
assigned to x (i.e. 4) now has been replaced by 10. After a piece of code has been
run, the values of the variables affected by it will have changed. There is no way to
revert the run and get that 4 back, save to rerun the code that generated it in the
first place.
You’ll notice that in the code above, I’ve added some spaces, for instance between
the numbers and the plus signs. This is simply to improve readability. The code
works just as well without spaces:
x<-1+2+3+4
However, you can not place a space in the middle of the <- arrow. The following will
not assign a value to x:
x < - 1 + 2 + 3 + 4
Running that piece of code rendered the output FALSE. This is because < - with a
space has a different meaning than <- in R, one that we shall return to in the next
chapter.
In rare cases, you may want to switch the direction of the arrow, so that the variable
names is on the right-hand side. This is called right-assignment and works just fine
too:
2 + 2 -> y
Later on, we’ll see plenty of examples where right-assignment comes in handy.
names, and for the sake of readability, it is often preferable to give your variables
more informative names. Compare the following two code chunks:
y <- 100
z <- 20
x <- y - z
and
income <- 100
taxes <- 20
net_income <- income - taxes
Both chunks will run without any errors and yield the same results, and yet there is
a huge difference between them. The first chunk is opaque - in no way does the code
help us conceive what it actually computes. On the other hand, it is perfectly clear
that the second chunk is used to compute a net income by subtracting taxes from
income. You don’t want to be a chunk-one type R user, who produces impenetrable
code with no clear purpose. You want to be a chunk-two type R user, who writes clear
and readable code where the intent of each line is clear. Take it from me - for years
I was a chunk-one guy. I managed to write a lot of useful code, but whenever I had
to return to my old code to reuse it or fix some bug, I had difficulties understanding
what each line was supposed to do. My new life as a chunk-two guy is better in every
way.
So, what’s in a name? Shakespeare’s balcony-bound Juliet would have us believe
that that which we call a rose by any other name would smell as sweet. Translated
to R practice, this means that your code will run just fine no matter what names
you choose for your variables. But when you or somebody else reads your code, it
will help greatly if you call a rose a rose and not x or my_new_variable_5.
You should note that R is case-sensitive, meaning that my_variable, MY_VARIABLE,
My_Variable, and mY_VariABle are treated as different variables. To access the data
stored in a variable, you must use its exact name - including lower- and uppercase
letters in the right places. Writing the wrong variable name is one of the most
common errors in R programming.
You’ll frequently find yourself wanting to compose variable names out of multiple
words, as we did with net_income. However, R does not allow spaces in variable
names, and so net income would not be a valid variable name. There are a few
different naming conventions that can be used to name your variables:
• snake_case, where words are separated by an underscore (_). Example:
househould_net_income.
• camelCase or CamelCase, where each new word starts with a capital letter.
Example: househouldNetIncome or HousehouldNetIncome.
• period.case, where each word is separated by a period (.). You’ll find this
used a lot in R, but I’d advise that you don’t use it for naming variables, as a
2.4. VARIABLES AND FUNCTIONS 29
period in the middle of a name can have a different meaning in more advanced
cases7 . Example: household.net.income.
• concatenatedwordscase, where the words are concatenated using only low-
ercase letters. Adownsidetothisconventionisthatitcanmakevariablenamesveryd-
ifficultoreadsousethisatyourownrisk. Example: householdnetincome
• SCREAMING_SNAKE_CASE, which mainly is used in Unix shell scripts these days.
You can use it in R if you like, although you will run the risk of making others
think that you are either angry, super excited or stark staring mad8 . Example:
HOUSEHOULD_NET_INCOME.
Some characters, including spaces, -, +, *, :, =, ! and $ are not allowed in variable
names, as these all have other uses in R. The plus sign +, for instance, is used for
addition (as you would expect), and allowing it to be used in variable names would
therefore cause all sorts of confusion. In addition, variable names can’t start with
numbers. Other than that, it is up to you how you name your variables and which
convention you use. Remember, your variable will smell as sweet regardless of what
name you give it, but using a good naming convention will improve readability9 .
Another great way to improve the readability of your code is to use comments. A
comment is a piece of text, marked by #, that is ignored by R. As such, it can be used
to explain what is going on to people who read your code (including future you) and
to add instructions for how to use the code. Comments can be placed on separate
lines or at the end of a line of code. Here is an example:
#############################################################
# This lovely little code snippet can be used to compute #
# your net income. #
#############################################################
In the Script panel in RStudio, you can comment and uncomment (i.e. remove the
# symbol) a row by pressing Ctrl+Shift+C on your keyboard. This is particularly
useful if you wish to comment or uncomment several lines - simply select the lines
and press Ctrl+Shift+C.
7 Specifically, the period is used to separate methods and classes in object-oriented programming,
which is hugely important in R (although you can use R for several years without realising this).
8 I find myself using screaming snake case on occasion. Make of that what you will.
9 I recommend snake_case or camelCase, just in case that wasn’t already clear.
30 CHAPTER 2. THE BASICS
1. What happens if you use an invalid character in a variable name? Try e.g. the
following:
net income <- income - taxes
net-income <- income - taxes
ca$h <- income - taxes
3. What happens if you remove a line break and replace it by a semicolon ;? E.g.:
income <- 200; taxes <- 30
…but this quickly becomes awkward. A much better solution is to store the entire
list in just one variable. In R, such a list is called a vector. We can create a vector
using the following code, where c stands for combine:
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
The numbers in the vector are called elements. We can treat the vector variable age
just as we treated variables containing a single number. The difference is that the
2.4. VARIABLES AND FUNCTIONS 31
operations will apply to all elements in the list. So for instance, if we wish to express
the ages in months rather than years, we can convert all ages to months using:
age_months <- age * 12
Most of the time, data will contain measurements of more than one quantity. In
the case of our bookstore customers, we also have information about the amount of
money they spent on their last purchase:
It would be nice to combine these two vectors into a table, like we would do in a
spreadsheet software such as Excel. That would allow us to look at relationships
between the two vectors - perhaps we could find some interesting patterns? In R,
tables of vectors are called data frames. We can combine the two vectors into a data
frame as follows:
bookstore <- data.frame(age, purchase)
If you type bookstore into the Console, it will show a simply formatted table with
the values of the two vectors (and row numbers):
> bookstore
age purchase
1 28 20
2 48 59
3 47 2
4 71 12
5 22 22
6 80 160
7 48 34
8 30 34
9 31 29
A better way to look at the table may be to click on the variable name bookstore
in the Environment panel, which will open the data frame in a spreadsheet format.
You will have noticed that R tends to print a [1] at the beginning of the line when
we ask it to print the value of a variable:
> age
[1] 28 48 47 71 22 80 48 30 31
# When we enter data into a vector, we can put line breaks between
# the commas:
distances <- c(687, 5076, 7270, 967, 6364, 1683, 9394, 5712, 5206,
4317, 9411, 5625, 9725, 4977, 2730, 5648, 3818, 8241,
5547, 1637, 4428, 8584, 2962, 5729, 5325, 4370, 5989,
9030, 5532, 9623)
distances
Depending on the size of your Console panel, R will require a different number of
rows to display the data in distances. The output will look something like this:
> distances
[1] 687 5076 7270 967 6364 1683 9394 5712 5206 4317 9411 5625 9725
[14] 4977 2730 5648 3818 8241 5547 1637 4428 8584 2962 5729 5325 4370
[27] 5989 9030 5532 9623
The numbers within the square brackets - [1], [8], [15], and so on - tell us which
elements of the vector that are printed first on each row. So in the latter example,
the first element in the vector is 687, the 8th element is 5712, the 15th element is
2730, and so forth. Those numbers, called the indices of the elements, aren’t exactly
part of your data, but as we’ll see later they are useful for keeping track of it.
This also tells you something about the inner workings of R. The fact that
x <- 4
x
tells us that x in fact is a vector, albeit with a single element. Almost everything in
R is a vector, in one way or another.
Being able to put data on multiple lines when creating vectors is hugely useful, but
can also cause problems if you forget to include the closing bracket ). Try running
the following code, where the final bracket is missing, in your Console panel:
2.4. VARIABLES AND FUNCTIONS 33
distances <- c(687, 5076, 7270, 967, 6364, 1683, 9394, 5712, 5206,
4317, 9411, 5625, 9725, 4977, 2730, 5648, 3818, 8241,
5547, 1637, 4428, 8584, 2962, 5729, 5325, 4370, 5989,
9030, 5532, 9623
When you hit Enter, a new line starting with a + sign appears. This indicates that
R doesn’t think that your statement has finished. To finish it, type ) in the Console
and then press Enter.
Vectors and data frames are hugely important when working with data in R.
Chapters 3 and 5 are devoted to how to work with these objects.
Exercise 2.5. Try creating a vector using x <- 1:5. What happens? What hap-
pens if you use 5:1 instead? How can you use this notation to create the vector
(1, 2, 3, 4, 5, 4, 3, 2, 1)?
2.4.4 Functions
You have some data. Great. But simply having data is not enough - you want to
do something with it. Perhaps you want to draw a graph, compute a mean value or
apply some advanced statistical model to it. To do so, you will use a function.
A function is a ready-made set of instructions - code - that tells R to do something.
There are thousands of functions in R. Typically, you insert a variable into the
function, and it returns an answer. The code for doing this follows the pattern
function_name(variable_name). As a first example, consider the function mean,
which computes the mean of a variable:
# Compute the mean age of bookstore customers
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
mean(age)
Note that the code follows the pattern function_name(variable_name): the func-
tion’s name is mean and the variable’s name is age.
34 CHAPTER 2. THE BASICS
Some functions take more than one variable as input, and may also have additional
arguments (or parameters) that you can use to control the behaviour of the function.
One such example is cor, which computes the correlation between two variables:
# Compute the correlation between the variables age and purchase
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
cor(age, purchase)
The answer, 0.59 means that there appears to be a fairly strong positive correlation
between age and the purchase size, which implies that older customers tend to spend
more. On the other hand, just by looking at the data we can see that the oldest
customer - aged 80 - spent much more than anybody else - 160 monetary units. It
can happen that such outliers strongly influence the computation of the correlation.
By default, cor uses the Pearson correlation formula, which is known to be sensitive
to outliers. It is therefore of interest to also perform the computation using a formula
that is more robust to outliers, such as the Spearman correlation. This can be done
by passing an additional argument to cor, telling it which method to use for the
computation:
cor(age, purchase, method = "spearman")
The resulting correlation, 0.35 is substantially lower than the previous result. Perhaps
the correlation isn’t all that strong after all.
So, how can we know what arguments to pass to a function? Luckily, we don’t
have to memorise all possible arguments for all functions. Instead, we can look at
the documentation, i.e. help file, for a function that we are interested in. This is
done by typing ?function_name in the Console panel, or doing a web search for R
function_name. To view the documentation for the cor function, type:
?cor
The first time that you look at the documentation for an R function, all this infor-
mation can be a bit overwhelming. Perhaps even more so for cor, which is a bit
unusual in that it shares its documentation page with three other (heavily related)
functions: var, cov and cov2cor. Let the section headlines guide you when you look
at the documentation. What information are you looking for? If you’re just looking
for an example of how the function is used, scroll down to Examples. If you want to
know what arguments are available, have a look at Usage and Arguments.
Finally, there are a few functions that don’t require any input at all, because they
don’t do anything with your variables. One such example is Sys.time() which prints
the current time on your system:
Sys.time()
Note that even though Sys.time doesn’t require any input, you still have to write
the parentheses (), which tells R that you want to run a function.
Exercise 2.6. Using the data you created in Exercise 2.4, do the following:
1. Compute the mean height of the people.
2. Compute the correlation between height and weight.
• log(x): computes the logarithm of 𝑥 with the natural number 𝑒 as the base.
• log(x, base = a): computes the logarithm of 𝑥 with the number 𝑎 as the
base.
• a^x: computes 𝑎𝑥 .
• exp(x): computes 𝑒𝑥 .
• sin(x): computes sin(𝑥).
• sum(x): when x is a vector 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ), computes the sum of the
𝑛
elements of x: ∑𝑖=1 𝑥𝑖 .
• prod(x): when x is a vector 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ), computes the product of
𝑛
the elements of x: ∏𝑖=1 𝑥𝑖 .
• pi: a built-in variable with value 𝜋, the ratio of the circumference of a circle
to its diameter.
• x %% a: computes 𝑥 modulo 𝑎.
• factorial(x): computes 𝑥!.
• choose(n,k): computes (𝑛𝑘).
Exercise 2.9. R will return non-numerical answers if you try to perform computa-
tions where the answer is infinite or undefined. Try the following to see some possible
results:
1. Compute 1/0.
2. Compute 0/0.
√
3. Compute −1.
2.5 Packages
R comes with a ton of functions, but of course these cannot cover all possible things
that you may want to do with your data. That’s where packages come in. Packages
are collections of functions and datasets that add new features to R. Do you want
to apply some obscure statistical test to your data? Plot your data on a map? Run
C++ code in R? Speed up some part of your data handling process? There are R
packages for that. In fact, with more than 17,000 packages and counting, there are
R packages for just about anything that you could possibly want to do. All packages
have been contributed by the R community - that is, by users like you and me.
2.5. PACKAGES 37
Most R packages are available from CRAN, the official R repository - a network of
servers (so-called mirrors) around the world. Packages on CRAN are checked before
they are published, to make sure that they do what they are supposed to do and
don’t contain malicious components. Downloading packages from CRAN is therefore
generally considered to be safe.
In the rest of this chapter, we’ll make use of a package called ggplot2, which adds
additional graphical features to R. To install the package from CRAN, you can either
select Tools > Install packages in the RStudio menu and then write ggplot2 in the
text box in the pop-up window that appears, or use the following line of code:
install.packages("ggplot2")
A menu may appear where you are asked to select the location of the CRAN mirror
to download from. Pick the one the closest to you, or just use the default option
- your choice can affect the download speed, but will in most cases not make much
difference. There may also be a message asking whether to create a folder for your
packages, which you should agree to do.
As R downloads and installs the packages, a number of technical messages are printed
in the Console panel (an example of what these messages can look like during a
successful installation is found in Section 11.4). ggplot2 depends on a number of
packages that R will install for you, so expect this to take a few minutes. If the
installation finishes successfully, it will finish with a message saying:
* DONE (ggplot2)
If the installation fails for some reason, there will usually be a (sometimes cryptic)
error message. You can read more about troubleshooting errors in Section 2.10.
There is also a list of common problems when installing packages available on the
RStudio support page at https://support.rstudio.com/hc/en-us/articles/200554786-
Problem-Installing-Packages.
After you’ve installed the package, you’re still not finished quite yet. The package
may have been installed, but its functions and datasets won’t be available until you
load it. This is something that you need to do each time that you start a new
R session. Luckily, it is done with a single short line of code using the library
function10 , that I recommend putting at the top of your script file:
library(ggplot2)
10 The use of library causes people to erroneously refer to R packages as libraries. Think of the
library as the place where you store your packages, and calling library means that you go to your
library to fetch the package.
38 CHAPTER 2. THE BASICS
We’ll discuss more details about installing and updating R packages in Section 10.1.
These, as well as some other datasets, are automatically loaded as data frames when
you load ggplot2:
library(ggplot2)
To begin with, let’s explore the msleep dataset. To have a first look at it, type the
following in the Console panel:
msleep
That shows you the first 10 rows of the data, and some of its columns. It also gives
another important piece of information: 83 x 11, meaning that the dataset has 83
rows (i.e. 83 observations) and 11 columns (with each column corresponding to a
variable in the dataset).
There are however better methods for looking at the data. To view all 83 rows and
all 11 variables, use:
View(msleep)
You’ll notice that some cells have the value NA instead of a proper value. NA stands
for Not Available, and is a placeholder used by R to point out missing data. In this
case, it means that the value is unknown for the animal.
To find information about the data frame containing the data, some useful functions
are:
head(msleep)
tail(msleep)
dim(msleep)
str(msleep)
names(msleep)
dim returns the numbers of rows and columns of the data frame, whereas str returns
information about the 11 variables. Of particular importance are the data types of
the variables (chr and num, in this instance), which tells us what kind of data we are
dealing with (numerical, categorical, dates, or something else). We’ll delve deeper
2.6. DESCRIPTIVE STATISTICS 39
into data types in Chapter 3. Finally, names returns a vector containing the names
of the variables.
Like functions, datasets that come with packages have documentation describing
them. The documentation for msleep gives a short description of the data and its
variables. Read it to learn a bit more about the variables:
?msleep
Finally, you’ll notice that msleep isn’t listed among the variables in the Environment
panel in RStudio. To include it there, you can run:
data(msleep)
For the text variables, this doesn’t provide any information at the moment. But
for the numerical variables, it provides a lot of useful information. For the variable
sleep_rem, for instance, we have the following:
sleep_rem
Min. :0.100
1st Qu.:0.900
Median :1.500
Mean :1.875
3rd Qu.:2.400
Max. :6.600
NA's :22
This tells us that the mean of sleep_rem is 1.875, that smallest value is 0.100 and
that the largest is 6.600. The 1st quartile11 is 0.900, the median is 1.500 and the
third quartile is 2.400. Finally, there are 22 animals for which there are no values
(missing data - represented by NA).
Sometimes we want to compute just one of these, and other times we may want
to compute summary statistics not included in summary. Let’s say that we want
to compute some descriptive statistics for the sleep_total variable. To access a
vector inside a data frame, we use a dollar sign: data_frame_name$vector_name. So
to access the sleep_total vector in the msleep data frame, we write:
11 The first quartile is a value such that 25 % of the observations are smaller than it; the 3rd
quartile is a value such that 25 % of the observations are larger than it.
40 CHAPTER 2. THE BASICS
msleep$sleep_total
Some examples of functions that can be used to compute descriptive statistics for
this vector are:
mean(msleep$sleep_total) # Mean
median(msleep$sleep_total) # Median
max(msleep$sleep_total) # Max
min(msleep$sleep_total) # Min
sd(msleep$sleep_total) # Standard deviation
var(msleep$sleep_total) # Variance
quantile(msleep$sleep_total) # Various quantiles
To see how many animals sleep for more than 8 hours a day, we can use the following:
sum(msleep$sleep_total > 8) # Frequency (count)
mean(msleep$sleep_total > 8) # Relative frequency (proportion)
msleep$sleep_total > 8 checks whether the total sleep time of each animal is
greater than 8. We’ll return to expressions like this in Section 3.2.
Now, let’s try to compute the mean value for the length of REM sleep for the animals:
mean(msleep$sleep_rem)
The above call returns the answer NA. The reason is that there are NA values in the
sleep_rem vector (22 of them, as we saw before). What we actually wanted was the
mean value among the animals for which we know the REM sleep. We can have a
look at the documentation for mean to see if there is some way we can get this:
?mean
Note that the NA values have not been removed from msleep. Setting na.rm = TRUE
simply tells R to ignore them in a particular computation, not to delete them.
We run into the same problem if we try to compute the correlation between
sleep_total and sleep_rem:
cor(msleep$sleep_total, msleep$sleep_rem)
A quick look at the documentation (?cor), tells us that the argument used to ignore
NA values has a different name for cor - it’s not na.rm but use. The reason will
2.6. DESCRIPTIVE STATISTICS 41
become evident later on, when we study more than two variables at a time. For now,
we set use = "complete.obs" to compute the correlation using only observations
with complete data (i.e. no missing values):
cor(msleep$sleep_total, msleep$sleep_rem, use = "complete.obs")
The table function can also be used to construct a cross table that shows the counts
for different combinations of two categorical variables:
# Counts:
table(msleep$vore, msleep$conservation)
Exercise 2.10. Load ggplot2 using library(ggplot2) if you have not already
done so. Then do the following:
1. View the documentation for the diamonds data and read about different the
variables.
2. Check the data structures: how many observations and variables are there and
what type of variables (numeric, categorical, etc.) are there?
3. Compute summary statistics (means, median, min, max, counts for categorical
variables). Are there any missing values?
42 CHAPTER 2. THE BASICS
When we create plots using ggplot2, we must define what data, aesthetics and geoms
to use. If that sounds a bit strange, it will hopefully become a lot clearer once we
have a look at some examples. To begin with, we will illustrate how this works by
visualising some continuous variables in the msleep data.
Using base R, we simply do a call to the plot function in a way that is analogous to
how we’d use e.g. cor:
plot(msleep$sleep_total, msleep$sleep_rem)
4
sleep_rem
0
5 10 15 20
sleep_total
At this point you may ask why on earth anyone would ever want to use ggplot2
code for creating plots. It’s a valid question. The base R code looks simpler, and is
consistent with other functions that we’ve seen. The ggplot2 code looks… different.
This is because it uses the grammar of graphics, which in many ways is a language
of its own, different from how we otherwise work with R.
But, the plot created using ggplot2 also looked different. It used filled circles instead
of empty circles for plotting the points, and had a grid in the background. In both
base R graphics and ggplot2 we can changes these settings, and many others. We
can create something similar to the ggplot2 plot using base R as follows, using the
pch argument and the grid function:
plot(msleep$sleep_total, msleep$sleep_rem, pch = 16)
grid()
Some people prefer the look and syntax of base R plots, while others argue that
ggplot2 graphics has a prettier default look. I can sympathise with both groups.
Some types of plots are easier to create using base R, and some are easier to create
using ggplot2. I like base R graphics for their simplicity, and prefer them for quick-
and-dirty visualisations as well as for more elaborate graphs where I want to combine
many different components. For everything in between, including exploratory data
analysis where graphics are used to explore and understand datasets, I prefer ggplot2.
In this book, we’ll use base graphics for some quick-and-dirty plots, but put more
emphasis on ggplot2 and how it can be used to explore data.
The syntax used to create the ggplot2 scatterplot was in essence ggplot(data, aes)
+ geom. All plots created using ggplot2 follow this pattern, regardless of whether
they are scatterplots, bar charts or something else. The plus sign in ggplot(data,
aes) + geom is important, as it implies that we can add more geoms to the plot,
for instance a trend line, and perhaps other things as well. We will return to that
shortly.
Unless the user specifies otherwise, the first two arguments to aes will always be
mapped to the x and y axes, meaning that we can simplify the code above by removing
the x = and y = bits (at the cost of a slight reduction in readability). Moreover, it
is considered good style to insert a line break after the + sign. The resulting code is:
ggplot(msleep, aes(sleep_total, sleep_rem)) +
geom_point()
Note that this does not change the plot in any way - the difference is merely in the
style of the code.
∼
2.7. PLOTTING NUMERICAL DATA 45
Exercise 2.11. Create a scatterplot with total sleeping time along the x-axis and
time awake along the y-axis (using the msleep data). What pattern do you see? Can
you explain it?
Note that the plus signs must be placed at the end of a row rather than at the
beginning. To change the y-axis label, add ylab instead.
To change the colour of the points, you can set the colour in geom_point:
ggplot(msleep, aes(sleep_total, sleep_rem)) +
geom_point(colour = "red") +
xlab("Total sleep time (h)")
In addition to "red", there are a few more colours that you can choose from. You
can run colors() in the Console to see a list of the 657 colours that have names in R
(examples of which include "papayawhip", "blanchedalmond", and "cornsilk4"),
or use colour hex codes like "#FF5733".
Alternatively, you may want to use the colours of the point to separate different
categories. This is done by adding a colour argument to aes, since you are now
mapping a data variable to a visual property. For instance, we can use the variable
vore to show differences between herbivores, carnivores and omnivores:
ggplot(msleep, aes(sleep_total, sleep_rem, colour = vore)) +
geom_point() +
xlab("Total sleep time (h)")
What happens if we use a continuous variable, such as the sleep cycle length
sleep_cycle to set the colour?
ggplot(msleep, aes(sleep_total, sleep_rem, colour = sleep_cycle)) +
geom_point() +
xlab("Total sleep time (h)")
You’ll learn more about customising colours (and other parts) of your plots in Section
4.2.
46 CHAPTER 2. THE BASICS
Exercise 2.13. Similar to how you changed the colour of the points, you can also
change their size and shape. The arguments for this are called size and shape.
1. Change the scatterplot from Exercise 2.12 so that diamonds with different cut
qualities are represented by different shapes.
2. Then change it so that the size of each point is determined by the diamond’s
length, i.e. the variable x.
There are two animals with brains that are much heavier than the rest (African
elephant and Asian elephant). These outliers distort the plot, making it difficult to
spot any patterns. We can try changing the x-axis to only go from 0 to 1.5 by adding
xlim to the plot, to see if that improves it:
ggplot(msleep, aes(brainwt, sleep_total, colour = vore)) +
geom_point() +
xlab("Brain weight") +
ylab("Total sleep time") +
xlim(0, 1.5)
This is slightly better, but we still have a lot of points clustered near the y-axis, and
some animals are now missing from the plot. If instead we wished to change the
limits of the y-axis, we would have used ylim in the same fashion.
2.7. PLOTTING NUMERICAL DATA 47
Another option is to resacle the x-axis by applying a log transform to the brain
weights, which we can do directly in aes:
ggplot(msleep, aes(log(brainwt), sleep_total, colour = vore)) +
geom_point() +
xlab("log(Brain weight)") +
ylab("Total sleep time")
Exercise 2.14. Using the msleep data, create a plot of log-transformed body weight
versus log-transformed brain weight. Use total sleep time to set the colours of the
points. Change the text on the axes to something informative.
Note that the x-axes and y-axes of the different plots in the grid all have the same
48 CHAPTER 2. THE BASICS
2.7.5 Boxplots
Another option for comparing groups is boxplots (also called box-and-whiskers plots).
Using ggplot2, we create boxplots for animal sleep times, grouped by feeding be-
haviour, with geom_boxplot. Using base R, we use the boxplot function instead:
# Base R:
boxplot(sleep_total ~ vore, data = msleep)
# ggplot2:
ggplot(msleep, aes(vore, sleep_total)) +
geom_boxplot()
The boxes visualise important descriptive statistics for the different groups, similar
to what we got using summary:
• Median: the thick black line inside the box.
• First quartile: the bottom of the box.
• Third quartile: the top of the box.
• Minimum: the end of the line (“whisker”) that extends from the bottom of the
box.
• Maximum: the end of the line that extends from the top of the box.
• Outliers: observations that deviate too much12 from the rest are shown as
separate points. These outliers are not included in the computation of the
median, quartiles and the extremes.
Note that just as for a scatterplot, the code consists of three parts:
• Data: given by the first argument in the call to ggplot: msleep
12 In this case, too much means that they are more than 1.5 times the height of the box away from
20
15
sleep_total
10
• Aesthetics: given by the second argument in the ggplot call: aes, where
we map the group variable vore to the x-axis and the numerical variable
sleep_total to the y-axis.
• Geoms: given by geom_boxplot, meaning that the data will be visualised
with boxplots.
2.7.6 Histograms
To show the distribution of a continuous variable, we can use a histogram, in which
the data is split into a number of bins and the number of observations in each bin is
shown by a bar. The ggplot2 code for histograms follows the same pattern as other
plots, while the base R code uses the hist function:
# Base R:
hist(msleep$sleep_total)
# ggplot2:
ggplot(msleep, aes(sleep_total)) +
geom_histogram()
4
count
5 10 15 20
sleep_total
# ggplot2
ggplot(msleep, aes(vore)) +
geom_bar()
To create a stacked bar chart using ggplot2, we use map all groups to the same
value on the x-axis and then map the different groups to different colours. This can
be done as follows:
ggplot(msleep, aes(factor(1), fill = vore)) +
geom_bar()
30
20
count
10
If you like, you can add things to the plot, just as before:
myPlot + xlab("I forgot to add a label!")
To save your plot object as an image file, use ggsave. The width and height
arguments allows us to control the size of the figure (in inches, unless you specify
otherwise using the units argument).
ggsave("filename.pdf", myPlot, width = 5, height = 5)
If you don’t supply the name of a plot object, ggsave will save the last ggplot2 plot
you created.
In addition to pdf, you can save images e.g. as jpg, tif, eps, svg, and png files, simply
by changing the file extension in the filename. Alternatively, graphics from both
base R and ggplot2 can be saved using the pdf and png functions, using dev.off
to mark the end of the file:
pdf("filename.pdf", width = 5, height = 5)
myPlot
dev.off()
plot(msleep$sleep_total, msleep$sleep_rem)
dev.off()
Note that you also can save graphics by clicking on the Export button in the Plots
panel in RStudio. Using code to save your plot is usually a better idea, because
of reproducibility. At some point you’ll want to go back and make changes to an
old figure, and that will be much easier if you already have the code to export the
graphic.
You’ve now had a first taste of graphics using R. We have however only scratched
the surface, and will return to the many uses of statistical graphics in Chapter 4.
2.10 Troubleshooting
Every now and then R will throw an error message at you. Sometimes these will be
informative and useful, as in this case:
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
means(age)
where R prints:
> means(age)
Error in means(age) : could not find function "means"
This tells us that the function that we are trying to use, means does not exist. There
are two possible reasons for this: either we haven’t loaded the package in which the
function exists, or we have misspelt the function name. In our example the latter is
true, the function that we really wanted to use was of course mean and not means.
At other times interpreting the error message seems insurmountable, like in these
examples:
Error in if (str_count(string = f[[j]], pattern = \"\\\\S+\") == 1) { :
\n argument is of length zero
56 CHAPTER 2. THE BASICS
and
Error in if (requir[y] > supply[x]) { : \nmissing value where
TRUE/FALSE needed
Transforming, summarising,
and analysing data
Most datasets are stored as tables, with rows and columns. In this chapter we’ll
see how you can import and export such data, and how it is stored in R. We’ll also
discuss how you can transform, summarise, and analyse your data.
After working with the material in this chapter, you will be able to use R to:
• Distinguish between different data types,
• Import data from Excel spreadsheets and csv text files,
• Compute descriptive statistics for subgroups in your data,
• Find interesting points in your data,
• Add new variables to your data,
• Modify variables in your data,
• Remove variables from your data,
• Save and export your data,
• Work with RStudio projects,
• Run t-tests and fit linear models,
• Use %>% pipes to chain functions together.
The chapter ends with a discussion of ethical guidelines for statistical work.
57
58CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
ical data we don’t. Instead we produce bar charts and display the data in tables. It
is no surprise then, that what R also treats different kinds of data differently.
In programming, a variable’s_data type_ describes what kind of object is assigned
to it. We can assign many different types of objects to the variable a: it could
for instance contain a number, text, or a data frame. In order to treat a correctly,
R needs to know what data type its assigned object has. In some programming
languages, you have to explicitly state what data type a variable has, but not in R.
This makes programming R simpler and faster, but can cause problems if a variable
turns out to have a different data type than what you thought1 .
R has six basic data types. For most people, it suffices to know about the first three
in the list below:
• numeric: numbers like 1 and 16.823 (sometimes also called double).
• logical: true/false values (boolean): either TRUE or FALSE.
• character: text, e.g. "a", "Hello! I'm Ada." and "[email protected]".
• integer: integer numbers, denoted in R by the letter L: 1L, 55L.
• complex: complex numbers, like 2+3i. Rarely used in statistical work.
• raw: used to hold raw bytes. Don’t fret if you don’t know what that means.
You can have a long and meaningful career in statistics, data science, or pretty
much any other field without ever having to worry about raw bytes. We won’t
discuss raw objects again in this book.
In addition, these can be combined into special data types sometimes called data
structures, examples of which include vectors and data frames. Important data struc-
tures include factor, which is used to store categorical data, and the awkwardly
named POSIXct which is used to store date and time data.
To check what type of object a variable is, you can use the class function:
x <- 6
y <- "Scotland"
z <- TRUE
class(x)
class(y)
class(z)
class returns the data type of the elements of the vector. So what happens if we
put objects of different type together in a vector?
1 And the subsequent troubleshooting makes programming R more difficult and slower.
3.1. DATA FRAMES AND DATA TYPES 59
In this case, R has coerced the objects in the vector to all be of the same type.
Sometimes that is desirable, and sometimes it is not. The lesson here is to be careful
when you create a vector from different objects. We’ll learn more about coercion and
how to change data types in Section 5.1.
# Diamonds data
View(diamonds)
Notice that all three data frames follow the same format: each column represents a
variable (e.g. age) and each row represents an observation (e.g. an individual). This
is the standard way to store data in R (as well as the standard format in statistics in
general). In what follows, we will use the terms column and variable interchangeably,
to describe the columns/variables in a data frame.
This kind of table can be stored in R as different types of objects - that is, in several
different ways. As you’d expect, the different types of objects have different properties
and can be used with different functions. Here’s the run-down of four common types:
• matrix: a table where all columns must contain objects of the same type
(e.g. all numeric or all character). Uses less memory than other types and
allows for much faster computations, but is difficult to use for certain types of
data manipulation, plotting and analyses.
• data.frame: the most common type, where different columns can contain dif-
ferent types (e.g. one numeric column, one character column).
• data.table: an enhanced version of data.frame.
• tbl_df (“tibble”): another enhanced version of data.frame.
60CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
First of all, in most cases it doesn’t matter which of these four that you use to store
your data. In fact, they all look similar to the user. Have a look at the following
datasets (WorldPhones and airquality come with base R):
# First, an example of data stored in a matrix:
?WorldPhones
class(WorldPhones)
View(WorldPhones)
That being said, in some cases it really matters which one you use. Some functions
require that you input a matrix, while others may break or work differently from
what was intended if you input a tibble instead of an ordinary data frame. Luckily,
you can convert objects into other types:
WorldPhonesDF <- as.data.frame(WorldPhones)
class(WorldPhonesDF)
Exercise 3.1. The following tasks are all related to data types and data structures:
1. Create a text variable using e.g. a <- "A rainy day in Edinburgh". Check
that it gets the correct type. What happens if you use single quotes marks
instead of double quotes when you create the variable?
2. What data types are the sums 1 + 2, 1L + 2 and 1L + 2L?
3. What happens if you add a numeric to a character, e.g. "Hello" + 1?
4. What happens if you perform mathematical operations involving a numeric
and a logical, e.g. FALSE * 2 or TRUE + 1?
3.2. VECTORS IN DATA FRAMES 61
Exercise 3.2. What do the functions ncol, nrow, dim, names, and row.names return
when applied to a data frame?
Exercise 3.3. matrix tables can be created from vectors using the function of the
same name. Using the vector x <- 1:6 use matrix to create the following matrices:
1 2 3
( )
4 5 6
and
1 4
⎛
⎜2 5⎞
⎟.
⎝3 6⎠
Remember to check ?matrix to find out how to set the dimensions of the matrix,
and how it is filled with the numbers from the vector!
If we want to grab a particular element from a vector, we must use its index within
square brackets: [index]. The first element in the vector has index 1, the second
has index 2, the third index 3, and so on. To access the fifth element in the Temp
vector in the airquality data frame, we can use:
airquality$Temp[5]
The square brackets can also be applied directly to the data frame. The syntax
for this follows that used for matrices in mathematics: airquality[i, j] means
the element at the i:th row and j:th column of airquality. We can also leave out
either i or j to extract an entire row or column from the data frame. Here are some
examples:
# First, we check the order of the columns:
names(airquality)
# We see that Temp is the 4th column.
Exercise 3.4. The following tasks all involve using the the [i, j] notation for
extracting data from data frames:
1. Why does airquality[, 3] not return the third row of airquality?
2. Extract the first five rows from airquality. Hint: a fast way of creating the
vector c(1, 2, 3, 4, 5) is to write 1:5.
3. Compute the correlation between the Temp and Wind vectors of airquality
without refering to them using $.
4. Extract all columns from airquality except Temp and Wind.
3.2. VECTORS IN DATA FRAMES 63
Perhaps there was a data entry error - the second customer was actually 18 years old
and not 48. We can assign a new value to that element by referring to it in either of
two ways:
bookstore$age[2] <- 18
# or
bookstore[2, 1] <- 18
We could also change an entire column if we like. For instance, if we wish to change
the age vector to months instead of years, we could use
bookstore$age <- bookstore$age * 12
What if we want to add another variable to the data, for instance the length of the
customers’ visits in minutes? There are several ways to accomplish this, one of which
involves the dollar sign:
bookstore$visit_length <- c(5, 2, 20, 22, 12, 31, 9, 10, 11)
bookstore
As you see, the new data has now been added to a new column in the data frame.
max(airquality$Temp)
But can we find out which day this corresponds to? We could of course manually go
through all 153 days e.g. by using View(airquality), but that seems tiresome and
wouldn’t even be possible in the first place if we’d had more observations. A better
option is therefore to use the function which.max:
which.max(airquality$Temp)
which.max returns the index of the observation with the maximum value. If there is
more than one observation attaining this value, it only returns the first of these.
We’ve just used which.max to find out that day 120 was the hottest during the period.
If we want to have a look at the entire row for that day, we can use
airquality[120,]
Alternatively, we could place the call to which.max inside the brackets. Because
which.max(airquality$Temp) returns the number 120, this yields the same result
as the previous line:
airquality[which.max(airquality$Temp),]
Were we looking for the day with the lowest temperature, we’d use which.min anal-
ogously. In fact, we could use any function or computation that returns an index in
the same way, placing it inside the brackets to get the corresponding rows or columns.
This is extremely useful if we want to extract observations with certain properties,
for instance all days where the temperature was above 90 degrees. We do this using
conditions, i.e. by giving statements that we wish to be fulfilled.
As a first example of a condition, we use the following, which checks if the temperature
exceeds 90 degrees:
airquality$Temp > 90
For each element in airquality$Temp this returns either TRUE (if the condition is
fulfilled, i.e. when the temperature is greater than 90) or FALSE (if the conditions
isn’t fulfilled, i.e. when the temperature is 90 or lower). If we place the condition
inside brackets following the name of the data frame, we will extract only the rows
corresponding to those elements which were marked with TRUE:
airquality[airquality$Temp > 90, ]
If you prefer, you can also store the TRUE or FALSE values in a new variable:
airquality$Hot <- airquality$Temp > 90
There are several logical operators and functions which are useful when stating con-
ditions in R. Here are some examples:
3.2. VECTORS IN DATA FRAMES 65
a <- 3
b <- 8
a == b # Check if a equals b
a > b # Check if a is greater than b
a < b # Check if a is less than b
a >= b # Check if a is equal to or greater than b
a <= b # Check if a is equal to or less than b
a != b # Check if a is not equal to b
is.na(a) # Check if a is NA
a %in% c(1, 4, 9) # Check if a equals at least one of 1, 4, 9
When checking a conditions for all elements in a vector, we can use which to get the
indices of the elements that fulfill the condition:
which(airquality$Temp > 90)
If we want to know if all elements in a vector fulfill the condition, we can use all:
all(airquality$Temp > 90)
In this case, it returns FALSE, meaning that not all days had a temperature above 90
(phew!). Similarly, if we wish to know whether at least one day had a temperature
above 90, we can use any:
any(airquality$Temp > 90)
To find how many elements that fulfill a condition, we can use sum:
sum(airquality$Temp > 90)
Why does this work? Remember that sum computes the sum of the elements in a
vector, and that when logical values are used in computations, they are treated
as 0 (FALSE) or 1 (TRUE). Because the condition returns a vector of logical values,
the sum of them becomes the number of 1’s - the number of TRUE values - i.e. the
number of elements that fulfill the condition.
To find the proportion of elements that fulfill a condition, we can count how many
elements fulfill it and then divide by how many elements are in the vector. This is
exactly what happens if we use mean:
mean(airquality$Temp > 90)
Finally, we can combine conditions by using the logical operators & (AND), | (OR),
and, less frequently, xor (exclusive or, XOR). Here are some examples:
a <- 3
b <- 8
66CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
Exercise 3.6. The following tasks all involve checking conditions for the airquality
data:
1. Which was the coldest day during the period?
2. How many days was the wind speed greater than 17 mph?
3. How many missing values are there in the Ozone vector?
4. How many days are there for which the temperature was below 70 and the wind
speed was above 10?
Exercise 3.7. The function cut can be used to create a categorical variable from
a numerical variable, by dividing it into categories corresponding to different inter-
vals. Reads its documentation and then create a new categorical variable in the
airquality data, TempCat, which divides Temp into the three intervals (50, 70],
(70, 90], (90, 110]3 .
excluding 50 but including 70; the intervals is open on the left but closed to the right.
3.3. IMPORTING DATA 67
to load data from other sources. Two important types of files are comma-separated
value files, .csv, and Excel spreadsheets, .xlsx. .csv files are spreadsheets stored as
text files - basically Excel files stripped down to the bare minimum - no formatting,
no formulas, no macros. You can open and edit them in spreadsheet software like
LibreOffice Calc, Google Sheets or Microsoft Excel. Many devices and databases can
export data in .csv format, making it a commonly used file format that you are
likely to encounter sooner rather than later.
In RStudio, your working directory will usually be shown in the Files panel. If
you have opened RStudio by opening a .R file, the working directory will be the
directory in which the file is stored. You can change the working directory by using
the function setwd or selecting Session > Set Working Directory > Choose Directory
in the RStudio menu.
Before we discuss paths further, let’s look at how you can import data from a file
that is in your working directory. The data files that we’ll use in examples in this
book can be downloaded from the book’s web page. They are stored in a zip file
(data.zip) - open it an copy/extract the files to the folder that is your current
working directory. Open philosophers.csv with a spreadsheet software to have a
quick look at it. Then open it in a text editor (for instance Notepad for Windows,
TextEdit for Mac or Gedit for Linux). Note how commas are used to separate the
columns of the data:
"Name","Description","Born","Deceased","Rating"
"Aristotle","Pretty influential, as philosophers go.",-384,"322 BC",
"4.8"
"Basilides","Denied the existence of incorporeal entities.",-175,
"125 BC",4
"Cercops","An Orphic poet",,,"3.2"
"Dexippus","Neoplatonic!",235,"375 AD","2.7"
"Epictetus","A stoic philosopher",50,"135 AD",5
"Favorinus","Sceptic",80,"160 AD","4.7"
Then run the following code to import the data using the read.csv function and
store it in a variable named imported_data:
68CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
…it means that philosophers.csv is not in your working directory. Either move
the file to the right directory (remember, you can use run getwd() to see what your
working directory is) or change your working directory, as described above.
The columns Name and Description both contain text, and have been imported as
character vectors4 . The Rating column contains numbers with decimals and has
been imported as a numeric vector. The column Born only contain integer values,
and has been imported as an integer vector. The missing value is represented
by an NA. The Deceased column contains years formatted like 125 BC and 135 AD.
These have been imported into a character vector - because numbers and letters
are mixed in this column, R treats is as a text string (in Chapter 5 we will see how
we can convert it to numbers or proper dates). In this case, the missing value is
represented by an empty string, "", rather than by NA.
So, what can you do in case you need to import data from a file that is not in
your working directory? This is a common problem, as many of us store script files
and data files in separate folders (or even on separate drives). One option is to use
file.choose, which opens a pop-up window that lets you choose which file to open
using a graphical interface:
imported_data2 <- read.csv(file.choose())
A third option is not to write any code at all. Instead, you can import the data using
RStudio’s graphical interface by choosing File > Import dataset > From Text (base)
and then choosing philosophers.csv. This will generate the code needed to import
the data (using read.csv) and run it in the Console window.
The latter two solutions work just fine if you just want to open a single file once. But
if you want to reuse your code or run it multiple times, you probably don’t want to
4 If you are running an older version of R (specifically, a version older than the 4.0.0 version
released in April 2020), the character vectors will have been imported as factor vectors instead.
You can change that behaviour by adding a stringsAsFactors = FALSE argument to read.csv.
3.3. IMPORTING DATA 69
have to click and select your file each time. Instead, you can specify the path to your
file in the call to read.csv.
And on Linux:
/home/Mans/Desktop/MyData/philosophers.csv
You can copy the path of the file from your file browser: Explorer5 (Windows),
Finder6 (Mac) or Nautilus/similar7 (Linux). Once you have copied the path, you
can store it in R as a character string.
Here’s how to do this on Mac and Linux:
file_path <- "/Users/Mans/Desktop/MyData/philosophers.csv" # Mac
file_path <- "/home/Mans/Desktop/MyData/philosophers.csv" # Linux
If you’re working on a Windows system, file paths are written using backslashes, \,
like so:
C:\Users\Mans\Desktop\MyData\file.csv
5 To copy the path, navigate to the file in Explorer. Hold down the Shift key and right-click the
file, selecting Copy as path.
6 To copy the path, navigate to the file in Finder and right-click/Control+click/two-finger click
on the file. Hold down the Option key, and then select Copy “file name” as Pathname.
7 To copy the path from Nautilus, navigate to the file and press Ctrl+L to show the path, then
copy it. If you are using some other file browser or the terminal, my guess is that you’re tech-savvy
enough that you don’t need me to tell you how to find the path of a file.
70CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
If you’ve copied the path to your clipboard, you can also get the path in the second
of the formats above by using
file_path <- readClipboard() # Windows example 3
Once the path is stored in file_path, you can then make a call to read.csv to
import the data:
imported_data <- read.csv(file_path)
Try this with your philosophers.csv file, to make sure that you know how it works.
Finally, you can read a file directly from a URL, by giving the URL as the file path.
Here is an example with data from the WHO Global Tuberculosis Report:
# Download WHO tuberculosis burden data:
tb_data <- read.csv("https://tinyurl.com/whotbdata")
.csv files can differ slightly in how they are formatted - for instance, different symbols
can be used to delimit the columns. You will learn how to handle this in the exercises
below.
A downside to read.csv is that it is very slow when reading large (50 MB or more)
csv files. Faster functions are available in add-on packages; see Section 5.7.1. In
addition, it is also possible to import data from other statistical software packages
such as SAS and SPSS, from other file formats like JSON, and from databases. We’ll
discuss most of these in Section 5.14
Now, download the philosophers.xlsx file from the book’s web page and save it in
a folder of your choice. Then set file_path to the path of the file, just as you did
for the .csv file. To import data from the Excel file, you can then use:
library(openxlsx)
imported_from_Excel <- read.xlsx(file_path)
View(imported_from_Excel)
str(imported_from_Excel)
As with read.csv, you can replace the file path with file.choose() in order to
select the file manually.
3.3. IMPORTING DATA 71
Exercise 3.8. The abbreviation CSV stands for Comma Separated Values, i.e. that
commas , are used to separate the data columns. Unfortunately, the .csv format is
not standardised, and .csv files can use different characters to delimit the columns.
Examples include semicolons (;) and tabs (multiple spaces, denoted \t in strings in
R). Moreover, decimal points can be given either as points (.) or as commas (,).
Download the vas.csv file from the book’s web page. In this dataset, a number of
patients with chronic pain have recorded how much pain they experience each day
during a period, using the Visual Analogue Scale (VAS, ranging from 0 - no pain -
to 10 - worst imaginable pain). Inspect the file in a spreadsheet software and a text
editor - check which symbol is used to separate the columns and whether a decimal
point or a decimal comma is used. Then set file_path to its path and import the
data from it using the code below:
View(vas)
str(vas)
1. Why are there two variables named X and X.1 in the data frame?
2. What happens if you remove the sep = ";" argument?
3. What happens if you instead remove the dec = "," argument?
4. What happens if you instead remove the skip = 4 argument?
5. What happens if you change skip = 4 to skip = 5?
Exercise 3.9. Download the projects-email.xlsx file from the book’s web page
and have a look at it in a spreadsheet software. Note that it has three sheet: Projects,
Email, and Contact.
1. Read the documentation for read.xlsx. How can you import the data from
the second sheet, Email?
2. Some email addresses are repeated more than once. Read the documentation for
unique. How can you use it to obtain a vector containing the email addresses
without any duplicates?
Exercise 3.10. Download the vas-transposed.csv file from the book’s web page
and have a look at it in a spreadsheet software. It is a transposed version of vas.csv,
where rows represent variables and columns represent observations (instead of the
other way around, as is the case in data frames in R). How can we import this data
into R?
72CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
1. Import the data using read.csv. What does the resulting data frame look
like?
2. Read the documentation for read.csv. How can you make it read the row
names that can be found in the first column of the .csv file?
3. The function t can be applied to transpose (i.e. rotate) your data frame. Try
it out on your imported data. Is the resulting object what you were looking
for? What happens if you make a call to as.data.frame with your data after
transposing it?
# Export to .csv:
write.csv(bookstore, "bookstore.csv")
To save the objects bookstore and age in a .Rdata file, we can use the save function:
save(bookstore, age, file = "myData.RData")
When we wish to load the stored objects, we use the load function:
load(file = "myData.RData")
To create a new Project, click File > New Project in the RStudio menu. You then get
to choose whether to create a Project associated with a folder that already exists, or
to create a Project in a new folder. After you’ve created the Project, it will be saved
as an .Rproj file. You can launch RStudio with the Project folder as the working
directory by double-clicking the .Rproj file. If you already have an active RStudio
session, this will open another session in a separate window.
When working in a Project, I recommend that you store your data in a subfolder of
the Project folder. You can the use relative paths to access your data files, i.e. paths
that are relative to you working directory. For instance, if the file bookstore.csv is
in a folder in your working directory called Data, it’s relative path is:
file_path <- "Data/bookstore.csv"
Much simpler that having to write the entire path, isn’t it?
If instead your working directory is contained inside the folder where bookstore.csv
is stored, its relative path would be
file_path <- "../bookstore.csv"
The beauty of using relative paths is that they are simpler to write, and if you transfer
the entire project folder to another computer, your code will still run, because the
relative paths will stay the same.
74CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
The output contains a lot of useful information, including the p-value (0.53) and a
95 % confidence interval. t.test contains a number of useful arguments that we
can use to tailor the test to our taste. For instance, we can change the confidence
level of the confidence interval (to 90 %, say), use a one-sided alternative hypothesis
(“carnivores sleep more than herbivores”, i.e. the mean of the first group is greater
than that of the second group) and perform the test under the assumption of equal
variances in the two samples:
t.test(carnivores$sleep_total, herbivores$sleep_total,
conf.level = 0.90,
alternative = "greater",
var.equal = TRUE)
Let’s have a look at the relationship between gross horsepower (hp) and fuel con-
sumption (mpg):
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
geom_point()
3.7. FITTING A LINEAR REGRESSION MODEL 75
The relationship doesn’t appear to be perfectly linear, but nevertheless, we can try
fitting a linear regression model to the data. This can be done using lm. We fit a
model with mpg as the response variable and hp as the explanatory variable:
m <- lm(mpg ~ hp, data = mtcars)
The first argument is a formula, saying that mpg is a function of hp, i.e.
𝑚𝑝𝑔 = 𝛽0 + 𝛽1 ⋅ ℎ𝑝.
A summary of the model is obtained using summary. Among other things, it includes
the estimated parameters, p-values and the coefficient of determination 𝑅2 .
summary(m)
We can add the fitted line to the scatterplot by using geom_abline, which lets us add
a straight line with a given intercept and slope - we take these to be the coefficients
from the fitted model, given by coef:
# Check model coefficients:
coef(m)
If we wish to add further variables to the model, we simply add them to the right-
hand-side of the formula in the function call:
m2 <- lm(mpg ~ hp + wt, data = mtcars)
summary(m2)
𝑚𝑝𝑔 = 𝛽0 + 𝛽1 ⋅ ℎ𝑝 + 𝛽2 ⋅ 𝑤𝑡.
There is much more to be said about linear models in R. We’ll return to them in
Section 8.1.
∼
76CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
Exercise 3.11. Fit a linear regression model to the mtcars data, using mpg as the
response variable and hp, wt, cyl, and am as explanatory variables. Are all four
explanatory variables significant?
To begin with, let’s compute the mean temperature for each month. Using
aggregate, we do this as follows:
aggregate(Temp ~ Month, data = airquality, FUN = mean)
The first argument is a formula, similar to what we used for lm, saying that we want
a summary of Temp grouped by Month. Similar formulas are used also in other R
functions, for instance when building regression models. In the second argument,
data, we specify in which data frame the variables are found, and in the third, FUN,
we specify which function should be used to compute the summary.
It is also possible to compute summaries for multiple variables at the same time.
For instance, we can compute the standard deviations (using sd) of Temp and Wind,
grouped by Month:
aggregate(cbind(Temp, Wind) ~ Month, data = airquality, FUN = sd)
aggregate can also be used to count the number of observations in the groups. For
instance, we can count the number of days in each month. In order to do so, we put
a variable with no NA values on the left-hand side in the formula, and use length,
which returns the length of a vector:
aggregate(Temp ~ Month, data = airquality, FUN = length)
Another function that can be used to compute grouped summaries is by. The results
are the same, but the output is not as nicely formatted. Here’s how to use it to
compute the mean temperature grouped by month:
3.9. USING %>% PIPES 77
What makes by useful is that unlike aggregate it is easy to use with functions that
take more than one variable as input. If we want to compute the correlation between
Wind and Temp grouped by month, we can do that as follows:
names(airquality) # Check that Wind and Temp are in columns 3 and 4
by(airquality[, 3:4], airquality$Month, cor)
For each month, this outputs a correlation matrix, which shows both the correlation
between Wind and Temp and the correlation of the variables with themselves (which
always is 1).
Exercise 3.12. Load the VAS pain data vas.csv from Exercise 3.8. Then do the
following:
1. Compute the mean VAS for each patient.
2. Compute the lowest and highest VAS recorded for each patient.
3. Compute the number of high-VAS days, defined as days where the VAS was at
least 7, for each patient.
Wouldn’t it be more convenient if you didn’t have to write the bookstore$ part each
time? To just say once that you are manipulating bookstore, and have R implicitly
understand that all the variables involved reside in that data frame? Yes. Yes, it
would. Fortunately, R has tools that will let you do just that.
78CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
Now, let’s say that we are interested in finding out what the mean wind speed (in
m/s rather than mph) on hot days (temperature above 80, say) in the airquality
data is, aggregated by month. We could do something like this:
# Extract hot days:
airquality2 <- airquality[airquality$Temp > 80, ]
# Convert wind speed to m/s:
airquality2$Wind <- airquality2$Wind * 0.44704
# Compute mean wind speed for each month:
hot_wind_means <- aggregate(Wind ~ Month, data = airquality2,
FUN = mean)
There is nothing wrong with this code per se. We create a copy of airquality
(because we don’t want to change the original data), change the units of the wind
speed, and then compute the grouped means. A downside is that we end up with
a copy of airquality that we maybe won’t need again. We could avoid that by
putting all the operations inside of aggregate:
# More compact:
hot_wind_means <- aggregate(Wind*0.44704 ~ Month,
data = airquality[airquality$Temp > 80, ],
FUN = mean)
The problem with this is that it is a little difficult to follow because we have to read
the code from the inside out. When we run the code, R will first extract the hot
days, then convert the wind speed to m/s, and then compute the grouped means -
so the operations happen in an order that is the opposite of the order in which we
wrote them.
magrittr introduces a new operator, %>%, called a pipe, which can be used to chain
functions together. Calls that you would otherwise write as
new_variable <- function_2(function_1(your_data))
can be written as
your_data %>% function_1 %>% function_2 -> new_variable
so that the operations are written in the order they are performed. Some prefer the
former style, which is more like mathematics, but many prefer the latter, which is
more like natural language (particularly for those of us who are used to reading from
left to right).
Three operations are required to solve the airquality wind speed problem:
1. Extract the hot days,
2. Convert the wind speed to m/s,
3. Compute the grouped means.
Where before we used function-less operations like airquality2$Wind <-
airquality2$Wind * 0.44704, we would now require functions that carried
out the same operations if we wanted to solve this problem using pipes.
A function that lets us extract the hot days is subset:
subset(airquality, Temp > 80)
And finally, aggregate can be used to compute the grouped means. We could use
these functions step-by-step:
# Extract hot days:
airquality2 <- subset(airquality, Temp > 80)
# Convert wind speed to m/s:
airquality2 <- inset(airquality2, "Wind",
value = airquality2$Wind * 0.44704)
# Compute mean wind speed for each month:
hot_wind_means <- aggregate(Wind ~ Month, data = airquality2,
FUN = mean)
But, because we have functions to perform the operations, we can instead use %>%
pipes to chain them together in a pipeline. Pipes automatically send the output from
the previous function as the first argument to the next, so that the data flows from
left to right, which make the code more concise. They also let us refer to the output
from the previous function as ., which saves even more space. The resulting code is:
airquality %>%
subset(Temp > 80) %>%
inset("Wind", value = .$Wind * 0.44704) %>%
aggregate(Wind ~ Month, data = ., FUN = mean) ->
hot_wind_means
You can read the %>% operator as then: take the airquality data, then subset it,
80CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
then convert the Wind variable, then compute the grouped means. Once you wrap
your head around the idea of reading the operations from left to right, this code is
arguably clearer and easier to read. Note that we used the right-assignment operator
-> to assign the result to hot_wind_means, to keep in line with the idea that the
data flows from the left to the right.
If you need to use binary operators like +, ^ and <, magrittr has a number of aliases
that you can use. For instance, add works as an alias for +:
x <- 2
exp(x + 2)
x %>% add(2) %>% exp
In simple cases like these it is usually preferable to use the base R solution - the
point here is that if you need to perform this kind of operation inside a pipeline, the
aliases make it easy to do so. For a complete list of aliases, see ?extract.
If the function does not take the output from the previous function as its first argu-
ment, you can use . as a placeholder, just as we did in the airquality problem.
Here is another example:
cat(paste("The current time is ", Sys.time())))
Sys.time() %>% paste("The current time is", .) %>% cat
If the data only appears inside parentheses, you need to wrap the function in curly
3.10. FLAVOURS OF R: BASE AND TIDYVERSE 81
brackets {}, or otherwise %>% will try to pass it as the first argument to the function:
airquality %>% cat("Number of rows in data:", nrow(.)) # Doesn't work
airquality %>% {cat("Number of rows in data:", nrow(.))} # Works!
In addition to the magrittr pipes, from version 4.1 R also offers a native pipe, |>,
which can be used in lieu of %>% without loading any packages. Nevertheless, we’ll use
%>% pipes in the remainder of the book, partially because they are more commonly
used (meaning that you are more likely to encounter them when looking at other
people’s code), and partially because magrittr also offers some other useful pipe
operators. You’ll see plenty of examples of how pipes can be used in Chapters 5-9,
and learn about other pipe operators in Section 6.2.
Exercise 3.14. Rewrite the following function calls using pipes, with x <- 1:8:
1. sqrt(mean(x))
2. mean(sqrt(x))
3. sort(x^2-5)[1:2]
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
visit_length <- c(5, 2, 20, 22, 12, 31, 9, 10, 11)
bookstore <- data.frame(age, purchase, visit_length)
Add a new variable rev_per_minute which is the ratio between purchase and the
visit length, using a pipe.
much of what R has to offer. Base R is just as marvellous, and can definitely make
data science as fast, easy and fun as the tidyverse. Besides, nobody uses just base
R anyway - there are a ton of non-tidyverse packages that extend and enrich R in
exciting new ways. Perhaps “extended R” or “superpowered R” would be better
names for the non-tidyverse dialect.
Anyone who tells you to just learn one of these dialects is wrong. Both are great,
they work extremely well together, and they are similar enough that you shouldn’t
limit yourself to just mastering one of them. This book will show you both base
R and tidyverse solutions to problems, so that you can decide for yourself which is
faster, easier, and more fun.
A defining property of the tidyverse is that there are separate functions for everything,
which is perfect for code that relies on pipes. In contrast, base R uses fewer functions,
but with more parameters, to perform the same tasks. If you use tidyverse solutions
there is a good chance that there exists a function which performs exactly the task
you’re going to do with its default settings. This is great (once again, especially if
you want to use pipes), but it means that there are many more functions to master
for tidyverse users, whereas you can make do with much fewer in base R. You will
spend more time looking up function arguments when working with base R (which
fortunately is fairly straightforward using the ? documentation), but on the other
hand, looking up arguments for a function that you know the name of is easier than
finding a function that does something very specific that you don’t know the name
of. There are advantages and disadvantages to both approaches.
Similar ethical guidelines for statisticians have been put forward by the International
Statistical Institute (https://www.isi-web.org/about-isi/policies/professional-
ethics), the United Nations Statistics Division (https://unstats.un.org/unsd/dnss
/gp/fundprinciples.aspx), and the Data Science Association (http://www.datascie
84CHAPTER 3. TRANSFORMING, SUMMARISING, AND ANALYSING DATA
85
86CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
## R Markdown
When you click the **Knit** button a document will be generated that
4.1. REPORTS WITH R MARKDOWN 87
```{r cars}
summary(cars)
```
## Including Plots
Note that the `echo = FALSE` parameter was added to the code chunk to
prevent printing of the R code that generated the plot.
Press the Knit button at the top of the Script panel to create an HTML document
using this Markdown file. It will be saved in the same folder as your Markdown file.
Once the HTML document has been created, it will open so that you can see the
results. You may have to install additional packages for this to work, in which case
RStudio will prompt you.
Now, let’s have a look at what the different parts of the Markdown document do.
The first part is called the document header or YAML header. It contains information
about the document, including its title, the name of its author, and the date on which
it was first created:
---
title: "Untitled"
author: "Måns Thulin"
date: "10/20/2020"
output: html_document
---
The part that says output: html_document specifies what type of document
should be created when you press Knit. In this case, it’s set to html_document,
meaning that an HTML document will be created. By changing this to output:
word_document you can create a .docx Word document instead. By changing it to
output: pdf_document, you can create a .pdf document using LaTeX (you’ll have
to install LaTeX if you haven’t already - RStudio will notify you if that is the case).
The second part sets the default behaviour of code chunks included in the docu-
ment, specifying that the output from running the chunks should be printed unless
otherwise specified:
88CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
When you click the **Knit** button a document will be generated that
includes both content as well as the output of any embedded R code
chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
The fourth and final part contains another section, this time with a figure created
using R. A setting is added to the code chunk used to create the figure, which prevents
the underlying code from being printed in the document:
## Including Plots
Note that the `echo = FALSE` parameter was added to the code chunk to
prevent printing of the R code that generated the plot.
In the next few sections, we will look at how formatting and code chunks work in R
Markdown.
𝑎2 + 𝑏2 = 𝑐2 .
To add headers and subheaders, and to divide your document into section, start a
new line with #’s as follows:
# Header text
## Sub-header text
### Sub-sub-header text
...and so on.
2. Second item
i) Sub-item 1
ii) Sub-item 2
3. Item 3
To create a table, use | and --------- as follows:
Column 1 | Column 2
--------- | ---------
Content | More content
Even more | And some here
Even more? | Yes!
which yields the following output:
Column 1 Column 2
Content More content
Even more And some here
Even more? Yes!
To include an image, use the same syntax as when creating linked text with a link
to the image path (either local or on the web), but with a ! in front:

plot(pressure)
As we can see in Figure 4.2, the relationship between temperature and pressure
resembles a banana.
In addition, you can add settings to the chunk header to control its behaviour. For
92CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
800
600
pressure
400
200
0
temperature
instance, you can include a code chunk without running it by adding echo = FALSE:
Data frames can be printed either as in the Console or as a nicely formatted table.
For example,
4.2. CUSTOMISING GGPLOT2 PLOTS 93
library(ggplot2)
It looks nice, sure, but there may be things that you’d like to change. Maybe you’d
like the plot’s background to be white instead of grey, or perhaps you’d like to use a
different font. These, and many other things, can be modified using themes.
There are several packages available that contain additional themes. Let’s download
a few:
install.packages("ggthemes")
library(ggthemes)
##############################
install.packages("hrbrthemes")
library(hrbrthemes)
You can change the colour palette using scale_colour_brewer. Three types of
colour palettes are available:
• Sequential palettes: that range from a colour to white. These are useful for
representing ordinal (i.e. ordered) categorical variables and numerical variables.
• Diverging palettes: these range from one colour to another, with white in
between. Diverging palettes are useful when there is a meaningful middle or
0 value (e.g. when your variables represent temperatures or profit/loss), which
can be mapped to white.
• Qualitative palettes: which contain multiple distinct colours. These are
useful for nominal (i.e. with no natural ordering) categorical variables.
See ?scale_colour_brewer or http://www.colorbrewer2.org for a list of the avail-
able palettes. Here are some examples:
# Sequential palette:
p + scale_colour_brewer(palette = "OrRd")
# Diverging palette:
p + scale_colour_brewer(palette = "RdBu")
96CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
# Qualitative palette:
p + scale_colour_brewer(palette = "Set1")
In the last example, the vector c(0.9, 0.7) gives the relative coordinates of the
legend, with c(0 0) representing the bottom left corner of the plot and c(1, 1) the
upper right corner. Try to change the coordinates to different values between 0 and
1 and see what happens.
theme has a lot of other settings, including for the colours of the background, the
grid and the text in the plot. Here are a few examples that you can use as starting
point for experimenting with the settings:
p + theme(panel.grid.major = element_line(colour = "black"),
panel.grid.minor = element_line(colour = "purple",
linetype = "dotted"),
panel.background = element_rect(colour = "red", size = 2),
plot.background = element_rect(fill = "yellow"),
axis.text = element_text(family = "mono", colour = "blue"),
axis.title = element_text(family = "serif", size = 14))
Exercise 4.1. Use the documentation for theme and the element_... functions to
change the plot object p created above as follows:
4. Change the colour of the axis ticks to orange and make them thicker.
A similar plot is created using frequency polygons, which uses lines instead of bars
to display the counts in the bins:
ggplot(diamonds, aes(carat)) +
geom_freqpoly()
An advantage with frequency polygons is that they can be used to compare groups,
e.g. diamonds with different cuts, without facetting:
ggplot(diamonds, aes(carat, colour = cut)) +
geom_freqpoly()
98CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
It is clear from this figure that there are more diamonds with ideal cuts than diamonds
with fair cuts in the data. The polygons have roughly the same shape, except perhaps
for the polygon for diamonds with fair cuts.
In some cases, we are more interested in the shape of the distribution than in the
actual counts in the different bins. Density plots are similar to frequency polygons
but show an estimate of the density function of the underlying random variable.
These estimates are smooth curves that are scaled so that the area below them is 1
(i.e. scaled to be proper density functions):
ggplot(diamonds, aes(carat, colour = cut)) +
geom_density()
From this figure, it becomes clear that low-carat diamonds tend to have better cuts,
which wasn’t obvious from the frequency polygons. However, the plot does not
provide any information about how common different cuts are. Use density plots
if you’re more interested in the shape of a variable’s distribution, and frequency
polygons if you’re more interested in counts.
Exercise 4.2. Using the density plot created above and the documentation for
geom_density, do the following:
1. Increase the smoothness of the density curves.
2. Fill the area under the density curves with the same colour as the curves them-
selves.
3. Make the colours that fill the areas under the curves transparent.
4. The figure still isn’t that easy to interpret. Install and load the ggridges
package, an extension of ggplot2 that allows you to make so-called ridge plots
(density plots that are separated along the y-axis, similar to facetting). Read
the documentation for geom_density_ridges and use it to make a ridge plot
of diamond prices for different cuts.
Maybe we could compute the average price in each bin of the histogram? In that
case, we need to extract the bin breaks from the histogram somehow. We could then
create a new categorical variable using the breaks with cut (as we did in Exercise
3.7). It turns out that extracting the bins is much easier using base graphics than
ggplot2, so let’s do that:
# Extract information from a histogram with bin width 0.01,
# which corresponds to 481 breaks:
carat_br <- hist(diamonds$carat, breaks = 481, right = FALSE,
plot = FALSE)
# Of interest to us are:
# carat_br$breaks, which contains the breaks for the bins
# carat_br$mid, which contains the midpoints of the bins
# (useful for plotting!)
We now have a variable, carat_cat, that shows to which bin each observation belongs.
Next, we’d like to compute the mean for each bin. This is a grouped summary - mean
by category. After we’ve computed the bin means, we could then plot them against
the bin midpoints. Let’s try it:
means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)
plot(carat_br$mid, means$price)
That didn’t work as intended. We get an error message when attempting to plot the
results:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
The error message implies that the number of bins and the number of mean values
that have been computed differ. But we’ve just computed the mean for each bin,
haven’t we? So what’s going on?
By default, aggregate ignores groups for which there are no values when computing
grouped summaries. In this case, there are a lot of empty bins - there is for instance
no observation in the [4.99,5) bin. In fact, only 272 out of the 481 bins are non-
empty.
We can solve this in different ways. One way is to remove the empty bins. We can
do this using the match function, which returns the positions of matching values in
100CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
two vectors. If we use it with the bins from the grouped summary and the vector
containing all bins, we can find the indices of the non-empty bins. This requires the
use of the levels function, that you’ll learn more about in Section 5.4:
means <- aggregate(price ~ carat_cat, data = diamonds, FUN = mean)
Finally, we’ll also add some vertical lines to our plot, to call attention to multiples
of 0.25.
Using base graphics is faster here: But we can of course stick to ggplot2 if
plot(carat_br$mid[id], means$price, we like:
cex = 0.5) library(ggplot2)
It appears that there are small jumps in the prices at some of the 0.25-marks. This
explains why there are more diamonds just above these marks than just below.
The above example illustrates three crucial things regarding exploratory data analy-
sis:
• Plots (in our case, the histogram) often lead to new questions.
• Sometimes the thing that we’re trying to do doesn’t work straight away. There
is almost always a solution though (and oftentimes more than one!). The more
you work with R, the more problem-solving tricks you will learn.
4.3. EXPLORING DISTRIBUTIONS 101
Instead of using a boxplot, we can use a violin plot. Each group is represented by a
“violin”, given by a rotated and duplicated density plot:
ggplot(diamonds, aes(cut, price)) +
geom_violin()
Compared to boxplots, violin plots capture the entire distribution of the data rather
than just a few numerical summaries. If you like numerical summaries (and you
should!) you can add the median and the quartiles (corresponding to the borders of
the box in the boxplot) using the draw_quantiles argument:
ggplot(diamonds, aes(cut, price)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))
Exercise 4.4. Using the first boxplot created above, i.e. ggplot(diamonds,
aes(cut, price)) + geom_violin(), do the following:
1. Add some colour to the plot by giving different colours to each violin.
2. Because the categories are shown along the x-axis, we don’t really need the
legend. Remove it.
3. Both boxplots and violin plots are useful. Maybe we can have the best of both
worlds? Add the corresponding boxplot inside each violin. Hint: the width
and alpha arguments in geom_boxplot are useful for creating a nice-looking
figure here.
4. Flip the coordinate system to create horizontal violins and boxes instead.
To use it, save each plot as a plot object and then add them together:
102CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
library(patchwork)
plot1 + plot2
You can also arrange the plots on multiple lines, with different numbers of plots on
each line. This is particularly useful if you are combining different types of plots in
a single plot window. In this case, you separate plots that are same line by | and
mark the beginning of a new line with /:
# Create two more plot objects:
plot3 <- ggplot(diamonds, aes(cut, depth, fill = cut)) +
geom_violin() +
theme(legend.position = "none")
plot4 <- ggplot(diamonds, aes(carat, fill = cut)) +
geom_density(alpha = 0.5) +
theme(legend.position = c(0.9, 0.6))
# One row with three plots and one row with a single plot:
(plot1 | plot2 | plot3) / plot4
# One column with three plots and one column with a single plot:
(plot1 / plot2 / plot3) | plot4
(You may need to enlarge your plot window for this to look good!)
There are some outliers which we may want to study further. For instance, there is
a surprisingly cheap 5 carat diamond, and some cheap 3 carat diamonds2 . But how
can we identify those points?
One option is to use the plotly package to make an interactive version of the plot,
where we can hover interesting points to see more information about them. Start by
installing it:
install.packages("plotly")
To use plotly with a ggplot graphic, we store the graphic in a variable and then
use it as input to the ggplotly function. The resulting (interactive!) plot takes a
little longer than usual to load. Try hovering the points:
myPlot <- ggplot(diamonds, aes(carat, price)) +
geom_point()
library(plotly)
ggplotly(myPlot)
By default, plotly only shows the carat and price of each diamond. But we can add
more information to the box by adding a text aesthetic:
myPlot <- ggplot(diamonds, aes(carat, price, text = paste("Row:",
rownames(diamonds)))) +
geom_point()
ggplotly(myPlot)
We can now find the row numbers of the outliers visually, which is very useful when
exploring data.
Exercise 4.5. The variables x and y in the diamonds data describe the length and
width of the diamonds (in mm). Use an interactive scatterplot to identify outliers in
these variables. Check prices, carat and other information and think about if any of
the outliers can be due to data errors.
2 Note that it is not just the prices nor just the carats of these diamonds that make them outliers,
We can count the number of missing values for each variable using:
colSums(is.na(msleep))
Here, colSums computes the sum of is.na(msleep) for each column of msleep
(remember that in summation, TRUE counts as 1 and FALSE as 0), yielding the number
of missing values for each variable. In total, there are 136 missing values in the
dataset:
sum(is.na(msleep))
You’ll notice that ggplot2 prints a warning in the Console when you create a plot
with missing data:
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
scale_x_log10()
Sometimes data are missing simply because the information is not yet available (for
instance, the brain weight of the mountain beaver could be missing because no one
has ever weighed the brain of a mountain beaver). In other cases, data can be missing
because something about them is different (for instance, values for a male patient
in a medical trial can be missing because the patient died, or because some values
only were collected for female patients). Therefore, it is of interest to see if there are
any differences in non-missing variables between subjects that have missing data and
subjects that don’t.
In msleep, all animals have recorded values for sleep_total and bodywt. To check
if the animals that have missing brainwt values differ from the others, we can plot
them in a different colour in a scatterplot:
ggplot(msleep, aes(bodywt, sleep_total, colour = is.na(brainwt))) +
geom_point() +
scale_x_log10()
106CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
(If is.na(brainwt) is TRUE then the brain weight is missing in the dataset.) In this
case, there are no apparent differences between the animals with missing data and
those without.
Exercise 4.7. Create a version of the diamonds dataset where the x value is missing
for all diamonds with 𝑥 > 9. Make a scatterplot of carat versus price, with points
where the x value is missing plotted in a different colour. How would you interpret
this plot?
Exercise 4.8. Explore the flights2 dataset, focusing on delays and the amount of
time spent in the air. Are there any differences between the different carriers? Are
there missing data? Are there any outliers?
There appears to be a decreasing trend in the plot. To aid the eye, we can add a
smoothed line by adding a new geom, geom_smooth, to the figure:
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
geom_smooth() +
4.6. EXPLORING TIME SERIES 107
This technique is useful for bivariate data as well as for time series, which we’ll delve
into next.
By default, geom_smooth adds a line fitted using either LOESS3 or GAM4 , as well as
the corresponding 95 % confidence interval describing the uncertainty in the estimate.
There are several useful arguments that can be used with geom_smooth. You will
explore some of these in the exercise below.
Exercise 4.9. Check the documentation for geom_smooth. Starting with the plot
of brain weight vs. sleep time created above, do the following:
1. Decrease the degree of smoothing for the LOESS line that was fitted to the
data. What is better in this case, more or less smoothing?
2. Fit a straight line to the data instead of a non-linear LOESS line.
3. Remove the confidence interval from the plot.
4. Change the colour of the fitted line to red.
The a10 dataset contains information about the monthly anti-diabetic drug sales in
Australia during the period July 1991 to June 2008. By checking its structure, we
see that it is saved as a time series object5 :
library(fpp2)
str(a10)
3 LOESS, LOcally Estimated Scatterplot Smoothing, is a non-parametric regression method that
ggplot2 requires that data is saved as a data frame in order for it to be plotted. In
order to plot the time series, we could first convert it to a data frame and then plot
each point using geom_points:
a10_df <- data.frame(time = time(a10), sales = a10)
ggplot(a10_df, aes(time, sales)) +
geom_point()
It is however usually preferable to plot time series using lines instead of points. This
is done using a different geom: geom_line:
ggplot(a10_df, aes(time, sales)) +
geom_line()
Having to convert the time series object to a data frame is a little awkward. Luckily,
there is a way around this. ggplot2 offers a function called autoplot, that auto-
matically draws an appropriate plot for certain types of data. forecast extends this
function to time series objects:
library(forecast)
autoplot(a10)
We can still add other geoms, axis labels and other things just as before. autoplot
has simply replaced the ggplot(data, aes()) + geom part that would be the first
two rows of the ggplot2 figure, and has implicitly converted the data to a data
frame.
library(forecast)
autoplot(gold)
There is a sharp spike a few weeks before day 800, which is due to an incorrect value
in the data series. We’d like to add a note about that to the plot. First, we wish
to find out on which day the spike appears. This can be done by checking the data
manually or using some code:
spike_date <- which.max(gold)
To add a circle around that point, we add a call to annotate to the plot:
autoplot(gold) +
annotate(geom = "point", x = spike_date, y = gold[spike_date],
size = 5, shape = 21,
colour = "red",
fill = "transparent")
annotate can be used to annotate the plot with both geometrical objects and text
(and can therefore be used as an alternative to geom_text).
Exercise 4.11. Using the figure created above and the documentation for annotate,
do the following:
1. Add the text “Incorrect value” next to the circle.
2. Create a second plot where the incorrect value has been removed.
3. Read the documentation for the geom geom_hline. Use it to add a red reference
line to the plot, at 𝑦 = 400.
In this case, it is probably a good idea to facet the data, i.e. to plot each series in a
different figure:
110CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
The resulting figure is quite messy. Using colour to indicate the passing of time helps
a little. For this, we need to add the day of the year to the data frame. To get the
values right, we use nrow, which gives us the number of rows in the data frame.
elecdaily2 <- as.data.frame(elecdaily)
elecdaily2$day <- 1:nrow(elecdaily2)
It becomes clear from the plot that temperatures were the highest at the beginning
of the year and lower in the winter months (July-August).
4.6. EXPLORING TIME SERIES 111
Exercise 4.13. Make the following changes to the plot you created above:
1. Decrease the size of the points.
2. Add text annotations showing the dates of the highest and lowest temperatures,
next to the corresponding points in the figure.
The first two aes arguments specify the x and y-axes, and the third specifies that
there should be one line per subject (i.e. per boy) rather than a single line interpo-
lating all points. The latter would be a rather useless figure that looks like this:
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line() +
ggtitle("A terrible plot")
Returning to the original plot, if we wish to be able to identify which time series
corresponds to which boy, we can add a colour aesthetic:
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) +
geom_point() +
geom_line()
Note that the boys are ordered by height, rather than subject number, in the legend.
Now, imagine that we wish to add a trend line describing the general growth trend
for all boys. The growth appears approximately linear, so it seems sensible to use
geom_smooth(method = "lm") to add the trend:
ggplot(Oxboys, aes(age, height, group = Subject, colour = Subject)) +
geom_point() +
geom_line() +
geom_smooth(method = "lm", colour = "red", se = FALSE)
112CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
Unfortunately, because we have specified in the aesthetics that the data should be
grouped by Subject, geom_smooth produces one trend line for each boy. The “prob-
lem” is that when we specify an aesthetic in the ggplot call, it is used for all geoms.
Exercise 4.14. Figure out how to produce a spaghetti plot of the Oxboys data with
a single red trend line based on the data from all 26 boys.
When working with seasonal time series, it is common to decompose the series into
a seasonal component, a trend component and a remainder. In R, this is typically
done using the stl function, which uses repeated LOESS smoothing to decompose
the series. There is an autoplot function for stl objects:
autoplot(stl(a10, s.window = 365))
This plot can too be manipulated in the same way as other ggplot objects. You can
access the different parts of the decomposition as follows:
stl(a10, s.window = 365)$time.series[,"seasonal"]
stl(a10, s.window = 365)$time.series[,"trend"]
stl(a10, s.window = 365)$time.series[,"remainder"]
4.6. EXPLORING TIME SERIES 113
Exercise 4.15. Investigate the writing dataset from the fma package graphically.
Make a time series plot with a smoothed trend line, a seasonal plot and an stl-
decomposition plot. Add appropriate plot titles and labels to the axes. Can you see
any interesting patterns?
We can now look at some examples with the anti-diabetic drug sales data:
library(forecast)
library(fpp2)
library(changepoint)
library(ggfortify)
autoplot(cpt.meanvar(a10_ns))
As you can see, the different methods from changepoint all yield different results.
The results for changes in the mean are a bit dubious - which isn’t all that strange as
we are using a method that looks for jumps in the mean on a time series where the
increase actually is more or less continuous. The changepoint for the variance looks
more reliable - there is a clear change towards the end of the series where the sales
become more volatile. We won’t go into details about the different methods here,
but mention that the documentation at ?cpt.mean, ?cpt.var, and ?cpt.meanvar
contains descriptions of and references for the available methods.
Exercise 4.16. Are there any changepoints for variance in the Demand time series
in elecdaily? Can you explain why the behaviour of the series changes?
ggplotly(myPlot)
When you hover the mouse pointer over a point, a box appears, displaying informa-
tion about that data point. Unfortunately, the date formatting isn’t great in this
example - dates are shown as weeks with decimal points. We’ll see how to fix this in
Section 5.6.
We can visualise the monthly average temperature using lines in a Cartesian coordi-
nate system:
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
geom_line()
What this plot doesn’t show is that the 12th month and the 1st month actually are
consecutive months. If we instead use polar coordinates, this becomes clearer:
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
geom_line() +
coord_polar()
To improve the presentation, we can change the scale of the x-axis (i.e. the circular
axis) so that January and December aren’t plotted at the same angle:
ggplot(Cape_Town_weather, aes(Month, Temp_C)) +
geom_line() +
coord_polar() +
xlim(0, 12)
Exercise 4.17. In the plot that we just created, the last and first month of the year
aren’t connected. You can fix manually this by adding a cleverly designed faux data
point to Cape_Town_weather. How?
116CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
What would happen if we plotted this figure in a polar coordinate system instead?
If we map the height of the bars (the y-axis of the Cartesian coordinate system) to
both the angle and the radial distance, we end up with a pie chart:
ggplot(msleep, aes(factor(1), fill = vore)) +
geom_bar() +
coord_polar(theta = "y")
There are many arguments against using pie charts for visualisations. Most boil
down to the fact that the same information is easier to interpret when conveyed as a
bar chart. This is at least partially due to the fact that most people are more used
to reading plots in Cartesian coordinates than in polar coordinates.
If we make a similar transformation of a grouped bar chart, we get a different type
of pie chart, in which the height of the bars are mapped to both the angle and the
radial distance6 :
# Cartestian bar chart:
ggplot(msleep, aes(vore, fill = vore)) +
geom_bar()
way.
4.8. VISUALISING MULTIPLE VARIABLES 117
plots density estimates (along the diagonal) and shows the (Pearson) correlation for
each pair. Let’s start by installing GGally:
install.packages("GGally")
(Enlarging your Plot window can make the figure look better.)
If we want to create a scatterplot matrix but only want to include some of the
variables in a dataset, we can do so by providing a vector with variable names. Here
is an example for the animal sleep data msleep:
ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")])
Alternatively, we can use a categorical variable to colour points and density estimates
using aes(colour = ...). The syntax for this is follows the same pattern as that
for a standard ggplot call - ggpairs(data, aes). The only exception is that if the
categorical variable is not included in the data argument, we much specify which
data frame it belongs to:
ggpairs(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")],
aes(colour = msleep$vore, alpha = 0.5))
118CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
As a side note, if all variables in your data frame are numeric, and if you only
are looking for a quick-and-dirty scatterplot matrix without density estimates and
correlations, you can also use the base R plot:
plot(airquality)
Exercise 4.18. Create a scatterplot matrix for all numeric variables in diamonds.
Differentiate different cuts by colour. Add a suitable title to the plot. (diamonds is
a fairly large dataset, and it may take a minute or so for R to create the plot.)
4.8.2 3D scatterplots
The plotly package lets us make three-dimensional scatterplots with the plot_ly
function, which can be a useful alternative to scatterplot matrices in some cases.
Here is an example using the airquality data:
library(plotly)
plot_ly(airquality, x = ~Ozone, y = ~Wind, z = ~Temp,
color = ~factor(Month))
Note that you can drag and rotate the plot, to see it from different angles.
4.8.3 Correlograms
Scatterplot matrices are not a good choice when we have too many variables, partially
because the plot window needs to be very large to fit all variables and partially
because it becomes difficult to get a good overview of the data. In such cases, a
correlogram, where the strength of the correlation between each pair of variables is
plotted instead of scatterplots, can be used instead. It is effectively a visualisation of
the correlation matrix of the data, where the strengths and signs of the correlations
are represented by different colours.
The GGally package contains the function ggcorr, which can be used to create a
correlogram:
ggcorr(msleep[, c("sleep_total", "sleep_rem", "sleep_cycle", "awake",
"brainwt", "bodywt")])
Exercise 4.19. Using the diamonds dataset and the documentation for ggcorr, do
the following:
1. Create a correlogram for all numeric variables in the dataset.
4.8. VISUALISING MULTIPLE VARIABLES 119
2. The Pearson correlation that ggcorr uses by default isn’t always the best choice.
A commonly used alternative is the Spearman correlation. Change the type of
correlation used to create the plot to the Spearman correlation.
3. Change the colour scale from a categorical scale with 5 categories.
4. Change the colours on the scale to go from yellow (low correlation) to black
(high correlation).
The plot looks a little nicer if we increase the size of the points:
ggplot(msleep, aes(brainwt, sleep_total, shape = vore, size = 2)) +
geom_point() +
scale_x_log10()
The size of each “bubble” now represents the weight of the animal. Because some
animals are much heavier (i.e. have higher bodywt values) than most others, almost
all points are quite small. There are a couple of things we can do to remedy this. First,
we can transform bodywt, e.g. using the square root transformation sqrt(bodywt), to
decrease the differences between large and small animals. This can be done by adding
scale_size(trans = "sqrt") to the plot. Second, we can also use scale_size to
control the range of point sizes (e.g. from size 1 to size 20). This will cause some points
to overlap, so we add alpha = 0.5 to the geom, to make the points transparent:
120CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
This produces a fairly nice-looking plot, but it’d look even better if we changed the
axes labels and legend texts. We can change the legend text for the size scale by
adding the argument name to scale_size. Including a \n in the text lets us create
a line break - you’ll learn more tricks like that in Section 5.5. Similarly, we can use
scale_colour_discrete to change the legend text for the colours:
ggplot(msleep, aes(brainwt, sleep_total, colour = vore,
size = bodywt)) +
geom_point(alpha = 0.5) +
xlab("log(Brain weight)") +
ylab("Sleep total (h)") +
scale_x_log10() +
scale_size(range = c(1, 20), trans = "sqrt",
name = "Square root of\nbody weight") +
scale_colour_discrete(name = "Feeding behaviour")
Exercise 4.20. Using the bubble plot created above, do the following:
1. Replace colour = vore in the aes by fill = vore and add colour =
"black", shape = 21 to geom_point. What happens?
2. Use ggplotly to create an interactive version of the bubble plot above, where
variable information and the animal name are displayed when you hover a point.
4.8.5 Overplotting
Let’s make a scatterplot of table versus depth based on the diamonds dataset:
ggplot(diamonds, aes(table, depth)) +
geom_point()
This plot is cluttered. There are too many points, which makes it difficult to see
if, for instance, high table values are more common than low table values. In this
section, we’ll look at some ways to deal with this problem, known as overplotting.
This helps a little, but now the outliers become a bit difficult to spot. We can try
changing the opacity using alpha instead:
ggplot(diamonds, aes(table, depth)) +
geom_point(alpha = 0.2)
This is also better than the original plot, but neither plot is great. Instead of plotting
each individual point, maybe we can try plotting the counts or densities in different
regions of the plot instead? Effectively, this would be a 2D version of a histogram.
There are several ways of doing this in ggplot2.
First, we bin the points and count the numbers in each bin, using geom_bin2d:
ggplot(diamonds, aes(table, depth)) +
geom_bin2d()
By default, geom_bin2d uses 30 bins. Increasing that number can sometimes give us
a better idea about the distribution of the data:
ggplot(diamonds, aes(table, depth)) +
geom_bin2d(bins = 50)
If you prefer, you can get a similar plot with hexagonal bins by using geom_hex
instead:
ggplot(diamonds, aes(table, depth)) +
geom_hex(bins = 50)
The fill = ..level.. bit above probably looks a little strange to you. It means
that an internal function (the level of the contours) is used to choose the fill colours.
It also means that we’ve reached a point where we’re reaching deep into the depths
of ggplot2!
We can use a similar approach to show a summary statistic for a third variable in a
plot. For instance, we may want to plot the average price as a function of table and
depth. This is called a tile plot:
ggplot(diamonds, aes(table, depth, z = price)) +
geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
122CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
However, it is often better to use colour rather than point size to visualise counts,
which we can do using a tile plot. First, we have to compute the counts though, using
aggregate. We now wish to have two grouping variables, color and cut, which we
can put on the right-hand side of the formula as follows:
diamonds2 <- aggregate(carat ~ cut + color, data = diamonds,
FUN = length)
diamonds2
diamonds2 is now a data frame containing the different combinations of color and
cut along with counts of how many diamonds belong to each combination (labelled
carat, because we put carat in our formula). Let’s change the name of the last
column from carat to Count:
names(diamonds2)[3] <- "Count"
library(nycflights13)
?planes
Exercise 4.24. Use graphics to answer the following questions regarding the planes
dataset:
1. What is the most common combination of manufacturer and plane type in the
dataset?
2. Which combination of manufacturer and plane type has the highest average
number of seats?
3. Do the numbers of seats on planes change over time? Which plane had the
highest number of seats?
4. Does the type of engine used change over time?
column to a categorical factor variable (which you’ll learn more about in Section
5.4):
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
If we make a scatterplot matrix of all variables, it becomes evident that there are
differences between the varieties, but that no single pair of variables is enough to
separate them:
library(ggplot2)
library(GGally)
ggpairs(seeds[, -8], aes(colour = seeds$Variety, alpha = 0.2))
To see the loadings of the components, i.e. how much each variable contributes to
the components, simply type the name of the object prcomp created:
pca
The first principal component is more or less a weighted average of all variables,
but has stronger weights on Area, Perimeter, Kernel_length, Kernel_width, and
Groove_length, all of which are measures of size. We can therefore interpret it as
a size variable. The second component has higher loadings for Compactness and
Asymmetry, meaning that it mainly measures those shape features. In Exercise 4.26
126CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
The first principal component accounts for 71.87 % of the variance, and the first
three combined account for 98.67 %.
To visualise this, we can draw a scree plot, which shows the variance of each principal
component - the total variance of the data is the sum of the variances of the principal
components:
screeplot(pca, type = "lines")
We can use this to choose how many principal components to use when visualising
or summarising our data. In that case, we look for an “elbow”, i.e. a bend after
which increases the number of components doesn’t increase the amount of variance
explained much.
We can access the values of the principal components using pca$x. Let’s check that
the first two components really are uncorrelated:
cor(pca$x[,1], pca$x[,2])
In this case, almost all of the variance is summarised by the first two or three principal
components. It appears that we have successfully reduced the data from 7 variables to
2-3, which should make visualisation much easier. The ggfortify package contains
an autoplot function for PCA objects, that creates a scatterplot of the first two
principal components:
library(ggfortify)
autoplot(pca, data = seeds, colour = "Variety")
That is much better! The groups are almost completely separated, which shows
that the variables can be used to discriminate between the three varieties. The first
principal component accounts for 71.87 % of the total variance in the data, and the
second for 17.11 %.
If you like, you can plot other pairs of principal components than just components 1
and 2. In this case, component 3 may be of interest, as its variance is almost as high
as that of component 2. You can specify which components to plot with the x and y
arguments:
# Plot 2nd and 3rd PC:
autoplot(pca, data = seeds, colour = "Variety",
x = 2, y = 3)
Here, the separation is nowhere near as clear as in the previous figure. In this
particular example, plotting the first two principal components is the better choice.
4.10. CLUSTER ANALYSIS 127
Judging from these plots, it appears that the kernel measurements can be used to
discriminate between the three varieties of wheat. In Chapters 7 and 9 you’ll learn
how to use R to build models that can be used to do just that, e.g. by predicting which
variety of wheat a kernel comes from given its measurements. If we wanted to build
a statistical model that could be used for this purpose, we could use the original
measurements. But we could also try using the first two principal components as
the only input to the model. Principal component analysis is very useful as a pre-
processing tool, used to create simpler models based on fewer variables (or ostensibly
simpler, because the new variables are typically more difficult to interpret than the
original ones).
Exercise 4.25. Use principal components on the carat, x, y, z, depth, and table
variables in the diamonds data, and answer the following questions:
1. How much of the total variance does the first principal component account for?
How many components are needed to account for at least 90 % of the total
variance?
2. Judging by the loadings, what do the first two principal components measure?
3. What is the correlation between the first principal component and price?
4. Can the first two principal components be used to distinguish between dia-
monds with different cuts?
Exercise 4.26. Return to the scatterplot of the first two principal components
for the seeds data, created above. Adding the arguments loadings = TRUE and
loadings.label = TRUE to the autoplot call creates a biplot, which shows the
loadings for the principal components on top of the scatterplot. Create a biplot and
compare the result to those obtained by looking at the loadings numerically. Do the
conclusions from the two approaches agree?
We are interested in finding subgroups - clusters - of states with similar voting pat-
terns.
To find clusters of similar observations (states, in this case), we could start by assign-
ing each observation to its own cluster. We’d then start with 50 clusters, one for each
observation. Next, we could merge the two clusters that are the most similar, yielding
49 clusters, one of which consisted of two observations and 48 consisting of a single
observation. We could repeat this process, merging the two most similar clusters in
each iteration until only a single cluster was left. This would give us a hierarchy of
clusters, which could be plotted in a tree-like structure, where observations from the
same cluster would be one the same branch. Like this:
clusters_agnes <- agnes(votes.repub)
plot(clusters_agnes, which = 2)
• method = "single, single linkage, uses the smallest distance between points
from the two clusters,
• method = "complete, complete linkage, uses the largest distance between
points from the two clusters,
• method = "ward", Ward’s method, uses the within-cluster variance to compare
different possible clusterings, with the clustering with the lowest within-cluster
variance being chosen.
Regardless of which of these that you use, it is often a good idea to standardise
the numeric variables in your dataset so that they all have the same variance. If
you don’t, your distance measure is likely to be dominated by variables with larger
variance, while variables with low variances will have little or no impact on the
clustering. To standardise your data, you can use scale:
# Perform clustering on standardised data:
clusters_agnes <- agnes(scale(votes.repub))
# Plot dendrogram:
plot(clusters_agnes, which = 2)
At this point, we’re starting to use several functions after another, and so this looks
like a perfect job for a pipeline. To carry out the same analysis uses %>% pipes, we
write:
library(magrittr)
votes.repub %>% scale() %>%
agnes() %>%
plot(which = 2)
We can now try changing the metric and clustering method used as described above.
Let’s use the Manhattan distance and complete linkage:
votes.repub %>% scale() %>%
agnes(metric = "manhattan", method = "complete") %>%
plot(which = 2)
We can change the look of the dendrogram by adding hang = -1, which causes all
observations to be placed at the same level:
votes.repub %>% scale() %>%
agnes(metric = "manhattan", method = "complete") %>%
plot(which = 2, hang = -1)
You can change the distance measure used by setting metric in the diana call.
Euclidean distance is the default.
To wrap this section up, we’ll look at two packages that are useful for plotting
the results of hierarchical clustering: dendextend and factoextra. We installed
factoextra in the previous section, but still need to install dendextend:
install.packages("dendextend")
To compare the dendrograms from produced by different methods (or the same
method with different settings), in a tanglegram, where the dendrograms are plot-
ted against each other, we can use tanglegram from dendextend:
library(dendextend)
# Create clusters using agnes:
votes.repub %>% scale() %>%
agnes() -> clusters_agnes
# Create clusters using diana:
votes.repub %>% scale() %>%
diana() -> clusters_diana
Some clusters are quite similar here, whereas others are very different.
Often, we are interested in finding a comparatively small number of clusters, 𝑘. In
hierarchical clustering, we can reduce the number of clusters by “cutting” the den-
drogram tree. To do so using the factoextra package, we first use hcut to cut the
tree into 𝑘 parts, and then fviz_dend to plot the dendrogram, with each cluster
plotted in a different colour. If, for instance, we want 𝑘 = 5 clusters8 and want to
use agnes with average linkage and Euclidean distance for the clustering, we’d do
the following:
library(factoextra)
votes.repub %>% scale() %>%
hcut(k = 5, hc_func = "agnes",
hc_method = "average",
hc_metric = "euclidean") %>%
8 Just to be clear, 5 is just an arbitrary number here. We could of course want 4, 14, or any other
number of clusters.
4.10. CLUSTER ANALYSIS 131
fviz_dend()
There is no inherent meaning to the colours - they are simply a way to visually
distinguish between clusters.
Hierarchical clustering is especially suitable for data with named observations. For
other types of data, other methods may be better. We will consider some alternatives
next.
Exercise 4.27. Continue the last example above by changing the clustering method
to complete linkage with the Manhattan distance.
1. Do any of the 5 coloured clusters remain the same?
2. How can you produce a tanglegram with 5 coloured clusters, to better compare
the results from the two clusterings?
Exercise 4.28. The USArrests data contains statistics on violent crime rates in 50
US states. Perform a hierarchical cluster analysis of the data. With which states are
Maryland clustered?
You may want to increase the height of your Plot window so that the names of all
states are displayed properly. Using the default colours, low values are represented
by a light yellow and high values by a dark red. White represents missing values.
You’ll notice that dendrograms are plotted along the margins. heatmap performs
hierarchical clustering (by default, agglomerative with complete linkage) of the ob-
servations as well as of the variables. In the latter case, variables are grouped together
based on similarities between observations, creating clusters of variables. In essence,
this is just a hierarchical clustering of the transposed data matrix, but it does offer
132CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
a different view of the data, which at times can be very revealing. The rows and
columns are sorted according to the two hierarchical clusterings.
As per usual, it is a good idea to standardise the data before clustering, which can
be done using the scale argument in heatmap. There are two options for scaling,
either in the row direction (preferable if you wish to cluster variables) or the column
direction (preferable if you wish to cluster observations):
# Standardisation suitable for clustering variables:
votes.repub %>% as.matrix() %>% heatmap(scale = "row")
Looking at the first of these plots, we can see which elections (i.e. which variables)
had similar outcomes in terms of Republican votes. For instance, we can see that the
elections in 1960, 1976, 1888, 1884, 1880, and 1876 all had similar outcomes, with
the large number of orange rows indicating that the Republicans neither did great
nor did poorly.
If you like, you can change the colour palette used. As in Section 4.2.2, you can
choose between palettes from http://www.colorbrewer2.org. heatmap is not a
ggplot2 function, so this is done in a slightly different way to what you’re used to
from other examples. Here are two examples, with the white-blue-purple sequential
palette "BuPu" and the red-white-blue diverging palette "RdBu":
library(RColorBrewer)
col_palette <- colorRampPalette(brewer.pal(8, "BuPu"))(25)
votes.repub %>% as.matrix() %>%
heatmap(scale = "row", col = col_palette)
Exercise 4.29. Draw a heatmap for the USArrests data. Have a look at Maryland
and the states with which it is clustered. Do they have high or low crime rates?
We know that there are three varieties of seeds in this dataset, but what if we didn’t?
Or what if we’d lost the labels and didn’t know what seeds are of what type? There
are no rows names for this data, and plotting a dendrogram may therefore not be
that useful. Instead, we can use 𝑘-means clustering, where the points are clustered
into 𝑘 clusters based on their distances to the cluster means, or centroids.
When performing 𝑘-means clustering (using the algorithm of Hartigan & Wong (1979)
that is the default in the function that we’ll use), the data is split into 𝑘 clusters
based on their distance to the mean of all points. Points are then moved between
clusters, one at a time, based on how close they are (as measured by Euclidean
distance) to the mean of each cluster. The algorithm finishes when no point can be
moved between clusters without increasing the average distance between points and
the means of their clusters.
seeds_cluster
To visualise the results, we’ll plot the first two principal components. We’ll use colour
to show the clusters. Moreover, we’ll plot the different varieties in different shapes,
to see if the clusters found correspond to different varieties:
# Compute principal components:
pca <- prcomp(seeds[,-8])
library(ggfortify)
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)
In this case, the clusters more or less overlap with the varieties! Of course, in a lot of
cases, we don’t know the number of clusters beforehand. What happens if we change
𝑘?
First, we try 𝑘 = 2:
134CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
Next, 𝑘 = 4:
seeds[, -8] %>% scale() %>%
kmeans(centers = 4) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$cluster,
shape = "Variety", size = 2, alpha = 0.75)
If it weren’t for the fact that the different varieties were shown as different shapes,
we’d have no way to say, based on this plot alone, which choice of 𝑘 that is preferable
here. Before we go into methods for choosing 𝑘 though, we’ll mention pam. pam is
an alternative to 𝑘-means that works in the same way, but uses median-like points,
medoids instead of cluster means. This makes it more robust to outliers. Let’s try it
with 𝑘 = 3 clusters:
seeds[, -8] %>% scale() %>%
pam(k = 3) -> seeds_cluster
autoplot(pca, data = seeds, colour = seeds_cluster$clustering,
shape = "Variety", size = 2, alpha = 0.75)
For both kmeans and pam, there are visual tools that can help us choose the value of
𝑘 in the factoextra package. Let’s install it:
install.packages("factoextra")
The fviz_nbclust function in factoextra can be used to obtain plots that can
guide the choice of 𝑘. It takes three arguments as input: the data, the clustering
function (e.g. kmeans or pam) and the method used for evaluating different choices
of 𝑘. There are three options for the latter: "wss", "silhouette" and "gap_stat".
method = "wss" yields a plot that relies on the within-cluster sum of squares, WSS,
which is a measure of the within-cluster variation. The smaller this is, the more
compact are the clusters. The WSS is plotted for several choices of 𝑘, and we look
for an “elbow”, just as we did when using a scree plot for PCA. That is, we look for
the value of 𝑘 such that increasing 𝑘 further doesn’t improve the WSS much. Let’s
have a look at an example, using pam for clustering:
4.10. CLUSTER ANALYSIS 135
library(factoextra)
fviz_nbclust(scale(seeds[, -8]), pam, method = "wss")
Finally, method = "gap_stat" yields a plot of the gap statistic (Tibshirani et al.,
2001), which is based on comparing the WSS to its expected value under a null
distribution obtained using the bootstrap (Section 7.7). Higher values of the gap
statistic are preferable:
fviz_nbclust(scale(seeds[, -8]), pam, method = "gap_stat")
Note that in this plot, the shapes correspond to the clusters and not the varieties of
seeds.
Exercise 4.30. The chorSub data from cluster contains measurements of 10 chem-
icals in 61 geological samples from the Kola Peninsula. Cluster this data using using
either kmeans or pam (does either seem to be a better choice here?). What is a good
choice of 𝑘 here? Visualise the results.
As for kmeans and pam, we can use fviz_nbclust to determine how many clusters
to use:
seeds[, -8] %>% scale() %>%
fviz_nbclust(fanny, method = "wss")
seeds[, -8] %>% scale() %>%
fviz_nbclust(fanny, method = "silhouette")
# Producing the gap statistic plot takes a while here, so
4.10. CLUSTER ANALYSIS 137
Now, let’s cluster the seeds data. The number of clusters is chosen as part of the
clustering procedure. We’ll use a function from the factoextra for plotting the
clusters with ellipsoids, and so start by installing that:
install.packages("factoextra")
library(mclust)
seeds_cluster <- Mclust(scale(seeds[, -8]))
summary(seeds_cluster)
Gaussian finite mixture models are based on the assumption that the data is numer-
ical. For categorical data, we can use latent class analysis, which we’ll discuss in
Section 4.11.2, instead.
Exercise 4.32. Return to the chorSub data from Exercise 4.30. Cluster it using a
Gaussian finite mixture model. How many clusters do you find?
138CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
For instance, using the seeds data, we can compare the area of seeds from different
clusters:
# Cluster the seeds using k-means with k=3:
library(cluster)
seeds[, -8] %>% scale() %>%
kmeans(centers = 3) -> seeds_cluster
It may be tempting to run some statistical tests (e.g. a t-test) to see if there are
differences between the clusters. Note, however, that in statistical hypothesis testing,
it is typically assumed that the hypotheses that are being tested have been generated
independently from the data. Double-dipping, where the data first is used to generate
a hypothesis (“judging from this boxplot, there seems to be a difference in means
between these two groups!” or “I found these clusters, and now I’ll run a test to see if
they are different”) and then test that hypothesis, is generally frowned upon, as that
substantially inflates the risk of a type I error. Recently, there have however been
some advances in valid techniques for testing differences in means between clusters
found using hierarchical clustering; see Gao et al. (2020).
4.11. EXPLORATORY FACTOR ANALYSIS 139
For our first example of factor analysis, we’ll be using the attitude data that comes
with R. It describes the outcome of a survey of employees at a financial organisation.
Have a look at its documentation to read about the variables in the dataset:
?attitude
attitude
To fit a factor analysis model to these data, we can use fa from psych. fa requires us
to specify the number of factors used in the model. We’ll get back to how to choose
the number of factors, but for now, let’s go with 2:
library(psych)
# Fit factor model:
attitude_fa <- fa(attitude, nfactors = 2,
rotate = "oblimin", fm = "ml")
fa does two things for us. First, it fits a factor model to the data, which yields a
table of factor loadings, i.e. the correlation between the two unobserved factors and
the observed variables. However, there is an infinite number of mathematically valid
factor models for any given dataset. Therefore, the factors are rotated according
to some rule to obtain a factor model that hopefully allows for easy and useful
interpretation. Several methods can be used to fit the factor model (set using the fm
argument in fa) and for rotation the solution (set using rotate). We’ll look at some
of the options shortly.
First, we’ll print the result, showing the factor loadings (after rotation). We’ll also
plot the resulting model using fa.diagram, showing the correlation between the
factors and the observed variables:
# Print results:
attitude_fa
140CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
# Plot results:
fa.diagram(attitude_fa, simple = FALSE)
The first factor is correlated to the variables advance, learning and raises. We can
perhaps interpret this factor as measuring the employees’ career opportunity at the
organisation. The second factor is strongly correlated to complaints and (overall)
rating, but also to a lesser degree correlated to raises, learning and privileges.
This can maybe be interpreted as measuring how the employees’ feel that they are
treated at the organisation.
We can also see that the two factors are correlated. In some cases, it makes sense
to expect the factors to be uncorrelated. In that case, we can change the rotation
method used, from oblimin (which yields oblique rotations, allowing for correlations
- usually a good default) to varimax, which yields uncorrelated factors:
attitude_fa <- fa(attitude, nfactors = 2,
rotate = "varimax", fm = "ml")
fa.diagram(attitude_fa, simple = FALSE)
The fm = "ml" setting means that maximum likelihood estimation of the factor
model is performed, under the assumption of a normal distribution for the data.
Maximum likelihood estimation is widely recommended for estimation of factor mod-
els, and can often work well even for non-normal data (Costello & Osborne, 2005).
However, there are cases where it fails to find useful factors. fa offers several differ-
ent estimation methods. A good alternative is minres, which often works well when
maximum likelihood fails:
attitude_fa <- fa(attitude, nfactors = 2,
rotate = "oblimin", fm = "minres")
fa.diagram(attitude_fa, simple = FALSE)
Once again, the results are similar to what we saw before. In other examples, the
results differ more. When choosing which estimation method and rotation to use,
bear in mind that in an exploratory study, there is no harm in playing around with a
few different methods. After all, your purpose is to generate hypotheses rather than
confirm them, and looking at the data in a few different ways will help you do that.
To determine the number of factors that are appropriate for a particular dataset, we
can draw a scree plot with scree. This is interpreted in the same way as for principal
components analysis (Section 4.9) and centroid-based clustering (Section 4.10.3) - we
look for an “elbow” in the plot, which tells us at which point adding more factors no
longer contributes much to the model:
scree(attitude, pc = FALSE)
4.11. EXPLORATORY FACTOR ANALYSIS 141
Some older texts recommend that only factors with an eigenvalue (the y-axis in the
scree plot) greater than 1 be kept in the model. It is widely agreed that this so-called
Kaiser rule is inappropriate (Costello & Osborne, 2005), as it runs the risk of leaving
out important factors.
Similarly, some older texts also recommend using principal components analysis to fit
factor models. While the two are mathematically similar in that both in some sense
reduce the dimensionality of the data, PCA and factor analysis are designed to target
different problems. Factor analysis is concerned with an underlying causal structure
where the unobserved factors affect the observed variables. In contrast, PCA simply
seeks to create a small number of variables that summarise the variation in the data,
which can work well even if there are no unobserved factors affecting the variables.
Exercise 4.33. Factor analysis only relies on the covariance or correlation matrix
of your data. When using fa and other functions for factor analysis, you can input
either a data frame or a covariance/correlation matrix. Read about the ability.cov
data that comes shipped with R, and perform a factor analysis of it.
When observations from the same cluster are assumed to be uncorrelated, the result-
ing model is called latent profile analysis, which typically is handled using model-
based clustering (Section 4.10.5). The special case where the observed variables are
categorical is instead known as latent class analysis. This is common e.g. in analyses
of survey data, and we’ll have a look at such an example in this section. The package
that we’ll use for our analyses is called poLCA - let’s install it:
install.packages("poLCA")
142CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
The National Mental Health Services Survey is an annual survey collecting informa-
tion about mental health treatment facilities in the US. We’ll analyse data from the
2019 survey, courtesy of the Substance Abuse and Mental Health Data Archive, and
try to find latent classes. Download nmhss-puf-2019.csv from the book’s web page,
and set file_path to its path. We can then load and look at a summary of the data
using:
nmhss <- read.csv(file_path)
summary(nmhss)
All variables are categorical (except perhaps for the first one, which is an identifier).
According to the survey’s documentation, negative values are used to represent miss-
ing values. For binary variables, 0 means no/non-presence and 1 means yes/presence.
Next, we’ll load the poLCA package and read the documentation for the function that
we’ll use for the analysis.
library(poLCA)
?poLCA
As you can see in the description of the data argument, the observed variables
(called manifest variables here) are only allowed to contain consecutive integer values,
starting from 1. Moreover, missing values should be represented by NA, and not by
negative numbers (just as elsewhere in R!). We therefore need to make two changes
to our data:
In our example, we’ll look at variables describing what treatments are available at
the different facilities. Let’s create a new data frame for those variables:
treatments <- nmhss[, names(nmhss)[17:30]]
summary(treatments)
To make the changes to the data that we need, we can do the following:
# Change negative values to NA:
treatments[treatments < 0] <- NA
We are now ready to get started with our analysis. To begin with, we will try to find
classes based on whether or not the facilities offer the following five treatments:
• TREATPSYCHOTHRPY: The facility offers individual psychotherapy,
• TREATFAMTHRPY: The facility offers couples/family therapy,
• TREATGRPTHRPY: The facility offers group therapy,
• TREATCOGTHRPY: The facility offers cognitive behavioural therapy,
• TREATPSYCHOMED: The facility offers psychotropic medication. The poLCA func-
tion needs three inputs: a formula describing what observed variables to use,
a data frame containing the observations, and nclass, the number of latent
classes to find. To begin with, let’s try two classes:
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
TREATGRPTHRPY, TREATCOGTHRPY,
TREATPSYCHOMED) ~ 1,
data = treatments, nclass = 2)
The output shows the probabilities of 1’s (no/non-presence) and 2’s (yes/presence)
for the two classes. So, for instance, from the output
$TREATPSYCHOTHRPY
Pr(1) Pr(2)
class 1: 0.6628 0.3372
class 2: 0.0073 0.9927
we gather that 34 % of facilities belonging to the first class offer individual psychother-
apy, whereas 99 % of facilities from the second class offer individual psychotherapy.
Looking at the other variables, we see that the second class always has high proba-
bilities of offering therapies, while the first class doesn’t. Interpreting this, we’d say
that the second class contains facilities that offer a wide variety of treatments, and
the first facilities that only offer some therapies. Finally, we see from the output that
88 % of the facilities belong to the second class:
Estimated class population shares
0.1167 0.8833
Just as in a cluster analysis, it is often a good idea to run the analysis with different
numbers of classes. Next, let’s try 3 classes:
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
TREATGRPTHRPY, TREATCOGTHRPY,
144CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
TREATPSYCHOMED) ~ 1,
data = treatments, nclass = 3)
This time, we run into numerical problems - the model estimation has failed, as
indicated by the following warning message:
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
poLCA fits the model using a method known as the EM algorithm, which finds maxi-
mum likelihood estimates numerically. First, the observations are randomly assigned
to the classes. Step by step, the observations are then moved between classes, un-
til the optimal split has been found. It can however happen that more steps are
needed to find the optimum (by default 1,000 steps are used), or that we end up
with unfortunate initial class assignments that prevent the algorithm from finding
the optimum. To attenuate this problem, we can increase the number of steps used,
or run the algorithm multiple times, each with new initial class assignments. The
poLCA arguments for this are maxiter, which controls the number of steps (or itera-
tions) used, and nrep, which controls the number of repetitions with different initial
assignments. We’ll increase both, and see if that helps. Note that this means that
the algorithm will take longer to run:
m <- poLCA(cbind(TREATPSYCHOTHRPY, TREATFAMTHRPY,
TREATGRPTHRPY, TREATCOGTHRPY,
TREATPSYCHOMED) ~ 1,
data = treatments, nclass = 3,
maxiter = 2500, nrep = 5)
These setting should do the trick for this dataset, and you probably won’t see a
warning message this time. If you do, try increasing either number and run the code
again.
The output that you get can differ between runs - in particular, the order of the
classes can differ depending on initial assignments. Here is part of the output from
my run:
$TREATPSYCHOTHRPY
Pr(1) Pr(2)
class 1: 0.0076 0.9924
class 2: 0.0068 0.9932
class 3: 0.6450 0.3550
$TREATFAMTHRPY
Pr(1) Pr(2)
class 1: 0.1990 0.8010
class 2: 0.0223 0.9777
class 3: 0.9435 0.0565
4.11. EXPLORATORY FACTOR ANALYSIS 145
$TREATGRPTHRPY
Pr(1) Pr(2)
class 1: 0.0712 0.9288
class 2: 0.3753 0.6247
class 3: 0.4935 0.5065
$TREATCOGTHRPY
Pr(1) Pr(2)
class 1: 0.0291 0.9709
class 2: 0.0515 0.9485
class 3: 0.5885 0.4115
$TREATPSYCHOMED
Pr(1) Pr(2)
class 1: 0.0825 0.9175
class 2: 1.0000 0.0000
class 3: 0.3406 0.6594
You can either let interpretability guide your choice of how many classes to include
in your analysis, our use model fit measures like 𝐴𝐼𝐶 and 𝐵𝐼𝐶, which are printed
in the output and can be obtained from the model using:
m$aic
m$bic
If you like, you can add a covariate to your latent class analysis, which allows you to
simultaneously find classes and study their relationship with the covariate. Let’s add
the variable PAYASST (which says whether a facility offers treatment at no charge or
minimal payment to clients who cannot afford to pay) to our data, and then use that
as a covariate.
146CHAPTER 4. EXPLORATORY DATA ANALYSIS AND UNSUPERVISED LEARNING
The interpretation is that both class 2 and class 3 differ significantly from class 1 (the
p-values in the Pr(>|t|) column are low), with the positive coefficients for PAYASST
telling us that class 2 and 3 facilities are more likely to offer pay assistance than class
1 facilities.
Exercise 4.34. The cheating dataset from poLCA contains students’ answers to
four questions about cheating, along with their grade point averages (GPA). Perform
a latent class analysis using GPA as a covariate. What classes do you find? Does
having a high GPA increase the probability of belonging to either class?
Chapter 5
…or, put differently, welcome to the real world. Real datasets are seldom as tidy and
clean as those you have seen in the previous examples in this book. On the contrary,
real data is messy. Things will be out of place, and formatted in the wrong way.
You’ll need to filter the rows to remove those that aren’t supposed to be used in the
analysis. You’ll need to remove some columns and merge others. You will need to
wrestle, clean, coerce, and coax your data until it finally has the right format. Only
then will you be able to actually analyse it.
This chapter contains a number of examples that serve as cookbook recipes for com-
mon data wrangling tasks. And as with any cookbook, you’ll find yourself returning
to some recipes more or less every day, until you know them by heart, while you
never find the right time to use other recipes. You do definitely not have to know all
of them by heart, and can always go back and look up a recipe that you need.
After working with the material in this chapter, you will be able to use R to:
• Handle numeric and categorical data,
• Manipulate and find patterns in text strings,
• Work with dates and times,
• Filter, subset, sort, and reshape your data using data.table, dplyr, and
tidyr,
• Split and merge datasets,
• Scrape data from the web,
• Import data from different file formats.
147
148 CHAPTER 5. DEALING WITH MESSY DATA
a numeric. And if you place them together in a vector, the vector will contain two
numeric values:
TRUE + 5
v1 <- c(TRUE, 5)
v1
However, if you add a numeric to a character, the operation fails. If you put them
together in a vector, both become character strings:
"One" + 5
v2 <- c("One", 5)
v2
There is a hierarchy for data types in R: logical < integer < numeric < character.
When variables of different types are somehow combined (with addition, put in the
same vector, and so on), R will coerce both to the higher ranking type. That is why
v1 contained numeric variables (numeric is higher ranked than logical) and v2
contained character values (character is higher ranked than numeric).
Automatic coercion is often useful, but will sometimes cause problems. As an exam-
ple, a vector of numbers may accidentally be converted to a character vector, which
will confuse plotting functions. Luckily it is possible to convert objects to other data
types. The functions most commonly used for this are as.logical, as.numeric and
as.character. Here are some examples of how they can be used:
as.logical(1) # Should be TRUE
as.logical("FALSE") # Should be FALSE
as.numeric(TRUE) # Should be 1
as.numeric("2.718282") # Should be numeric 2.718282
as.character(2.718282) # Should be the string "2.718282"
as.character(TRUE) # Should be the string "TRUE"
A word of warning though - conversion only works if R can find a natural conversion
between the types. Here are some examples where conversion fails. Note that only
some of them cause warning messages:
as.numeric("two") # Should be 2
as.numeric("1+1") # Should be 2
as.numeric("2,718282") # Should be numeric 2.718282
as.logical("Vaccines cause autism") # Should be FALSE
Exercise 5.1. The following tasks are concerned with converting and checking data
types:
5.2. WORKING WITH LISTS 149
1. What happens if you apply as.logical to the numeric values 0 and 1? What
happens if you apply it to other numbers?
To access the elements in the list, we can use the same $ notation as for data frames:
my_list$my_numbers
my_list$my_data
my_list$my_text
In addition, we can access them using indices, but using double brackets:
my_list[[1]]
my_list[[2]]
my_list[[3]]
To access elements within the elements of lists, additional brackets can be added.
For instance, if you wish to access the second element of the my_numbers vector, you
can use either of these:
my_list[[1]][2]
my_list$my_numbers[2]
1 In fact, the opposite is true: under the hood, a data frame is a list of vectors of equal length.
150 CHAPTER 5. DEALING WITH MESSY DATA
Apart from the fact that this isn’t a very good-looking solution, this would be infeasi-
ble if we needed to split our vector into a larger number of new vectors. Fortunately,
there is a function that allows us to split the vector by month, storing the result as
a list - split:
temps <- split(airquality$Temp, airquality$Month)
temps
Note that, in breach of the rules for variable names in R, the original variable names
here were numbers (actually character variables that happened to contain numeric
characters). When accessing them using $ notation, you need to put them between
backticks (`), e.g. temps$`6`, to make it clear that 6 is a variable name and not a
number.
Exercise 5.2. Load the vas.csv data from Exercise 3.8. Split the VAS vector so
that you get a list containing one vector for each patient. How can you then access
the VAS values for patient 212?
5.3. WORKING WITH NUMBERS 151
Moving beyond sums and means, in Section 6.5 you’ll learn how to apply any function
to the rows or columns of a data frame.
152 CHAPTER 5. DEALING WITH MESSY DATA
Elements 7 to 18 contain the sales for 1992. We can compute the total, highest and
smallest monthly sales up to and including each month using cumsum, cummax and
cummin:
a10[7:18]
cumsum(a10[7:18]) # Total sales
cummax(a10[7:18]) # Highest monthly sales
cummin(a10[7:18]) # Lowest monthly sales
It could be interesting to look at runs of sub-zero days, i.e. consecutive days with
sub-zero temperatures. The rle function counts the lengths of runs of equal values
in a vector. To find the length of runs of temperatures below or above zero we can
use the vector defined by the condition upp_temp < 0, the values of which are TRUE
on sub-zero days and FALSE when the temperature is 0 or higher. When we apply
rle to this vector, it returns the length and value of the runs:
2 Courtesy of the Department of Earth Sciences at Uppsala University.
5.3. WORKING WITH NUMBERS 153
rle(upp_temp < 0)
We first have a 2-day run of above zero temperatures (FALSE), then a 5-day run of
sub-zero temperatures (TRUE), then a 5-day run of above zero temperatures, and so
on.
Scientific notation is a convenient way to display large numbers, but it’s not always
desirable. If you just want to print the number, the format function can be used to
convert it to a character, suppressing scientific notation:
format(7000000, scientific = FALSE)
If you still want your number to be a numeric (as you often do), a better choice is
to change the option for when R uses scientific notation. This can be done using the
scipen argument in the options function:
options(scipen = 1000)
7000000
0.0000007
7e+07
exp(30)
Note that this option only affects how R prints numbers, and not how they are
treated in computations.
154 CHAPTER 5. DEALING WITH MESSY DATA
0.33333333333333333333333333333333 … .
Clearly, the computer cannot store this number exactly, as that would require an
infinite memory3 . Because of this, numbers in computers are stored as floating point
numbers, which aim to strike a balance between range (being able to store both
very small and very large numbers) and precision (being able to represent numbers
accurately). Most of the time, calculations with floating points yield exactly the
results that we’d expect, but sometimes these non-exact representations of numbers
will cause unexpected problems. If we wish to compute 1.5 − 0.2 and 1.1 − 0.2, say,
we could of course use R for that. Let’s see if it gets the answers right:
1.5 - 0.2
1.5 - 0.2 == 1.3 # Check if 1.5-0.2=1.3
1.1 - 0.2
1.1 - 0.2 == 0.9 # Check if 1.1-0.2=0.9
The limitations of floating point arithmetics causes the second calculation to fail. To
see what has happened, we can use sprintf to print numbers with 30 decimals (by
default, R prints a rounded version with fewer decimals):
sprintf("%.30f", 1.1 - 0.2)
sprintf("%.30f", 0.9)
The first 12 decimals are identical, but after that the two numbers 1.1 - 0.2 and
0.9 diverge. In our other example, 1.5 − 0.2, we don’t encounter this problem - both
1.5 - 0.2 and 0.3 have the same floating point representation:
sprintf("%.30f", 1.5 - 0.2)
sprintf("%.30f", 1.3)
The order of the operations also matters in this case. The following three calculations
would all yield identical results if performed with real numbers, but in floating point
arithmetics the results differ:
1.1 - 0.2 - 0.9
1.1 - 0.9 - 0.2
1.1 - (0.9 + 0.2)
involves the US Patriot surface-to-air defence system, which at the end of the first
Gulf war missed an incoming missile due to an error in floating point arithmetics4 . It
is important to be aware of the fact that floating point arithmetics occasionally will
yield incorrect results. This can happen for numbers of any size, but is more likely
to occur when very large and very small numbers appear in the same computation.
So, 1.1 - 0.2 and 0.9 may not be the same thing in floating point arithmetics, but
at least they are nearly the same thing. The == operator checks if two numbers are
exactly equal, but there is an alternative that can be used to check if two numbers
are nearly equal: all.equal. If the two numbers are (nearly) equal, it returns TRUE,
and if they are not, it returns a description of how they differ. In order to avoid the
latter, we can use the isTRUE function to return FALSE instead:
1.1 - 0.2 == 0.9
all.equal(1.1 - 0.2, 0.9)
all.equal(1, 2)
isTRUE(all.equal(1, 2))
Exercise 5.3. These tasks showcase some problems that are commonly faced when
working with numeric data:
1. The vector props <- c(0.1010, 0.2546, 0.6009, 0.0400, 0.0035) con-
tains proportions (which, by definition, are between 0 and 1). Convert the
proportions to percentages with one decimal place.
2. Compute the highest and lowest temperatures up to and including each day in
the airquality dataset.
3. What is the longest run of days with temperatures above 80 in the airquality
dataset?
Exercise 5.4. These tasks are concerned with floating point arithmetics:
1. Very large numbers, like 10e500, are represented by Inf (infinity) in R. Try to
find out what the largest number that can be represented as a floating point
number in R is.
4 Not in R though.
156 CHAPTER 5. DEALING WITH MESSY DATA
Note that the last answer is invalid - No was not one of the four answers that were
allowed for the question.
You could use table to get a summary of how many answers of each type that you
got:
table(smoke)
But the categories are not presented in the correct order! There is a clear order
between the different categories, Never < Occasionally < Regularly < Heavy, but
table doesn’t present the results in that way. Moreover, R didn’t recognise that No
was an invalid answer, and treats it just the same as the other categories.
This is where factor variables come in. They allow you to specify which values your
variable can take, and the ordering between them (if any).
You can inspect the elements, and levels, i.e. the values that the categorical variable
takes, as follows:
smoke2
levels(smoke2)
5.4. WORKING WITH FACTORS 157
So far, we haven’t solved neither the problem of the categories being in the wrong
order nor that invalid No value. To fix both these problems, we can use the levels
argument in factor:
smoke2 <- factor(smoke, levels = c("Never", "Occasionally",
"Regularly", "Heavy"),
ordered = TRUE)
You can control the order in which the levels are presented by choosing which order
we write them in in the levels argument. The ordered = TRUE argument specifies
that the order of the variables is meaningful. It can be excluded in cases where you
wish to specify the order in which the categories should be presented purely for presen-
tation purposes (e.g. when specifying whether to use the order Male/Female/Other
or Female/Male/Other). Also note that the No answer now became an NA, which in
the case of factor variables represents both missing observations and invalid obser-
vations. To find the values of smoke that became NA in smoke2 you can use which
and is.na:
smoke[which(is.na(smoke2))]
By checking the original values of the NA elements, you can see if they should be
excluded from the analysis or recoded into a proper category (No could for instance
be recoded into Never). In Section 5.5.3 you’ll learn how to replace values in larger
datasets automatically using regular expressions.
If you wish to change the name of one or more of the factor levels, you can do it
directly via the levels function. For instance, we can change the name of the NA
category, which is the 5th level of smoke2, as follows:
levels(smoke2)[5] <- "Invalid answer"
The above solution is a little brittle in that it relies on specifying the index of the
level name, which can change if we’re not careful. More robust solutions using the
data.table and dplyr packages are presented in Section 5.7.6.
158 CHAPTER 5. DEALING WITH MESSY DATA
Finally, if you’ve added more levels than what are actually used, these can be dropped
using the droplevels function:
smoke2 <- factor(smoke, levels = c("Never", "Occasionally",
"Regularly", "Heavy",
"Constantly"),
ordered = TRUE)
levels(smoke2)
smoke2 <- droplevels(smoke2)
levels(smoke2)
Exercise 5.5. In Exercise 3.7 you learned how to create a factor variable from a
numeric variable using cut. Return to your solution (or the solution at the back of
the book) and do the following:
1. Change the category names to Mild, Moderate and Hot.
5.5. WORKING WITH STRINGS 159
Exercise 5.6. Load the msleep data from the ggplot2 package. Note that cate-
gorical variable vore is stored as a character. Convert it to a factor by running
msleep$vore <- factor(msleep$vore).
1. How are the resulting factor levels ordered? Why are they ordered in that way?
2. Compute the mean value of sleep_total for each vore group.
3. Sort the factor levels according to their sleep_total means. Hint: this can
be done manually, or more elegantly using e.g. a combination of the functions
rank and match in an intermediate step.
If you check what these two strings look like, you’ll notice something funny about
text2:
text1
text2
R has put backslash characters, \, before the double quotes. The backslash is called
an escape character, which invokes a different interpretation of the character that
follows it. In fact, you can use this to put double quotes inside a string that you
define using double quotes:
text2_success <- "Another example of a so-called \"string\"."
There are a number of other special characters that can be included using a backslash:
160 CHAPTER 5. DEALING WITH MESSY DATA
\n for a line break (a new line) and \t for a tab (a long whitespace) being the most
important5 :
text3 <- "Text...\n\tWith indented text on a new line!"
To print your string in the Console in a way that shows special characters instead of
their escape character-versions, use the function cat:
cat(text3)
You can also use cat to print the string to a text file…
cat(text3, file = "new_findings.txt")
By default, cat places a single white space between the two strings, so that "This is
the beginning of a sentence" and "and this is the end." are concatenated
to "This is the beginning of a sentence and this is the end.". You can
change that using the sep argument in cat. You can also add as many strings as
you like as input:
cat(first, second, sep = "; ")
cat(first, second, sep = "\n")
cat(first, second, sep = "")
cat(first, second, "\n", "And this is another sentence.")
At other times, you want to concatenate two or more strings without printing them.
You can then use paste in exactly the same way as you’d use cat, the exception
being that paste returns a string instead of printing it.
my_sentence <- paste(first, second, sep = "; ")
my_novel <- paste(first, second, "\n",
"And this is another sentence.")
# View results:
my_sentence
my_novel
cat(my_novel)
Finally, if you wish to create a number of similar strings based on information from
other variables, you can use sprintf, which allows you to write a string using %s as
a placeholder for the values that should be pulled from other variables:
names <- c("Irma", "Bea", "Lisa")
ages <- c(5, 59, 36)
There are many more uses of sprintf (we’ve already seen some in Section 5.3.5),
but this enough for us for now.
If you only wish to change the case of some particular element in your string, you
can use substr, which allows you to access substrings:
months <- c("january", "february", "march", "aripl")
convert them to numeric, or find all strings that contain an email address and remove
said addresses (for censoring purposes, say). Regular expressions are incredibly useful,
but can be daunting. Not everyone will need them, and if this all seems a bit too
much to you can safely skip this section, or just skim through it, and return to it at
a later point.
To illustrate the use of regular expressions we will use a sheet from the
projects-email.xlsx file from the books’ web page. In Exercise 3.9, you ex-
plored the second sheet in this file, but here we’ll use the third instead. Set
file_path to the path to the file, and then run the following code to import the
data:
library(openxlsx)
contacts <- read.xlsx(file_path, sheet = 3)
str(contacts)
There are now three variables in contacts. We’ll primarily be concerned with the
third one: Address. Some people have email addresses attached to them, others have
postal addresses and some have no address at all:
contacts$Address
You can find loads of guides on regular expressions online, but few of them are easy to
use with R, the reason being that regular expressions in R sometimes require escape
characters that aren’t needed in some other programming languages. In this section
we’ll take a look at regular expressions, as they are written in R.
The basic building blocks of regular expressions are patterns consisting of one or
more characters. If, for instance, we wish to find all occurrences of the letter y in a
vector of strings, the regular expression describing that “pattern” is simply "y". The
functions used to find occurrences of patterns are called grep and grepl. They differ
only in the output they return: grep returns the indices of the strings containing
the pattern, and grepl returns a logical vector with TRUE at indices matching the
patterns and FALSE at other indices.
To find all addresses containing a lowercase y, we use grep and grepl as follows:
grep("y", contacts$Address)
grepl("y", contacts$Address)
Note how both outputs contain the same information presented in different ways.
In the same way, we can look for word or substrings. For instance, we can find all
addresses containing the string "Edin":
grep("Edin", contacts$Address)
grepl("Edin", contacts$Address)
Similarly, we can also look for special characters. Perhaps we can find all email
5.5. WORKING WITH STRINGS 163
Interestingly, this includes two rows that aren’t email addresses. To separate the
email addresses from the other rows, we’ll need a more complicated regular expression,
describing the pattern of an email address in more general terms. Here are four
examples or regular expressions that’ll do the trick:
grep(".+@.+[.].+", contacts$Address)
grep(".+@.+\\..+", contacts$Address)
grep("[[:graph:]]+@[[:graph:]]+[.][[:alpha:]]+", contacts$Address)
grep("[[:alnum:]._-]+@[[:alnum:]._-]+[.][[:alpha:]]+",
contacts$Address)
To try to wrap our head around what these mean we’ll have a look at the building
blocks of regular expressions. These are:
• Patterns describing a single character.
• Patterns describing a class of characters, e.g. letters or numbers.
• Repetition quantifiers describing how many repetitions of a pattern to look for.
• Other operators.
We’ve already looked at single character expressions, as well as the multi-character
expression "Edin" which simply is a combination of four single-character expressions.
Patterns describing classes of characters, e.g. characters with certain properties, are
denoted by brackets [] (for manually defined classes) or double brackets [[]] (for
predefined classes). One example of the latter is "[[:digit:]] which is a pattern
that matches all digits: 0 1 2 3 4 5 6 7 8 9. Let’s use it to find all addresses
containing a number:
grep("[[:digit:]]", contacts$Address)
contacts$Address[grep("[[:digit:]]", contacts$Address)]
All of these patterns can be combined with patterns describing a single character:
• gr[ea]y matches grey and gray (but not greay!),
• b[^o]g matches bag, beg, and similar strings, but not bog,
• [.]com matches .com.
When using the patterns above, you only look for a single occurrence of the pattern.
Sometimes you may want a pattern like a word of 2-4 letters or any number of digits
in a row. To create these, you add repetition patterns to your regular expression:
• ? means that the preceding patterns is matched at most once, i.e. 0 or 1 time,
• * means that the preceding pattern is matched 0 or more times,
• + means that the preceding pattern is matched at least once, i.e. 1 time or more,
• {n} means that the preceding pattern is matched exactly n times,
• {n,} means that the preceding pattern is matched at least n times, i.e. n times
or more,
• {n,m} means that the preceding pattern is matched at least n times but not
more than m times.
Here are some examples of how repetition patterns can be used:
# There are multiple ways of finding strings containing two n's
# in a row:
contacts$Address[grep("nn", contacts$Address)]
contacts$Address[grep("n{2}", contacts$Address)]
contacts$Address[grep("[[:upper:]][[:lower:]]+", contacts$Address)]
Finally, there are some other operators that you can use to create even more complex
patterns:
• | alteration, picks one of multiple possible patterns. For example, ab|bc
matches ab or bc.
• () parentheses are used to denote a subset of an expression that should be evalu-
ated separately. For example, colo|our matches colo or our while col(o|ou)r
matches color or colour.
• ^, when used outside of brackets [], means that the match should be found at
the start of the string. For example, ^a matches strings beginning with a, but
not "dad".
• $ means that the match should be found at the end of the string. For example,
a$ matches strings ending with a, but not "dad".
• \\ escape character that can be used to match special characters like ., ^ and
$ (\\., \\^, \\$).
This may seem like a lot (and it is!), but there are in fact many more possibilities
when working with regular expression. For the sake of some sorts of brevity, we’ll
leave it at this for now though.
Let’s return to those email addresses. We saw three regular expressions that could
be used to find them:
grep(".+@.+[.].+", contacts$Address)
grep(".+@.+\\..+", contacts$Address)
grep("[[:graph:]]+@[[:graph:]]+[.][[:alpha:]]+", contacts$Address)
grep("[[:alnum:]._-]+@[[:alnum:]._-]+[.][[:alpha:]]+",
contacts$Address)
The first two of these both specify the same pattern: any number of any characters,
followed by an @, followed by any number of any characters, followed by a period .,
followed by any number of characters. This will match email addresses, but would
also match strings like "?=)(/x@!.a??", which isn’t a valid email address. In this
case, that’s not a big issue, as our goal was to find addresses that looked like email
addresses, and not to verify that the addresses were valid.
166 CHAPTER 5. DEALING WITH MESSY DATA
The third alternative has a slightly different pattern: any number of letters, digits,
and punctuation characters, followed by an @, followed by any number of letters,
digits, and punctuation characters, followed by a period ., followed by any number of
letters. This too would match "?=)(/x@!.a??" as it allows punctuation characters
that don’t usually occur in email addresses. The fourth alternative, however, won’t
match "?=)(/x@!.a??" as it only allows letters, digits and the symbols ., _ and -
in the name and domain name of the address.
5.5.4 Substitution
An important use of regular expressions is in substitutions, where the parts of strings
that match the pattern in the expression are replaced by another string. There are
two email addresses in our data that contain (a) instead of @:
contacts$Address[grep("[(]a[])]", contacts$Address)]
If we wish to replace the (a) by @, we can do so using sub and gsub. The former
replaces only the first occurrence of the pattern in the input vector, whereas the
latter replaces all occurrences.
contacts$Address[grep("[(]a[])]", contacts$Address)]
sub("[(]a[])]", "@", contacts$Address) # Replace first occurrence
gsub("[(]a[])]", "@", contacts$Address) # Replace all occurrences
emails_split is a list. In this case, it seems convenient to convert the split strings
into a matrix using unlist and matrix (you may want to have a quick look at
Exercise 3.3 to re-familiarise yourself with matrix):
emails_split <- unlist(emails_split)
5.5. WORKING WITH STRINGS 167
# Extract usernames:
emails_matrix[,1]
Similarly, when working with data stored in data frames, it is sometimes desirable
to split a column containing strings into two columns. Some convenience functions
for this are discussed in Section 5.11.3.
Exercise 5.7. Download the file handkerchief.csv from the book’s web page. It
contains a short list of prices of Italian handkerchiefs from the 1769 publication Prices
in those branches of the weaving manufactory, called, the black branch, and, the fancy
branch. Load the data in a data frame in R and then do the following:
1. Read the documentation for the function nchar. What does it do? Apply it to
the Italian.handkerchief column of your data frame.
2. Use grep to find out how many rows of the Italian.handkerchief column
that contain numbers.
3. Find a way to extract the prices in shillings (S) and pence (D) from the Price
column, storing these in two new numeric variables in your data frame.
Exercise 5.8. Download the oslo-biomarkers.xlsx data from the book’s web
page. It contains data from a medical study about patients with disc herniations,
performed at the Oslo University Hospital, Ullevål (this is a modified6 version of the
data analysed by Moen et al. (2016)). Blood samples were collected from a number of
patients with disc herniations at three time points: 0 weeks (first visit at the hospital),
6 For patient confidentiality purposes.
168 CHAPTER 5. DEALING WITH MESSY DATA
Exercise 5.9. What patterns do the following regular expressions describe? Apply
them to the Address vector of the contacts data to check that you interpreted them
correctly.
1. "$g"
2. "^[^[[:digit:]]"
3. "a(s|l)"
4. "[[:lower:]]+[.][[:lower:]]+"
Exercise 5.10. Write code that, given a string, creates a vector containing all words
from the string, with one word in each element and no punctuation marks. Apply it
to the following string to check that it works:
the date manually. To complicate things further, what formats work automatically
will depend on your system settings. Consequently, the safest option is always to
specify the format of your dates, to make sure that the code still will run if you at
some point have to execute it on a different machine. To help describe date formats,
R has a number of tokens to describe days, months and years:
• %d - day of the month as a number (01-31).
• %m - month of the year as a number (01-12).
• %y - year without century (00-99).
• %Y - year with century (e.g. 2020).
Here are some examples of date formats, all describing 1 April 2020 - try them both
with and without specifying the format to see what happens:
as.Date("2020-04-01")
as.Date("2020-04-01", format = "%Y-%m-%d")
as.Date("4/1/20")
as.Date("4/1/20", format = "%m/%d/%y")
If the date includes month or weekday names, you can use tokens to describe that as
well:
• %b - abbreviated month name, e.g. Jan, Feb.
• %B - full month name, e.g. January, February.
• %a - abbreviated weekday, e.g. Mon, Tue.
• %A - full weekday, e.g. Monday, Tuesday.
Things become a little more complicated now though, because R will interpret the
names as if they were written in the language set in your locale, which contains a
number of settings related your language and region. To find out what language is
in your locale, you can use:
Sys.getlocale("LC_TIME")
I’m writing this on a machine with Swedish locale settings (my output from the above
code chunk is "sv_SE.UTF-8"). The Swedish word for Wednesday is onsdag7 , and
therefore the following code doesn’t work on my machine:
as.Date("Wednesday 1 April 2020", format = "%A %d %B %Y")
You may at times need to make similar translations of dates. One option is to use
gsub to translate the names of months and weekdays into the correct language (see
Section 5.5.4). Alternatively, you can change the locale settings. On most systems,
the following setting will allow you to read English months and days properly:
Sys.setlocale("LC_TIME", "C")
The locale settings will revert to the defaults the next time you start R.
Conversely, you may want to extract a substring from a Date object, for instance
the day of the month. This can be done using strftime, using the same tokens as
above. Here are some examples, including one with the token %j, which can be used
to extract the day of the year:
dates <- as.Date(c("2020-04-01", "2021-01-29", "2021-02-22"),
format = "%Y-%m-%d")
Should you need to, you can of course convert these objects from character to
numeric using as.numeric.
For a complete list of tokens that can be used to describe date patterns, see
?strftime.
1. Apply the functions weekdays, months and quarters to the vector. What do
they do?
5.6. WORKING WITH DATES AND TIMES 171
2. Use the julian function to find out how many days passed between 1970-01-01
and the dates in dates.
1. What happens if you convert the three variables to Date objects using as.Date
without specifying the date format?
2. Convert time1 to a Date object and add 1 to it. What is the result?
3. Convert time3 and time1 to Date objects and subtract them. What is the
result?
4. Convert time2 and time1 to Date objects and subtract them. What is the
result?
5. What happens if you convert the three variables to POSIXct date and time
objects using as.POSIXct without specifying the date format?
6. Convert time3 and time1 to POSIXct objects and subtract them. What is the
result?
7. Convert time2 and time1 to POSIXct objects and subtract them. What is the
result?
8. Use the difftime to repeat the calculation in task 6, but with the result pre-
sented in hours.
Exercise 5.13. In some fields, e.g. economics, data is often aggregated on a quarter-
year level, as in these examples:
To convert qvec1 to a Date object, we can use as.yearqtr from the zoo package in
two ways:
library(zoo)
as.Date(as.yearqtr(qvec1, format = "%Y Q%q"))
as.Date(as.yearqtr(qvec1, format = "%Y Q%q"), frac = 1)
1. Describe the results. What is the difference? Which do you think is preferable?
2. Convert qvec2 and qvec3 to Date objects in the same way. Make sure that
you get the format argument, which describes the date format, right.
172 CHAPTER 5. DEALING WITH MESSY DATA
When you hover the points, the formatting of the dates looks odd. We’d like to have
proper dates instead. In order to do so, we’ll use seq.Date to create a sequence of
dates, ranging from 2014-01-01 to 2014-12-31:
## Create a data frame with better formatted dates
elecdaily2 <- as.data.frame(elecdaily)
elecdaily2$Date <- seq.Date(as.Date("2014-01-01"),
as.Date("2014-12-31"),
by = "day")
seq.Date can be used analogously to create sequences where there is a week, month,
quarter or year between each element of the sequence, by changing the by argument.
Exercise 5.14. Return to the plot from Exercise 4.12, which was created using
library(fpp2)
autoplot(elecdaily, facets = TRUE)
You’ll notice that the x-axis shows week numbers rather than dates (the dates in
the elecdaily time series object are formatted as weeks with decimal numbers).
Make a time series plot of the Demand variable with dates (2014-01-01 to 2014-12-31)
5.7. DATA MANIPULATION WITH DATA.TABLE, DPLYR, AND TIDYR 173
along the x-axis (your solution is likely to rely on standard R techniques rather than
autoplot).
Exercise 5.15. Create an interactive version time series plot of the a10 anti-diabetic
drug sales data, as in Section 4.6.7. Make sure that the dates are correctly displayed.
There is almost always more than one way to solve a problem in R. We now know
how to access vectors and elements in data frames, e.g. to compute means. We also
know how to modify and add variables to data frames. Indeed, you can do just about
anything using the functions in base R. Sometimes, however, those solutions become
rather cumbersome, as they can require a fair amount of programming and verbose
code. data.table and the tidyverse packages offer simpler solutions and speed up
the workflow for these types of problems. Both can be used for the same tasks. You
can learn one of them or both. The syntax used for data.table is often more concise
and arguably more consistent than that in dplyr (it is in essence an extension of the
[i, j] notation that we have already used for data frames). Second, it is fast and
memory-efficient, which makes a huge difference if you are working with big data
(you’ll see this for yourself in Section 6.6). On the other hand, many people prefer
the syntax in dplyr and tidyr, which lends itself exceptionally well for usage with
pipes. If you work with small or medium-sized datasets, the difference in performance
between the two packages is negligible. dplyr is also much better suited for working
directly with databases, which is a huge selling point if your data already is in a
database8 .
In the sections below, we will see how to perform different operations using both
data.table and the tidyverse packages. Perhaps you already know which one that
you want to use (data.table if performance is important to you, dplyr+tidyr if
you like to use pipes or will be doing a lot of work with databases). If not, you can
use these examples to guide your choice. Or not choose at all! I regularly use both
packages myself, to harness the strength of both. There is no harm in knowing how
to use both a hammer and a screwdriver.
8 There is also a package called dtplyr, which allows you to use the fast functions from data.table
with dplyr syntax. It is useful if you are working with big data, already know dplyr and don’t want
to learn data.table. If that isn’t an accurate description of you, you can safely ignore dtplyr for
now.
174 CHAPTER 5. DEALING WITH MESSY DATA
aq <- as.data.table(airquality)
When importing data from csv files, you can import them as data.table objects in-
stead of data.frame objects by replacing read.csv with fread from the data.table
package. The latter function also has the benefit of being substantially faster when
importing large (several MB’s) csv files.
Note that, similar to what we saw in Section 5.2.1, variables in imported data frames
can have names that would not be allowed in base R, for instance including forbidden
characters like -. data.table and dplyr allow you to work with such variables by
wrapping their names in apostrophes: referring to the illegally named variable as
illegal-character-name won’t work, but `illegal-character-name` will.
Note that when using data.table, there is not an explicit assignment. We don’t use
<- to assign the new data frame to aq - instead the assignment happens automatically.
This means that you have to be a little bit careful, so that you don’t inadvertently
make changes to your data when playing around with it.
In this case, using data.table or dplyr doesn’t make anything easier. Where these
packages really shine is when we attempt more complicated operations. Before that
though, let’s look at a few more simple examples.
Exercise 5.16. Load the VAS pain data vas.csv from Exercise 3.8. Then do the
following:
1. Remove the columns X and X.1.
2. Add a dummy variable called highVAS that indicates whether a patient’s VAS
is 7 or greater on any given day.
Suppose that we want to change the levels’ names to abbreviated versions: Nvr, Occ,
Reg and Hvy. Here’s how to do this:
5.7. DATA MANIPULATION WITH DATA.TABLE, DPLYR, AND TIDYR 177
With data.table:
new_names = c("Nvr", "Occ", "Reg", "Hvy")
smoke3[.(smoke2 = levels(smoke2), to = new_names),
on = "smoke2",
smoke2 := i.to]
smoke3[, smoke2 := droplevels(smoke2)]
With dplyr:
smoke3 %>% mutate(smoke2 = recode(smoke2,
"Never" = "Nvr",
"Occasionally" = "Occ",
"Regularly" = "Reg",
"Heavy" = "Hvy"))
Next, we can combine the Occ, Reg and Hvy levels into a single level, called Yes:
With data.table:
smoke3[.(smoke2 = c("Occ", "Reg", "Hvy"), to = "Yes"),
on = "smoke2",
smoke2 := i.to]
With dplyr:
smoke3 %>% mutate(smoke2 = recode(smoke2,
"Occ" = "Yes",
"Reg" = "Yes",
"Hvy" = "Yes"))
Exercise 5.17. In Exercise 3.7 you learned how to create a factor variable from a
numeric variable using cut. Return to your solution (or the solution at the back of
the book) and do the following using data.table and/or dplyr:
1. Change the category names to Mild, Moderate and Hot.
2. Combine Moderate and Hot into a single level named Hot.
To begin with, let’s load the packages again (in case you don’t already have them
loaded), and let’s recreate the aq data.table, which we made a bit of a mess of by
removing some important columns in the previous section:
library(data.table)
library(dplyr)
aq <- data.table(airquality)
Now, let’s compute the mean temperature for each month. Both data.table and
dplyr will return a data frame with the results. In the data.table approach, as-
signing a name to the summary statistic (mean, in this case) is optional, but not in
dplyr.
You’ll recall that if we apply mean to a vector containing NA values, it will return NA:
In order to avoid this, we can pass the argument na.rm = TRUE to mean, just as we
would in other contexts. To compute the mean ozone concentration for each month,
ignoring NA values:
The syntax for computing multiple grouped statistics is similar. We compute both
the mean temperature and the correlation for each month:
At times, you’ll want to compute summaries for all variables that share some property.
As an example, you may want to compute the mean of all numeric variables in your
data frame. In dplyr there is a convenience function called across that can be used
for this: summarise(across(where(is.numeric), mean)) will compute the mean
of all numeric variables. In data.table, we can instead utilise the apply family of
functions from base R, that we’ll study in Section 6.5. To compute the mean of all
numeric variables:
Both packages have special functions for counting the number of observations in
groups: .N for data.table and n for dplyr. For instance, we can count the number
of days in each month:
Similarly, you can count the number of unique values of variables using uniqueN for
data.table and n_distinct for dplyr:
180 CHAPTER 5. DEALING WITH MESSY DATA
Exercise 5.18. Load the VAS pain data vas.csv from Exercise 3.8. Then do the
following using data.table and/or dplyr:
1. Compute the mean VAS for each patient.
2. Compute the lowest and highest VAS recorded for each patient.
3. Compute the number of high-VAS days, defined as days with where the VAS
was at least 7, for each patient.
To fill the missing values with the last non-missing entry, we can now use nafill or
fill as follows:
To instead fill the missing values with the next non-missing entry:
Exercise 5.20. Load the VAS pain data vas.csv from Exercise 3.8. Fill the missing
values in the Visit column with the last non-missing value.
To fill in the missing values with the last non-missing entry (Section 5.7.8) and then
count the number of days in each month (Section 5.7.7), we can do as follows.
With data.table:
aq[, Month := nafill(Month, "locf")][, .N, Month]
library(data.table)
library(dplyr)
aq <- data.table(airquality)
To select rows 3 to 5:
In this case, the above code returns more than 5 rows because of ties.
To remove duplicate rows:
To remove rows with missing data (NA values) in at least one variable:
184 CHAPTER 5. DEALING WITH MESSY DATA
At times, you want to filter your data based on whether the observations are con-
nected to observations in a different dataset. Such filters are known as semijoins and
antijoins, and are discussed in Section 5.12.4.
If you run the code multiple times, you will get different results each time. See
Section 7.1 for more on random sampling and how it can be used.
Exercise 5.21. Download the ucdp-onesided-191.csv data file from the book’s
web page. It contains data about international attacks on civilians by governments
and formally organised armed groups during the period 1989-2018, collected as part
of the Uppsala Conflict Data Program (Eck & Hultman, 2007; Petterson et al., 2019).
Among other things, it contains information about the actor (attacker), the fatality
rate, and attack location. Load the data and check its structure.
1. Filter the rows so that only conflicts that took place in Colombia are retained.
How many different actors were responsible for attacks in Colombia during the
period?
2. Using the best_fatality_estimate column to estimate fatalities, calculate
the number of worldwide fatalities caused by government attacks on civilians
during 1989-2018.
Exercise 5.22. Load the oslo-biomarkers.xlsx data from Exercise 5.8. Use
data.table and/or dplyr to do the following:
1. Select only the measurements from blood samples taken at 12 months.
2. Select only the measurements from the patient with ID number 6.
186 CHAPTER 5. DEALING WITH MESSY DATA
aq <- data.table(airquality)
To select all numeric variables (which for the aq data is all variables!):
# data.table:
aq <- as.data.table(airquality)
str(aq[,2])
# tibble:
aq <- as_tibble(airquality)
str(aq[,2])
As you can see, aq[, 2] returns a vector, a data table or a tibble, depending on what
type of object aq is. Unfortunately, this approach is used by several R packages, and
can cause problems, because it may return the wrong type of object.
A better approach is to use aq[[2]], which works the same for data frames, data
tables and tibbles, returning a vector:
# data.frame:
aq <- as.data.frame(airquality)
str(aq[[2]])
# data.table:
aq <- as.data.table(airquality)
str(aq[[2]])
# tibble:
aq <- as_tibble(airquality)
str(aq[[2]])
5.10 Sorting
Sometimes you don’t want to filter rows, but rearrange their order according to their
values for some variable. Similarly, you may want to change the order of the columns
in your data. I often do this after merging data from different tables (as we’ll do in
Section 5.12). This is often useful for presentation purposes, but can at times also
aid in analyses.
aq <- data.table(airquality)
First of all, if you’re just looking to sort a single vector, rather than an entire data
frame, the quickest way to do so is to use sort:
sort(aq$Wind)
sort(aq$Wind, decreasing = TRUE)
sort(c("C", "B", "A", "D"))
If you’re looking to sort an entire data frame by one or more variables, you need to
move beyond sort. To sort rows by Wind (ascending order):
To sort rows, first by Temp (ascending order) and then by Wind (descending order):
Exercise 5.24. Load the oslo-biomarkers.xlsx data from Exercise 5.8. Note that
it is not ordered in a natural way. Reorder it by patient ID instead.
Each row contains data for one country and one year, meaning that the data for
each country is spread over 12 rows. This is known as long data or long format. As
another option, we could store it in wide format, where the data is formatted so that
all observations corresponding to a country are stored on the same row:
Country Continent lifeExp1952 lifeExp1957 lifeExp1962 ...
Afghanistan Asia 28.8 30.2 32.0 ...
Albania Europe 55.2 59.3 64.8 ...
Sometimes it makes sense to spread an observation over multiple rows (long format),
and sometimes it makes more sense to spread a variable across multiple columns
(wide format). Some analyses require long data, whereas others require wide data.
9 You may need to install the package first, using install.packages("gapminder").
5.11. RESHAPING DATA 191
And if you’re unlucky, data will arrive in the wrong format for the analysis you need
to do. In this section, you’ll learn how to transform your data from long to wide,
and back again.
gm <- as.data.table(gapminder)
With tidyr:
gm %>% pivot_wider(id_cols = c(country, continent),
names_from = year,
values_from =
c(pop, lifeExp, gdpPercap)) -> gmw
With tidyr:
gmw %>% pivot_longer(names(gmw)[2:37],
names_to = "variable",
values_to = "value") -> gm
The resulting data frames are perhaps too long, with each variable (pop, lifeExp
and gdpPercapita) being put on a different row. To make it look like the original
dataset, we must first split the variable variable (into a column with variable names
and column with years) and then make the data frame a little wider again. That is
the topic of the next section.
With tidyr:
gm %>% separate(variable,
into = c("variable", "year"),
sep = "_") %>%
pivot_wider(id_cols = c(country, continent, year),
names_from = variable,
values_from = value) -> gm
Day and Month columns into a new Date column. Let’s re-create the aq data.table
object one last time:
library(data.table)
library(tidyr)
aq <- as.data.table(airquality)
If we wanted to create a Date column containing the year (1973), month and day for
each observation, we could use paste and as.Date:
as.Date(paste(1973, aq$Month, aq$Day, sep = "-"))
The natural data.table approach is just this, whereas tidyr offers a function called
unite to merge columns, which can be combined with mutate to paste the year to
the date. To merge the Month and Day columns with a year and convert it to a Date
object:
Exercise 5.25. Load the oslo-biomarkers.xlsx data from Exercise 5.8. Then do
the following using data.table and/or dplyr/tidyr:
1. Split the PatientID.timepoint column in two parts: one with the patient ID
and one with the timepoint.
2. Sort the table by patient ID, in numeric order.
3. Reformat the data from long to wide, keeping only the IL-8 and VEGF-A
measurements.
Save the resulting data frame - you will need it again in Exercise 5.26!
str(rev_data)
View(rev_data)
str(weather_data)
View(weather_data)
5.12.1 Binds
The simplest types of merges are binds, which can be used when you have two tables
where either the rows or the columns match each other exactly. To illustrate what this
may look like, we will use data.table/dplyr to create subsets of the business revenue
data. First, we format the tables as data.table objects and the DATE columns as
Date objects:
library(data.table)
library(dplyr)
Next, we wish to subtract three subsets: the revenue in January (rev_jan), the
revenue in February (rev_feb) and the weather in January (weather_jan).
With dplyr:
rev_data %>% filter(between(DATE,
as.Date("2020-01-01"),
196 CHAPTER 5. DEALING WITH MESSY DATA
as.Date("2020-01-31"))
) -> rev_jan
rev_data %>% filter(between(DATE,
as.Date("2020-02-01"),
as.Date("2020-02-29"))
) -> rev_feb
weather_data %>% filter(between(
DATE,
as.Date("2020-01-01"),
as.Date("2020-01-31"))
) -> weather_jan
The rows in rev_jan correspond one-to-one to the rows in weather_jan, with both
tables being sorted in exactly the same way. We could therefore bind their columns,
i.e. add the columns of weather_jan to rev_jan.
rev_jan and rev_feb contain the same columns. We could therefore bind their rows,
i.e. add the rows of rev_feb to rev_jan. To perform these operations, we can use
either base R or dplyr:
# Join rows of datasets that have # Join rows of datasets that have
# the same columns: # the same columns:
rbind(rev_jan, rev_feb) bind_rows(rev_jan, rev_feb)
because the two tables have different numbers of rows. If we attempt a bind, R will
produce a merged table by recycling the first few rows from rev_data - note that
the two DATE columns aren’t properly aligned:
tail(cbind(rev_data, weather_data))
Clearly, this is not the desired output! We need a way to connect the rows in
rev_data with the right rows in weather_data. Put differently, we need something
that allows us to connect the observations in one table to those in another. Variables
used to connect tables are known as keys, and must in some way uniquely identify
observations. In this case the DATE column gives us the key - each observation is
uniquely determined by it’s DATE. So to combine the two tables, we can combine
rows from rev_data with the rows from weather_data that have the same DATE
values. In the following sections, we’ll look at different ways of merging tables using
data.table and dplyr.
But first, a word of warning: finding the right keys for merging tables is not always
straightforward. For a more complex example, consider the nycflights13 package,
which contains five separate but connected datasets:
library(nycflights13)
?airlines # Names and carrier codes of airlines.
?airports # Information about airports.
?flights # Departure and arrival times and delay information for
# flights.
?planes # Information about planes.
?weather # Hourly meteorological data for airports.
Perhaps you want to include weather information with the flight data, to study how
weather affects delays. Or perhaps you wish to include information about the longi-
tude and latitude of airports (from airports) in the weather dataset. In airports,
each observation can be uniquely identified in three different ways: either by its
airport code faa, its name name or its latitude and longitude, lat and lon:
?airports
head(airports)
If we want to use either of these options as a key when merging with airports data
with another table, that table should also contain the same key.
The weather data requires no less than four variables to identify each observation:
origin, month, day and hour:
?weather
head(weather)
It is not perfectly clear from the documentation, but the origin variable is actually
the FAA airport code of the airport corresponding to the weather measurements. If
198 CHAPTER 5. DEALING WITH MESSY DATA
we wish to add longitude and latitude to the weather data, we could therefore use
faa from airports as a key.
Remember that revenue data for 2020-03-01 is missing, and weather data for 2020-
02-05, 2020-02-06, 2020-03-10, and 2020-03-29 are missing. This means that out of
the 91 days in the period, only 86 have complete data. If we perform an inner join,
the resulting table should therefore have 86 rows.
To perform and inner join of rev_data and weather_data using DATE as key:
A left join will retain the 90 dates present in rev_data. To perform a(n outer) left
join of rev_data and weather_data using DATE as key:
A right join will retain the 87 dates present in weather_data. To perform a(n outer)
right join of rev_data and weather_data using DATE as key:
A full join will retain the 91 dates present in at least one of rev_data and
weather_data. To perform a(n outer) full join of rev_data and weather_data
using DATE as key:
The same thing can be achieved using the filtering techniques of Section 5.8, but
semijoins and antijoins are simpler to use when the filtering relies on conditions from
another table.
Suppose that we are interested in the revenue of our business for days in February
with subzero temperatures. First, we can create a table called filter_data listing
all such days:
With data.table:
rev_data$DATE <- as.Date(rev_data$DATE)
weather_data$DATE <- as.Date(weather_data$DATE)
filter_data <- weather_data[TEMPERATURE < 0 &
DATE %between%
c("2020-02-01",
"2020-02-29"),]
With dplyr:
rev_data$DATE <- as.Date(rev_data$DATE)
weather_data$DATE <- as.Date(weather_data$DATE)
weather_data %>% filter(TEMPERATURE < 0,
between(DATE,
as.Date("2020-02-01"),
as.Date("2020-02-29"))
) -> filter_data
Next, we can use a semijoin to extract the rows of rev_data corresponding to the
days of filter_data:
If instead we wanted to find all days except the days in February with subzero tem-
peratures, we could perform an antijoin:
∼
5.13. SCRAPING DATA FROM WEBSITES 201
To get hold of the data from the table, we could perhaps select all rows, copy them
and paste them into a spreadsheet software such as Excel. But it would be much
more convenient to be able to just import the table to R straight from the HTML file.
Because tables written in HTML follow specific formats, it is possible to write code
that automatically converts them to data frames in R. The rvest package contains
a number of functions for that. Let’s install it:
install.packages("rvest")
The object wiki now contains all the information from the page - you can have a
quick look at it by using html_text:
html_text(wiki)
That is more information than we need. To extract all tables from wiki, we can use
html_nodes:
5.14. OTHER COMMONS TASKS 203
is the one we are looking for. To transform it to a data frame, we use html_table
as follows:
laureates <- html_table(tables[[1]], fill = TRUE)
View(laureates)
The rvest package can also be used for extracting data from more complex website
structures using the SelectorGadget tool in the web browser Chrome. This let’s you
select the page elements that you wish to scrape in your browser, and helps you
create the code needed to import them to R. For an example of how to use it, run
vignette("selectorgadget").
Exercise 5.27. Scrape the table containing different keytar models from https:
//en.wikipedia.org/wiki/List_of_keytars. Perform the necessary operations to
convert the Dates column to numeric.
This can be useful for instance if you have loaded a data frame that no longer is
needed and takes up a lot of memory. If you, for some reason, want to wipe all
your variables, you can use ls, which returns a vector containing the names of all
variables, in combination with rm:
# Use this at your own risk! This deletes all currently loaded
# variables.
# Uncomment to run:
# rm(list = ls())
204 CHAPTER 5. DEALING WITH MESSY DATA
Variables are automatically deleted when you exit R (unless you choose to save your
workspace). On the rare occasions where I want to wipe all variables from memory,
I usually do a restart instead of using rm.
We’ll use the fromJSON function from jsonlite to import the data:
library(jsonlite)
url <- paste("https://opendata-download-metobs.smhi.se/api/version/",
"1.0/parameter/2/station/98210/period/latest-months/",
"data.json",
sep = "")
stockholm <- fromJSON(url)
stockholm
By design, JSON files contain lists, and so stockholm is a list object. The temper-
ature data that we were looking for is (in this particular case) contained in the list
element called value:
5.14. OTHER COMMONS TASKS 205
stockholm$value
206 CHAPTER 5. DEALING WITH MESSY DATA
Chapter 6
R programming
The tools in Chapters 2-5 will allow you to manipulate, summarise and visualise your
data in all sorts of ways. But what if you need to compute some statistic that there
isn’t a function for? What if you need automatic checks of your data and results?
What if you need to repeat the same analysis for a large number of files? This is where
the programming tools you’ll learn about in this chapter, like loops and conditional
statements, come in handy. And this is where you take the step from being able to
use R for routine analyses to being able to use R for any analysis.
After working with the material in this chapter, you will be able to use R to:
• Write your own R functions,
• Use several new pipe operators,
• Use conditional statements to perform different operations depending on
whether or not a condition is satisfied,
• Iterate code operations multiple times using loops,
• Iterate code operations multiple times using functionals,
• Measure the performance of your R code.
6.1 Functions
Suppose that we wish to compute the mean of a vector x. One way to do this would
be to use sum and length:
x <- 1:100
# Compute mean:
sum(x)/length(x)
Now suppose that we wish to compute the mean of several vectors. We could do this
by repeated use of sum and length:
207
208 CHAPTER 6. R PROGRAMMING
x <- 1:100
y <- 1:200
z <- 1:300
# Compute means:
sum(x)/length(x)
sum(y)/length(y)
sum(z)/length(x)
But wait! I made a mistake when I copied the code to compute the mean of z - I
forgot to change length(x) to length(z)! This is an easy mistake to make when
you repeatedly copy and paste code. In addition, repeating the same code multiple
times just doesn’t look good. It would be much more convenient to have a single
function for computing the means. Fortunately, such a function exists - mean:
# Compute means
mean(x)
mean(y)
mean(z)
As you can see, using mean makes the code shorter and easier to read and reduces
the risk of errors induced by copying and pasting code (we only have to change the
argument of one function instead of two).
You’ve already used a ton of different functions in R: functions for computing means,
manipulating data, plotting graphics, and more. All these functions have been writ-
ten by somebody who thought that they needed to repeat a task (e.g. computing a
mean or plotting a bar chart) over and over again. And in such cases, it is much
more convenient to have a function that does that task than to have to write or copy
code every time you want to do it. This is true also for your own work - whenever
you need to repeat the same task several times, it is probably a good idea to write
a function for it. It will reduce the amount of code you have to write and lessen the
risk of errors caused by copying and pasting old code. In this section, you will learn
how to write your own functions.
In the case of our function for computing a mean, this could look like:
average <- function(x)
{
avg <- sum(x)/length(x)
return(avg)
}
This defines a function called average, that takes an object called x as input. It
computes the sum of the elements of x, divides that by the number of elements in x,
and returns the resulting mean.
If we now make a call to average(x), our function will compute the mean value of
the vector x. Let’s try it out, to see that it works:
x <- 1:100
y <- 1:200
average(x)
average(y)
Because avg is a local variable, it is only available inside of the average function.
Local variables take precedence over global variables inside the functions to which
they belong. Because we named the argument used in the function x, x becomes the
name of a local variable in average. As far as average is concerned, there is only
one variable named x, and that is whatever object that was given as input to the
function, regardless of what its original name was. Any operations performed on the
210 CHAPTER 6. R PROGRAMMING
y <- 2
y_squared()
But operations performed on global variables inside functions won’t affect the global
variable:
add_to_y <- function(n)
{
y <- y + n
}
y <- 1
add_to_y(1)
y
Suppose you really need to change a global variable inside a function1 . In that case,
you can use an alternative assignment operator, <<-, which assigns a value to the
variable in the parent environment to the current environment. If you use <<- for
assignment inside a function that is called from the global environment, this means
that the assignment takes place in the global environment. But if you use <<- in a
function (function 1) that is called by another function (function 2), the assignment
will take place in the environment for function 2, thus affecting a local variable in
function 2. Here is an example of a global assignment using <<-:
add_to_y_global <- function(n)
{
y <<- y + n
}
y <- 1
add_to_y_global(1)
y
1 Do you really?
6.1. FUNCTIONS 211
We’ve already seen that it seems to work when the input x is a numeric vector. But
what happens if we input something else instead?
average(c(1, 5, 8)) # Numeric input
average(c(TRUE, TRUE, FALSE)) # Logical input
average(c("Lady Gaga", "Tool", "Dry the River")) # Character input
average(data.frame(x = c(1, 1, 1), y = c(2, 2, 1))) # Numeric df
average(data.frame(x = c(1, 5, 8), y = c("A", "B", "C"))) # Mixed type
The first two of these render the desired output (the logical values being represented
by 0’s and 1’s), but the rest don’t. Many R functions include checks that the input
is of the correct type, or checks to see which method should be applied depending on
what data type the input is. We’ll learn how to perform such checks in Section 6.3.
As a side note, it is possible to write functions that don’t end with return. In that
case, the output (i.e. what would be written in the Console if you’d run the code
there) from the last line of the function will automatically be returned. I prefer to
(almost) always use return though, as it is easy to accidentally make the function
return nothing by finishing it with a line that yields no output. Below are two
examples of how we could have written average without a call to return. The first
doesn’t work as intended, because the function’s final (and only) line doesn’t give
any output.
average_bad <- function(x)
{
avg <- sum(x)/length(x)
}
average_bad(c(1, 5, 8))
average_ok(c(1, 5, 8))
212 CHAPTER 6. R PROGRAMMING
For clarity, you can specify which value corresponds to which argument:
power_n(x = 2, n = 5)
…and can then even put the arguments in the wrong order:
power_n(n = 5, x = 2)
However, if we only supply n we get an error, because there is no default value for x:
power_n(n = 5)
apply_to_first2(x, power_n)
But what if the function that we supply requires additional arguments? Using
apply_to_first2 with sum and the vector c(4, 5, 6) works fine:
apply_to_first2(x, sum)
But if we instead use the vector c(4, NA, 6) the function returns NA :
x <- c(4, NA, 6)
apply_to_first2(x, sum)
Perhaps we’d like to pass na.rm = TRUE to sum to ensure that we get a numeric
result, if at all possible. This can be done by adding ... to the list of arguments for
both functions, which indicates additional parameters (to be supplied by the user)
that will be passed to func:
apply_to_first2 <- function(x, func, ...)
{
result <- func(x[1:2], ...)
return(result)
}
6.1.5 Namespaces
It is possible, and even likely, that you will encounter functions in packages with the
same name as functions in other packages. Or, similarly, that there are functions in
214 CHAPTER 6. R PROGRAMMING
packages with the same names as those you have written yourself. This is of course a
bit of a headache, but it’s actually something that can be overcome without changing
the names of the functions. Just like variables can live in different environments, R
functions live in namespaces, usually corresponding to either the global environment
or the package they belong to. By specifying which namespace to look for the function
in, you can use multiple functions that all have the same name.
For example, let’s create a function called sqrt. There is already such a function in
the base package2 (see ?sqrt), but let’s do it anyway:
sqrt <- function(x)
{
return(x^10)
}
If we now apply sqrt to an object, the function that we just defined will be used:
sqrt(4)
But if we want to use the sqrt from base, we can specify that by writing the names-
pace (which almost always is the package name) followed by :: and the function
name:
base::sqrt(4)
The :: notation can also be used to call a function or object from a package without
loading the package’s namespace:
msleep # Doesn't work if ggplot2 isn't loaded
ggplot2::msleep # Works, without loading the ggplot2 namespace!
When you call a function, R will look for it in all active namespaces, following a
particular order. To see the order of the namespaces, you can use search:
search()
Note that the global environment is first in this list - meaning that the functions that
you define always will be preferred to functions in packages.
All this being said, note that it is bad practice to give your functions and variables
the same names as common functions. Don’t name them mean, c or sqrt. Nothing
good can ever come from that sort of behaviour.
Nothing.
2 base is automatically loaded when you start R, and contains core functions such as sqrt.
6.2. MORE ON PIPES 215
However, the curly brackets {} and the dots . makes this a little awkward and
difficult to read. A better option is to use the %$% pipe, which passes on the names
of all variables in your data frame instead:
airquality %>%
subset(Temp > 80) %$%
cor(Temp, Wind)
If you want to modify a variable using a pipe, you can use the compound assignment
pipe %<>%. The following three lines all yield exactly the same result:
216 CHAPTER 6. R PROGRAMMING
As long as the first pipe in the pipeline is the compound assignment operator %<>%,
you can combine it with other pipes:
x <- 1:8
x %<>% subset(x > 5) %>% sqrt
x
Sometimes you want to do something in the middle of a pipeline, like creating a plot,
before continuing to the next step in the chain. The tee operator %T>% can be used
to execute a function without passing on its output (if any). Instead, it passes on
the output to its left. Here is an example:
airquality %>%
subset(Temp > 80) %T>%
plot %$%
cor(Temp, Wind)
Note that if we’d used an ordinary pipe %>% instead, we’d get an error:
airquality %>%
subset(Temp > 80) %>%
plot %$%
cor(Temp, Wind)
The reason is that cor looks for the variables Temp and Wind in the plot object, and
not in the data frame. The tee operator takes care of this by passing on the data
from its left side.
Remember that if you have a function where data only appears within parentheses,
you need to wrap the function in curly brackets:
airquality %>%
subset(Temp > 80) %T>%
{cat("Number of rows in data:", nrow(.), "\n")} %$%
cor(Temp, Wind)
When using the tee operator, this is true also for call to ggplot, where you addition-
ally need to wrap the plot object in a call to print:
library(ggplot2)
airquality %>%
subset(Temp > 80) %T>%
{print(ggplot(., aes(Temp, Wind)) + geom_point())} %$%
cor(Temp, Wind)
6.2. MORE ON PIPES 217
Note that we don’t have to write function(...) when defining functions with pipes!
We can now use this function just like any other:
# With the airquality data:
airquality %>% plot_and_cor
plot_and_cor(airquality)
Exercise 6.3. Write a function that takes a data frame as input and uses pipes to
print the number of NA values in the data, remove all rows with NA values and return
a summary of the remaining data.
Exercise 6.4. Pipes are operators, that is, functions that take two variables as input
and can be written without parentheses (other examples of operators are + and *).
You can define your own operators just as you would any other function. For instance,
we can define an operator called quadratic that takes two numbers a and b as input
and computes the quadratic expression (𝑎 + 𝑏)2 :
Create an operator called %against% that takes two vectors as input and draws a
scatterplot of them.
218 CHAPTER 6. R PROGRAMMING
The condition should return a single logical value, so that it evaluates to either
TRUE or FALSE. If the condition is fulfilled, i.e. if it is TRUE, the code inside the first
pair of curly brackets will run, and if it’s not (FALSE), the code within the second
pair of curly brackets will run instead.
As a first example, assume that you want to compute the reciprocal of 𝑥, 1/𝑥, unless
𝑥 = 0, in which case you wish to print an error message:
x <- 2
if(x == 0) { cat("Error! Division by zero.") } else { 1/x }
Alternatively, we could check if 𝑥 ≠ 0 and then change the order of the segments
within the curly brackets:
x <- 0
if(x != 0) { 1/x } else { cat("Error! Division by zero.") }
You don’t have to write all of the code on the same line, but you must make sure
that the else part is on the same line as the first }:
if(x == 0)
{
cat("Error! Division by zero.")
} else
{
1/x
}
6.3. CHECKING CONDITIONS 219
You can also choose not to have an else part at all. In that case, the code inside the
curly brackets will run if the condition is satisfied, and if not, nothing will happen:
x <- 0
if(x == 0) { cat("x is 0.") }
x <- 2
if(x == 0) { cat("x is 0.") }
Finally, if you need to check a number of conditions one after another, in order to
list different possibilities, you can do so by repeated use of if and else:
if(x == 0)
{
cat("Error! Division by zero.")
} else if(is.infinite((x)))
{
cat("Error! Divison by infinity.")
} else if(is.na((x)))
{
cat("Error! Divison by NA.")
} else
{
1/x
}
As you can see, only the first element of the logical vector is evaluated by if.
Usually, if a condition evaluates to a vector, it is because you’ve made an error
in your code. Remember, if you really need to evaluate a condition regarding the
elements in a vector, you can collapse the resulting logical vector to a single value
using any or all.
Some texts recommend using the operators && and || instead of & and | in conditional
statements. These work almost like & and |, but force the condition to evaluate to a
single logical. I prefer to use & and |, because I want to be notified if my condition
220 CHAPTER 6. R PROGRAMMING
evaluates to a vector - once again, that likely means that there is an error somewhere
in my code!
There is, however, one case where I much prefer && and ||. & and | always evaluate
all the conditions that you’re combining, while && and || don’t: && stops as soon as
it encounters a FALSE and || stops as soon as it encounters a TRUE. Consequently,
you can put the conditions you wish to combine in a particular order to make sure
that they can be evaluated. For instance, you may want first to check that a variable
exists, and then check a property. This can be done using exists to check whether
or not it exists - note that the variable name must be written within quotes:
# a is a variable that doesn't exist
6.3.3 ifelse
It is common that you want to assign different values to a variable depending on
whether or not a condition is satisfied:
x <- 2
if(x == 0)
{
reciprocal <- "Error! Division by zero."
} else
{
reciprocal <- 1/x
}
reciprocal
In fact, this situation is so common that there is a special function for it: ifelse:
6.3. CHECKING CONDITIONS 221
6.3.4 switch
For the sake of readability, it is usually a good idea to try to avoid chains of the type
if() {} else if() {} else if() {} else {}. One function that can be useful
for this is switch, which lets you list a number of possible results, either by position
(a number) or by name:
position <- 2
switch(position,
"First position",
"Second position",
"Third position")
You can for instance use this to decide what function should be applied to your data:
x <- 1:3
y <- c(3, 5, 4)
method <- "nonparametric2"
cor_xy <- switch(method,
parametric = cor(x, y, method = "pearson"),
nonparametric1 = cor(x, y, method = "spearman"),
nonparametric2 = cor(x, y, method = "kendall"))
cor_xy
average(c(1, 5, 8))
average(c("Lady Gaga", "Tool", "Dry the River"))
Exercise 6.5. Which of the following conditions are TRUE? First think about the
answer, and then check it using R.
x <- 2
y <- 3
z <- -3
1. x > 2
2. x > y | x > z
3. x > y & x > z
4. abs(x*z) >= y
As a first example, let’s write a for loop that runs a block of code five times, where
the block prints the current iteration number:
for(i in 1:5)
{
cat("Iteration", i, "\n")
}
The upside is that we didn’t have to copy and edit the same code multiple times
- and as you can imagine, this benefit becomes even more pronounced if you have
more complicated code blocks.
224 CHAPTER 6. R PROGRAMMING
The values for the control variable are given in a vector, and the code block will be
run once for each element in the vector - we say the we loop over the values in the
vector. The vector doesn’t have to be numeric - here is an example with a character
vector:
for(word in c("one", "two", "five hundred and fifty five"))
{
cat("Iteration", word, "\n")
}
Of course, loops are used for so much more than merely printing text on the screen.
A common use is to perform some computation and then store the result in a vector.
In this case, we must first create an empty vector to store the result in, e.g. using
vector, which creates an empty vector of a specific type and length:
squares <- vector("numeric", 5)
for(i in 1:5)
{
squares[i] <- i^2
}
squares
In this case, it would have been both simpler and computationally faster to compute
the squared values by running (1:5)^2. This is known as a vectorised solution, and
is very important in R. We’ll discuss vectorised solutions in detail in Section 6.5.
When creating the values used for the control variable, we often wish to create differ-
ent sequences of numbers. Two functions that are very useful for this are seq, which
creates sequences, and rep, which repeats patterns:
seq(0, 100)
seq(0, 100, by = 10)
seq(0, 100, length.out = 21)
rep(1, 4)
rep(c(1, 2), 4)
rep(c(1, 2), c(4, 2))
Finally, seq_along can be used to create a sequence of indices for a vector of a data
frame, which is useful if you wish to iterate some code for each element of a vector
or each column of a data frame:
seq_along(airquality) # Gives the indices of all column of the data
# frame
seq_along(airquality$Temp) # Gives the indices of all elements of the
# vector
6.4. ITERATION USING LOOPS 225
Here is an example of how to use seq_along to compute the mean of each column
of a data frame:
# Compute the mean for each column of the airquality data:
means <- vector("double", ncol(airquality))
# Check that the results agree with those from the colMeans function:
means
colMeans(airquality, na.rm = TRUE)
The line inside the loop could have read means[i] <- mean(airquality[,i],
na.rm = TRUE), but that would have caused problems if we’d used it with a
data.table or tibble object; see Section 5.9.4.
Finally, we can also change the values of the data in each iteration of the loop. Some
machine learning methods require that the data is standardised, i.e. that all columns
have mean 0 and standard deviation 1. This is achieved by subtracting the mean
from each variable and then dividing each variable by its standard deviation. We can
write a function for this that uses a loop, changing the values of a column in each
iteration:
standardise <- function(df, ...)
{
for(i in seq_along(df))
{
df[[i]] <- (df[[i]] - mean(df[[i]], ...))/sd(df[[i]], ...)
}
return(df)
}
# Try it out:
aqs <- standardise(airquality, na.rm = TRUE)
colMeans(aqs, na.rm = TRUE) # Non-zero due to floating point
# arithmetics!
sd(aqs$Wind)
1. Compute the mean temperature for each month in the airquality dataset
using a loop rather than an existing function.
2. Use a for loop to compute the maximum and minimum value of each column
of the airquality data frame, storing the results in a data frame.
3. Make your solution to the previous task reusable by writing a function that
returns the maximum and minimum value of each column of a data frame.
Exercise 6.11. The function list.files can be used to create a vector containing
the names of all files in a folder. The pattern argument can be used to supply
a regular expression describing a file name pattern. For instance, if pattern =
"\\.csv$" is used, only .csv files will be listed.
Create a loop that goes through all .csv files in a folder and prints the names of the
variables for each file.
use = "pairwise.complete")
}
}
Adding a progress bar is a little more complicated, because we must first start the
bar by using txtProgressBar and the update it using setTxtProgressBar:
sequence <- 1:5
pbar <- txtProgressBar(min = 0, max = max(sequence), style = 3)
for(i in sequence)
228 CHAPTER 6. R PROGRAMMING
{
Sys.sleep(1) # Sleep for 1 second
setTxtProgressBar(pbar, i)
}
close(pbar)
Finally, the beepr package3 can be used to play sounds, with the function beep:
install.packages("beepr")
library(beepr)
# Play all 11 sounds available in beepr:
for(i in 1:11)
{
beep(sound = i)
Sys.sleep(2) # Sleep for 2 seconds
}
# Naming the list elements will help us see which variable the maximal
# indices belong to:
names(max_list) <- names(airquality)
# Check results:
max_list
# Collapse to a vector:
extreme_days <- unlist(max_list)
(In this case, only the variables Month and Days have duplicate maximum values.)
The code block inside the loop keeps repeating until the condition i^2 <= 100 no
longer is satisfied. We have to be a little bit careful with this condition - if we set it
in such a way that it is possible that the condition always will be satisfied, the loop
will just keep running and running - creating what is known as an infinite loop. If
you’ve accidentally created an infinite loop, you can break it by pressing the Stop
230 CHAPTER 6. R PROGRAMMING
i <- i + 1
# Save results:
run_values <- c(run_values, x[i-1])
run_lengths <- c(run_lengths, run_length)
}
Exercise 6.12. Consider the nested while loops in the run length example above.
6.5. ITERATION USING VECTORISATION AND FUNCTIONALS 231
Go through the code and think about what happens in each step. What happens
when i is 1? When it is 5? When it is 6? Answer the following questions:
1. What does the condition for the outer while loop check? Why is it needed?
2. What does the condition for the inner while loop check? Why is it needed?
3. What does the line run_values <- c(run_values, x[i-1]) do?
Exercise 6.13. The control statements break and next can be used inside both for
and while loops to control their behaviour further. break stops a loop, and next
skips to the next iteration of it. Use these functions to modify the following piece of
code so that the loop skips to the next iteration if x[i] is 0, and breaks if x[i] is
NA:
for(i in seq_along(x))
{
cat("Step", i, "- reciprocal is", 1/x[i], "\n")
}
Exercise 6.14. Using the cor_mat computation from Section 6.4.2, write a func-
tion that computes all pairwise correlations in a data frame, and uses next to only
compute correlations for numeric variables. Test your function by applying it to the
msleep data from ggplot2. Could you achieve the same thing without using next?
for(i in 1:5)
{
squares[i] <- i^2
}
squares
Instead, we can simply apply the ^ operator, which uses fast C code to compute the
squares:
4 Unlike R, C is a low-level language that allows the user to write highly specialised (and complex)
These types of functions and operators are called vectorised. They take a vector as
input and apply a function to all its elements, meaning that we can avoid slower
solutions utilising loops in R5 . Try to use vectorised solutions rather than loops
whenever possible - it makes your code both easier to read and faster to run.
A related concept is functionals, which are functions that contain a for loop. Instead
of writing a for loop, you can use a functional, supplying data, a function that
should be applied in each iteration of the loop, and a vector to loop over. This won’t
necessarily make your loop run faster, but it does have other benefits:
• Shorter code: functionals allow you to write more concise code. Some would
argue that they also allow you to write code that is easier to read, but that is
obviously a matter of taste.
• Efficient: functionals handle memory allocation and other small tasks effi-
ciently, meaning that you don’t have to worry about creating a vector of an
appropriate size to store the result.
• No changes to your environment: because all operations now take place in
the local environment of the functional, you don’t run the risk of accidentally
changing variables in your global environment.
• No left-overs: for leaves the control variable (e.g. i) in the environment, func-
tionals do not.
• Easy to use with pipes: because the loop has been wrapped in a function, it
lends itself well to being used in a %>% pipeline.
Explicit loops are preferable when:
• You think that they are easier to read and write.
• Your functions take data frames or other non-vector objects as input.
• Each iteration of your loop depends on the results from previous iterations.
In this section, we’ll see how we can apply functionals to obtain elegant alternatives
to (explicit) loops.
Using apply, we can reduce this to a single line. We wish to use the airquality
data, loop over the columns (margin 2) and apply the function mean to each column:
apply(airquality, 2, mean)
Exercise 6.15. Use apply to compute the maximum and minimum value of each
column of the airquality data frame. Can you write a function that allows you to
compute both with a single call to apply?
lapply(temps, mean)
sapply(temps, mean)
vapply(temps, mean, vector("numeric", 1))
tapply(airquality$Temp, airquality$Month, mean)
There is, as that delightful proverb goes, more than one way to skin a cat.
6.5.3 purrr
If you feel enthusiastic about skinning cats using functionals instead of loops, the
tidyverse package purrr is a great addition to your toolbox. It contains a number of
specialised alternatives to the *apply functions. More importantly, it also contains
certain shortcuts that come in handy when working with functionals. For instance,
it is fairly common to define a short function inside your functional, which is useful
for instance when you don’t want the function to take up space in your environment.
This can be done a little more elegantly with purrr functions using a shortcut denoted
by ~. Let’s say that we want to standardise all variables in airquality. The map
function is the purrr equivalent of lapply. We can use it with or with the shortcut,
and with or without pipes (we mention the use of pipes now because it will be
important in what comes next):
# Base solution:
lapply(airquality, function(x) { (x-mean(x))/sd(x) })
# purrr solution:
library(purrr)
map(airquality, function(x) { (x-mean(x))/sd(x) })
6.5. ITERATION USING VECTORISATION AND FUNCTIONALS 235
Where this shortcut really shines is if you need to use multiple functionals. Let’s say
that we want to standardise the airquality variables, compute a summary and then
extract columns 2 and 5 from the summary (which contains the 1st and 3rd quartile
of the data):
# Impenetrable base solution:
lapply(lapply(lapply(airquality,
function(x) { (x-mean(x))/sd(x) }),
summary),
function(x) { x[c(2, 5)] })
# purrr solution:
airquality %>%
map(~(.-mean(.))/sd(.)) %>%
map(summary) %>%
map(~.[c(2, 5)])
Once you know the meaning of ~, the purrr solution is a lot cleaner than the base
solutions.
For instance, if you need to specify that the output should be a vector of a specific
type, you can use:
Another specialised function is the walk function. It works just like map, but doesn’t
return anything. This is useful if you want to apply a function with no output, such
as cat or read.csv:
# Returns a list of NULL values:
airquality %>% map(~cat("Maximum:", max(.), "\n"))
# Returns nothing:
airquality %>% walk(~cat("Maximum:", max(.), "\n"))
Exercise 6.18. Use a map_* function to simultaneously compute the monthly max-
imum and minimum temperature in the airquality data frame, returning a vector.
You can of course combine purrr functionals with functions from other packages,
e.g. to replace length(unique(.)) with a function from your favourite data manip-
ulation package:
# Using uniqueN from data.table:
library(data.table)
6.5. ITERATION USING VECTORISATION AND FUNCTIONALS 237
When creating summaries it can often be useful to be able to loop over both the
elements of a vector and their indices. In purrr, this is done using the usual map*
functions, but with an i (for index) in the beginning of their names, e.g. imap and
iwalk:
# Returns a list of NULL values:
imap(airquality, ~ cat(.y, ": ", median(.x), "\n", sep = ""))
# Returns nothing:
iwalk(airquality, ~ cat(.y, ": ", median(.x), "\n", sep = ""))
Note that .x is used to denote the variable, and that .y is used to denote the name
of the variable. If i* functions are used on vectors without element names, indices
are used instead. The names of elements of vectors can be set using set_names:
# Without element names:
x <- 1:5
iwalk(x, ~ cat(.y, ": ", exp(.x), "\n", sep = ""))
Exercise 6.19. Write a function that takes a data frame as input and returns the
following information about each variable in the data frame: variable name, number
of unique values, data type and number of missing values. The function should, as
you will have guessed, use a functional.
Exercise 6.20. In Exercise 6.11 you wrote a function that printed the names and
variables for all .csv files in a folder given by folder_path. Use purrr functionals
to do the same thing.
238 CHAPTER 6. R PROGRAMMING
Note that some columns are character vectors, which will cause log to throw an
error:
log(msleep$name)
log(msleep)
lapply(msleep, log)
map(msleep, log)
Note that the error messages we get from lapply and map here don’t give any infor-
mation about which variable caused the error, making it more difficult to figure out
what’s gone wrong.
If first we wrap log with safely, we get a list containing the correct output for the
numeric variables, and error messages for the non-numeric variables:
safe_log <- safely(log)
lapply(msleep, safe_log)
map(msleep, safe_log)
Not only does this tell us where the errors occur, but it also returns the logarithms
for all variables that log actually could be applied to.
If you’d like your functional to return some default value, e.g. NA, instead of an error
message, you can use possibly instead of safely:
pos_log <- possibly(log, otherwise = NA)
map(msleep, pos_log)
library(ggplot2)
library(dplyr)
To create such a plot for all combinations of color and cut, we must first create a
data frame containing all unique combinations, which can be done using the distinct
function from dplyr:
combos <- diamonds %>% distinct(cut, color)
cuts <- combos$cut
colours <- combos$color
map2 and walk2 from purrr loop over the elements of two vectors, x and y, say. They
combine the first element of x with the first element of y, the second element of x
with the second element of y, and so on - meaning that they won’t automatically loop
over all combinations of elements. That is the reason why we use distinct above to
create two vectors where each pair (x[i], y[i]) correspond to a combination. Apart
from the fact that we add a second vector to the call, map2 and walk2 work just like
map and walk:
# Print all pairs:
walk2(cuts, colours, ~cat(.x, .y, "\n"))
# Save all plots in a pdf file, with one plot per page:
pdf("all_combos_plots.pdf", width = 8, height = 8)
combos_plots
dev.off()
240 CHAPTER 6. R PROGRAMMING
The base function mapply could also have been used here. If you need to iterate over
more than two vectors, you can use pmap or pwalk, which work analogously to map2
and walk2.
Exercise 6.21. Using the gapminder data from the gapminder package, create
scatterplots of pop and lifeExp for each combination of continent and year. Save
each plot as a separate .png file.
This isn’t the best way of measuring computational time though, and doesn’t allow
us to compare different functions easily. Instead, we’ll use the bench package, which
contains a function called mark that is very useful for measuring the execution time
of functions and blocks of code. Let’s start by installing it:
install.packages("bench")
Is this faster or slower than mean? We can use mark to apply both functions to a
vector multiple times, and measure how long each execution takes:
library(bench)
x <- 1:100
bm <- mark(mean(x), average(x))
bm # Or use View(bm) if you don't want to print the results in the
# Console panel.
mark has executed both function n_itr times each, and measured how long each
execution took to perform. The execution time varies - in the output you can see
the shortest (min) and median (median) execution times, as well as the number of
iterations per second (itr/sec). Be a little wary of the units for the execution times
so that you don’t get them confused - a millisecond (ms, 10−3 seconds) is 1,000
microseconds (µs, 1 µs is 10−6 seconds), and 1 microsecond is 1,000 nanoseconds (ns,
1 ns is 10−9 seconds).
The result here may surprise you - it appears that average is faster than mean! The
reason is that mean does a lot of things that average does not: it checks the data type
and gives error messages if the data is of the wrong type (e.g. character), and then
traverses the vector twice to lower the risk of errors due to floating point arithmetics.
242 CHAPTER 6. R PROGRAMMING
All of this takes time, and makes the function slower (but safer to use).
plot(bm)
It is also possible to place blocks of code inside curly brackets, { }, in mark. Here
is an example comparing a vectorised solution for computing the squares of a vector
with a solution using a loop:
x <- 1:100
bm <- mark(x^2,
{
y <- x
for(i in seq_along(x))
{
y[i] <- x[i]*x[i]
}
y
})
bm
plot(bm)
Although the above code works, it isn’t the prettiest, and the bm table looks a bit
confusing because of the long expression for the code block. I prefer to put the code
block inside a function instead:
squares <- function(x)
{
y <- x
for(i in seq_along(x))
{
y[i] <- x[i]*x[i]
}
return(y)
}
x <- 1:100
bm <- mark(x^2, squares(x))
bm
plot(bm)
bm <- mark(squares(x),
{
y <- x
for(i in seq_along(x))
{
y[i] <- x[i]*x[i]
}
y
})
bm
Functions in R are compiled the first time they are run, which often makes them
run faster than the same code would have outside of the function. We’ll discuss this
further next.
x <- 1:100
bm <- mark(x^2, squares(x))
bm
Judging from the mem_alloc column, it appears that the squares(x) loop not only
is slower, but also uses more memory. But wait! Let’s run the code again, just to be
7 But only if your version of R has been compiled with memory profiling. If you are using a
standard build of R, i.e. have downloaded the base R binary from R-project.org, you should be good
to go. You can check that memory profiling is enabled by checking that capabilities("profmem")
returns TRUE. If not, you may need to reinstall R if you wish to enable memory profiling.
244 CHAPTER 6. R PROGRAMMING
This time out, both functions use less memory, and squares now uses less memory
than x^2. What’s going on?
Computers can’t read code written in R or most other programming languages di-
rectly. Instead, the code must be translated to machine code that the computer’s
processor uses, in a process known as compilation. R uses just-in-time compilation
of functions and loops8 , meaning that it translates the R code for new functions and
loops to machine code during execution. Other languages, such as C, use ahead-
of-time compilation, translating the code prior to execution. The latter can make
the execution much faster, but some flexibility is lost, and the code needs to be run
through a compiler ahead of execution, which also takes time. When doing the just-
in-time compilation, R needs to use some of the computer’s memory, which causes
the memory usage to be greater the first time the function is run. However, if an
R function is run again, it has already been compiled, meaning R doesn’t have to
allocate memory for compilation.
In conclusion, if you want to benchmark the memory usage of functions, make sure
to run them once before benchmarking. Alternatively, if your function takes a long
time to run, you can compile it without running it using the cmpfun function from
the compiler package:
library(compiler)
squares <- cmpfun(squares)
squares(1:10)
Exercise 6.22. Write a function for computing the mean of a vector using a for
loop. How much slower than mean is it? Which function uses more memory?
Exercise 6.23. We have seen three different ways of filtering a data frame to only
keep rows that fulfil a condition: using base R, data.table and dplyr. Suppose
that we want to extract all flights from 1 January from the flights data in the
nycflights13 package:
library(data.table)
library(dplyr)
library(nycflights13)
# Read about the data:
8 Since R 3.4.
6.6. MEASURING CODE PERFORMANCE 245
?flights
Compare the speed and memory usage of these three approaches. Which has the
best performance?
246 CHAPTER 6. R PROGRAMMING
Chapter 7
“Modern classical” may sound like a contradiction, but it is in fact anything but.
Classical statistics covers topics like estimation, quantification of uncertainty, and
hypothesis testing - all of which are at the heart of data analysis. Since the advent
of modern computers, much has happened in this field that has yet to make it to
the standard textbooks of introductory courses in statistics. This chapter attempts
to bridge part of that gap by dealing with those classical topics, but with a modern
approach that uses more recent advances in statistical theory and computational
methods. Particular focus is put on how simulation can be used for analyses and for
evaluating the properties of statistical procedures.
Whenever it is feasible, our aim in this chapter and the next is to:
• Use hypothesis tests that are based on permutations or the bootstrap rather
than tests based on strict assumptions about the distribution of the data or
asymptotic distributions,
• To complement estimates and hypothesis tests with computing confidence in-
tervals based on sound methods (including the bootstrap),
• Offer easy-to-use Bayesian methods as an alternative to frequentist tools.
247
248 CHAPTER 7. MODERN CLASSICAL STATISTICS
As we shall soon see, they are also invaluable tools when evaluating statistical meth-
ods. A key component of modern statistical work is simulation, in which we generate
artificial data that can be used both in the analysis of real data (e.g. in permutation
tests and bootstrap confidence intervals, topics that we’ll explore in this chapter) and
for assessing different methods. Simulation is possible only because we can generate
random numbers, so let’s begin by having a look at how we can generate random
numbers in R.
Try running the above code multiple times. You’ll get different results each time,
because each time it runs the random number generator is in a different state. In
most cases, this is desirable (if the results were the same each time we used sample,
it wouldn’t be random), but not if we want to replicate a result at some later stage.
When we are concerned about reproducibility, we can use set.seed to fix the state
of the random number generator:
# Each run generates different results:
sample(1:10, 2); sample(1:10, 2)
We often want to use simulated data from a probability distribution, such as the
normal distribution. The normal distribution is defined by its mean 𝜇 and its vari-
ance 𝜎2 (or, equivalently, its standard deviation 𝜎). There are special functions for
generating data from different distributions - for the normal distribution it is called
rnorm. We specify the number of observations that we want to generate (n) and the
parameters of the distribution (the mean mu and the standard deviation sigma):
7.1. SIMULATION AND DISTRIBUTIONS 249
# A shorter version:
rnorm(10, 2, 1)
Similarly, there are functions that can be used compute the quantile function, density
function, and cumulative distribution function (CDF) of the normal distribution.
Here are some examples for a normal distribution with mean 2 and standard deviation
1:
qnorm(0.9, 2, 1) # Upper 90 % quantile of distribution
dnorm(2.5, 2, 1) # Density function f(2.5)
pnorm(2.5, 2, 1) # Cumulative distribution function F(2.5)
𝑎+𝑏
Continuous uniform distribution 𝑈 (𝑎, 𝑏) on the interval (𝑎, 𝑏), with mean 2 and
(𝑏−𝑎)2
variance 12 :
runif(n, a, b) # Generate n random numbers
qunif(0.95, a, b) # Upper 95 % quantile of distribution
dunif(x, a, b) # Density function f(x)
punif(x, a, b) # Cumulative distribution function F(X)
Lognormal distribution 𝐿𝑁 (𝜇, 𝜎2 ) with mean exp(𝜇 + 𝜎2 /2) and variance (exp(𝜎2 ) −
1) exp(2𝜇 + 𝜎2 ):
rlnorm(n, mu, sigma) # Generate n random numbers
qlnorm(0.95, mu, sigma) # Upper 95 % quantile of distribution
dlnorm(x, mu, sigma) # Density function f(x)
plnorm(x, mu, sigma) # Cumulative distribution function F(X)
𝜈
t-distribution 𝑡(𝜈) with mean 0 (for 𝜈 > 1) and variance 𝜈−2 (for 𝜈 > 2):
rt(n, nu) # Generate n random numbers
qt(0.95, nu) # Upper 95 % quantile of distribution
dt(x, nu) # Density function f(x)
pt(x, nu) # Cumulative distribution function F(X)
Exercise 7.2. Use runif and (at least) one of round, ceiling and floor to generate
observations from a discrete random variable on the integers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
library(ggplot2)
# Compare to histogram:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$normal_data),
sd = sd(generated_data$normal_data)))
We could also add a density estimate for the generated data, to further aid the eye
here - we’d expect this to be close to the theoretical density function:
# Compare to density estimate:
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$normal_data),
sd = sd(generated_data$normal_data)))
Note that the values of args have changed. args should always be a list containing
values for the parameters of the distribution: mu and sigma for the normal distribu-
tion and df for the 𝜒2 distribution (the same as in Section 7.1.2).
Another option is to draw a quantile-quantile plot, or Q-Q plot for short, which
compares the theoretical quantiles of a distribution to the empirical quantiles of
the data, showing each observation as a point. If the data follows the theorised
distribution, then the points should lie more or less along a straight line.
To draw a Q-Q plot for a normal distribution, we use the geoms geom_qq and
geom_qq_line:
7.1. SIMULATION AND DISTRIBUTIONS 253
For all other distributions, we must provide the quantile function of the distribution
(many of which can be found in Section 7.1.2):
# Q-Q plot for the lognormal distribution:
ggplot(generated_data, aes(sample = normal_data)) +
geom_qq(distribution = qlnorm) +
geom_qq_line(distribution = qlnorm)
Q-Q-plots can be a little difficult to read. There will always be points deviating
from the line - in fact, that’s expected. So how much must they deviate before we
rule out a distributional assumption? Particularly when working with real data, I
like to compare the Q-Q-plot of my data to Q-Q-plots of simulated samples from
the assumed distribution, to get a feel for what kind of deviations can appear if the
distributional assumption holds. Here’s an example of how to do this, for the normal
distribution:
# Look at solar radiation data for May from the airquality
# dataset:
May <- airquality[airquality$Month == 5,]
You can run the code several times, to get more examples of what Q-Q-plots can
look like when the distributional assumption holds. In this case, the tail points in
the Q-Q-plot for the solar radiation data deviate from the line more than the tail
points in most simulated examples do, and personally, I’d be reluctant to assume
that the data comes from a normal distribution.
Exercise 7.3. Investigate the sleeping times in the msleep data from the ggplot2
package. Do they appear to follow a normal distribution? A lognormal distribution?
Let’s say that we are interest in computing the area under quarter-circle. We can
highlight the area in our plot using geom_area:
ggplot(data.frame(x = seq(0, 1, 1e-4)), aes(x)) +
geom_area(aes(x = x,
y = ifelse(x^2 + circle(x)^2 <= 1, circle(x), 0)),
fill = "pink") +
geom_function(fun = circle)
To find the area, we will generate a large number of random points uniformly in the
unit square. By the law of large numbers, the proportion of points that end up under
the quarter-circle should be close to the area under the quarter-circle1 . To do this,
we generate 10,000 random values for the 𝑥 and 𝑦 coordinates of each point using
the 𝑈 (0, 1) distribution, that is, using runif:
B <- 1e4
unif_points <- data.frame(x = runif(B), y = runif(B))
Note the order in which we placed the geoms - we plot the points after the area so
that the pink colour won’t cover the points, and the function after the points so that
the points won’t cover the curve.
To estimate the area, we compute the proportion of points that are below the curve:
mean(unif_points$x^2 + unif_points$y^2 <= 1)
1√
In this case, we can also compute the area exactly: ∫0 1 − 𝑥2 𝑑𝑥 = 𝜋/4 = 0.7853 ….
For more complicated integrals, however, numerical integration methods like Monte
Carlo integration may be required. That being said, there are better numerical inte-
gration methods for low-dimensional integrals like this one. Monte Carlo integration
1 In general, the proportion of points that fall below the curve will be proportional to the area
under the curve relative to the area of the sample space. In this case the sample space is the unit
square, which has area 1, meaning that the relative area is the same as the absolute area.
256 CHAPTER 7. MODERN CLASSICAL STATISTICS
Alternatively, we could have used formula notation, as we e.g. did for the linear model
in Section 3.7. We’d then have to use the data argument in t.test to supply the
2 Note that this is not a random sample of mammals, and so one of the fundamental assumptions
behind the t-test isn’t valid in this case. For the purpose of showing how to use the t-test, the data
is good enough though.
7.2. STUDENT’S T-TEST REVISITED 257
Unless we are interested in keeping the vectors carnivores and herbivores for other
purposes, this latter approach is arguably more elegant.
Speaking of elegance, the data argument also makes it easy to run a t-test using
pipes. Here is an example, where we use filter from dplyr to do the subsetting:
library(dplyr)
msleep %>% filter(vore == "carni" | vore == "herbi") %>%
t.test(sleep_total ~ vore, data = .)
We could also use the magrittr pipe %$% from Section 6.2 to pass the variables from
the filtered subset of msleep, avoiding the data argument:
library(magrittr)
msleep %>% filter(vore == "carni" | vore == "herbi") %$%
t.test(sleep_total ~ vore)
There are even more options than this - the point that I’m trying to make here is
that like most functions in R, you can use functions for classical statistics in many
different ways. In what follows, I will show you one or two of these, but don’t hesitate
to try out other approaches if they seem better to you.
What we just did above was a two-sided t-test, where the null hypothesis was that
there was no difference in means between the groups, and the alternative hypoth-
esis that there was a difference. We can also perform one-sided tests using the
alternative argument. alternative = "greater" means that the alternative is
that the first group has a greater mean, and alternative = "less" means that the
first group has a smaller mean. Here is an example with the former:
t.test(sleep_total ~ vore,
data = subset(msleep, vore == "carni" | vore == "herbi"),
alternative = "greater")
By default, R uses the Welch two-sample t-test, meaning that it is not assumed that
the groups have equal variances. If you don’t want to make that assumption, you
can add var.equal = TRUE:
t.test(sleep_total ~ vore,
data = subset(msleep, vore == "carni" | vore == "herbi"),
var.equal = TRUE)
In addition to two-sample t-tests, t.test can also be used for one-sample tests and
paired t-tests. To perform a one-sample t-test, all we need to do is to supply a
single vector with observations, along with the value of the mean 𝜇 under the null
258 CHAPTER 7. MODERN CLASSICAL STATISTICS
hypothesis. I usually sleep for about 7 hours each night, and so if I want to test
whether that is true for an average mammal, I’d use the following:
t.test(msleep$sleep_total, mu = 7)
As we can see from the output, your average mammal sleeps for 10.4 hours per day.
Moreover, the p-value is quite low - apparently, I sleep unusually little for a mammal!
As for paired t-tests, we can perform them by supplying two vectors (where element
1 of the first vector corresponds to element 1 of the second vector, and so on) and
the argument paired = TRUE. For instance, using the diamonds data from ggplot2,
we could run a test to see if the length x of diamonds with a fair quality of the cut
on average equals the width y:
fair_diamonds <- subset(diamonds, cut == "Fair")
t.test(fair_diamonds$x, fair_diamonds$y, paired = TRUE)
Exercise 7.5. Load the VAS pain data vas.csv from Exercise 3.8. Perform a one-
sided t-test to see test the null hypothesis that the average VAS among the patients
during the time period is less than or equal to 6.
possible ways to assign 19 animals the label carnivore and 32 animals the label
herbivore. That is, look at all permutations of the labels. The probability of a
result at least as extreme as that obtained in our sample (in the direction of the
alternative), i.e. the p-value, would then be the proportion of permutations that
yield a result at least extreme as that in our sample. This is known as a permutation
test.
Permutation tests were known to the likes of Gosset and Fisher (Fisher’s exact test is
a common example), but because the number of permutations of labels often tend to
become quite large (76,000 billion, in our carnivore-herbivore example), they lacked
the means actually to use them. 76,000 billion permutations may be too many even
today, but we can obtain very good approximations of the p-values of permutation
tests using simulation.
The idea is that we look at a large number of randomly selected permutations, and
check for how many of them we obtain a test statistic that is more extreme than
the sample test statistic. The law of large number guarantees that this proportion
will converge to the permutation test p-value as the number of randomly selected
permutations increases.
Let’s have a go!
# Filter the data, to get carnivores and herbivores:
data <- subset(msleep, vore == "carni" | vore == "herbi")
In this particular example, the resulting p-value is pretty close to that from the old-
school t-test. However, we will soon see examples where the two versions of the t-test
differ more.
You may ask why we used 9,999 permutations and not 10,000. The reason is that we
avoid p-values that are equal to traditional significance levels like 0.05 and 0.01 this
way. If we’d used 10,000 permutations, 500 of which yielded a statistics that had
a larger absolute value than the sample statistic, then the p-value would have been
exactly 0.05, which would cause some difficulties in trying to determine whether
or not the result was significant at the 5 % level. This cannot happen when we
use 9,999 permutations instead (500 statistics with a large absolute value yields the
p-value 0.050005 > 0.05, and 499 yields the p-value 0.0499 < 0.05).
Having to write a for loop every time we want to run a t-test seems unnecessarily
complicated. Fortunately, others have tread this path before us. The MKinfer pack-
age contains a function to perform (approximate) permutation t-tests, which also
happens to be faster than our implementation above. Let’s install it:
install.packages("MKinfer")
The function for the permutation t-test, perm.t.test, works exactly like t.test. In
all the examples from Section 7.2.1 we can replace t.test with perm.t.test to run
a permutation t-test instead. Like so:
library(MKinfer)
perm.t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
Note that two p-values and confidence intervals are presented: one set from the
permutations and one from the old-school approach - so make sure that you look at
the right ones!
You may ask how many randomly selected permutations we need to get an accurate
approximation of the permutation test p-value. By default, perm.t.test uses 9,999
7.2. STUDENT’S T-TEST REVISITED 261
permutations (you can change that number using the argument R), which is widely
considered to be a reasonable number. If you are running a permutation test with a
much more complex (and computationally intensive) statistic, you may have to use
a lower number, but avoid that if you can.
We will have a closer look at the bootstrap in Section 7.7, where we will learn
how to use it for creating confidence intervals and computing p-values for any test
statistic. For now, we’ll just note that MKinfer offers a bootstrap version of the
t-test, boot.t.test :
library(MKinfer)
boot.t.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
Both perm.test and boot.test have a useful argument called symmetric, the details
of which are discussed in depth in Section 12.3.
test_result
As you can see, test_result is a list containing different parameters and vectors
for the test. To get the p-value, we can run the following:
262 CHAPTER 7. MODERN CLASSICAL STATISTICS
test_result$p.value
To view the p-values for each pairwise test, we can now run:
p_values
7.2. STUDENT’S T-TEST REVISITED 263
When we run multiple tests, the risk for a type I error increases, to the point where
we’re virtually guaranteed to get a significant result. We can reduce the risk of
false positive results and adjust the p-values for multiplicity using for instance Bon-
ferroni correction, Holm’s method (an improved version of the standard Bonferroni
approach), or the Benjamini-Hochberg approach (which controls the false discovery
rate and is useful if you for instance are screening a lot of variables for differences),
using p.adjust:
p.adjust(p_values$p, method = "bonferroni")
p.adjust(p_values$p, method = "holm")
p.adjust(p_values$p, method = "BH")
As an example, consider the airquality data. Let’s say that we want to test whether
the mean ozone, solar radiation, wind speed, and temperature differ between June
and July. We could use four separate t-tests to test this, but we could also use
Hotelling’s 𝑇 2 to test the null hypothesis that the mean vector, i.e. the vector con-
taining the four means, is the same for both months. The function used for this is
hotelling.test:
# Subset the data:
airquality_t2 <- subset(airquality, Month == 6 | Month == 7)
means, what constitutes “enough data” is usually measured by the power of the test.
The sample is large enough when the test achieves high enough power. If we are
comfortable assuming normality (and we may well be, especially as the main goal
with sample size computations is to get a ballpark figure), we can use power.t.test
to compute what power our test would achieve under different settings. For a two-
sample test with unequal variances, we can use power.welch.t.test from MKpower
instead. Both functions can be used to either find the sample size required for a
certain power, or to find out what power will be obtained from a given sample size.
power.t.test and power.welch.t.test both use delta to denote the mean differ-
ence under the alternative hypothesis. In addition, we must supply the standard
deviationsd of the distribution. Here are some examples:
library(MKpower)
You may wonder how to choose delta and sd. If possible, it is good to base these
numbers on a pilot study or related previous work. If no such data is available, your
guess is as good as mine. For delta, some useful terminology comes from medical
statistics, where the concept of clinical significance is used increasingly often. Make
sure that delta is large enough to be clinically significant, that is, large enough to
actually matter in practice.
If we have reason to believe that the data follows a non-normal distribution, another
option is to use simulation to compute the sample size that will be required. We’ll
7.2. STUDENT’S T-TEST REVISITED 265
Exercise 7.6. Return to the one-sided t-test that you performed in Exercise 7.5.
Assume that delta is 0.5 (i.e. that the true mean is 6.5) and that the standard
deviation is 2. How large does the sample size 𝑛 have to be for the power of the
test to be 95 % at a 5 % significance level? What is the power of the test when the
sample size is 𝑛 = 2, 351?
To use a Bayesian model with a weakly informative prior to analyse the difference in
sleeping time between herbivores and carnivores, we load rstanarm and use stan_glm
in complete analogue with how we use t.test:
library(rstanarm)
library(ggplot2)
m <- stan_glm(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
There are two estimates here: an “intercept” (the average sleeping time for carni-
vores) and voreherbi (the difference between carnivores and herbivores). To plot
the posterior distribution of the difference, we can use plot:
plot(m, "dens", pars = c("voreherbi"))
266 CHAPTER 7. MODERN CLASSICAL STATISTICS
p-values are not a part of Bayesian statistics, so don’t expect any. It is however
possible to perform a kind of Bayesian test of whether there is a difference by checking
whether the credible interval for the difference contains 0. If not, there is evidence
that there is a difference (Thulin, 2014c). In this case, 0 is contained in the interval,
and there is no evidence of a difference.
In most cases, Bayesian estimation is done using Monte Carlo integration (specifically,
a class of methods known as Markov Chain Monte Carlo, MCMC). To check that the
model fitting has converged, we can use a measure called 𝑅.̂ It should be less than
1.1 if the fitting has converged:
plot(m, "rhat")
If the model fitting hasn’t converged, you may need to increase the number of itera-
tions of the MCMC algorithm. You can increase the number of iterations by adding
the argument iter to stan_glm (the default is 2,000).
If you want to use a custom prior for your analysis, that is of course possible too.
See ?priors and ?stan_glm for details about this, and about the default weakly
informative prior.
library(ggplot2)
carnivores <- msleep[msleep$vore == "carni",]
herbivores <- msleep[msleep$vore == "herbi",]
wilcox.test(carnivores$sleep_total, herbivores$sleep_total)
Or use a formula:
wilcox.test(sleep_total ~ vore, data =
subset(msleep, vore == "carni" | vore == "herbi"))
The test we just performed uses the Pearson correlation coefficient as its test statistic.
If you prefer, you can use the nonparametric Spearman and Kendall correlation
coefficients in the test instead, by changing the value of method:
# Spearman test of correlation:
cor.test(msleep$sleep_total, msleep$brainwt,
use = "pairwise.complete",
method = "spearman")
These tests are all based on asymptotic approximations, which among other things
causes the Pearson correlation test perform poorly for non-normal data. In Sec-
tion 7.7 we will create a bootstrap version of the correlation test, which has better
performance.
7.3.3 𝜒2 -tests
𝜒2 (chi-squared) tests are most commonly used to test whether two categorical vari-
ables are independent. To use it, we must first construct a contingency table, i.e. a
268 CHAPTER 7. MODERN CLASSICAL STATISTICS
table showing the counts for different combinations of categories, typically using
table. Here is an example with the diamonds data from ggplot2:
library(ggplot2)
table(diamonds$cut, diamonds$color)
The null hypothesis of our test is that the quality of the cut (cut) and the colour
of the diamond (color) are independent, with the alternative being that they are
dependent. We use chisq.test with the contingency table as input to run the 𝜒2
test of independence:
chisq.test(table(diamonds$cut, diamonds$color))
If both of the variables are binary, i.e. only take two values, the power of the test can
be approximated using power.prop.test. Let’s say that we have two variables, 𝑋
and 𝑌 , taking the values 0 and 1. Assume that we collect 𝑛 observations with 𝑋 = 0
and 𝑛 with 𝑋 = 1. Furthermore, let p1 be the probability that 𝑌 = 1 if 𝑋 = 0 and
p2 be the probability that 𝑌 = 1 if 𝑋 = 1. We can then use power.prop.test as
follows:
# Assume that n = 50, p1 = 0.4 and p2 = 0.5 and compute the power:
power.prop.test(n = 50, p1 = 0.4, p2 = 0.5, sig.level = 0.05)
Let’s say that we want to compute a confidence interval for the proportion of herbi-
vore mammals that sleep for more than 7 hours a day.
library(ggplot2)
herbivores <- msleep[msleep$vore == "herbi",]
# Compute the number of animals for which we know the sleeping time:
n <- sum(!is.na(herbivores$sleep_total))
The estimated proportion is x/n, which in this case is 0.625. We’d like to quantify
the uncertainty in this estimate by computing a confidence interval. The standard
Wald method, taught in most introductory courses, can be computed using:
library(MKinfer)
binomCI(x, n, conf.level = 0.95, method = "wald")
Don’t do that though! The Wald interval is known to be severely flawed (Brown et
al., 2001), and much better options are available. If the proportion can be expected
to be close to 0 or 1, the Clopper-Pearson interval is recommended, and otherwise
the Wilson interval is the best choice (Thulin, 2014a):
binomCI(x, n, conf.level = 0.95, method = "clopper-pearson")
binomCI(x, n, conf.level = 0.95, method = "wilson")
An excellent Bayesian credible interval is the Jeffreys interval, which uses the weakly
informative Jeffreys prior:
binomCI(x, n, conf.level = 0.95, method = "jeffreys")
The ssize.propCI function in MKpower can be used to compute the sample size
needed to obtain a confidence interval with a given width3 . It relies on asymptotic
formulas that are highly accurate, as you later on will verify in Exercise 7.17.
library(MKpower)
# Compute the sample size required to obtain an interval with
# width 0.1 if the true proportion is 0.4:
ssize.propCI(prop = 0.4, width = 0.1, method = "wilson")
ssize.propCI(prop = 0.4, width = 0.1, method = "clopper-pearson")
3 Or rather, a given expected, or average, width. The width of the interval is a function of a
Exercise 7.7. The function binomDiffCI from MKinfer can be used to compute
a confidence interval for the difference of two proportions. Using the msleep data,
use it to compute a confidence interval for the difference between the proportion of
herbivores that sleep for more than 7 hours a day and the proportion of carnivores
that sleep for more than 7 hours a day.
the cancer-preventing effect of kale backed up by several papers, even though the
majority of studies actually indicated that there was no such effect.
Exercise 7.8. Discuss the following. You are helping a research team with statistical
analysis of data that they have collected. You agree on five hypotheses to test.
None of the tests turns out significant. Fearing that all their hard work won’t lead
anywhere, your collaborators then ask you to carry out five new tests. Neither turns
out significant. Your collaborators closely inspect the data and then ask you to carry
out ten more tests, two of which are significant. The team wants to publish these
significant results in a scientific journal. Should you agree to publish them? If so,
what results should be published? Should you have put your foot down and told them
not to run more tests? Does your answer depend on how long it took the research
team to collect the data? What if the team won’t get funding for new projects unless
they publish a paper soon? What if other research teams competing for the same
grants do their analyses like this?
Exercise 7.9. Discuss the following. You are working for a company that is launch-
ing a new product, a hair-loss treatment. In a small study, the product worked for
19 out of 22 participants (86 %). You compute a 95 % Clopper-Pearson confidence
interval (Section 7.3.4) for the proportion of successes and find that it is (0.65, 0.97).
Based on this, the company wants to market the product as being 97 % effective.
Is that acceptable to you? If not, how should it be marketed? Would your answer
change if the product was something else (new running shoes that make you faster,
a plastic film that protects smartphone screens from scratches, or contraceptives)?
What if the company wanted to market it as being 86 % effective instead?
Exercise 7.10. Discuss the following. You have worked long and hard on a project.
In the end, to see if the project was a success, you run a hypothesis test to check if
two variables are correlated. You find that they are not (p = 0.15). However, if you
remove three outliers, the two variables are significantly correlated (p = 0.03). What
should you do? Does your answer change if you only have to remove one outlier to
get a significant result? If you have to remove ten outliers? 100 outliers? What if the
p-value is 0.051 before removing the outliers and 0.049 after removing the outliers?
Exercise 7.11. Discuss the following. You are analysing data from an experiment
to see if there is a difference between two treatments. You estimate4 that given the
sample size and the expected difference in treatment effects, the power of the test
that you’ll be using, i.e. the probability of rejecting the null hypothesis if it is false,
is about 15 %. Should you carry out such an analysis? If not, how high does the
power need to be for the analysis to be meaningful?
4 We’ll discuss methods for producing such estimates in Section 7.5.3.
272 CHAPTER 7. MODERN CLASSICAL STATISTICS
7.4.2 Reproducibility
An analysis is reproducible if it can be reproduced by someone else. By producing
reproducible analyses, we make it easier for others to scrutinise our work. We also
make all the steps in the data analysis transparent. This can act as a safeguard
against data fabrication and data dredging.
In order to make an analysis reproducible, we need to provide at least two things.
First, the data - all unedited data files in their original format. This also includes
metadata with information required to understand the data (e.g. codebooks explain-
ing variable names and codes used for categorical variables). Second, the computer
code used to prepare and analyse the data. This includes any wrangling and prelimi-
nary testing performed on the data.
As long as we save our data files and code, data wrangling and analyses in R are
inherently reproducible, in contrast to the same tasks carried out in menu-based
software such as Excel. However, if reports are created using a word processor,
there is always a risk that something will be lost along the way. Perhaps numbers
are copied by hand (which may introduce errors), or maybe the wrong version of a
figure is pasted into the document. R Markdown (Section 4.1) is a great tool for
creating completely reproducible reports, as it allows you to integrate R code for
data wrangling, analyses, and graphics in your report-writing. This reduces the risk
of manually inserting errors, and allows you to share your work with others easily.
Exercise 7.12. Discuss the following. You are working on a study at a small-town
hospital. The data involves biomarker measurements for a number of patients, and
you show that patients with a sexually transmittable disease have elevated levels of
some of the biomarkers. The data also includes information about the patients: their
names, ages, ZIP codes, heights, and weights. The research team wants to publish
your results and make the analysis reproducible. Is it ethically acceptable to share
all your data? Can you make the analysis reproducible without violating patient
confidentiality?
Next, we’ll generate some data from a 𝑁 (0, 1) distribution and compute the three
estimates:
x <- rnorm(25)
As you can see, the estimates given by the different approaches differ, so clearly the
choice of estimator matters. We can’t determine which to use based on a single
sample though. Instead, we typically compare the long-run properties of estimators,
such as their bias and variance. The bias is the difference between the mean of the
estimator and the parameter it seeks to estimate. An estimator is unbiased if its bias
is 0, which is considered desirable at least in this setting. Among unbiased estimators,
we prefer the one that has the smallest variance. So how can we use simulation to
compute the bias and variance of estimators?
The key to using simulation here is to realise that x_mean is an observation of the
random variable 𝑋̄ = 251
(𝑋1 + 𝑋2 + ⋯ + 𝑋25 ) where each 𝑋𝑖 is 𝑁 (0, 1)-distributed.
We can generate observations of 𝑋𝑖 (using rnorm), and can therefore also generate
observations of 𝑋.̄ That means that we can obtain an arbitrarily large sample of
observations of 𝑋,̄ which we can use to estimate its mean and variance. Here is an
example:
# Set the parameters for the normal distribution:
mu <- 0
sigma <- 1
for(i in seq_along(res$x_mean))
{
x <- rnorm(25, mu, sigma)
res$x_mean[i] <- mean(x)
res$x_median[i] <- median(x)
res$x_mma[i] <- max_min_avg(x)
All three estimators appear to be unbiased (even if the simulation results aren’t
exactly 0, they are very close). The sample mean has the smallest variance (and
is therefore preferable!), followed by the median. The 𝑥𝑚𝑎𝑥 +𝑥
2
𝑚𝑖𝑛
estimator has the
worst performance, which is unsurprising as it ignores all information not contained
in the extremes of the dataset.
In Section 7.5.5 we’ll discuss how to choose the number of simulated samples to use
in your simulations. For now, we’ll just note that the estimate of the estimators’
biases becomes more stable as the number of simulated samples increases, as can be
seen from this plot, which utilises cumsum, described in Section 5.3.3:
# Compute estimates of the bias of the sample mean for each
# iteration:
res$iterations <- 1:B
res$x_mean_bias <- cumsum(res$x_mean)/1:B - mu
xlab("Number of iterations") +
ylab("Estimated bias")
Exercise 7.13. Repeat the above simulation for different samples sizes 𝑛 between
10 and 100. Plot the resulting variances as a function of 𝑛.
Exercise 7.14. Repeat the simulation in 7.13, but with a 𝑡(3) distribution instead
of the normal distribution. Which estimator is better in this case?
for(i in 1:B)
{
# Generate data:
x <- distr(n1, ...)
y <- distr(n2, ...)
# Compute p-values:
p_values[i, 1] <- t.test(x, y,
alternative = alternative)$p.value
p_values[i, 2] <- perm.t.test(x, y,
alternative = alternative,
R = 999)$perm.p.value
p_values[i, 3] <- wilcox.test(x, y,
alternative = alternative)$p.value
close(pbar)
First, let’s try it with normal data. The simulation takes a little while to run,
primarily because of the permutation t-test, so you may want to take a short break
while you wait.
simulate_type_I(20, 20, rnorm, B = 9999)
Next, let’s try it with a lognormal distribution, both with balanced and imbalanced
sample sizes. Increasing the parameter 𝜎 (sdlog) increases the skewness of the
lognormal distribution (i.e. makes it more asymmetric and therefore less similar to
the normal distribution), so let’s try that to. In case you are in a rush, the results
from my run of this code block can be found below it.
simulate_type_I(20, 20, rlnorm, B = 9999, sdlog = 1)
simulate_type_I(20, 20, rlnorm, B = 9999, sdlog = 3)
simulate_type_I(20, 30, rlnorm, B = 9999, sdlog = 1)
simulate_type_I(20, 30, rlnorm, B = 9999, sdlog = 3)
7.5. EVALUATING STATISTICAL METHODS USING SIMULATION 277
My results were:
# Normal distribution, n1 = n2 = 20:
p_t_test p_perm_t_test p_wilcoxon
0.04760476 0.04780478 0.04680468
What’s noticeable here is that the permutation t-test and the Wilcoxon-Mann-
Whitney test have type I error rates that are close to the nominal 0.05 in all five
scenarios, whereas the t-test has too low a type I error rate when the data comes
from a lognormal distribution. This makes the test too conservative in this setting.
Next, let’s compare the power of the tests.
for(i in 1:B)
{
# Generate data:
x <- distr1(n1)
y <- distr2(n2)
# Compute p-values:
p_values[i, 1] <- t.test(x, y,
alternative = alternative)$p.value
p_values[i, 2] <- perm.t.test(x, y,
alternative = alternative,
R = 999)$perm.p.value
p_values[i, 3] <- wilcox.test(x, y,
alternative = alternative)$p.value
close(pbar)
# Return power:
return(colMeans(p_values < level))
}
Let’s try this out with lognormal data, where the difference in the log means is 1:
# Balanced sample sizes:
simulate_power(20, 20, function(n) { rlnorm(n,
meanlog = 2, sdlog = 1) },
function(n) { rlnorm(n,
meanlog = 1, sdlog = 1) },
B = 9999)
function(n) { rlnorm(n,
meanlog = 1, sdlog = 1) },
B = 9999)
Exercise 7.15. Repeat the simulation study of type I error rate and power for the
old school t-test, permutation t-test and the Wilcoxon-Mann-Whitney test with 𝑡(3)-
distributed data. Which test has the best performance? How much lower is the type
I error rate of the old-school t-test compared to the permutation t-test in the case of
balanced sample sizes?
arguments rx and ry are used to pass functions used to generate the random numbers,
in line with the simulate_power function that we created above.
For the t-test, we can use sim.power.t.test:
library(MKpower)
sim.power.t.test(nx = 25, rx = rnorm, rx.H0 = rnorm,
ny = 25, ry = function(x) { rnorm(x, mean = 0.8) },
ry.H0 = rnorm)
determine the number of simulations that you need to obtain a confidence interval
that is short enough for you to feel that you have a good idea about the actual power
of the test.
As an example, if a small pilot simulation indicates that the power is about 0.8 and
you want a confidence interval with width 0.01, the number of simulations needed
can be computed as follows:
library(MKpower)
ssize.propCI(prop = 0.8, width = 0.01, method = "wilson")
In this case, you’d need 24,592 iterations to obtain the desired accuracy.
for(i in 1:B)
{
# Generate bivariate data:
x <- distr(n)
# Compute p-values:
p_values[i] <- cor.test(x[,1], x[,2], ...)$p.value
close(pbar)
To find the sample size we need, we will write a new function containing a while
loop (see Section 6.4.5), that performs the simulation for increasing values of 𝑛 until
the test has achieved the desired power:
library(MASS)
Exercise 7.16. Modify the functions we used to compute the sample sizes for the
Pearson correlation test to instead compute sample sizes for the Spearman correlation
tests. For bivariate normal data, are the required sample sizes lower or higher than
those of the Pearson correlation test?
Exercise 7.17. In Section 7.3.4 we had a look at some confidence intervals for
proportions, and saw how ssize.propCI can be used to compute sample sizes for such
intervals using asymptotic approximations. Write a function to compute the exact
sample size needed for the Clopper-Pearson interval to achieve a desired expected
(average) width. Compare your results to those from the asymptotic approximations.
Are the approximations good enough to be useful?
284 CHAPTER 7. MODERN CLASSICAL STATISTICS
7.7 Bootstrapping
The bootstrap can be used formany things, most notably for constructing confidence
intervals and running hypothesis tests. These tend to perform better than traditional
parametric methods, such as the old-school t-test and its associated confidence inter-
val, when the distributional assumptions of the parametric methods aren’t met.
Confidence intervals and hypothesis tests are always based on a statistic, i.e. a quan-
tity that we compute from the samples. The statistic could be the sample mean,
a proportion, the Pearson correlation coefficient, or something else. In traditional
parametric methods, we start by assuming that our data follows some distribution.
For different reasons, including mathematical tractability, a common assumption is
that the data is normally distributed. Under that assumption, we can then derive
the distribution of the statistic that we are interested in analytically, like Gosset did
for the t-test. That distribution can then be used to compute confidence intervals
and p-values.
When using a bootstrap method, we follow the same steps, but use the observed data
and simulation instead. Rather than making assumptions about the distribution6 ,
we use the empirical distribution of the data. Instead of analytically deriving a
formula that describes the statistic’s distribution, we find a good approximation of
the distribution of the statistic by using simulation. We can then use that distribution
to obtain confidence intervals and p-values, just as in the parametric case.
The simulation step is important. We use a process known as resampling, where
we repeatedly draw new observations with replacement from the original sample.
We draw 𝐵 samples this way, each with the same size 𝑛 as the original sample.
Each randomly drawn sample - called a bootstrap sample - will include different
observations. Some observations from the original sample may appear more than
once in a specific bootstrap sample, and some not at all. For each bootstrap sample,
we compute the statistic in which we are interested. This gives us 𝐵 observations
of this statistic, which together form what is called the bootstrap distribution of the
statistic. I recommend using 𝐵 = 9, 999 or greater, but we’ll use smaller 𝐵 in some
examples, to speed up the computations.
6 Well, sometimes we make assumptions about the distribution and use the bootstrap. This is
To find the bootstrap distribution of the Pearson correlation coefficient, we can use
resampling with a for loop (Section 6.4.1):
# Extract the data that we are interested in:
mydata <- na.omit(msleep[,c("sleep_total", "brainwt")])
Because this is such a common procedure, there are R packages that let’s us do
resampling without having to write a for loop. In the remainder of the section, we
will use the boot package to draw bootstrap samples. It also contains convenience
functions that allows us to get confidence intervals from the bootstrap distribution
quickly. Let’s install it:
install.packages("boot")
The most important function in this package is boot, which does the resampling. As
input, it takes the original data, the number 𝐵 of bootstrap samples to draw (called
R here), and a function that computes the statistic of interest. This function should
take the original data (mydata in our example above) and the row numbers of the
sampled observation for a particular bootstrap sample (row_numbers in our example)
as input.
For the correlation coefficient, the function that we input can look like this:
286 CHAPTER 7. MODERN CLASSICAL STATISTICS
To get the bootstrap distribution of the Pearson correlation coefficient for our data,
we can now use boot as follows:
library(boot)
# Base solution:
boot_res <- boot(na.omit(msleep[,c("sleep_total", "brainwt")]),
cor_boot,
999)
Next, we can plot the bootstrap distribution of the statistic computed in cor_boot:
plot(boot_res)
If you prefer, you can of course use a pipeline for the resampling instead:
library(boot)
library(dplyr)
# With pipes:
msleep %>% select(sleep_total, brainwt) %>%
drop_na %>%
boot(cor_boot, 999) -> boot_res
Four intervals are presented: normal, basic, percentile and BCa. The details con-
cerning how these are computed based on the bootstrap distribution are presented in
Section 12.1. It is generally agreed that the percentile and BCa intervals are prefer-
able to the normal and basic intervals; see e.g. Davison & Hinkley (1997) and Hall
(1992); but which performs the best varies.
We also receive a warning message:
7.7. BOOTSTRAPPING 287
Warning message:
In boot.ci(boot_res) : bootstrap variances needed for studentized
intervals
A fifth type of confidence interval, the studentised interval, requires bootstrap esti-
mates of the standard error of the test statistic. These are obtained by running an
inner bootstrap, i.e. by bootstrapping each bootstrap sample to get estimates of the
variance of the test statistic. Let’s create a new function that does this, and then
compute the bootstrap confidence intervals:
cor_boot_student <- function(data, i, method = "pearson")
{
sample <- data[i,]
return(c(correlation, variance))
}
library(ggplot2)
library(boot)
While theoretically appealing (Hall, 1992), studentised intervals can be a little erratic
in practice. I prefer to use percentile and BCa intervals instead.
For two-sample problems, we need to make sure that the number of observations
drawn from each sample is the same as in the original data. The strata argument
in boot is used to achieve this. Let’s return to the example studied in Section 7.2,
concerning the difference in how long carnivores and herbivores sleep. Let’s say that
we want a confidence interval for the difference of two means, using the msleep data.
The simplest approach is to create a Welch-type interval, where we allow the two
populations to have different variances. We can then resample from each population
288 CHAPTER 7. MODERN CLASSICAL STATISTICS
separately:
# Function that computes the mean for each group:
mean_diff_msleep <- function(data, i)
{
sample1 <- subset(data[i, 1], data[i, 2] == "carni")
sample2 <- subset(data[i, 1], data[i, 2] == "herbi")
return(mean(sample1[[1]]) - mean(sample2[[1]]))
}
Exercise 7.18. Let’s continue the example with a confidence interval for the differ-
ence in how long carnivores and herbivores sleep. How can you create a confidence
interval under the assumption that the two groups have equal variances?
• The p-value of the test for the parameter 𝜃 is the smallest 𝛼 such that 𝜃 is not
contained in the corresponding 1 − 𝛼 confidence interval.
7.7. BOOTSTRAPPING 289
• For a test for the parameter 𝜃 with significance level 𝛼, the set of values of 𝜃
that aren’t rejected by the test (when used as the null hypothesis) is a 1 − 𝛼
confidence interval for 𝜃.
Here is an example of how we can use a while loop (Section 6.4.5) for confidence
interval inversion, in order to test the null hypothesis that the Pearson correlation be-
tween sleeping time and brain weight is 𝜌 = −0.2. It uses the studentised confidence
interval that we created in the previous section:
# Compute the studentised confidence interval:
cor_boot_student <- function(data, i, method = "pearson")
{
sample <- data[i,]
return(c(correlation, variance))
}
library(ggplot2)
library(boot)
# Check if the null value for rho is greater than the lower
# interval bound and smaller than the upper interval bound,
# i.e. if it is contained in the interval:
in_interval <- rho_null > interval[1] & rho_null < interval[2]
}
# The loop will finish as soon as it reaches a value of alpha such
# that rho_null is not contained in the interval.
Confidence interval inversion fails in spectacular ways for certain tests for parameters
of discrete distributions (Thulin & Zwanzig, 2017), so be careful if you plan on using
this approach with count data.
Exercise 7.19. With the data from Exercise 7.18, invert a percentile confidence
interval to compute the p-value of the corresponding test of the null hypothesis that
there is no difference in means. What are the results?
library(MASS)
generate_data <- function(data, mle)
{
return(mvrnorm(nrow(data), mle[[1]], mle[[2]]))
}
library(ggplot2)
library(boot)
The BCa interval implemented in boot.ci is not valid for parametric bootstrap
samples, so running boot.ci(boot_res) without specifying the interval type will
render an error7 . Percentile intervals work just fine, though.
7 If you really need a BCa interval for the parametric bootstrap, you can find the formulas for it
If the purpose of your study is to describe differences between groups, you should
present a confidence interval for the difference between the groups, rather than one
confidence interval (or error bar) for each group. It is possible for the individual
confidence intervals to overlap even if there is a significant difference between the
two groups, so reporting group-wise confidence intervals will only lead to confusion.
If you are interested in the difference, then of course the difference is what you should
report a confidence interval for.
BAD: There was no significant difference between the sleeping times
of carnivores (mean 10.4, 95 % percentile bootstrap confidence interval:
8.4-12.5) and herbivores (mean 9.5, 95 % percentile bootstrap confidence
interval: 8.1-12.6).
GOOD: There was no significant difference between the sleeping times
of carnivores (mean 10.4) and herbivores (mean 9.5), with the 95 % per-
centile bootstrap confidence interval for the difference being (-1.8, 3.5).
To get the citation and version information for a package, use citation and
packageVersion as follows:
citation("ggplot2")
packageVersion("ggplot2")
294 CHAPTER 7. MODERN CLASSICAL STATISTICS
Chapter 8
Regression models
Regression models, in which explanatory variables are used to model the behaviour
of a response variable, is without a doubt the most commonly used class of models
in the statistical toolbox. In this chapter, we will have a look at different types of
regression models tailored to many different sorts of data and applications.
After reading this chapter, you will be able to use R to:
• Fit and evaluate linear and generalised linear models,
• Fit and evaluate mixed models,
• Fit survival analysis models,
• Analyse data with left-censored observations,
• Create matched samples.
295
296 CHAPTER 8. REGRESSION MODELS
library(ggplot2)
ggplot(mtcars, aes(hp, mpg)) +
geom_point()
We had a look at some diagnostic plots given by applying plot to our fitted model
m:
plot(m)
Finally, we added another variable, the car weight wt, to the model:
m <- lm(mpg ~ hp + wt, data = mtcars)
summary(m)
Next, we’ll look at what more R has to offer when it comes to regression. Before
that though, it’s a good idea to do a quick exercise to make sure that you remember
how to fit linear models.
∼
8.1. LINEAR MODELS 297
Exercise 8.1. The sales-weather.csv data from Section 5.12 describes the
weather in a region during the first quarter of 2020. Download the file from the
book’s web page. Fit a linear regression model with TEMPERATURE as the response
variable and SUN_HOURS as an explanatory variable. Plot the results. Is there a
connection?
You’ll return to and expand this model in the next few exercises, so make sure to
save your code.
Exercise 8.2. Fit a linear model to the mtcars data using the formula mpg ~ ..
What happens? What is ~ . a shorthand for?
Alternatively, to include the main effects of hp and wt along with the interaction
effect, we can use hp*wt as a shorthand for hp + wt + hp:wt to write the model
formula more concisely:
m <- lm(mpg ~ hp*wt, data = mtcars)
summary(m)
Note how only two categories, 6 cylinders and 8 cylinders, are shown in the summary
table. The third category, 4 cylinders, corresponds to both those dummy variables
being 0. Therefore, the coefficient estimates for cyl6 and cyl8 are relative to the
remaining reference category cyl4. For instance, compared to cyl4 cars, cyl6 cars
have a higher fuel consumption, with their mpg being 1.26 lower.
We can control which category is used as the reference category by setting the order
of the factor variable, as in Section 5.4. The first factor level is always used as the
reference, so if for instance we want to use cyl6 as our reference category, we’d do
the following:
# Make cyl a categorical variable with cyl6 as
# reference variable:
mtcars$cyl <- factor(mtcars$cyl, levels =
c(6, 4, 8))
Dummy variables are frequently used for modelling differences between different
8.1. LINEAR MODELS 299
groups. Including only the dummy variable corresponds to using different inter-
cepts for different groups. If we also include an interaction with the dummy variable,
we can get different slopes for different groups. Consider the model
where 𝑥1 is numeric and 𝑥2 is a dummy variable. Then the intercept and slope
changes depending on the value of 𝑥2 as follows:
𝐸(𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖1 , if 𝑥2 = 0,
𝐸(𝑦𝑖 ) = (𝛽0 + 𝛽2 ) + (𝛽1 + 𝛽12 )𝑥𝑖1 , if 𝑥2 = 1.
This yields a model where the intercept and slope differs between the two groups
that 𝑥2 represents.
Exercise 8.3. Return to the weather model from Exercise 8.1. Create a dummy
variable for precipitation (zero precipitation or non-zero precipitation) and add it to
your model. Also include an interaction term between the precipitation dummy and
the number of sun hours. Are any of the coefficients significantly non-zero?
We could also plot the observed values against the fitted values:
n <- nrow(mtcars)
models <- data.frame(Observed = rep(mtcars$mpg, 2),
Fitted = c(predict(m1), predict(m2)),
Model = rep(c("Model 1", "Model 2"), c(n, n)))
Linear models are fitted and analysed using a number of assumptions, most of which
are assessed by looking at plots of the model residuals, 𝑦𝑖 − 𝑦𝑖̂ , where 𝑦𝑖̂ is the fitted
value for observation 𝑖. Some important assumptions are:
• The model is linear in the parameters: we check this by looking for non-linear
patterns in the residuals, or in the plot of observed against fitted values.
• The observations are independent: which can be difficult to assess visually.
We’ll look at models that are designed to handle correlated observations in
Sections 8.4 and 9.6.
• Homoscedasticity: that the random errors all have the same variance. We
check this by looking for non-constant variance in the residuals. The opposite
of homoscedasticity is heteroscedasticity.
• Normally distributed random errors: this assumption is important if we want
to use the traditional parametric p-values, confidence intervals and prediction
intervals. If we use permutation p-values or bootstrap intervals (as we will later
in this chapter), we no longer need this assumption.
Additionally, residual plots can be used to find influential points that (possibly) have
a large impact on the model coefficients (influence is measured using Cook’s distance
and potential influence using leverage). We’ve already seen that we can use plot(m)
to create some diagnostic plots. To get more and better-looking plots, we can use
the autoplot function for lm objects from the ggfortify package:
library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)
straight line (in this case, it isn’t perfectly straight, which could indicate a
mild non-linearity).
• Normal Q-Q: see if the points follow the line, which would indicate that the
residuals (which we for this purpose can think of as estimates of the random
errors) follow a normal distribution.
• Scale-Location: similar to the residuals versus fitted plot, this plot shows
whether the residuals are evenly spread for different values of the fitted val-
ues. Look for patterns in how much the residuals vary - if they e.g. vary more
for large fitted values, then that is a sign of heteroscedasticity. A horizontal
blue line is a sign of homoscedasticity.
• Cook’s distance: look for points with high values. A commonly-cited rule-of-
thumb (Cook & Weisberg, 1982) says that values above 1 indicate points with
a high influence.
• Residuals versus leverage: look for points with a high residual and high leverage.
Observations with a high residual but low leverage deviate from the fitted model
but don’t affect it much. Observations with a high residual and a high leverage
likely have a strong influence on the model fit, meaning that the fitted model
could be quite different if these points were removed from the dataset.
• Cook’s distance versus leverage: look for observations with a high Cook’s dis-
tance and a high leverage, which are likely to have a strong influence on the
model fit.
A formal test for heteroscedasticity, the Breusch-Pagan test, is available in the car
package as a complement to graphical inspection. A low p-value indicates statistical
evidence for heteroscedasticity. To run the test, we use ncvTest (where “ncv” stands
for non-constant variance):
install.packages("car")
library(car)
ncvTest(m1)
In this case, there are some highly correlated pairs, hp and disp among them. As a
numerical measure of collinearity, we can use the generalised variance inflation factor
(GVIF), given by the vif function in the car package:
library(car)
m <- lm(mpg ~ ., data = mtcars)
302 CHAPTER 8. REGRESSION MODELS
vif(m)
A high GVIF indicates that a variable is highly correlated with other explanatory
variables in the dataset. Recommendations for what a “high GVIF” is varies, from
2.5 to 10 or more.
You can mitigate problems related to multicollinearity by:
• Removing one or more of the correlated variables from the model (because they
are strongly correlated, they measure almost the same thing anyway!),
• Centring your explanatory variables (particularly if you include polynomial
terms),
• Using a regularised regression model (which we’ll do in Section 9.4).
Exercise 8.4. Below are two simulated datasets. One exhibits a nonlinear depen-
dence between the variables, and the other exhibits heteroscedasticity. Fit a model
with y as the response variable and x as the explanatory variable for each dataset,
and make some residual plots. Which dataset suffers from which problem?
Exercise 8.5. We continue our investigation of the weather models from Exercises
8.1 and 8.3.
1. Plot the observed values against the fitted values for the two models that you’ve
fitted. Does either model seem to have a better fit?
8.1. LINEAR MODELS 303
2. Create residual plots for the second model from Exercise 8.3. Are there any
influential points? Any patterns? Any signs of heteroscedasticity?
8.1.5 Transformations
If your data displays signs of heteroscedasticity or non-normal residuals, you can
sometimes use a Box-Cox transformation (Box & Cox, 1964) to mitigate those prob-
lems. The Box-Cox transformation is applied to your dependent variable 𝑦. What
𝑦𝜆 −1
it looks like is determined by a parameter 𝜆. The transformation is defined as 𝑖𝜆
if 𝜆 ≠ 0 and ln(𝑦𝑖 ) if 𝜆 = 0. 𝜆 = 1 corresponds to no transformation at all. The
boxcox function in MASS is useful for finding an appropriate choice of 𝜆. Choose a 𝜆
that is close to the peak (inside the interval indicated by the outer dotted lines) of
the curve plotted by boxcox:
m <- lm(mpg ~ hp + wt, data = mtcars)
library(MASS)
boxcox(m)
library(ggfortify)
autoplot(m_bc, which = 1:6, ncol = 2, label.size = 3)
The model fit seems to have improved after the transformation. The downside is
that we now are modelling the log-mpg rather than mpg, which make the model
coefficients a little difficult to interpret.
Exercise 8.6. Run boxcox with your model from Exercise 8.3. Does it indicate that
a transformation can be useful for your model?
8.1.6 Alternatives to lm
Non-normal regression errors can sometimes be an indication that you need to trans-
form your data, that your model is missing an important explanatory variable, that
there are interaction effects that aren’t accounted for, or that the relationship be-
tween the variables is non-linear. But sometimes, you get non-normal errors simply
because the errors are non-normal.
304 CHAPTER 8. REGRESSION MODELS
The p-values reported by summary are computed under the assumption of normally
distributed regression errors, and can be sensitive to deviations from normality. An
alternative is to use the lmp function from the lmPerm package, which provides per-
mutation test p-values instead. This doesn’t affect the model fitting in any way - the
only difference is how the p-values are computed. Moreover, the syntax for lmp is
identical to that of lm:
# First, install lmPerm:
install.packages("lmPerm")
In some cases, you need to change the arguments of lmp to get reliable p-values.
We’ll have a look at that in Exercise 8.12. Relatedly, in Section 8.1.7 we’ll see how
to construct bootstrap confidence intervals for the parameter estimates.
Another option that does affect the model fitting is to use a robust regression model
based on M-estimators. Such models tend to be less sensitive to outliers, and can be
useful if you are concerned about the influence of deviating points. The rlm function
in MASS is used for this. As was the case for lmp, the syntax for rlm is identical to
that of lm:
library(MASS)
m <- rlm(mpg ~ hp + wt, data = mtcars)
summary(m)
Another option is to use Bayesian estimation, which we’ll discuss in Section 8.1.13.
Exercise 8.7. Refit your model from Exercise 8.3 using lmp. Are the two main
effects still significant?
confint(m)
I usually prefer to use bootstrap confidence intervals, which we can obtain using boot
8.1. LINEAR MODELS 305
and boot.ci, as we’ll do next. Note that the only random part in the linear model
is the error term 𝜖𝑖 . In most cases, it is therefore this term (and this term only) that
we wish to resample. The explanatory variables should remain constant throughout
the resampling process; the inference is conditioned on the values of the explanatory
variables.
To achieve this, we’ll resample from the model residuals, and add those to the values
predicted by the fitted function, which creates new bootstrap values of the response
variable. We’ll then fit a linear model to these values, from which we obtain obser-
vations from the bootstrap distribution of the model coefficients.
It turns out that the bootstrap performs better if we resample not from the original
residuals 𝑒1 , … , 𝑒𝑛 , but from scaled and centred residuals 𝑟𝑖 − 𝑟,̄ where each 𝑟𝑖 is a
scaled version of residual 𝑒𝑖 , scaled by the leverage ℎ𝑖 :
𝑒𝑖
𝑟𝑖 = ,
√1 − ℎ𝑖
see Chapter 6 of Davison & Hinkley (1997) for details. The leverages can be computed
using lm.influence.
We implement this procedure in the code below (and will then have a look at conve-
nience functions that help us achieve the same thing more easily). It makes use of
formula, which can be used to extract the model formula from regression models:
library(boot)
The argument index in boot.ci should be the row number of the parameter in the
table given by summary. The intercept is on the first row, and so its index is 1, hp
is on the second row and its index is 2, and so on.
Clearly, the above code is a little unwieldy. Fortunately, the car package contains
a function called Boot that can be used to bootstrap regression models in the exact
same way:
library(car)
Finally, the most convenient approach is to use boot_summary from the boot.pval
package. It provides a data frame with estimates, bootstrap confidence intervals,
and bootstrap p-values (computed using interval inversion) for the model coefficients.
The arguments specify what interval type and resampling strategy to use (more on
the latter in Exercise 8.9):
library(boot.pval)
boot_summary(m, type = "perc", method = "residual", R = 9999)
Exercise 8.8. Refit your model from Exercise 8.3 using a robust regression estimator
with rlm. Compute confidence intervals for the coefficients of the robust regression
model.
8.1. LINEAR MODELS 307
Exercise 8.9. In an alternative bootstrap scheme for regression models, often re-
ferred to as case resampling, the observations (or cases) (𝑦𝑖 , 𝑥𝑖1 , … , 𝑥𝑖𝑝 ) are resampled
instead of the residuals. This approach can be applied when the explanatory vari-
ables can be treated as being random (but measured without error) rather than fixed.
It can also be useful for models with heteroscedasticity, as it doesn’t rely on assump-
tions about constant variance (which, on the other hand, makes it less efficient if the
errors actually are homoscedastic).
Read the documentation for boot_summary to see how you can compute confidence in-
tervals for the coefficients in the model m <- lm(mpg ~ hp + wt, data = mtcars)
using case resampling. Do they differ substantially from those obtained using residual
resampling in this case?
How can we access the information about the model? For instance, we may want to
get the summary table from summary, but as a data frame rather than as printed
text. Here are two ways of doing this, using summary and the tidy function from
broom:
# Using base R:
summary(m)$coefficients
# Using broom:
library(broom)
tidy(m)
tidy is the better option if you want to retrieve the table as part of a pipeline.
For instance, if you want to adjust the p-values for multiplicity using Bonferroni
correction (Section 7.2.5), you could do as follows:
library(magrittr)
mtcars %>%
lm(mpg ~ hp + wt, data = .) %>%
tidy() %$%
p.adjust(p.value, method = "bonferroni")
308 CHAPTER 8. REGRESSION MODELS
If you prefer bootstrap p-values, you can use boot_summary from boot.pval similarly.
That function also includes an argument for adjusting the p-values for multiplicity:
library(boot.pval)
lm(mpg ~ hp + wt, data = mtcars) %>%
boot_summary(adjust.method = "bonferroni")
Another useful function in broom is glance, which lets us get some summary statistics
about the model:
glance(m)
Finally, augment can be used to add predicted values, residuals, and Cook’s distances
to the dataset used for fitting the model, which of course can be very useful for model
diagnostics:
# To get the data frame with predictions and residuals added:
augment(m)
If your main interest is prediction, then that is a completely different story. For
predictive models, it is usually recommended that variable selection and model fitting
should be done simultaneously. This can be done using regularised regression models,
to which Section 9.4 is devoted.
8.1. LINEAR MODELS 309
8.1.10 Prediction
An important use of linear models is prediction. In R, this is done using predict.
By providing a fitted model and a new dataset, we can get predictions.
Let’s use one of the models that we fitted to the mtcars data to make predictions for
two cars that aren’t from the 1970’s. Below, we create a data frame with data for a
2009 Volvo XC90 D3 AWD (with a fuel consumption of 29 mpg) and a 2019 Ferrari
Roma (15.4 mpg):
new_cars <- data.frame(hp = c(161, 612), wt = c(4.473, 3.462),
row.names = c("Volvo XC90", "Ferrari Roma"))
To get the model predictions for these new cars, we run the following:
predict(m, new_cars)
predict also lets us obtain prediction intervals for our prediction, under the as-
sumption of normality3 . To get 90 % prediction intervals, we add interval =
"prediction" and level = 0.9:
m <- lm(mpg ~ hp + wt, data = mtcars)
predict(m, new_cars,
interval = "prediction",
level = 0.9)
The lmp function that we used to compute permutation p-values does not offer con-
fidence intervals. We can however compute bootstrap prediction intervals using the
code below. Prediction intervals try to capture two sources of uncertainty:
3 Prediction intervals provide interval estimates for the new observations. They incorporate both
the uncertainty associated with our model estimates, and the fact that the new observation is likely
to deviate slightly from its expected value.
310 CHAPTER 8. REGRESSION MODELS
• Model uncertainty, which we will capture by resampling the data and make
predictions for the expected value of the observation,
• Random noise, i.e. that almost all observations deviate from their expected
value. We will capture this by resampling residuals from the fitted bootstrap
models.
Consequently, the value that we generate in each bootstrap replication will be the
sum of a prediction and a resampled residual (see Davison & Hinkley (1997), Section
6.3, for further details):
boot_pred <- function(data, new_data, model, i,
formula, predictions, residuals){
# Resample residuals and fit new model:
data[,all.vars(formula)[1]] <- predictions + residuals[i]
m_boot <- lm(formula, data = data)
library(boot)
Exercise 8.10. Use your model from Exercise 8.3 to compute a bootstrap prediction
interval for the temperature on a day with precipitation but no sun hours.
We’ll make use of this approach when we study linear mixed models in Section 8.4.
8.1.12 ANOVA
Linear models are also used for analysis of variance (ANOVA) models to test whether
there are differences among the means of different groups. We’ll use the mtcars data
to give some examples of this. Let’s say that we want to investigate whether the mean
fuel consumption (mpg) of cars differs depending on the number of cylinders (cyl),
and that we want to include the type of transmission (am) as a blocking variable.
To get an ANOVA table for this problem, we must first convert the explanatory
variables to factor variables, as the variables in mtcars all numeric (despite some
of them being categorical). We can then use aov to fit the model, and then summary:
# Convert variables to factors:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
(aov actually uses lm to fit the model, but by using aov we specify that we want an
ANOVA table to be printed by summary.)
When there are different numbers of observations in the groups in an ANOVA, so
that we have an unbalanced design, the sums of squares used to compute the test
statistics can be computed in at least three different ways, commonly called type I,
II and III. See Herr (1986) for an overview and discussion of this.
summary prints a type I ANOVA table, which isn’t the best choice for unbalanced
designs. We can however get type II or III tables by instead using Anova from the
car package to print the table:
library(car)
Anova(m, type = "II")
Anova(m, type = "III") # Default in SAS and SPSS.
As a guideline, for unbalanced designs, you should use type II tables if there are no
interactions, and type III tables if there are interactions. To look for interactions, we
can use interaction.plot to create a two-way interaction plot:
interaction.plot(mtcars$am, mtcars$cyl, response = mtcars$mpg)
In this case, there is no sign of an interaction between the two variables, as the lines
are more or less parallel. A type II table is therefore probably the best choice here.
We can obtain diagnostic plots the same way we did for other linear models:
library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3)
To find which groups that have significantly different means, we can use a post hoc
test like Tukey’s HSD, available through the TukeyHSD function:
TukeyHSD(m)
We can visualise the results of Tukey’s HSD with plot, which shows 95 % confidence
intervals for the mean differences:
# When the difference isn't significant, the dashed line indicating
# "no differences" falls within the confidence interval for
# the difference:
plot(TukeyHSD(m, "am"))
Exercise 8.11. Return to the residual plots that you created with autoplot. Figure
out how you can plot points belonging to different cyl groups in different colours.
Exercise 8.12. The aovp function in the lmPerm package can be utilised to perform
permutation tests instead of the classical parametric ANOVA tests. Rerun the anal-
ysis in the example above, using aovp instead. Do the conclusions change? What
happens if you run your code multiple times? Does using summary on a model fitted
using aovp generate a type I, II or III table by default? Can you change what type
of table it produces?
Exercise 8.13. In the case of a one-way ANOVA (i.e. ANOVA with a single explana-
tory variable), the Kruskal-Wallis test can be used as a nonparametric option. It is
available in kruskal.test. Use the Kruskal-Wallis test to run a one-way ANOVA
for the mtcars data, with mpg as the response variable and cyl as an explanatory
variable.
Finally, we can use 𝑅̂ to check model convergence. It should be less than 1.1 if the
fitting has converged:
plot(m, "rhat")
Like for lm, residuals(m) provides the model residuals, which can be used for diag-
nostics. For instance, we can plot the residuals against the fitted values to look for
signs of non-linearity, adding a curve to aid the eye:
model_diag <- data.frame(Fitted = predict(m),
Residual = residuals(m))
library(ggplot2)
ggplot(model_diag, aes(Fitted, Residual)) +
geom_point() +
geom_smooth(se = FALSE)
For fitting ANOVA models, we can instead use stan_aov with the argument prior
= R2(location = 0.5) to fit the model.
Exercise 8.14. Discuss the following. You are tasked with analysing a study on
whether Vitamin D protects against the flu. One group of patients are given Vi-
tamin D supplements, and one group is given a placebo. You plan on fitting a
8.3. GENERALISED LINEAR MODELS 315
regression model to estimate the effect of the vitamin supplements, but note that
some confounding factors that you have reason to believe are of importance, such as
age and ethnicity, are missing from the data. You can therefore not include them as
explanatory variables in the model. Should you still fit the model?
Exercise 8.15. Discuss the following. You are fitting a linear regression model to a
dataset from a medical study on a new drug which potentially can have serious side
effects. The test subjects take a risk by participating in the study. Each observation
in the dataset corresponds to a test subject. Like all ordinary linear regression
models, your model gives more weight to observations that deviate from the average
(and have a high leverage or Cook’s distance). Given the risks involved for the test
subjects, is it fair to give different weight to data from different individuals? Is it
OK to remove outliers because they influence the results too much, meaning that the
risk that the subject took was for nought?
𝜋𝑖
log ( ) = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 , 𝑖 = 1, … , 𝑛
1 − 𝜋𝑖
Where we in linear regression models model the expected value of the response vari-
able as a linear function of the explanatory variables, we now model the expected
value of a function of the expected value of the response variable (that is, a function
of 𝜋𝑖 ). In GLM terminology, this function is known as a link function.
Logistic regression models can be fitted using the glm function. To specify what our
model is, we use the argument family = binomial:
m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
summary(m)
The p-values presented in the summary table are based on a Wald test known to
have poor performance unless the sample size is very large (Agresti, 2013). In this
case, with a sample size of 6,497, it is probably safe to use, but for smaller sample
sizes, it is preferable to use a bootstrap test instead, which you will do in Exercise
8.18.
The coefficients of a logistic regression model aren’t as straightforward to interpret
as those in a linear model. If we let 𝛽 denote a coefficient corresponding to an
explanatory variable 𝑥, then:
• If 𝛽 is positive, then 𝜋𝑖 increases when 𝑥𝑖 increases.
• If 𝛽 is negative, then 𝜋𝑖 decreases when 𝑥𝑖 increases.
𝜋𝑖
• 𝑒𝛽 is the odds ratio, which shows how much the odds 1−𝜋1 change when 𝑥𝑖 is
increased 1 step.
We can extract the coefficients and odds ratios using coef:
coef(m) # Coefficients, beta
exp(coef(m)) # Odds ratios
8.3. GENERALISED LINEAR MODELS 317
To find the fitted probability that an observation belongs to the second class we can
use predict(m, type = "response"):
# Check which class is the second one:
levels(wine$type)
# "white" is the second class!
It turns out that the model predicts that most wines are white - even the red ones!
The reason may be that we have more white wines (4,898) than red wines (1,599)
in the dataset. Adding more explanatory variables could perhaps solve this problem.
We’ll give that a try in the next section.
Exercise 8.16. Download sharks.csv file from the book’s web page. It contains
information about shark attacks in South Africa. Using data on attacks that occurred
in 2000 or later, fit a logistic regression model to investigate whether the age and sex
of the individual that was attacked affect the probability of the attack being fatal.
Note: save the code for your model, as you will return to it in the subsequent
exercises.
Exercise 8.17. In Section 8.1.8 we saw how some functions from the broom package
could be used to get summaries of linear models. Try using them with the wine data
model that we created above. Do the broom functions work for generalised linear
models as well?
library(boot.pval)
In the parametric approach, for each observation, the fitted success probability from
the logistic model will be used to sample new observations of the response variable.
This method can work well if the model is well-specified but tends to perform poorly
for misspecified models, so make sure to carefully perform model diagnostics (as
described in the next section) before applying it. To use the parametric approach,
we can do as follows:
library(boot)
Exercise 8.18. Use the model that you fitted to the sharks.csv data in Exercise
8.16 for the following:
1. When the MASS package is loaded, you can use confint to obtain (asymptotic)
confidence intervals for the parameters of a GLM. Use it to compute confidence
intervals for the parameters of your model for the sharks.csv data.
2. Compute parametric bootstrap confidence intervals and p-values for the pa-
rameters of your logistic regression model for the sharks.csv data. Do they
differ from the intervals obtained using confint? Note that there are a lot of
missing values for the response variable. Think about how that will affect your
bootstrap intervals and adjust your code accordingly.
3. Use the confidence interval inversion method of Section 7.7.3 to compute boot-
strap p-values for the effect of age.
Plots of raw residuals are of little use in logistic regression models. A better option is
to use a binned residual plot, in which the observations are grouped into bins based
on their fitted value. The average residual in each bin can then be computed, which
will tell us if which parts of the model have a poor fit. A function for this is available
in the arm package:
320 CHAPTER 8. REGRESSION MODELS
install.packages("arm")
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
The grey lines show confidence bounds which are supposed to contain about 95 %
of the bins. If too many points fall outside these bounds, it’s a sign that we have
a poor model fit. In this case, there are a few points outside the bounds. Most
notably, the average residuals are fairly large for the observations with the lowest
fitted values, i.e. among the observations with the lowest predicted probability of
being white wines.
Let’s compare the above plot to that for a model with more explanatory variables:
m2 <- glm(type ~ pH + alcohol + fixed.acidity + residual.sugar,
data = wine, family = binomial)
This looks much better - adding more explanatory variable appears to have improved
the model fit.
It’s worth repeating that if your main interest is hypothesis testing, you shouldn’t fit
multiple models and then pick the one that gives the best results. However, if you’re
doing an exploratory analysis or are interested in predictive modelling, you can and
should try different models. It can then be useful to do a formal hypothesis test of
the null hypothesis that m and m2 fit the data equally well, against the alternative
that m2 has a better fit. If both fit the data equally well, we’d prefer m, since it is
a simpler model. We can use anova to perform a likelihood ratio deviance test (see
Section 12.4 for details), which tests this:
anova(m, m2, test = "LRT")
The p-value is very low, and we conclude that m2 has a better model fit.
Another useful function is cooks.distance, which can be used to compute the Cook’s
distance for each observation, which is useful for finding influential observations. In
this case, I’ve chosen to print the row numbers for the observations with a Cook’s
distance greater than 0.004 - this number has been arbitrarily chosen in order only
to highlight the observations with the highest Cook’s distance.
res <- data.frame(Index = 1:length(cooks.distance(m)),
CooksDistance = cooks.distance(m))
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > 0.004,
rownames(res), "")),
hjust = 1.1)
Exercise 8.19. Investigate the residuals for your sharks.csv model. Are there any
problems with the model fit? Any influential points?
8.3.4 Prediction
Just as for linear models, we can use predict to make predictions for new obser-
vations using a GLM. To begin with, let’s randomly sample 10 rows from the wine
data and fit a model using all data except those ten observations:
# Randomly select 10 rows from the wine data:
rows <- sample(1:nrow(wine), 10)
We can now use predict to make predictions for the ten observations:
preds <- predict(m, wine[rows,])
preds
Those predictions look a bit strange though - what are they? By default, predict
returns predictions on the scale of the link function. That’s not really what we want
in most cases - instead, we are interested in the predicted probabilities. To get those,
we have to add the argument type = "response" to the call:
preds <- predict(m, wine[rows,], type = "response")
preds
Logistic regression models are often used for prediction, in what is known as classifi-
cation. Section 9.1.7 is concerned with how to evaluate the predictive performance
of logistic regression and other classification models.
shark attack data in sharks.csv, available on the book’s website. It contains data
about shark attacks in South Africa, downloaded from The Global Shark Attack File
(http://www.sharkattackfile.net/incidentlog.htm). To load it, we download the file
and set file_path to the path of sharks.csv:
sharks <- read.csv(file_path, sep =";")
The number of attacks in a year is not binary but a count that, in principle, can take
any non-negative integer as its value. Are there any trends over time for the number
of reported attacks?
# Plot data from 1960-2019:
library(ggplot2)
ggplot(attacks, aes(Year, Type)) +
geom_point() +
ylab("Number of attacks")
No trend is evident. To confirm this, let’s fit a regression model with Type (the
number of attacks) as the response variable and Year as an explanatory variable.
For count data like this, a good first model to use is Poisson regression. Let 𝜇𝑖
denote the expected value of the response variable given the explanatory variables.
Given 𝑛 observations of 𝑝 explanatory variables, the Poisson regression model is:
To fit it, we use glm as before, but this time with family = poisson:
m <- glm(Type ~ Year, data = attacks, family = poisson)
summary(m)
We can add the curve corresponding to the fitted model to our scatterplot as follows:
attacks_pred <- data.frame(Year = attacks$Year, at_pred =
predict(m, type = "response"))
The fitted model seems to confirm our view that there is no trend over time in the
number of attacks.
For model diagnostics, we can use a binned residual plot and a plot of Cook’s distance
to find influential points:
# Binned residual plot:
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
A common problem in Poisson regression models is excess zeros, i.e. more observations
with value 0 than what is predicted by the model. To check the distribution of counts
in the data, we can draw a histogram:
ggplot(attacks, aes(Type)) +
geom_histogram(binwidth = 1, colour = "black")
If there are a lot of zeroes in the data, we should consider using another model, such
as a hurdle model or a zero-inflated Poisson regression. Both of these are available
in the pscl package.
Another common problem is overdispersion, which occurs when there is more variabil-
ity in the data than what is predicted by the GLM. A formal test of overdispersion
(Cameron & Trivedi, 1990) is provided by dispersiontest in the AER package. The
null hypothesis is that there is no overdispersion, and the alternative that there is
overdispersion:
install.packages("AER")
library(AER)
dispersiontest(m, trafo = 1)
There are several alternative models that can be considered in the case of overdisper-
sion. One of them is negative binomial regression, which uses the same link function
as Poisson regression. We can fit it using the glm.nb function from MASS:
324 CHAPTER 8. REGRESSION MODELS
library(MASS)
m_nb <- glm.nb(Type ~ Year, data = attacks)
summary(m_nb)
For the shark attack data, the predictions from the two models are virtually identical,
meaning that both are equally applicable in this case:
attacks_pred <- data.frame(Year = attacks$Year, at_pred =
predict(m, type = "response"))
attacks_pred_nb <- data.frame(Year = attacks$Year, at_pred =
predict(m_nb, type = "response"))
Finally, we can obtain bootstrap confidence intervals e.g. using case resampling, using
boot_summary:
library(boot.pval)
boot_summary(m_nb, type = "perc", method = "case")
Exercise 8.20. The quakes dataset, available in base R, contains information about
seismic events off Fiji. Fit a Poisson regression model with stations as the response
variable and mag as an explanatory variable. Are there signs of overdispersion? Does
using a negative binomial model improve the model fit?
In other words, we should include log(𝑁𝑖 ) on the right-hand side of our model, with
a known coefficient equal to 1. In regression, such a term is known as an offset. We
can add it to our model using the offset function.
As an example, we’ll consider the ships data from the MASS package. It describes
the number of damage incidents for different ship types operating in the 1960’s and
1970’s, and includes information about how many months each ship type was in
service (i.e. each ship type’s exposure):
library(MASS)
?ships
View(ships)
For our example, we’ll use ship type as the explanatory variable, incidents as the
response variable and service as the exposure variable. First, we remove observa-
tions with 0 exposure (by definition, these can’t be involved in incidents, and so there
is no point in including them in the analysis). Then, we fit the model using glm and
offset:
ships <- ships[ships$service != 0,]
summary(m)
Rate models are usually interpreted in terms of the rate ratios 𝑒𝛽𝑗 , which describe
the multiplicative increases of the intensity of rates when 𝑥𝑗 is increased by one unit.
To compute the rate ratios for our model, we use exp:
exp(coef(m))
Exercise 8.21. Compute bootstrap confidence intervals for the rate ratios in the
model for the ships data.
326 CHAPTER 8. REGRESSION MODELS
We can fit a Bayesian GLM with the rstanarm package, using stan_glm in the same
way we did for linear models. Let’s look at an example with the wine data. First,
we load and prepare the data:
# Import data about white and red wines:
white <- read.csv("https://tinyurl.com/winedata1",
sep = ";")
red <- read.csv("https://tinyurl.com/winedata2",
sep = ";")
white$type <- "white"
red$type <- "red"
wine <- rbind(white, red)
wine$type <- factor(wine$type)
plot(m, "intervals",
pars = names(coef(m)),
prob = 0.95)
Finally, we can use 𝑅̂ to check model convergence. It should be less than 1.1 if the
fitting has converged:
plot(m, "rhat")
8.4. MIXED MODELS 327
The sleepstudy dataset from lme4 contains data from a study on reaction times in
a sleep deprivation study. The participants were restricted to 3 hours of sleep per
night, and their average reaction time on a series of tests were measured each day
during the 9 days that the study lasted:
library(lme4)
?sleepstudy
str(sleepstudy)
Let’s start our analysis by making boxplots showing reaction times for each subject.
We’ll also superimpose the observations for each participant on top of their boxplots:
library(ggplot2)
ggplot(sleepstudy, aes(Subject, Reaction)) +
geom_boxplot() +
geom_jitter(aes(colour = Subject),
position = position_jitter(0.1))
We are interested in finding out if the reaction times increase when the participants
have been starved for sleep for a longer period. Let’s try plotting reaction times
against days, adding a regression line:
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", se = FALSE)
As we saw in the boxplots, and can see in this plot too, some participants always have
comparatively high reaction times, whereas others always have low values. There are
clear differences between individuals, and the measurements for each individual will
be correlated. This violates a fundamental assumption of the traditional linear model,
namely that all observations are independent.
In addition to this, it also seems that the reaction times change in different ways for
different participants, as can be seen if we facet the plot by test subject:
ggplot(sleepstudy, aes(Days, Reaction, colour = Subject)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~ Subject, nrow = 3) +
328 CHAPTER 8. REGRESSION MODELS
Both the intercept and the slope of the average reaction time differs between individ-
uals. Because of this, the fit given by the single model can be misleading. Moreover,
the fact that the observations are correlated will cause problems for the traditional
intervals and tests. We need to take this into account when we estimate the overall
intercept and slope.
One approach could be to fit a single model for each subject. That doesn’t seem very
useful though. We’re not really interested in these particular test subjects, but in
how sleep deprivation affects reaction times in an average person. It would be much
better to have a single model that somehow incorporates the correlation between
measurements made on the same individual. That is precisely what a linear mixed
regression model does.
• Fixed effects, which are non-random. These are usually the variables of primary
interest in the data. In the sleepstudy example, Days is a fixed effect.
• Random effects, which represent nuisance variables that cause measurements
to be correlated. These are usually not of interest in and of themselves, but
are something that we need to include in the model to account for correlations
between measurements. In the sleepstudy example, Subject is a random
effect.
Linear mixed models can be fitted using lmer from the lme4 package. The syntax
is the same as for lm, with the addition of random effects. These can be included in
different ways. Let’s have a look at them.
First, we can include a random intercept, which gives us a model where the intercept
(but not the slope) varies between test subjects. In our example, the formula for this
is:
library(lme4)
m1 <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
Alternatively, we could include a random slope in the model, in which case the slope
(but not the intercept) varies between test subjects. The formula would be:
m2 <- lmer(Reaction ~ Days + (0 + Days|Subject), data = sleepstudy)
Finally, we can include both a random intercept and random slope in the model.
This can be done in two different ways, as we can model the intercept and slope as
being correlated or uncorrelated:
8.4. MIXED MODELS 329
Which model should we choose? Are the intercepts and slopes correlated? It could
of course be the case that individuals with a high intercept have a smaller slope - or
a greater slope! To find out, we can fit different linear models to each subject, and
then make a scatterplot of their intercepts and slopes. To fit a model to each subject,
we use split and map as in Section 8.1.11:
# Collect the coefficients from each linear model:
library(purrr)
sleepstudy %>% split(.$Subject) %>%
map(~ lm(Reaction ~ Days, data = .)) %>%
map(coef) -> coefficients
The correlation test is not significant, and judging from the plot, there is little in-
dication that the intercept and slope are correlated. We saw earlier that both the
intercept and the slope seem to differ between subjects, and so m4 seems like the best
choice here. Let’s stick with that, and look at a summary table for the model.
summary(m4, correlation = FALSE)
I like to add correlation = FALSE here, which suppresses some superfluous output
from summary.
330 CHAPTER 8. REGRESSION MODELS
You’ll notice that unlike the summary table for linear models, there are no p-values!
This is a deliberate design choice from the lme4 developers, who argue that the
approximate test available aren’t good enough for small sample sizes (Bates et al.,
2015).
Using the bootstrap, as we will do in Section 8.4.3, is usually the best approach for
mixed models. If you really want some quick p-values, you can load the lmerTest
package, which adds p-values computed using the Satterthwaite approximation
(Kuznetsova et al., 2017). This is better than the usual approximate test, but still
not perfect.
install.packages("lmerTest")
library(lmerTest)
m4 <- lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
data = sleepstudy)
summary(m4, correlation = FALSE)
If we need to extract the model coefficients, we can do so using fixef (for the fixed
effects) and ranef (for the random effects):
fixef(m4)
ranef(m4)
If we want to extract the variance components from the model, we can use VarCorr:
VarCorr(m4)
Let’s add the lines from the fitted model to our facetted plot, to compare the results
of our mixed model to the lines that were fitted separately for each individual:
mixed_mod <- coef(m4)$Subject
mixed_mod$Subject <- row.names(mixed_mod)
Notice that the lines differ. The intercept and slopes have been shrunk toward the
global effects, i.e. toward the average of all lines.
8.4. MIXED MODELS 331
Exercise 8.22. Consider the Oxboys data from the nlme package. Does a mixed
model seem appropriate here? If so, is the intercept and slope for different subjects
correlated? Fit a suitable model, with height as the response variable.
Save the code for your model, as you will return to it in the next few exercises.
Exercise 8.23. The broom.mixed package allows you to get summaries of mixed
models as data frames, just as broom does for linear and generalised linear models.
Install it and use it to get the summary table for the model for the Oxboys data that
you created in the previous exercise. How are fixed and random effects included in
the table?
# Plot residuals:
ggplot(fm4, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
xlab("Fitted values") + ylab("Residuals")
The normality assumption appears to be satisfied, but there are some signs of het-
eroscedasticity in the boxplots of the residuals for the different subjects.
Exercise 8.24. Return to your mixed model for the Oxboys data from Exercise
8.22. Make diagnostic plots for the model. Are there any signs of heteroscedasticity
or non-normality?
8.4.3 Bootstrapping
Summary tables, including p-values, for the fixed effects are available through
boot_summary:
library(boot.pval)
boot_summary(m4, type = "perc")
library(boot)
boot.ci(boot_res, type = "perc", index = 1) # Intercept
boot.ci(boot_res, type = "perc", index = 2) # Days
We are interested in the strength of a chemical product. There are ten delivery
batches (batch), and three casks within each delivery (cask). Because of variations in
manufacturing, transportation, storage, and so on, it makes sense to include random
effects for both batch and cask in a linear mixed model. However, each cask only
appears within a single batch, which makes the cask effect nested within batch.
Models that use nested random factors are commonly known as multilevel models
(the random factors exist at different “levels”), or hierarchical models (there is a
hierarchy between the random factors). These aren’t really any different from other
mixed models, but depending on how the data is structured, we may have to be a
bit careful to get the nesting right when we fit the model with lmer.
If the two effects weren’t nested, we could fit a model using:
# Incorrect model:
m1 <- lmer(strength ~ (1|batch) + (1|cask),
data = Pastes)
summary(m1, correlation = FALSE)
However, because the casks are labelled a, b, and c within each batch, we’ve now
fitted a model where casks from different batches are treated as being equal! To
clarify that the labels a, b, and c belong to different casks in different batches, we
need to include the nesting in our formula. This is done as follows:
# Cask in nested within batch:
m2 <- lmer(strength ~ (1|batch/cask),
data = Pastes)
summary(m2, correlation = FALSE)
library(lmerTest)
# TV data:
?TVbo
# All three types of ANOVA table give the same results here:
anova(m, type = "III")
anova(m, type = "II")
anova(m, type = "I")
The interaction effect is significant at the 5 % level. As for other ANOVA models,
we can visualise this with an interaction plot:
interaction.plot(TVbo$TVset, TVbo$Picture,
response = TVbo$Colourbalance)
Exercise 8.25. Fit a mixed effects ANOVA to the TVbo data, using Coloursaturation
as the response variable, TVset and Picture as fixed effects, and Assessor as a
random effect. Does there appear to be a need to include the interaction between
Assessor and TVset as a random effect? If so, do it.
We’ll use the binary version of the response, r2, and fit a logistic mixed regression
model to the data, to see if it can be used to explain the subjects’ responses. The
formula syntax is the same as for linear mixed models, but now we’ll use glmer to fit
a GLMM. We’ll include Anger and Gender as fixed effects (we are interested in seeing
how these affect the response) and item and id as random effects with random slopes
(we believe that answers to the same item and answers from the same individual may
be correlated):
m <- glmer(r2 ~ Anger + Gender + (1|item) + (1|id),
data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
We can plot the fitted random effects for item to verify that there appear to be
differences between the different items:
mixed_mod <- coef(m)$item
mixed_mod$item <- row.names(mixed_mod)
The situ variable, describing situation type, also appears interesting. Let’s include
it as a fixed effect. Let’s also allow different situational (random) effects for different
respondents. It seems reasonable that such responses are random rather than fixed
(as in the solution to Exercise 8.25), and we do have repeated measurements of these
responses. We’ll therefore also include situ as a random effect nested within id:
m <- glmer(r2 ~ Anger + Gender + situ + (1|item) + (1|id/situ),
data = VerbAgg, family = binomial)
summary(m, correlation = FALSE)
Finally, we’d like to obtain bootstrap confidence intervals for fixed effects. Because
this is a fairly large dataset (𝑛 = 7, 584) this can take a looong time to run, so stretch
your legs and grab a cup of coffee or two while you wait:
library(boot.pval)
boot_summary(m, type = "perc", R = 100)
# Ideally, R should be greater, but for the sake of
# this example, we'll use a low number.
Exercise 8.26. Consider the grouseticks data from the lme4 package (Elston et
al., 2001). Fit a mixed Poisson regression model to the data, with TICKS as the
response variable and YEAR and HEIGHT as fixed effects. What variables are suitable
336 CHAPTER 8. REGRESSION MODELS
to use for random effects? Compute a bootstrap confidence interval for the effect of
HEIGHT.
After loading rstanarm, fitting a Bayesian linear mixed model with a weakly infor-
mative prior is as simple as substituting lmer with stan_lmer:
library(lme4)
library(rstanarm)
m4 <- stan_lmer(Reaction ~ Days + (1|Subject) + (0 + Days|Subject),
data = sleepstudy)
To plot the posterior distributions for the coefficients of the fixed effects, we can use
plot, specifying which effects we are interested in using pars:
plot(m4, "dens", pars = c("(Intercept)", "Days"))
To get 95 % credible intervals for the fixed effects, we can use posterior_interval
as follows:
posterior_interval(m4,
pars = c("(Intercept)", "Days"),
prob = 0.95)
The survival package contains a number of useful methods for survival analysis.
Let’s install it:
install.packages("survival")
The survival times of the patients consist of two parts: time (the time from diagnosis
until either death or the end of the study) and status (1 if the observations is
censored, 2 if the patient died before the end of the study). To combine these so that
they can be used in a survival analysis, we must create a Surv object:
Surv(lung$time, lung$status)
To print the values for the survival curves at different time points, we can use
summary:
summary(m)
338 CHAPTER 8. REGRESSION MODELS
To test for differences between two groups, we can use the logrank test (also known
as the Mantel-Cox test), given by survfit:
survdiff(Surv(time, status) ~ sex, data = lung)
Another option is the Peto-Peto test, which puts more weight on early events (deaths,
in the case of the lung data), and therefore is suitable when such events are of greater
interest. In contrast, the logrank test puts equal weights on all events regardless of
when they occur. The Peto-Peto test is obtained by adding the argument rho = 1:
survdiff(Surv(time, status) ~ sex, rho = 1, data = lung)
The Hmisc package contains a function for obtaining confidence intervals based on the
Kaplan-Meier estimator, called bootkm. This allows us to get confidence intervals for
the quantiles (including the median) of the survival distribution for different groups,
as well as for differences between the quantiles of different groups. First, let’s install
it:
install.packages("Hmisc")
We can now use bootkm to compute bootstrap confidence intervals for survival times
based on the lung data. We’ll compute an interval for the median survival time
for females, and one for the difference in median survival time between females and
males:
library(Hmisc)
To obtain confidence intervals for other quantiles, we simply change the argument q
in bootkm.
Exercise 8.27. Consider the ovarian data from the survival package. Plot
Kaplan-Meier curves comparing the two treatment groups. Compute a bootstrap
confidence interval for the difference in the 75 % quantile for the survival time for
the two groups.
The exponentiated coefficients show the hazard ratios, i.e. the relative increases (val-
ues greater than 1) or decreases (values below 1) of the hazard rate when a covariate
is increased one step while all others are kept fixed:
exp(coef(m))
In this case, the hazard increases with age (multiply the hazard by 1.017 for each
additional year that the person has lived), and is lower for women (sex=2) than for
men (sex=1).
The censboot_summary function from boot.pval provides a table of estimates, boot-
strap confidence intervals, and bootstrap p-values for the model coefficients. The
coef argument can be used to specify whether to print confidence intervals for the
coefficients or for the exponentiated coeffientes (i.e. the hazard ratios):
# censboot_summary requires us to use model = TRUE
# when fitting our regression model:
m <- coxph(Surv(time, status) ~ age + sex,
data = lung, model = TRUE)
library(boot.pval)
# Original coefficients:
censboot_summary(m, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m, type = "perc", coef = "exp")
340 CHAPTER 8. REGRESSION MODELS
As the name implies, the Cox proportional hazards model relies on the assumption
of proportional hazards, which essentially means that the effect of the explanatory
variables is constant over time. This can be assessed visually by plotting the model
residuals, using cox.zph and the ggcoxzph function from the survminer package.
Specifically, we will plot the scaled Schoenfeld (1982) residuals, which measure the
difference between the observed covariates and the expected covariates given the risk
at the time of an event. If the proportional hazards assumption holds, then there
should be no trend over time for these residuals. Use the trend line to aid the eye:
install.packages("survminer")
library(survminer)
In this case, there are no apparent trends over time (which is in line with the cor-
responding formal hypothesis tests), indicating that the proportional hazards model
could be applicable here.
∼
8.5. SURVIVAL ANALYSIS 341
Exercise 8.28. Consider the ovarian data from the survival package.
1. Use a Cox proportional hazards regression to test whether there is a difference
between the two treatment groups, adjusted for age.
2. Compute bootstrap confidence interval for the hazard ratio of age.
Exercise 8.29. Consider the retinopathy data from the survival package. We
are interested in a mixed survival model, where id is used to identify patients and
type, trtand age are fixed effects. Fit a mixed Cox proportional hazards regression
(add cluster = id to the call to coxph to include this as a random effect). Is the
assumption of proportional hazards fulfilled?
Interpreting the coefficients of accelerated failure time models is easier than inter-
preting coefficients from proportional hazards models. The exponentiated coefficients
show the relative increase or decrease in the expected survival times when a covariate
is increased one step while all others are kept fixed:
342 CHAPTER 8. REGRESSION MODELS
exp(coef(m_ll))
In this case, according to the log-logistic model, the expected survival time decreases
by 1.4 % (i.e. multiply by 0.986) for each additional year that the patient has lived.
The expected survival time for females (sex=2) is 61.2 % higher than for males
(multiply by 1.612).
To obtain bootstrap confidence intervals and p-values for the effects, we follow the
same procedure as for the Cox model, using censboot_summary. Here is an example
for the log-logistic accelerated failure time model:
library(boot.pval)
# Original coefficients:
censboot_summary(m_ll, type = "perc", coef = "raw")
# Exponentiated coefficients:
censboot_summary(m_ll, type = "perc", coef = "exp")
Exercise 8.30. Consider the ovarian data from the survival package. Fit a log-
logistic accelerated failure time model to the data, using all available explanatory
variables. What is the estimated difference in survival times between the two treat-
ment groups?
8.5. SURVIVAL ANALYSIS 343
Now, let’s have a look at how to fit a Bayesian model to the lung data from survival:
library(survival)
library(rstanarm)
Fitting a survival model with a random effect works similarly, and uses the same
syntax as lme4. Here is an example with the retinopathy data:
m <- stan_surv(Surv(futime, status) ~ age + type + trt + (1|id),
data = retinopathy)
m
As an example, we’ll use the diabetes dataset from MultSurvTest. It contains two
344 CHAPTER 8. REGRESSION MODELS
We’ll compare two groups that received two different treatments. The survival times
(time until blindness) and censoring statuses of the two groups are put in a matrices
called z and z.delta, which are used as input for the test function perm_mvlogrank:
# Survival times for the two groups:
x <- as.matrix(subset(diabetes, LASER==1)[,c(6,8)])
y <- as.matrix(subset(diabetes, LASER==2)[,c(6,8)])
We’ll assume that the treatment has no effect for the first 6 months, and that it then
has a constant effect, leading to a hazard ratio of 0.75 (so the hazard ratio is 1 if
the time in years is less than or equal to 0.5, and 0.75 otherwise). Moreover, we’ll
8.5. SURVIVAL ANALYSIS 345
assume that there is a constant drop-out rate, such that 20 % of the patients can be
expected to drop out during the three years. Finally, there is no drop-in. We define
a function to simulate survival times under these conditions:
# In the functions used to define the hazard ratio, drop-out
# and drop-in, t denotes time in years:
sim_func <- Quantile2(weib_dist,
hratio = function(t) { ifelse(t <= 0.5, 1, 0.75) },
dropout = function(t) { 0.2*t/3 },
dropin = function(t) { 0 })
Next, we define a function for the censoring distribution, which is assumed to be the
same for both groups. Let’s say that each follow-up is done at a random time point
between 2 and 3 years. We’ll therefore use a uniform distribution on the interval
(2, 3) for the censoring distribution:
rcens <- function(n)
{
runif(n, 2, 3)
}
Finally, we define two helper functions required by spower and then run the simu-
lation study. The output is the simulated power using the settings that we’ve just
created.
# Define helper functions:
rcontrol <- function(n) { sim_func(n, "control") }
rinterv <- function(n) { sim_func(n, "intervention") }
8.6.1 Estimation
The EnvStats package contains a number of functions that can be used to com-
pute descriptive statistics and estimating parameters of distributions from data with
nondetects. Let’s install it:
install.packages("EnvStats")
Estimates of the mean and standard deviation of a normal distribution that take the
censoring into account in the right way can be obtained with enormCensored, which
allows us to use several different estimators (the details surrounding the available es-
timators can be found using ?enormCensored). Analogous functions are available for
other distributions, for instance elnormAltCensored for the lognormal distribution,
egammaCensored for the gamma distribution, and epoisCensored for the Poisson
distribution.
To illustrate the use of enormCensored, we will generate data from a normal distri-
bution. We know the true mean and standard deviation of the distribution, and can
compute the estimates for the generated sample. We will then pretend that there
is a detection limit for this data, and artificially left-censor about 20 % of it. This
allows us to compare the estimates for the full sample and the censored sample, to
see how the censoring affects the estimates. Try running the code below a few times:
# Generate 50 observations from a N(10, 9)-distribution:
x <- rnorm(50, 10, 3)
library(EnvStats)
# Maximum likelihood estimate:
estimates_mle <- enormCensored(x, censored,
method = "mle")
# Biased-corrected maximum likelihood estimate:
estimates_bcmle <- enormCensored(x, censored,
method = "bcmle")
# Regression on order statistics, ROS, estimate:
estimates_ros <- enormCensored(x, censored,
method = "ROS")
The naive estimators tend to be biased for data with nondetects (sometimes very
biased!). Your mileage may vary depending on e.g. the sample size and the amount
of censoring, but in general, the estimators that take censoring into account will fare
much better.
After we have obtained estimates for the parameters of the normal distribution, we
can plot the data against the fitted distribution to check the assumption of normality:
348 CHAPTER 8. REGRESSION MODELS
library(ggplot2)
# Compare to histogram, including a bar for nondetects:
ggplot(data.frame(x), aes(x)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = estimates_mle$parameters[1],
sd = estimates_mle$parameters[2]))
To obtain percentile and BCa bootstrap confidence intervals for the mean, we can
add the options ci = TRUE and ci.method = "bootstrap":
# Using 999 bootstrap replicates:
enormCensored(x, censored, method = "mle",
ci = TRUE, ci.method = "bootstrap",
n.bootstraps = 999)$interval$limits
Exercise 8.31. Download the il2rb.csv data from the book’s web page. It contains
measurements of the biomarker IL-2RB made in serum samples from two groups of
patients. The values that are missing are in fact nondetects, with detection limit
0.25.
Under the assumption that the biomarker levels follow a lognormal distribution, com-
pute bootstrap confidence intervals for the mean of the distribution for the control
group. What proportion of the data is left-censored?
Exercise 8.32. Return to the il2rb.csv data from Exercise 8.32. Test the hypoth-
esis that there is no difference in location between the two groups.
Fitting regression models where the explanatory variables are censored is more chal-
lenging. For prediction, a good option is models based on decision trees, studied in
Section 9.5. For testing whether there is a trend over time, tests based on Kendall’s
correlation coefficient can be useful. EnvStats provides two functions for this -
kendallTrendTest for testing a monotonic trend, and kendallSeasonalTrendTest
for testing a monotonic trend within seasons.
350 CHAPTER 8. REGRESSION MODELS
We will illustrate the use of the packages using the lalonde dataset, that is shipped
with the MatchIt package:
library(MatchIt)
data(lalonde)
?lalonde
View(lalonde)
Note that the data has row names, which are useful e.g. for identifying which indi-
viduals have been paired - we can access them using rownames(lalonde).
summary(matches)
plot(matches)
plot(matches, type = "hist")
To view the values of the re78 variable of the matched pairs, use:
varName <- "re78"
resMatrix <- lalonde[row.names(matches$match.matrix), varName]
for(i in 1:ncol(matches$match.matrix))
{
resMatrix <- cbind(resMatrix, lalonde[matches$match.matrix[,i],
varName])
}
rownames(resMatrix) <- row.names(matches$match.matrix)
View(resMatrix)
To perform propensity score matching using optimal matching with 2 matches each:
matches <- matchit(treat ~ re74 + re75 + age + educ + married,
data = lalonde, method = "optimal", ratio = 2)
summary(matches)
plot(matches)
plot(matches, type = "hist")
summary(matched_data)
You may also want to find all controls that match participants in the treatment group
exactly. This is called exact matching:
matches <- matchit(treat ~ re74 + re75 + age + educ + married,
data = lalonde, method = "exact")
summary(matches)
plot(matches)
plot(matches, type = "hist")
# Check results:
View(matchlist)
View(matched_data2)
354 CHAPTER 8. REGRESSION MODELS
Chapter 9
In predictive modelling, we fit statistical models that use historical data to make
predictions about future (or unknown) outcomes. This practice is a cornerstone
of modern statistics, and includes methods ranging from classical parametric linear
regression to black-box machine learning models.
After reading this chapter, you will be able to use R to:
• Fit predictive models for regression and classification,
• Evaluate predictive models,
• Use cross-validation and the bootstrap for out-of-sample evaluations,
• Handle imbalanced classes in classification problems,
• Fit regularised (and possibly also generalised) linear models, e.g. using the
lasso,
• Fit a number of machine learning models, including kNN, decision trees, ran-
dom forests, and boosted trees.
• Make forecasts based on time series data.
355
356 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
The terminology used in predictive modelling differs a little from that used in tra-
ditional statistics. For instance, explanatory variables are often called features or
predictors, and predictive modelling is often referred to as supervised learning. We
will stick with the terms used in Section 7, to keep the terminology consistent within
the book.
Predictive models can be divided into two categories:
• Regression, where we want to make predictions for a numeric variable,
• Classification, where we want to make predictions for a categorical variable.
There are many similarities between these two, but we need to use different measures
when evaluating their predictive performance. Let’s start with models for numeric
predictions, i.e. regression models.
(Recall that the formula mpg ~ . means that all variables in the dataset, except mpg,
are used as explanatory variables in the model.)
A number of measures of how well the model fits the data have been proposed.
Without going into details (it will soon be apparent why), we can mention examples
like the coefficient of determination 𝑅2 , and information criteria like 𝐴𝐼𝐶 and 𝐵𝐼𝐶.
All of these are straightforward to compute for our model:
summary(m)$r.squared # R^2
summary(m)$adj.r.squared # Adjusted R^2
AIC(m) # AIC
BIC(m) # BIC
𝑅2 is a popular tool for assessing model fit, with values close to 1 indicating a good
fit and values close to 0 indicating a poor fit (i.e. that most of the variation in the
data isn’t accounted for).
It is nice if our model fits the data well, but what really matters in predictive mod-
elling is how close the predictions from the model are to the truth. We therefore
need ways to measure the distance between predicted values and observed values -
ways to measure the size of the average prediction error. A common measure is the
root-mean-square error (RMSE). Given 𝑛 observations 𝑦1 , 𝑦2 , … , 𝑦𝑛 for which our
model makes the predictions 𝑦1̂ , … , 𝑦𝑛̂ , this is defined as
𝑛
∑𝑖=1 (𝑦𝑖̂ − 𝑦𝑖 )2
𝑅𝑀 𝑆𝐸 = √ ,
𝑛
9.1. EVALUATING PREDICTIVE MODELS 357
that is, as the named implies, the square root of the mean of the squared errors
(𝑦𝑖̂ − 𝑦𝑖 )2 .
Another common measure is the mean absolute error (MAE):
𝑛
∑𝑖=1 |𝑦𝑖̂ − 𝑦𝑖 |
𝑀 𝐴𝐸 = .
𝑛
Let’s compare the predicted values 𝑦𝑖̂ to the observed values 𝑦𝑖 for our mtcars model
m:
rmse <- sqrt(mean((predict(m) - mtcars$mpg)^2))
mae <- mean(abs(predict(m) - mtcars$mpg))
rmse; mae
There is a problem with this computation, and it is a big one. What we just computed
was the difference between predicted values and observed values for the sample that
was used to fit the model. This doesn’t necessarily tell us anything about how well
the model will fare when used to make predictions about new observations. It is, for
instance, entirely possible that our model has overfitted to the sample, and essentially
has learned the examples therein by heart, ignoring the general patterns that we were
trying to model. This would lead to a small 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸, and a high 𝑅2 , but
would render the model useless for predictive purposes.
All the computations that we’ve just done - 𝑅2 , 𝐴𝐼𝐶, 𝐵𝐼𝐶, 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸 - were
examples of in-sample evaluations of our model. There are a number of problems
associated with in-sample evaluations, all of which have been known for a long time
- see e.g. Picard & Cook (1984). In general, they tend to be overly optimistic and
overestimate how well the model will perform for new data. It is about time that we
got rid of them for good.
A fundamental principle of predictive modelling is that the model chiefly should be
judged on how well it makes predictions for new data. To evaluate its performance,
we therefore need to carry out some form of out-of-sample evaluation, i.e. to use the
model to make predictions for new data (that weren’t used to fit the model). We
can then compare those predictions to the actual observed values for those data, and
e.g. compute the 𝑅𝑀 𝑆𝐸 or 𝑀 𝐴𝐸 to measure the size of the average prediction error.
Out-of-sample evaluations, when done right, are less overoptimistic than in-sample
evaluations, and are also better in the sense that they actually measure the right
thing.
Exercise 9.1. To see that a high 𝑅2 and low p-values say very little about the
predictive performance of a model, consider the following dataset with 30 randomly
generated observations of four variables:
358 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
1. The true relationship between the variables, used to generate the y variables,
is 𝑦 = 2𝑥1 − 𝑥2 + 𝑥3 ⋅ 𝑥2 . Plot the y values in the data against this expected
value. Does a linear model seem appropriate?
2. Fit a linear regression model with x1, x2 and x3 as explanatory variables (with-
out any interactions) using the first 20 observations of the data. Do the p-values
and 𝑅2 indicate a good fit?
3. Make predictions for the remaining 10 observations. Are the predictions accu-
rate?
4. A common (mal)practice is to remove explanatory variables that aren’t signifi-
cant from a linear model (see Section 8.1.9 for some comments on this). Remove
any variables from the regression model with a p-value above 0.05, and refit
the model using the first 20 observations. Do the p-values and 𝑅2 indicate a
good fit? Do the predictions for the remaining 10 observations improve?
5. Finally, fit a model with x1, x2 and x3*x2 as explanatory variables (i.e. a
correctly specified model) to the first 20 observations. Do the predictions for
the remaining 10 observations improve?
In most cases though, we don’t have that luxury. A popular alternative is to artifi-
cially create two sets by randomly withdrawing a part of the data, 10 % or 20 % say,
which can be used for evaluation. In machine learning lingo, model fitting is known
as training and model evaluation as testing. The set used for training (fitting) the
9.1. EVALUATING PREDICTIVE MODELS 359
model is therefore often referred to as the training data, and the set used for testing
(evaluating) the model is known as the test data.
Let’s try this out with the mtcars data. We’ll use 80 % of the data for fitting our
model and 20 % for evaluating it.
# Set the sizes of the test and training samples.
# We use 20 % of the data for testing:
n <- nrow(mtcars)
ntest <- round(0.2*n)
ntrain <- n - ntest
In this case, our training set consists of 26 observations and our test set of 6 obser-
vations. Let’s fit the model using the training set and use the test set for evaluation:
# Fit model to training set:
m <- lm(mpg ~ ., data = mtcars_train)
Because of the small sample sizes here, the results can vary a lot if you rerun the two
code chunks above several times (try it!). When I ran them ten times, the 𝑅𝑀 𝑆𝐸
varied between 1.8 and 7.6 - quite a difference on the scale of mpg! This problem is
usually not as pronounced if you have larger sample sizes, but even for fairly large
datasets, there can be a lot of variability depending on how the data happens to
be split. It is not uncommon to get a “lucky” or “unlucky” test set that either
overestimates or underestimates the model’s performance.
In general, I’d therefore recommend that you only use test-training splits of your data
as a last resort (and only use it with sample sizes of 10,000 or more). Better tools
are available in the form of the bootstrap and its darling cousin, cross-validation.
To begin with, we split the data into 𝑘 sets, where 𝑘 is equal to or less than the
number of observations 𝑛. We then put the first set aside, to use for evaluation, and
fit the model to the remaining 𝑘 − 1 sets. The model predictions are then evaluated
on the first set. Next, we put the first set back among the others and remove the
second set to use that for evaluation. And so on. This means that we fit 𝑘 models
to 𝑘 different (albeit similar) training sets, and evaluate them on 𝑘 test sets (none of
which are used for fitting the model that is evaluated on them).
# Evaluate predictions:
rmse <- sqrt(mean((pred - mtcars$mpg)^2))
mae <- mean(abs(pred - mtcars$mpg))
rmse; mae
We will use cross-validation a lot, and so it is nice not to have to write a lot of code
each time we want to do it. To that end, we’ll install the caret package, which not
only lets us do cross-validation, but also acts as a wrapper for a large number of
packages for predictive models. That means that we won’t have to learn a ton of
functions to be able to fit different types of models. Instead, we just have to learn
a few functions from caret. Let’s install the package and some of the packages it
needs to function fully:
install.packages("caret", dependencies = TRUE)
Now, let’s see how we can use caret to fit a linear regression model and evaluate
it using cross-validation. The two main functions used for this are trainControl,
which we use to say that we want to perform a leave-one-out cross-validation (method
= "LOOCV") and train, where we state the model formula and specify that we want
to use lm for fitting the model:
9.1. EVALUATING PREDICTIVE MODELS 361
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(mpg ~ .,
data = mtcars,
method = "lm",
trControl = tc)
train has now done several things in parallel. First of all, it has fitted a linear model
to the entire dataset. To see the results of the linear model we can use summary, just
as if we’d fitted it with lm:
summary(m)
Many, but not all, functions that we would apply to an object fitted using lm still
work fine with a linear model fitted using train, including predict. Others, like
coef and confint no longer work (or work differently) - but that is not that big a
problem. We only use train when we are fitting a linear regression model with the
intent of using it for prediction - and in such cases, we are typically not interested
in the values of the model coefficients or their confidence intervals. If we need them,
we can always refit the model using lm.
What makes train great is that m also contains information about the predictive per-
formance of the model, computed, in this case, using leave-one-out cross-validation:
# Print a summary of the cross-validation:
m
Exercise 9.2. Download the estates.xlsx data from the book’s web page. It
describes the selling prices (in thousands of SEK) of houses in and near Uppsala,
Sweden, along with a number of variables describing the location, size, and standard
of the house.
Fit a linear regression model to the data, with selling_price as the response vari-
able and the remaining variables as explanatory variables. Perform an out-of-sample
evaluation of your model. What are the 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸? Do the prediction errors
seem acceptable?
your sample, i.e. observations that are identical or nearly identical (in which case the
model for all intents and purposes already has “seen” the observation for which it is
making a prediction). It can also be quite slow if you have a large dataset, as you
need to fit 𝑛 different models, each using a lot of data.
A much faster option is 𝑘-fold cross-validation, which is the name for cross-validation
where 𝑘 is lower than 𝑛 - usually much lower, with 𝑘 = 10 being a common choice. To
run a 10 fold cross-validation with caret, we change the arguments of trainControl,
and then run train exactly as before:
tc <- trainControl(method = "cv" , number = 10)
m <- train(mpg ~ .,
data = mtcars,
method = "lm",
trControl = tc)
Like with test-training splitting, the results from a 𝑘-fold cross-validation will vary
each time it is run (unless 𝑘 = 𝑛). To reduce the variance of the estimates of the
prediction error, we can repeat the cross-validation procedure multiple times, and
average the errors from all runs. This is known as a repeated 𝑘-fold cross-validation.
To run 100 10-fold cross-validations, we change the settings in trainControl as
follows:
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100)
m <- train(mpg ~ .,
data = mtcars,
method = "lm",
trControl = tc)
Which type of cross-validation to use for different problems remains an open question.
Several studies (e.g. Zhang & Yang (2015), and the references therein) indicate that
in most settings larger 𝑘 is better (with LOOCV being the best), but there are
exceptions to this rule - e.g. when you have a lot of twinned data. This is in contrast
to an older belief that a high 𝑘 leads to estimates with high variances, tracing its
roots back to a largely unsubstantiated claim in Efron (1983), which you still can see
repeated in many books. When 𝑛 is very large, the difference between different 𝑘 is
typically negligible.
9.1. EVALUATING PREDICTIVE MODELS 363
Exercise 9.3. Return to the estates.xlsx data from the previous exercise. Refit
your linear model, but this time:
1. Use 10-fold cross-validation for the evaluation. Run it several times and check
the MAE. How much does the MAE vary between runs?
2. Run repeated 10-fold cross-validations a few times. How much does the MAE
vary between runs?
If you plan on using LOOCV, you may want to remove duplicates. We saw how to
do this in Section 5.8.2:
9.1.6 Bootstrapping
An alternative to cross-validation is to draw bootstrap samples, some of which are
used to fit models, and some to evaluate them. This has the benefit that the models
are fitted to 𝑛 observations instead of 𝑘−1
𝑘 𝑛 observations. This is in fact the default
method in trainControl. To use it for our mtcars model, with 999 bootstrap
samples, we run the following:
library(caret)
tc <- trainControl(method = "boot",
number = 999)
m <- train(mpg ~ .,
data = mtcars,
method = "lm",
trControl = tc)
m
m$results
Exercise 9.4. Return to the estates.xlsx data from the previous exercise. Refit
your linear model, but this time use the bootstrap to evaluate the model. Run it
several times and check the MAE. How much does the MAE vary between runs?
In Section 8.3.1, we fitted a logistic regression model to the data using glm:
m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
summary(m)
Logistic regression models are regression models, because they give us a numeric
output: class probabilities. These probabilities can however be used for classification
- we can for instance classify a wine as being red if the predicted probability that it
is red is at least 0.5. We can therefore use logistic regression as a classifier, and refer
to it as such, although we should bear in mind that it actually is more than that1 .
We can use caret and train to fit the same a logistic regression model, and use
cross-validation or the bootstrap to evaluate it. We should supply the arguments
method = "glm" and family = "binomial" to train to specify that we want a
logistic regression model. Let’s do that, and run a repeated 10-fold cross-validation
of the model - this takes longer to run than our mtcars example because the dataset
is larger:
library(caret)
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100)
m <- train(type ~ pH + alcohol,
data = wine,
trControl = tc,
method = "glm",
family = "binomial")
We mentioned a little earlier that we can use logistic regression for classification by,
for instance, classifying a wine as being red if the predicted probability that it is red
is at least 0.5. It is of course possible to use another threshold as well, and classify
wines as being red if the probability is at least 0.2, or 0.3333, or 0.62. When setting
this threshold, there is a tradeoff between the occurrence of what is known as false
negatives and false positives. Imagine that we have two classes (white and red), and
that we label one of them as negative (white) and one as positive (red). Then:
In the wine example, there is little difference between these types of errors. But
in other examples, the distinction is an important one. Imagine for instance that
we, based on some data, want to classify patients as being sick (positive) or healthy
(negative). In that case it might be much worse to get a false negative (the patient
won’t get the treatment that they need) than a false positive (which just means that
the patient will have to run a few more tests). For any given threshold, we can
compute two measures of the frequency of these types of errors:
If we increase the threshold for at what probability a wine is classified as being red
(positive), then the sensitivity will increase, but the specificity will decrease. And if
we lower the threshold, the sensitivity will decrease while the specificity increases.
It would make sense to try several different thresholds, to see for which threshold we
get a good compromise between sensitivity and specificity. We will use the MLeval
package to visualise the result of this comparison, so let’s install that:
install.packages("MLeval")
Sensitivity and specificity are usually visualised using receiver operation character-
istic curves, or ROC curves for short. We’ll plot such a curve for our wine model.
The function evalm from MLeval can be used to collect the data that we need from
the cross-validations of a model m created using train. To use it, we need to set
savePredictions = TRUE and classProbs = TRUE in trainControl:
9.1. EVALUATING PREDICTIVE MODELS 367
library(MLeval)
plots <- evalm(m)
# ROC:
plots$roc
The x-axis shows the false positive rate of the classifier (which is 1 minus the specificity
- we’d like this to be as low as possible) and the y-axis shows the corresponding
sensitivity of the classifier (we’d like this to be as high as possible). The red line
shows the false positive rate and sensitivity of our classifier, which each point on the
line corresponding to a different threshold. The grey line shows the performance of
a classifier that is no better than random guessing - ideally, we want the red line to
be much higher than that.
The beauty of the ROC curve is that it gives us a visual summary of how the classifier
performs for all possible thresholds. It is instrumental if we want to compare two or
more classifiers, as you will do in Exercise 9.5.
The legend shows a summary measure, 𝐴𝑈 𝐶, the area under the ROC curve. An
𝐴𝑈 𝐶 of 0.5 means that the classifier is no better than random guessing, and an
𝐴𝑈 𝐶 of 1 means that the model always makes correct predictions for all thresholds.
Getting an 𝐴𝑈 𝐶 that is lower than 0.5, meaning that the classifier is worse than
random guessing, is exceedingly rare, and can be a sign of some error in the model
fitting.
evalm also computes a 95 % confidence interval for the 𝐴𝑈 𝐶, which can be obtained
as follows:
plots$optres[[1]][13,]
Another very important plot provided by evalm is the calibration curve. It shows
how well-calibrated the model is. If the model is well-calibrated, then the predicted
probabilities should be close to the true frequencies. As an example, this means that
among wines for which the predicted probability of the wine being red is about 20
%, 20 % should actually be red. For a well-calibrated model, the red curve should
368 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
Our model doesn’t appear to be that well-calibrated, meaning that we can’t really
trust its predicted probabilities.
If we just want to quickly print the 𝐴𝑈 𝐶 without plotting the ROC curves, we can
set summaryFunction = twoClassSummary in trainControl, after which the 𝐴𝑈 𝐶
will be printed instead of accuracy and Cohen’s kappa (although it is erroneously
called ROC instead of 𝐴𝑈 𝐶). The sensitivity and specificity for the 0.5 threshold
are also printed:
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
Exercise 9.5. Fit a second logistic regression model, m2, to the wine data, that
also includes fixed.acidity and residual.sugar as explanatory variables. You
can then run
library(MLeval)
plots <- evalm(list(m, m2),
gnames = c("Model 1", "Model 2"))
to create ROC curves and calibration plots for both models. Compare their curves.
Is the new model better than the simpler model?
classifier associates with the different classes. Let’s look at an example of this using
the model m fitted to the wine data at the end of the previous section. We’ll create a
grid of points using expand.grid and make predictions for each of them (i.e. classify
each of them). We can then use geom_contour to draw the decision boundaries:
contour_data <- expand.grid(
pH = seq(min(wine$pH), max(wine$pH), length = 500),
alcohol = seq(min(wine$alcohol), max(wine$alcohol), length = 500))
library(ggplot2)
ggplot(wine, aes(pH, alcohol, colour = type)) +
geom_point(size = 2) +
stat_contour(aes(x = pH, y = alcohol, z = type),
data = predictions, colour = "black")
In this case, points to the left of the black line are classified as white, and points to
the right of the line are classified as red. It is clear from the plot (both from the
point clouds and from the decision boundaries) that the model won’t work very well,
as many wines will be misclassified.
Exercise 9.6. Discuss the following. You are working for a company that tracks
the behaviour of online users using cookies. The users have all agreed to be tracked
by clicking on an “Accept all cookies” button, but most can be expected not to have
read the terms and conditions involved. You analyse information from the cookies,
consisting of data about more or less all parts of the users’ digital lives, to serve
targeted ads to the users. Is this acceptable? Does the accuracy of your targeting
models affect your answer? What if the ads are relevant to the user 99 % of the time?
What if they only are relevant 1 % of the time?
Exercise 9.7. Discuss the following. You work for a company that has developed a
facial recognition system. In a final trial before releasing your product, you discover
that your system performs poorly for people over the age of 70 (the accuracy is 99 %
for people below 70 and 65 % for people above 70). Should you release your system
without making any changes to it? Does your answer depend on how it will be used?
What if it is used instead of keycards to access offices? What if it is used to unlock
smartphones? What if it is used for ID controls at voting stations? What if it is
used for payments?
Exercise 9.8. Discuss the following. Imagine a model that predicts how likely it
is that a suspect committed a crime that they are accused of, and that said model
is used in courts of law. The model is described as being faster, fairer, and more
impartial than human judges. It is a highly complex black-box machine learning
model built on data from previous trials. It uses hundreds of variables, and so it
isn’t possible to explain why it gives a particular prediction for a specific individual.
The model makes correct predictions 99 % of the time. Is using such a model in the
judicial system acceptable? What if an innocent person is predicted by the model
to be guilty, without an explanation of why it found them to be guilty? What if the
model makes correct predictions 90 % or 99.99 % of the time? Are there things that
the model shouldn’t be allowed to take into account, such as skin colour or income?
If so, how can you make sure that such variables aren’t implicitly incorporated into
the training data?
9.3. CHALLENGES IN PREDICTIVE MODELLING 371
Next, we fit three logistic models - one the usual way, one with down-sampling and one
with up-sampling. We’ll use 10-fold cross-validation to evaluate their performance.
library(caret)
Bear in mind though, that the accuracy can be very high when you have imbalanced
classes, even if your model has overfitted to the data and always predicts that all
observations belong to the same class. Perhaps ROC curves will paint a different
picture?
library(MLeval)
plots <- evalm(list(m1, m2, m3),
gnames = c("Imbalanced data",
"Down-sampling",
"Up-sampling"))
The three models have virtually identical performance in terms of AUC, so thus far
there doesn’t seem to be an advantage to using down-sampling or up-sampling.
Now, let’s make predictions for all the red wines that the models haven’t seen in
the training data. What are the predicted probabilities of them being red, for each
model?
9.3. CHALLENGES IN PREDICTIVE MODELLING 373
When the model is fitted using the standard methods, almost all red wines get very
low predicted probabilities of being red. This isn’t the case for the models that
used down-sampling and up-sampling, meaning that m2 and m3 are much better at
correctly classifying red wines. Note that we couldn’t see any differences between the
models in the ROC curves, but that there are huge differences between them when
they are applied to new data. Problems related to class imbalance can be difficult to
detect, so always be careful when working with imbalanced data.
linear model. In essence, this means that variables with a lower p-value are assigned
higher importance. But the p-value is not a measure of effect size, nor the predictive
importance of a variable (see e.g. Wasserstein & Lazar (2016)). I strongly advise
against using varImp for linear models.
There are other options for computing variable importance for linear and generalised
linear models, for instance in the relaimpo package, but mostly these rely on in-
sample metrics like 𝑅2 . Since our interest is in the predictive performance of our
model, we are chiefly interested in how much the different variables affect the predic-
tions. In Section 9.5.2 we will see an example of such an evaluation, for another type
of model.
9.3.3 Extrapolation
It is always dangerous to use a predictive model with data that comes from outside the
range of the variables in the training data. We’ll use bacteria.csv as an example
of that - download that file from the books’ web page and set file_path to its
path. The data has two variables, Time and OD. The first describes the time of a
measurement, and the second describes the optical density (OD) of a well containing
bacteria. The more the bacteria grow, the greater the OD. First, let’s load and plot
the data:
# Read and format data:
bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")
Now, let’s fit a linear model to data from hours 3-6, during which the bacteria are in
their exponential phase, where they grow faster:
# Fit model:
m <- lm(OD ~ Time, data = bacteria[45:90,])
The model fits the data that it’s been fitted to extremely well - but does very poorly
outside this interval. It overestimates the future growth and underestimates the
previous OD.
9.3. CHALLENGES IN PREDICTIVE MODELLING 375
In this example, we had access to data from outside the range used for fitting the
model, which allowed us to see that the model performs poorly outside the original
data range. In most cases however, we do not have access to such data. When
extrapolating outside the range of the training data, there is always a risk that the
patterns governing the phenomenons we are studying are completely different, and
it is important to be aware of this.
If we try to fit a model to this data, we’ll get an error message about NA values:
library(caret)
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100)
m <- train(mpg ~ .,
data = mtcars_missing,
method = "lm",
trControl = tc)
na.action = na.pass)
m$results
You can compare the results obtained for this model to does obtained using the
complete dataset:
m <- train(mpg ~ .,
data = mtcars,
method = "lm",
trControl = tc)
m$results
Here, these are probably pretty close (we didn’t have a lot of missing data, after all),
but not identical.
The number of cores available on your machine determines how many processes can
be run in parallel. To see how many you have, use detectCores:
library(parallel)
detectCores()
You should avoid the temptation of using all available cores for your parallel com-
putation - you’ll always need to reserve at least one for running RStudio and other
applications.
To enable parallel computations, we use registerDoParallel to register the parallel
backend to be used. Here is an example where we create 3 workers (and so use 3
cores in parallel2 ):
2 If your CPU has 3 or fewer cores, you should lower this number.
9.3. CHALLENGES IN PREDICTIVE MODELLING 377
library(doParallel)
registerDoParallel(3)
After this, it will likely take less time to fit your caret models, as model fitting now
will be performed using parallel computations on 3 cores. That means that you’ll
spend less time waiting and more time modelling. Hurrah! One word of warning
though: parallel computations require more memory, so you may run into problems
with RAM if you are working on very large datasets.
still performs as expected. Model evaluation is a task that lasts as long as the model
is in use.
which is known as the bias-variance decomposition of the 𝑀 𝑆𝐸. This means that
if increasing the bias allows us to decrease the variance, it is possible to obtain an
estimator with a lower 𝑀 𝑆𝐸 than what is possible for unbiased estimators.
Regularised regression models are linear or generalised linear models in which a small
(typically) bias is introduced in the model fitting. Often this can lead to models with
better predictive performance. Moreover, it turns out that this also allows us to fit
models in situations where it wouldn’t be possible to fit ordinary (generalised) linear
models, for example when the number of variables is greater than the sample size.
To introduce the bias, we add a penalty term to the loss function used to fit the
regression model. In the case of linear regression, the usual loss function is the
squared ℓ2 norm, meaning that we seek the estimates 𝛽𝑖 that minimise
𝑛
∑(𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 )2 .
𝑖=1
When fitting a regularised regression model, we instead seek the 𝛽 = (𝛽1 , … , 𝛽𝑝 ) that
minimise
𝑛
∑(𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 − ⋯ − 𝛽𝑝 𝑥𝑖𝑝 )2 + 𝑝(𝛽, 𝜆),
𝑖=1
for some penalty function 𝑝(𝛽, 𝜆). The penalty function increases the “cost” of having
large 𝛽𝑖 , which causes the estimates to “shrink” towards 0. 𝜆 is a shrinkage parameter
used to control the strength of the shrinkage - the larger 𝜆 is, the greater the shrinkage.
It is usually chosen using cross-validation.
Regularised regression models are not invariant under linear rescalings of the explana-
tory variables, meaning that if a variable is multiplied by some number 𝑎, then this
can change the fit of the entire model in an arbitrary way. For that reason, it is
9.4. REGULARISED REGRESSION MODELS 379
widely agreed that the explanatory variables should be standardised to have mean 0
and variance 1 before fitting a regularised regression model. Fortunately, the func-
tions that we will use for fitting these models does that for us, so that we don’t have
to worry about it. Moreover, they then rescale the model coefficients to be on the
original scale, to facilitate interpretation of the model. We can therefore interpret the
regression coefficients in the same way as we would for any other regression model.
In this section, we’ll look at how to use regularised regression in practice. Further
mathematical details are deferred to Section 12.5. We will make use of model-fitting
functions from the glmnet package, so let’s start by installing that:
install.packages("glmnet")
We will use the mtcars data to illustrate regularised regression. We’ll begin by once
again fitting an ordinary linear regression model to the data:
library(caret)
tc <- trainControl(method = "LOOCV")
m1 <- train(mpg ~ .,
data = mtcars,
method = "lm",
trControl = tc)
summary(m1)
In the tuneGrid setting of train we specified that values of 𝜆 in the interval [0, 10]
should be evaluated. When we print the m object, we will see 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸 of
the models for different values of 𝜆 (with 𝜆 = 0 being ordinary non-regularised linear
regression):
# Print the results:
m2
To only print the results for the best model, we can use:
m2$results[which(m2$results$lambda == m2$finalModel$lambdaOpt),]
Note that the 𝑅𝑀 𝑆𝐸 is substantially lower than that for the ordinary linear regres-
sion (m1).
In the metric setting of train, we said that we wanted 𝑅𝑀 𝑆𝐸 to be used to
determine which value of 𝜆 gives the best model. To get the coefficients of the model
with the best choice of 𝜆, we use coef as follows:
# Check the coefficients of the best model:
coef(m2$finalModel, m2$finalModel$lambdaOpt)
∼
9.4. REGULARISED REGRESSION MODELS 381
Exercise 9.9. Return to the estates.xlsx data from Exercise 9.2. Refit your
linear model, but this time use ridge regression instead. Does the 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸
improve?
Exercise 9.10. Return to the wine data from Exercise 9.5. Fitting the models below
will take a few minutes, so be prepared to wait for a little while.
1. Fit a logistic ridge regression model to the data (make sure to add family =
"binomial" so that you actually fit a logistic model and not a linear model),
using all variables in the dataset (except type) as explanatory variables. Use
5-fold cross-validation for choosing 𝜆 and evaluating the model (other options
are too computer-intensive). What metric is used when finding the optimal 𝜆?
2. Set summaryFunction = twoClassSummary in trainControl and metric =
"ROC" in train and refit the model using 𝐴𝑈 𝐶 to find the optimal 𝜆. Does
the choice of 𝜆 change, for this particular dataset?
The variables that were removed from the model are marked by points (.) in the list
of coefficients. The 𝑅𝑀 𝑆𝐸 is comparable to that from the ridge regression - and is
better than that for the ordinary linear regression, but the number of variables used
is fewer. The lasso model is more parsimonious, and therefore easier to interpret
(and present to your boss/client/supervisor/colleagues!).
If you only wish to extract the names of the variables with non-zero coefficients from
the lasso model (i.e. a list of the variables retained in the variable selection), you
can do so using the code below. This can be useful if you have a large number of
variables and quickly want to check which have non-zero coefficients:
rownames(coef(m3$finalModel, m3$finalModel$lambdaOpt))[
coef(m3$finalModel, m3$finalModel$lambdaOpt)[,1]!= 0]
Exercise 9.11. Return to the estates.xlsx data from Exercise 9.2. Refit your
linear model, but this time use the lasso instead. Does the 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸
improve?
Exercise 9.12. To see how the lasso handles variable selection, simulate a dataset
where only the first 5 out of 200 explanatory variables are correlated with the response
variable:
1. Fit a linear model to the data (using the model formula y ~ .). What happens?
2. Fit a lasso model to this data. Does it select the correct variables? What if
you repeat the simulation several times, or change the values of n and p?
𝑝 𝑝
𝜆(𝛼 ∑𝑗=1 |𝛽𝑖 | + (1 − 𝛼) ∑𝑗=1 𝛽𝑖2 ), with 0 ≤ 𝛼 ≤ 1. 𝛼 = 0 yields the ridge estimator,
𝛼 = 1 yields the lasso, and 𝛼 between 0 and 1 yields a combination of the both.
When fitting an elastic net model, we search for an optimal choice of 𝛼, along with
the choice of 𝜆𝑖 . To fit such a model, we can run the following:
library(caret)
tc <- trainControl(method = "LOOCV")
m4 <- train(mpg ~ .,
data = mtcars,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, 0.1),
lambda = seq(0, 10, 0.1)),
metric = "RMSE",
trControl = tc)
In this example, the ridge regression happened to yield the best fit, in terms of the
cross-validation 𝑅𝑀 𝑆𝐸.
Exercise 9.13. Return to the estates.xlsx data from Exercise 9.2. Refit your
linear model, but this time use the elastic net instead. Does the 𝑅𝑀 𝑆𝐸 and 𝑀 𝐴𝐸
improve?
• tolerance, which chooses the simplest model that has a performance within
(by default) 1.5 % of the model with the best performance.
Neither of these can be used with LOOCV, but work for other cross-validation
schemes and the bootstrap.
We can set the rule for selecting the “best” model using the argument
selectionFunction in trainControl. By default, it uses a function called
best that simply extracts the model with the best performance. Here are some
examples for the lasso:
library(caret)
# Choose the best model (this is the default!):
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100)
m3 <- train(mpg ~ .,
data = mtcars,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0, 10, 0.1)),
metric = "RMSE",
trControl = tc)
In this example, the difference between the models is small - and it usually is. In
some cases, using oneSE or tolerance leads to a model that has better performance
on new data, but in other cases the model that has the best performance in the
evaluation also has the best performance for new data.
9.4. REGULARISED REGRESSION MODELS 385
Regularised mixed models are strange birds. Mixed models are primarily used for in-
ference about the fixed effects, whereas regularisation primarily is used for predictive
purposes. The two don’t really seem to match. They can however be very useful if
our main interest is estimation rather than prediction or hypothesis testing, where
regularisation can help decrease overfitting. Similarly, it is not uncommon for linear
mixed models to be numerically unstable, with the model fitting sometimes failing to
converge. In such situations, a regularised LMM will often work better. Let’s study
an example concerning football (soccer) teams, from Groll & Tutz (2014), that shows
how to incorporate random effects and the lasso in the same model:
library(glmmLasso)
data(soccer)
?soccer
View(soccer)
We want to model the points totals for these football teams. We suspect that variables
like transfer.spendings can affect the performance of a team:
ggplot(soccer, aes(transfer.spendings, points, colour = team)) +
geom_point() +
geom_smooth(method = "lm", colour = "black", se = FALSE)
Moreover, it also seems likely that other non-quantitative variables also affect the
performance, which could cause the teams to all have different intercepts. Let’s plot
them side-by-side:
library(ggplot2)
ggplot(soccer, aes(transfer.spendings, points, colour = team)) +
geom_point() +
theme(legend.position = "none") +
facet_wrap(~ team, nrow = 3)
When we model the points totals, it seems reasonable to include a random intercept
for team. We’ll also include other fixed effects describing the crowd capacity of the
teams’ stadiums, and their playing style (e.g. ball possession and number of yellow
cards).
The glmmLasso functions won’t automatically centre and scale the data for us, which
you’ll recall is recommended to do before fitting a regularised regression model. We’ll
386 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
create a copy of the data with centred and scaled numeric explanatory variables:
soccer_scaled <- soccer
soccer_scaled[, c(4:16)] <- scale(soccer_scaled[, c(4:16)],
center = TRUE,
scale = TRUE)
Next, we’ll run a for loop to find the best 𝜆. Because we are interested in fitting
a model to this particular dataset rather than making predictions, we will use an
in-sample measure of model fit, 𝐵𝐼𝐶, to compare the different values of 𝜆. The code
below is partially adapted from demo("glmmLasso-soccer"):
# Number of effects used in model:
params <- 10
Don’t pay any attention to the p-values in the summary table. Variable selection
can affect p-values in all sorts of strange ways, and because we’ve used the lasso to
select what variables to include, the p-values presented here are no longer valid.
Note that the coefficients printed by the code above are on the scale of the standard-
ised data. To make them possible to interpret, let’s finish by transforming them back
to the original scale of the variables:
sds <- sqrt(diag(cov(soccer[, c(4:16)])))
sd_table <- data.frame(1/sds)
sd_table["(Intercept)",] <- 1
coef(opt_m) * sd_table[names(coef(opt_m)),]
tree to the estates data from Exercise 9.2. We set file_path to the path to
estates.xlsx and import and clean the data as before:
library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)
Next, we fit a decision tree by setting method = "rpart"4 , which uses functions from
the rpart package to fit the tree:
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
data = estates,
trControl = tc,
method = "rpart",
tuneGrid = expand.grid(cp = 0))
So, what is this? We can plot the resulting decision tree using the rpart.plot
package, so let’s install and use that:
install.packages("rpart.plot")
library(rpart.plot)
prp(m$finalModel)
What we see here is our machine learning model - our decision tree. When it is used
for prediction, the new observation is fed to the top of the tree, where a question
about the new observation is asked: “is tax_value < 1610”? If the answer is yes, the
observation continues down the line to the left, to the next question. If the answer
is no, it continues down the line to the right, to the question ”is tax_value < 2720‘,
and so on. After a number of questions, the observation reaches a circle - a so-called
leaf node, with a number in it. This number is the predicted selling price of the
house, which is based on observations in the training data that belong to the same
leaf. When the tree is used for classification, the predicted probability of class A is
the proportion of observations from the training data in the leaf that belong to class
A.
prp has a number of parameters that lets us control what our tree plot looks like.
box.palette, shadow.col, nn, type, extra, and cex are all useful - read the docu-
mentation for prp to see how they affect the plot:
4 The name rpart may seem cryptic: it is an abbreviation for Recursive Partitioning and Regres-
prp(m$finalModel,
box.palette = "RdBu",
shadow.col = "gray",
nn = TRUE,
type = 3,
extra = 1,
cex = 0.75)
When fitting the model, rpart builds the tree from the top down. At each split, it
tries to find a question that will separate subgroups in the data as much as possible.
There is no need to standardise the data (in fact, this won’t change the shape of the
tree at all).
Exercise 9.14. Fit a classification tree model to the wine data, using pH, alcohol,
fixed.acidity, and residual.sugar as explanatory variables. Evaluate its 𝐴𝑈 𝐶
using repeated 10-fold cross-validation.
1. Plot the resulting decision tree. It is too large to be easily understandable, and
needs to be pruned. This is done using the parameter cp. Try increasing the
value of cp in tuneGrid = expand.grid(cp = 0) to different values between
0 and 1. What happens with the tree?
2. Use tuneGrid = expand.grid(cp = seq(0, 0.01, 0.001)) to find an opti-
mal choice of cp. What is the result?
Exercise 9.15. Fit a regression tree model to the bacteria.csv data to see how
OD changes with Time, using the data from observations 45 to 90 of the data frame,
as in the example in Section 9.3.3. Then make predictions for all observations in
the dataset. Plot the actual OD values along with your predictions. Does the model
extrapolate well?
Exercise 9.16. Fit a classification tree model to the seeds data from Section 4.9,
using Variety as the response variable and Kernel_length and Compactness as
explanatory variables. Plot the resulting decision boundaries, as in Section 9.1.8. Do
they seem reasonable to you?
split only a random subset of the explanatory variables are used. The predictions
from these trees are then averaged to obtain a single prediction. While the individual
trees in the forest tend to have rather poor performance, the random forest itself often
performs better than a single decision tree fitted to all of the data using all variables.
To fit a random forest to the estates data (loaded in the same way as in Section
9.5.1), we set method = "rf", which will let us do the fitting using functions from
the randomForest package. The random forest has a parameter called mtry that
determines the number of randomly selected explanatory variables. As a rule-of-
√
thumb, mtry close to 𝑝, where 𝑝 is the number of explanatory variables in your
data, is usually a good choice. When trying to find the best choice for mtry I
recommend trying some values close to that.
For√the estates data we have 11 explanatory variables, and so a value of mtry close
to 11 ≈ 3 could be a good choice. Let’s try a few different values with a 10-fold
cross-validation:
library(caret)
tc <- trainControl(method = "cv",
number = 10)
m <- train(selling_price ~ .,
data = estates,
trControl = tc,
method = "rf",
tuneGrid = expand.grid(mtry = 2:4))
In my run, an mtry equal to 4 gave the best results. Let’s try larger values as well,
just to see if that gives a better model:
m <- train(selling_price ~ .,
data = estates,
trControl = tc,
method = "rf",
tuneGrid = expand.grid(mtry = 4:10))
For this data, a value of mtry that is a little larger than what usually is recommended
seems to give the best results. It was a good thing that we didn’t just blindly go
with the rule-of-thumb, but instead tried a few different values.
9.5. MACHINE LEARNING MODELS 391
Random forests have a built-in variable importance measure, which is based on mea-
suring how much worse the model fares when the values of each variable are permuted.
This is a much more sensible measure of variable importance than that presented in
Section 9.3.2. The importance values are reported on a relative scale, with the value
for the most important variable always being 100. Let’s have a look:
dotPlot(varImp(m))
Exercise 9.17. Fit a decision tree model and a random forest to the wine data, using
all variables (except type) as explanatory variables. Evaluate their performance using
10-fold cross-validation. Which model has the best performance?
Exercise 9.18. Fit a random forest to the bacteria.csv data to see how OD changes
with Time, using the data from observations 45 to 90 of the data frame, as in the
example in Section 9.3.3. Then make predictions for all observations in the dataset.
Plot the actual OD values along with your predictions. Does the model extrapolate
well?
Exercise 9.19. Fit a random forest model to the seeds data from Section 4.9,
using Variety as the response variable and Kernel_length and Compactness as
explanatory variables. Plot the resulting decision boundaries, as in Section 9.1.8. Do
they seem reasonable to you?
The decision trees in the ensemble are built sequentially, with each new tree giving
more weight to observations for which the previous trees performed poorly. This
process is known as boosting.
When fitting a boosted trees model in caret, we set method = "gbm". There are
four parameters that we can use to find a better fit. The two most important are
interaction.depth, which determines the maximum tree depth (values greater than
√
𝑝, where 𝑝 is the number of explanatory variables in your data, are discouraged)
and n.trees, which specifies the number of trees to fit (also known as the number
of boosting iterations). Both these can have a large impact on the model fit. Let’s
try a few values with the estates data (loaded in the same way as in Section 9.5.1):
392 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
library(caret)
tc <- trainControl(method = "cv",
number = 10)
m <- train(selling_price ~ .,
data = estates,
trControl = tc,
method = "gbm",
tuneGrid = expand.grid(
interaction.depth = 1:3,
n.trees = seq(20, 200, 10),
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
The setting verbose = FALSE is added used to stop gbm from printing details about
each fitted tree.
We can plot the model performance for different settings:
ggplot(m)
As you can see, using more trees (a higher number of boosting iterations) seems to
lead to a better model. However, if we use too many trees, the model usually overfits,
leading to a worse performance in the evaluation:
m <- train(selling_price ~ .,
data = estates,
trControl = tc,
method = "gbm",
tuneGrid = expand.grid(
interaction.depth = 1:3,
n.trees = seq(25, 500, 25),
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
ggplot(m)
In many problems, boosted trees are among the best-performing models. They do
however require a lot of tuning, which can be time-consuming, both in terms of how
9.5. MACHINE LEARNING MODELS 393
long it takes to run the tuning and in terms of how much time you have to spend
fiddling with the different parameters. Several different implementations of boosted
trees are available in caret. A good alternative to gbm is xgbTree from the xgboost
package. I’ve chosen not to use that for the examples here, as it often is slower to
train due to having a larger number of hyperparameters (which in return makes it
even more flexible!).
Exercise 9.20. Fit a boosted trees model to the wine data, using all variables
(except type) as explanatory variables. Evaluate its performance using repeated 10-
fold cross-validation. What is the best 𝐴𝑈 𝐶 that you can get by tuning the model
parameters?
Exercise 9.21. Fit a boosted trees regression model to the bacteria.csv data to
see how OD changes with Time, using the data from observations 45 to 90 of the data
frame, as in the example in Section 9.3.3. Then make predictions for all observations
in the dataset. Plot the actual OD values along with your predictions. Does the
model extrapolate well?
Exercise 9.22. Fit a boosted trees model to the seeds data from Section 4.9, using
Variety as the response variable and Kernel_length and Compactness as explana-
tory variables. Plot the resulting decision boundaries, as in Section 9.1.8. Do they
seem reasonable to you?
The model trees in partykit differ from classical decision tress not only in how the
nodes are treated, but also in how the splits are determined; see Zeileis et al. (2008)
394 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
for details. To illustrate their use, we’ll return to the estates data. The model
formula for model trees has two parts. The first specifies the response variable and
what variables to use for the linear models in the nodes, and the second part specifies
what variables to use for the splits. In our example, we’ll use living_area as the sole
explanatory variable in our linear models, and location, build_year, tax_value,
and plot_area for the splits (in this particular example, there is no overlap between
the variables used for the linear models and the variables used for the splits, but its
perfectly fine to have an overlap if you like!).
As in Section 9.5.1, we set file_path to the path to estates.xlsx and import and
clean the data. We can then fit a model tree with linear regressions in the nodes
using lmtree:
library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)
Next, we plot the resulting tree - make sure that you enlarge your Plot panel so that
you can see the linear models fitted in each node:
library(ggparty)
autoplot(m)
By adding additional arguments to lmtree, we can control e.g. the amount of pruning.
You can find a list of all the available arguments by having a look at ?mob_control.
To do automated likelihood-based pruning, we can use prune = "AIC" or prune =
"BIC", which yields a slightly shorter tree:
m <- lmtree(selling_price ~ living_area | location + build_year +
tax_value + plot_area,
data = estates,
prune = "BIC")
autoplot(m)
As per usual, we can use predict to make predictions from our model. Similarly
to how we used lmtree above, we can use glmtree to fit a logistic regression in
each node, which can be useful for classification problems. We can also fit Poisson
9.5. MACHINE LEARNING MODELS 395
regressions in the nodes using glmtree, creating more flexible Poisson regression
models. For more information on how you can control how model trees are plotted
using ggparty, have a look at vignette("ggparty-graphic-partying").
Exercise 9.23. In this exercise, you will fit model trees to the bacteria.csv data
to see how OD changes with Time.
1. Fit a model tree and a decision tree, using the data from observations 45 to 90
of the data frame, as in the example in Section 9.3.3. Then make predictions
for all observations in the dataset. Plot the actual OD values along with your
predictions. Do the models extrapolate well?
2. Now, fit a model tree and a decision tree using the data from observations 20
to 120 of the data frame. Then make predictions for all observations in the
dataset. Does this improve the models’ ability to extrapolate?
# With a prior:
# Prior probability of a red wine is set to be 0.5.
m_with_prior <- train(type ~ pH + alcohol + fixed.acidity +
residual.sugar,
data = wine,
trControl = tc,
method = "lda",
metric = "ROC",
prior = c(0.5, 0.5))
m_no_prior
m_with_prior
When caret fits an LDA, it uses the lda function from the MASS package, which
uses the same syntax as lm. If we use lda directly, without involving caret, we can
extract the scores (linear combinations of variables) for all observations. We can then
plot these, to get something similar to a plot of the first two principal components.
There is a difference though - PCA seeks to create new variables that summarise
as much as possible of the variation in the data, whereas LDA seeks to create new
variables that can be used to discriminate between pre-specified groups.
# Run an LDA:
library(MASS)
m <- lda(Variety ~ ., data = seeds)
Score = predict(m)$x)
View(lda_preds)
# There are 3 varieties of seeds. LDA creates 1 less new variable
# than the number of categories - so 2 in this case. We can
# therefore visualise these using a simple scatterplot.
# Plot the two LDA scores for each observation to get a visual
# representation of the data:
library(ggplot2)
ggplot(lda_preds, aes(Score.LD1, Score.LD2, colour = Type)) +
geom_point()
Exercise 9.25. Fit an LDA classifier and a QDA classifier to the seeds data
from Section 4.9, using Variety as the response variable and Kernel_length and
Compactness as explanatory variables. Plot the resulting decision boundaries, as in
Section 9.1.8. Do they seem reasonable to you?
Exercise 9.26. An even more flexible version of discriminant analysis is MDA, mix-
ture discriminant analysis, which uses normal mixture distributions for classification.
That way, we no longer have to rely on the assumption of normality. It is available
through the mda package, and can be used in train with ‘method = "mda". Fit an
MDA classifier to the seeds data from Section 4.9, using Variety as the response
variable and Kernel_length and Compactness as explanatory variables. Plot the
resulting decision boundaries, as in Section 9.1.8. Do they seem reasonable to you?
Despite the fancy mathematics, using SVM’s is not that difficult. With caret, we
can fit many SVM’s with many different types of kernels using the kernlab package.
Let’s install it:
install.packages("kernlab")
The simplest SVM uses a linear kernel, creating a linear classification that is reminis-
cent of LDA. Let’s look at an example using the wine data from Section 9.1.7. The
parameter 𝐶 is a regularisation parameter:
library(caret)
tc <- trainControl(method = "cv",
number = 10,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
There are a number of other nonlinear kernels that can be used, with different hyper-
parameters that can be tuned. Without going into details about the different kernels,
some important examples are:
• method = "svmPoly: polynomial kernel. The tuning parameters are degree
(the polynomial degree, e.g. 3 for a cubic polynomial), scale (scale) and C
(regularisation).
• method = "svmRadialCost: radial basis/Gaussian kernel. The only tuning
parameter is C (regularisation).
• method = "svmRadialSigma: radial basis/Gaussian kernel with tuning of 𝜎.
The tuning parameters are C (regularisation) and sigma (𝜎).
• method = "svmSpectrumString: spectrum string kernel. The tuning parame-
ters are C (regularisation) and length (length).
Exercise 9.27. Fit an SVM to the wine data, using all variables (except type) as
explanatory variables, using a kernel of your choice. Evaluate its performance using
repeated 10-fold cross-validation. What is the best 𝐴𝑈 𝐶 that you can get by tuning
the model parameters?
9.5. MACHINE LEARNING MODELS 399
Exercise 9.28. In this exercise, you will SVM regression models to the
bacteria.csv data to see how OD changes with Time.
1. Fit an SVM, using the data from observations 45 to 90 of the data frame, as
in the example in Section 9.3.3. Then make predictions for all observations in
the dataset. Plot the actual OD values along with your predictions. Does the
model extrapolate well?
2. Now, fit an SVM using the data from observations 20 to 120 of the data frame.
Then make predictions for all observations in the dataset. Does this improve
the model’s ability to extrapolate?
Exercise 9.29. Fit SVM classifiers with different kernels to the seeds data
from Section 4.9, using Variety as the response variable and Kernel_length and
Compactness as explanatory variables. Plot the resulting decision boundaries, as in
Section 9.1.8. Do they seem reasonable to you?
An important choice in kNN is what value to use for the parameter 𝑘. If 𝑘 is too small,
we use too little information, and if 𝑘 is to large, the classifier will become prone to
classify all observations as belonging to the most common class in the training data.
𝑘 is usually chosen using cross-validation or bootstrapping. To have caret find a
good choice of 𝑘 for us (like we did with 𝜆 in regularised regression models), we use
the argument tuneLength in train, e.g. tuneLength = 15 to try 15 different values
of 𝑘.
By now, I think you’ve seen enough examples of how to fit models in caret that you
can figure out how to fit a model with knn on your own (using the information above,
of course). In the next exercise, you will give kNN a go, using the wine data.
Exercise 9.30. Fit a kNN classification model to the wine data, using pH, alcohol,
fixed.acidity, and residual.sugar as explanatory variables. Evaluate its perfor-
mance using 10-fold cross-validation, using 𝐴𝑈 𝐶 to choose the best 𝑘. Is it better
than the logistic regression models that you fitted in Exercise 9.5?
Exercise 9.31. Fit a kNN classifier to the seeds data from Section 4.9, using
Variety as the response variable and Kernel_length and Compactness as explana-
tory variables. Plot the resulting decision boundaries, as in Section 9.1.8. Do they
seem reasonable to you?
9.6.1 Decomposition
In Section 4.6.5 we saw how time series can be decomposed into three components:
• A seasonal component, describing recurring seasonal patterns,
• A trend component, describing a trend over time,
• A remainder component, describing random variation.
Let’s have a quick look at how to do this in R, using the a10 data from fpp2:
9.6. FORECASTING TIME SERIES 401
library(forecast)
library(ggplot2)
library(fpp2)
?a10
autoplot(a10)
The stl function uses repeated LOESS smoothing to decompose the series. The
s.window parameter lets us set the length of the season in the data. We can set it
to "periodic" to have stl find the periodicity of the data automatically:
autoplot(stl(a10, s.window = "periodic"))
When modelling time series data, we usually want to remove the seasonal component,
as it makes the data structure too complicated. We can then add it back when we
use the model for forecasting. We’ll see how to do that in the following sections.
For model diagnostics, we can use checkresiduals to check whether the residuals
from the model look like white noise (i.e. look normal):
402 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
In this case, the variance of the series seems to increase with time, which the model
fails to capture. We therefore see more large residuals than what is expected under
the model.
Nevertheless, let’s see how we can make a forecast for the next 24 months. The
function for this is the aptly named forecast:
# Plot the forecast (with the seasonal component added back)
# for the next 24 months:
forecast(tsmod, h = 24)
The forecast package is designed to work well with pipes. To fit a model using
stlm and auto.arima and then plot the forecast, we could have used:
a10 %>% stlm(s.window = "periodic", modelfunction = auto.arima) %>%
forecast(h = 24, bootstrap = TRUE) %>% autoplot()
For this data, the forecasts from the two approaches are very similar.
In Section 9.3 we mentioned that a common reason for predictive models failing
in practical applications is that many processes are non-stationary, so that their
patterns change over time. ARIMA model are designed to handle some types of non-
stationary, which can make them particularly useful for modelling such processes.
Exercise 9.32. Return to the writing dataset from the fma package, that we stud-
ied in Exercise 4.15. Remove the seasonal component. Fit an ARIMA model to the
9.7. DEPLOYING MODELS 403
data and use it plot a forecast for the next three years, with the seasonal component
added back and with bootstrap prediction intervals.
We’ll illustrate how this works with a simple example. First, let’s install plumber:
install.packages("plumber")
Next, assume that we’ve fitted a model (we’ll use the linear regression model for
mtcars that we’ve used several times before). We can use this model to make pre-
dictions:
m <- lm(mpg ~ hp + wt, data = mtcars)
We would like to make these predictions available to other systems. That is, we’d like
to allow other systems to send values of hp and wt to our model, and get predictions
in return. To do so, we start by writing a function for the predictions:
404 CHAPTER 9. PREDICTIVE MODELLING AND MACHINE LEARNING
predictions(150, 2)
To make this accessible to other systems, we save this function in a script called
mtcarsAPI.R (make sure to save it in your working directory), which looks as follows:
# Fit the model:
m <- lm(mpg ~ hp + wt, data = mtcars)
The only changes that we have made are some additional special comments (#*),
which specify what input is expected (parameters hp and wt) and that the function is
called predictions. plumber uses this information to create the API. The functions
made available in an API are referred to as endpoints.
The function will now be available on port 8000 of your computer. To access it, you
can open your browser and go to the following URL:
• http://localhost:8000/predictions?hp=150&wt=2
Try changing the values of hp and wt and see how the returned value changes.
That’s it! As long as you leave your R session running with plumber, other systems
will be able to access the model using the URL. Typically, you would run this on a
server and not on your personal computer.
9.7. DEPLOYING MODELS 405
#* Print a message
#* @param name Your name
#* @get /message
function(name = "") {
list(message = paste("Hello", name, "- I'm happy to see you!"))
}
After you’ve saved the file in your working directory, run the following to create the
API:
library(plumber)
pr("mtcarsAPI.R") %>% pr_run(port = 8000)
Advanced topics
This chapter contains brief descriptions of more advanced uses of R. First, we cover
more details surrounding packages. We then deal with two topics that are important
for computational speed: parallelisation and matrix operations. Finally, there are
some tips for how to play well with others (which in this case means using R in
combination with programming languages like Python and C++).
After reading this chapter, you will know how to:
• Update and remove R packages,
• Install R packages from other repositories than CRAN,
• Run computations in parallel,
• Perform matrix computations using R,
• Integrate R with other programming languages.
Alternatively, you can use require to load packages. This will only display a warning,
but won’t cause your code to stop executing, which usually would be a problem if
the rest of the code depends on the package1 !
1 And why else would you load it…?
407
408 CHAPTER 10. ADVANCED TOPICS
However, require also returns a logical: TRUE if the package is installed, and FALSE
otherwise. This is useful if you want to load a package, and automatically install it
if it doesn’t exist.
To load the beepr package, and install it if it doesn’t already exist, we can use
require inside an if condition, as in the code chunk below. If the package exists,
the package will be loaded (by require) and TRUE will be returned, and otherwise
FALSE will be returned. By using ! to turn FALSE into TRUE and vice versa, we can
make R install the package if it is missing:
if(!require("beepr")) { install.packages("beepr"); library(beepr) }
beep(4)
If you make a major update of R, you may have to update most or all of your packages.
To update all your packages, you simply run update.packages(). If this fails, you
can try the following instead:
pkgs <- installed.packages()
pkgs <- pkgs[is.na(pkgs[, "Priority"]), 1]
install.packages(pkgs)
To install packages from GitHub, you need the devtools package. You can install
it using:
install.packages("devtools")
If you for instance want to install the development version of dplyr (which you can
find at https://github.com/tidyverse/dplyr), you can then run the following:
10.2. SPEEDING UP COMPUTATIONS WITH PARALLELISATION 409
library(devtools)
install_github("tidyverse/dplyr")
Using development versions of packages can be great, because it gives you the most
up-to-date version of packages. Bear in mind that they are development versions
though, which means that they can be less stable and have more bugs.
To install packages from Bioconductor, you can start by running this code chunk,
which installs the BiocManager package that is used to install Bioconductor packages:
install.packages("BiocManager")
# Install core packages:
library(BiocManager)
install()
To see how many cores that are available on your machine, you can use detectCores:
library(parallel)
detectCores()
It is unwise to use all available cores for your parallel computation - you’ll always
need to reserve at least one for running RStudio and other applications.
To run the steps of a for loop in parallel, we must first use registerDoParallel
to register the parallel backend to be used. Here is an example where we create 3
workers (and so use 3 cores in parallel3 ) using registerDoParallel. When we then
use foreach to create a for loop, these three workers will execute different steps of
the loop in parallel. Note that this wouldn’t work if each step of the loop depended
on output from the previous step. foreach returns the output created at the end of
each step of the loop in a list (Section 5.2):
library(doParallel)
registerDoParallel(3)
loop_output
unlist(loop_output) # Convert the list to a vector
If the output created at the end of each iteration is a vector, we can collect the output
in a matrix object as follows:
library(doParallel)
registerDoParallel(3)
loop_output
matrix(unlist(loop_output), 9, 2, byrow = TRUE)
If you have nested loops, you should run the outer loop in parallel, but not the inner
loops. The reason for this is that parallelisation only really helps if each step of the
3 If your CPU has 3 or fewer cores, you should lower this number.
10.2. SPEEDING UP COMPUTATIONS WITH PARALLELISATION 411
loop takes a comparatively long time to run. In fact, there is a small overhead cost
associated with assigning different iterations to different cores, meaning that parallel
loops can be slower than regular loops if each iteration runs quickly.
An example where each step often takes a while to run is simulation studies. Let’s
rewrite the simulation we used to compute the type I error rates of different versions
of the t-test in Section 7.5.2 using a parallel for loop instead. First, we define the
function as in Section 7.5.2 (minus the progress bar):
# Load package used for permutation t-test:
library(MKinfer)
for(i in 1:B)
{
# Generate data:
x <- distr(n1, ...)
y <- distr(n2, ...)
# Compute p-values:
p_values[i, 1] <- t.test(x, y,
alternative = alternative)$p.value
p_values[i, 2] <- perm.t.test(x, y,
alternative = alternative,
R = 999)$perm.p.value
p_values[i, 3] <- wilcox.test(x, y,
alternative = alternative)$p.value
}
# Compute p-values:
p_val1 <- t.test(x, y,
alternative = alternative)$p.value
p_val2 <- perm.t.test(x, y,
alternative = alternative,
R = 999)$perm.p.value
p_val3 <- wilcox.test(x, y,
alternative = alternative)$p.value
We can now compare how long the two functions take to run using the tools from
Section 6.6 (we’ll not use mark in this case, as it requires both functions to yield
identical output, which won’t be the case for a simulation):
time1 <- system.time(simulate_type_I(20, 20, rlnorm,
B = 999, sdlog = 3))
time2 <- system.time(simulate_type_I_parallel(20, 20, rlnorm,
B = 999, sdlog = 3))
# Compare results:
10.2. SPEEDING UP COMPUTATIONS WITH PARALLELISATION 413
As you can see, the parallel function is considerably faster. If you have more cores,
you can try increasing the value in registerDoParallel and see how that affects
the results.
Similarly, the furrr package lets just run purrr-functionals in parallel. It relies on
a package called future. Let’s install them both:
install.packages(c("future", "furrr"))
To run functionals in parallel, we load the furrr package and use plan to set the
number of parallel workers:
library(furrr)
# Use 3 workers:
plan(multisession, workers = 3)
We can then run parallel versions of functions like map and imap, by using functions
from furrr with the same names, only with future_ added at the beginning. Here
is the first example from Section 6.5.3, run in parallel:
library(magrittr)
airquality %>% future_map(~(.-mean(.))/sd(.))
414 CHAPTER 10. ADVANCED TOPICS
Just as for for loops, parallelisation of functionals only really helps if each iteration
of the functional takes a comparatively long time to run (and so there is no benefit
to using parallelisation in this particular example).
Matrix operations require the dimension of the matrices involved to match. To check
the dimension of a matrix, we can use dim:
A <- matrix(c(2, -1, 3, 1, -2, 4), 3, 2)
dim(A)
To create a unit matrix (all 1’s) or a zero matrix (all 0’s), we use matrix with a
single value in the first argument:
# Create a 3x3 unit matrix:
matrix(1, 3, 3)
The diag function has three uses. First, it can be used to create a diagonal matrix
(if we supply a vector as input). Second, it can be used to create an identity matrix
(if we supply a single number as input). Third, it can be used to extract the diagonal
from a square matrix (if we supply a matrix as input). Let’s give it a go:
10.3. LINEAR ALGEBRA AND MATRICES 415
Matrix contains additional classes for e.g. symmetric sparse matrices and triangular
matrices. See vignette("Introduction", "Matrix") for further details.
# Vectors:
a <- 1:9 # Length 9
10.3. LINEAR ALGEBRA AND MATRICES 417
Given the vectors a, b, and d defined above, we can compute the outer product 𝑎 ⊗ 𝑏
using %o% and the dot product 𝑎 ⋅ 𝑑 by using %*% and t in the right manner:
a %o% b # Outer product
a %*% t(b) # Alternative way of getting the outer product
t(a) %*% d # Dot product
To find the inverse of a square matrix, we can use solve. To find the generalised
Moore-Penrose inverse of any matrix, we can use ginv from MASS:
solve(A)
solve(B) # Doesn't work because B isn't square
library(MASS)
ginv(A) # Same as solve(A), because A is non-singular and square
ginv(B)
solve can also be used to solve equations systems. To solve the equation 𝐴𝑥 = 𝑦:
solve(A, y)
The eigenvalues and eigenvectors of a square matrix can be found using eigen:
eigen(A)
eigen(A)$values # Eigenvalues only
eigen(A)$vectors # Eigenvectors only
As a P.S., I’ll also mention the matlab package, which contains functions for running
computations using MATLAB-like function calls. This is useful if you want to reuse
MATLAB code in R without translating it row-by-row. Incidentally, this also brings
us nicely into the next section.
Some care has to be taken when sending data back and forth between R and Python.
In R NA is used to represent missing data and NaN (not a number) is used to represent
things that should be numbers but aren’t (e.g. the result of computing 0/0). Perfectly
reasonable! However, for reasons unknown to humanity, popular Python packages
like Pandas, NumPy and SciKit-Learn use NaN instead of NA to represent missing
data - but only for double (numeric) variables. integer and logical variables
have no way to represent missing data in Pandas. Tread gently if there are NA or NaN
values in your data.
Like in C++, the indexing of vectors (and similar objects) in Python starts with 0.
Debugging
In Section 2.10, I gave some general advice about what to do when there is an error
in your R code:
1. Read the error message carefully and try to decipher it. Have you seen it
before? Does it point to a particular variable or function? Check Section 11.2
of this book, which deals with common error messages in R.
2. Check your code. Have you misspelt any variable or function names? Are there
missing brackets, strange commas or invalid characters?
3. Copy the error message and do a web search using the message as your search
term. It is more than likely that somebody else has encountered the same
problem, and that you can find a solution to it online. This is a great shortcut
for finding solutions to your problem. In fact, this may well be the single
most important tip in this entire book.
4. Read the documentation for the function causing the error message, and look at
some examples of how to use it (both in the documentation and online, e.g. in
blog posts). Have you used it correctly?
5. Use the debugging tools presented in Chapter 11, or try to simplify the example
that you are working with (e.g. removing parts of the analysis or the data) and
see if that removes the problem.
6. If you still can’t find a solution, post a question at a site like Stack Overflow
or the RStudio community forums. Make sure to post your code and describe
the context in which the error message appears. If at all possible, post a
reproducible example, i.e. a piece of code that others can run, that causes the
error message. This will make it a lot easier for others to help you.
421
422 CHAPTER 11. DEBUGGING
The debugging tools mentioned in point 5 are an important part of your toolbox,
particularly if you’re doing more advanced programming with R.
In this chapter you will learn how to:
• Debug R code,
• Recognise and resolve common errors in R code,
• Interpret and resolve common warning messages in R.
11.1 Debugging
Debugging is the process of finding and removing bugs in your scripts. R and RStudio
have several functions that can be used for this purpose. We’ll have a closer look at
some of them here.
Why is the function is.data.frame throwing an error? We were using cor, not
is.data.frame!
Functions often make calls to other functions, which in turn make calls the functions,
and so on. When you get an error message, the error can have taken place in any
one of these functions. To find out in which function the error occurred, you can run
traceback, which shows the sequence of calls that lead to the error:
traceback()
What this tells you is that cor makes a call to is.data.frame, and that that is
where the error occurs. This can help you understand why a function that you
weren’t aware that you were calling (is.data.frame in this case) is throwing an
11.1. DEBUGGING 423
error, but won’t tell you why there was an error. To find out, you can use debug,
which we’ll discuss next.
As a side note, if you’d like to know why and when cor called is.data.frame you
can print the code for cor in the Console by typing the function name without
parentheses:
cor
Reading the output, you can see that it makes a call to is.data.frame on the 10th
line:
1 function (x, y = NULL, use = "everything", method = c("pearson",
2 "kendall", "spearman"))
3 {
4 na.method <- pmatch(use, c("all.obs", "complete.obs",
5 "pairwise.complete.obs",
6 "everything", "na.or.complete"))
7 if (is.na(na.method))
8 stop("invalid 'use' argument")
9 method <- match.arg(method)
10 if (is.data.frame(y))
11 y <- as.matrix(y)
...
transform_number(2)
transform_number(-1)
Two things happen. First, a tab with the code for transform_number opens. Second,
a browser is initialised in the Console panel. This allows you to step through the
code, by typing one of the following and pressing Enter:
• n to run the next line,
• c to run the function until it finishes or an error occurs,
• a variable name to see the current value of that variable (useful for checking
that variables have the intended values),
• Q to quit the browser and stop the debugging.
If you either use n a few times, or c, you can see that the error occurs on line number
4 of the function:
if(x >= 0) { logx <- log(x) } else { stop("x must be positive") }
Because this function was so short, you could probably see that already, but for longer
and more complex functions, debug is an excellent way to find out where exactly the
error occurs.
The browser will continue to open for debugging each time transform_number is run.
To turn it off, use undebug:
undebug(transform_number)
This gives you the same list of function calls as traceback (called the function stack),
and you can select which of these that you’d like to investigate (in this case there is
11.2. COMMON ERROR MESSAGES 425
only one, which you access by writing 1 and pressing Enter). The environment for
that call shows up in the Environment panel, which in this case shows you that the
local variable x has been assigned the value NA (which is what causes an error when
the condition x >= 0 is checked).
11.2.1 +
If there is a + sign at the beginning of the last line in the Console, and it seems that
your code doesn’t run, that is likely due to missing brackets or quotes. Here is an
example where a bracket is missing:
> 1 + 2*(3 + 2
+
Type ) in the Console to finish the expression, and your code will run. The same
problem can occur if a quote is missing:
> myString <- "Good things come in threes
+
Type " in the Console to finish the expression, and your code will run.
This error is either due to a misspelling (in which case you should fix the spelling)
or due to attempting to use a function from a package that hasn’t been loaded (in
which case you should load the package using library(package_name)). If you
are unsure which package the function belongs to, doing a quick web search for “R
function_name” usually does the trick.
This error may be due to a spelling error, so check the spelling of the variable name.
It is also commonly encountered if you return to an old R script and try to run just
a part of it - if the variable is created on an earlier line that hasn’t been run, R won’t
find it because it hasn’t been created yet.
Check the spelling of the file name, and that you have given the correct path to it
(see Section 3.3). If you are unsure about the path, you can use
read.csv(file.choose())
and
Error: Evaluation error: zip file 'C:\Users\mans\Data\some_file.xlsx' cannot be op
These usually appear if you have the file open in Excel at the same time that you’re
trying to import data from it in R. Excel temporarily locks the file so that R can’t
open it. Close Excel and then import the data.
11.2. COMMON ERROR MESSAGES 427
which yields:
> x <- c(8, 5, 9, NA)
> for(i in seq_along(x))
+ {
+ if(x[i] > 7) { cat(i, "\n") }
+ }
1
3
Error in if (x[i] > 7) { : missing value where TRUE/FALSE needed
The error occurs when i is 4, because the expression x[i] > 7 becomes NA > 7,
which evaluates to NA. if statements require that the condition evaluates to either
TRUE or FALSE. When this error occurs, you should investigate why you get an NA
instead.
which yields:
> x <- c(8, 5, 9, NA)
> for(i in seq_along(x))
+ {
+ if(x[i] = 5) { cat(i, "\n") }
Error: unexpected '=' in:
"{
428 CHAPTER 11. DEBUGGING
if(x[i] ="
> }
Error: unexpected '}' in "}"
Replace the = by == and your code should run as intended. If you really intended
to assign a value to a variable inside the if condition, you should probably rethink
that.
In this case, we need to put a multiplication symbol * between 2 and ( to make the
code run:
> 1+2*(2+3)
[1] 11
If we attempt to access the third column of the data, we get the error message:
> bookstore[,3]
Error in `[.data.frame`(bookstore, , 3) : undefined columns selected
Check that you really have the correct column number. It is common to get this
error if you have removed columns from your data.
Check that you really have the correct column number. It is common to get this
error if you have removed columns from your data, or if you are running a for loop
accessing element [i, j] of your data frame, where either i or j is greater than the
number of rows and columns of your data.
You probably meant to use parentheses ( ) instead. Or perhaps you wanted to use
the square brackets on the object returned by the function:
> sqrt(x)[2]
[1] 2.236068
If you need to access the element named a, you can do so using bracket notation:
> x["a"]
a
2
If you really need to create an object with different numbers of rows for different
columns, create a list instead:
x <- list(a = 1:3, b = 6:9)
Make sure that the data you are inputting doesn’t contain character variables.
You can fix this e.g. by changing the numbers of rows to place the data in:
x[3:4,] <- y
432 CHAPTER 11. DEBUGGING
11.3.2 the condition has length > 1 and only the first
element will be used
This warning is thrown when the condition in a conditional statement is a vector
rather than a single value. Here is an example:
> x <- 1:3
> if(x == 2) { cat("Two!") }
Warning message:
In if (x == 2) { :
the condition has length > 1 and only the first element will be used
Only the first element of the vector is used for evaluating the condition. See if you
can change the condition so that it doesn’t evaluate to a vector. If you actually want
to evaluate the condition for all elements of the vector, either collapse it using any
or all or wrap it in a loop:
x <- 1:3
if(any(x == 2)) { cat("Two!") }
for(i in seq_along(x))
{
if(x[i] == 2) { cat("Two!") }
}
Don’t try to squeeze more values than can fit into a single element! Instead, do
something like this:
x[4:5] <- c(5, 7)
a <- c(1, 2, 3)
b <- c(4, 5, 6)
a + b
R does element-wise addition, i.e. adds the first element of a to the first element of
b, and so on.
But what happens if we try to add two vectors of different lengths together?
a <- c(1, 2, 3)
b <- c(4, 5, 6, 7)
a + b
R recycles the numbers in a in the addition, so that the first element of a is added
to the fourth element of b. Was that really what you wanted? Maybe. But probably
not.
This can be either due to the fact that you’ve misspelt the package name or that the
package isn’t available for your version of R, either because you are using an out-of-
date version or because the package was developed for an older version of R. In the
former case, consider updating to a newer version of R. In the latter case, if you really
need the package you can find and download older version of R at R-project.org - on
Windows it is relatively easy to have multiple version of R installed side-by-side.
==================================================
downloaded 83 KB
...
...
...
Mathematical appendix
(𝜃1−𝛼/2 , 𝜃𝛼/2 ).
437
438 CHAPTER 12. MATHEMATICAL APPENDIX
Now, consider a family of two-sided tests with p-values 𝜆(𝜃, x), for 𝜃 ∈ Θ. For such
a family we can define an inverted rejection region
For any fixed 𝜃0 , 𝐻0 (𝜃0 ) is rejected if x ∈ 𝑅𝛼 (𝜃0 ), which happens if and only if
𝜃0 ∈ 𝑄𝛼 (x), that is,
x ∈ 𝑅𝛼 (𝜃0 ) ⇔ 𝜃0 ∈ 𝑄𝛼 (x).
If the test is based on a test statistic with a completely specified absolutely continuous
null distribution, then 𝜆(𝜃0 , X) ∼ U(0, 1) under 𝐻0 (𝜃0 ) (Liero & Zwanzig, 2012).
Then
P𝜃0 (X ∈ 𝑅𝛼 (𝜃0 )) = P𝜃0 (𝜆(𝜃0 , X) ≤ 𝛼) = 𝛼.
Since this holds for any 𝜃0 ∈ Θ and since the equivalence relation x ∈ 𝑅𝛼 (𝜃0 ) ⇔ 𝜃0 ∈
𝑄𝛼 (x) implies that
it follows that the random set 𝑄𝛼 (x) always covers the true parameter 𝜃0 with prob-
ability 𝛼. Consequently, letting 𝑄𝐶 𝛼 (x) denote the complement of 𝑄𝛼 (x), for all
𝜃0 ∈ Θ we have
P𝜃0 (𝜃0 ∈ 𝑄𝐶𝛼 (X)) = 1 − 𝛼,
Figure 12.1: The equivalence between confidence intervals and hypothesis tests.
In other cases, the choice matters more. Below, we will discuss the difference between
the two approaches.
Let 𝑇 (X) be a test statistic on which a two-sided test of the point-null hypothesis
that 𝜃 = 𝜃0 is based, and let 𝜆(𝜃0 , x) denote its p-value. Assume for simplicity that
𝑇 (x) < 0 implies that 𝜃 < 𝜃0 and that 𝑇 (x) > 0 implies that 𝜃 > 𝜃0 . We’ll call
the symmetric = FALSE scenario the twice-the-smaller-tail approach to computing
p-values. In it, the first step is to check whether 𝑇 (x) < 0 or 𝑇 (x) > 0. “At least
as extreme as the observed” is in a sense redefined as “at least as extreme as the
observed, in the observed direction”. If the median of the null distribution of 𝑇 (X)
is 0, then, for 𝑇 (x) > 0,
P𝜃0 (𝑇 (X) ≥ 𝑇 (x)|𝑇 (x) > 0) = 2 ⋅ P𝜃0 (𝑇 (X) ≥ 𝑇 (x)),
i.e. twice the unconditional probability that 𝑇 (X) ≥ 𝑇 (x). Similarly, for 𝑇 (x) < 0,
P𝜃0 (𝑇 (X) ≤ 𝑇 (x)|𝑇 (x) < 0) = 2 ⋅ P𝜃0 (𝑇 (X) ≤ 𝑇 (x)).
Moreover,
P𝜃0 (𝑇 (X) ≥ 𝑇 (x)) < P𝜃0 (𝑇 (X) ≤ 𝑇 (x)) when 𝑇 (x) > 0
and
P𝜃0 (𝑇 (X) ≥ 𝑇 (x)) > P𝜃0 (𝑇 (X) ≤ 𝑇 (x)) when 𝑇 (x) < 0.
12.3. TWO TYPES OF P-VALUES 441
This definition of the p-value is frequently used also in situations where the median
of the null distribution of 𝑇 (X) is not 0, despite the fact that the interpretation of
the p-value as being conditioned on whether 𝑇 (x) < 0 or 𝑇 (x) > 0 is lost.
At the level 𝛼, if 𝑇 (x) > 0 the test rejects the hypothesis 𝜃 = 𝜃0 if
This happens if and only if the one-sided test of 𝜃 ≤ 𝜃0 , also based on 𝑇 (X), rejects its
null hypothesis at the 𝛼/2 level. By the same reasoning, it is seen that the rejection
region of a level 𝛼 twice-the-smaller-tail test always is the union of the rejection
regions of two level 𝛼/2 one-sided tests of 𝜃 ≤ 𝜃0 and 𝜃 ≥ 𝜃0 , respectively. The test
puts equal weight to the two types of type I errors: false rejection in the two different
directions. The corresponding confidence interval is therefore also equal-tailed, in
the sense that the non-coverage probability is 𝛼/2 on both sides of the interval.
Twice-the-smaller-tail p-values are in a sense computed by looking only at one tail
of the null distribution. In the alternative approach, symmetric = TRUE, we use
strictly two-sided p-values. Such a p-value is computed using both tails, as follows:
𝜆𝑆𝑇 𝑇 (𝜃0 , x) = P𝜃0 (|𝑇 (X)| ≥ |𝑇 (x)|) = P𝜃0 ({X ∶ 𝑇 (X) ≤ −|𝑇 (x)|} ∪ {X ∶ 𝑇 (X) ≥
|𝑇 (x)|}).
Under this approach, the directional type I error rates will in general not be equal
to 𝛼/2, so that the test might be more prone to falsely reject 𝐻0 (𝜃0 ) in one direction
than in another. On the other hand, the rejection region of a strictly-two sided test
is typically smaller than its twice-the-smaller-tail counterpart. The coverage proba-
bilities of the corresponding confidence interval 𝐼𝛼 (X) = (𝐿𝛼 (X), 𝑈𝛼 (X)) therefore
satisfies the condition that
For parameters of discrete distributions, strictly two-sided hypothesis tests and con-
fidence intervals can behave very erratically (Thulin & Zwanzig, 2017). Twice-the-
smaller tail methods are therefore always preferable when working with count data.
It is also worth noting that if the null distribution of 𝑇 (X) is symmetric about 0,
𝑝
where ℓ(𝛽) is the loglikelihood of 𝛽 and ∑𝑖=1 |𝛽𝑖 |𝑞 is the 𝐿𝑞 -norm, with 𝑞 ≥ 0. This is
𝑝 1
equivalent to maximising ℓ(𝛽) under the constraint that ∑𝑖=1 |𝛽𝑖 |𝑞 ≤ ℎ(𝜆) , for some
increasing positive function ℎ.
In Bayesian estimation, a prior distribution 𝑝(𝛽) for the parameters 𝛽𝑖 is used The
estimates are then computed from the conditional distribution of the 𝛽𝑖 given the
data, called the posterior distribution. Using Bayes’ theorem, we find that
i.e. that the posterior distribution is proportional to the likelihood times the prior.
The Bayesian maximum a posteriori estimator (MAP) is found by maximising the
above expression (i.e. finding the mode of the posterior). This is equivalent to the
estimates from a regularised frequentist model with penalty function 𝑝(𝛽), meaning
444 CHAPTER 12. MATHEMATICAL APPENDIX
that regularised regression can be motivated both from a frequentist and a Bayesian
perspective.
When the 𝐿2 penalty is used, the regularised model is called ridge regression, for
which we maximise
𝑝
ℓ(𝛽) + 𝜆 ∑ 𝛽𝑖2 .
𝑖=1
Solutions to exercises
Chapter 2
Exercise 2.1
Type the following code into the Console window:
1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10
Exercise 2.2
1. To compute the sum and assign it to a, we use:
a <- 924 + 124
Exercise 2.3
1. When an invalid character is used in a variable name, an error message is
displayed in the Console window. Different characters will render different
error messages. For instance, net-income <- income - taxes yields the er-
ror message Error in net - income <- income - taxes : object 'net'
445
446 CHAPTER 13. SOLUTIONS TO EXERCISES
not found. This may seem a little cryptic (and it is!), but what it means
is that R is trying to compute the difference between the variables net and
income, because that is how R interprets net-income, and fails because the
variable net does not exist. As you become more experienced with R, the error
messages will start making more and more sense (at least in most cases).
2. If you put R code as a comment, it will be treated as a comment, meaning that
it won’t run. This is actually hugely useful, for instance when you’re looking
for errors in your code - you can comment away lines of code and see if the rest
of the code runs without them.
3. Semicolons can be used to write multiple commands on a single line - both will
run as if they were on separate lines. If you like, you can add more semicolons
to run even more commands.
4. The value to the right is assigned to both variables. Note, however, that any
operations you perform on one variable won’t affect the other. For instance, if
you change the value of one of them, the other will remain unchanged:
income2 <- taxes2 <- 100
income2; taxes2 # Check that both are 100
taxes2 <- 30 # income2 doesn't change
income2; taxes2 # Check values
Exercise 2.4
1. To create the vectors, use c:
height <- c(158, 170, 172, 181, 196)
weight <- c(45, 80, 62, 75, 115)
Exercise 2.5
The vector created using:
x <- 1:5
gives us the same vector in reverse order: (5, 4, 3, 2, 1). To create the vector
(1, 2, 3, 4, 5, 4, 3, 2, 1) we can therefore use:
x <- c(1:5, 4:1)
447
Exercise 2.6
1. To compute the mean height, use the mean function:
mean(height)
Exercise 2.7
1. length computes the length (i.e. the number of elements) of a vector.
length(height) returns the value 5, because the vector is 5 elements long.
2. sort sorts a vector. The parameter decreasing can be used to decide whether
the elements should be sorted in ascending (sort(weights, decreasing =
FALSE)) or descending (sort(weights, decreasing = TRUE)) order. To sort
the weights in ascending order, we can use sort(weight). Note, however, that
the resulting sorted vector won’t be stored in the variable weight unless we
write weight <- sort(weight)!
Exercise 2.8
√
1. 𝜋 = 1.772454 …:
sqrt(pi)
2. 𝑒2 ⋅ 𝑙𝑜𝑔(4) = 10.24341 …:
exp(2)*log(4)
Exercise 2.9
1. The expression 1/𝑥 tends to infinity as 𝑥 → 0, and so R returns ∞ as the
answer in this case:
1/0
2. The division 0/0 is undefined, and R returns NaN, which stands for Not a
Number:
0/0
√
3. −1 is undefined (as long as we stick to real numbers), and so R returns NaN.
The sqrt function also provides an error message saying that NaN values were
produced.
sqrt(-1)
448 CHAPTER 13. SOLUTIONS TO EXERCISES
If you want to use complex numbers for some reason, you can write the complex
number 𝑎 + 𝑏𝑖 as complex(1, a, b). Using complex numbers, the square root of −1
is 𝑖:
sqrt(complex(1, -1, 0))
Exercise 2.10
1. View the documentation, where the data is described:
?diamonds
This shows you the number of observations (53,940) and variables (10), and the vari-
able types. There are three different data types here: num (numerical), Ord.factor
(ordered factor, i.e. an ordered categorical variable) and int (integer, a numerical
variable that only takes integer values).
3. To compute the descriptive statistics, we can use:
summary(diamonds)
In the summary, missing values show up as NA’s. There are no NA’s here, and hence
no missing values.
Exercise 2.11
The points follow a declining line. The reason for this is that at any given time,
an animal is either awake or asleep, so the total sleep time plus the awake time is
always 24 hours for all animals. Consequently, the points lie on the line given by
awake=24-sleep_total.
Exercise 2.12
1.
ggplot(diamonds, aes(carat, price, colour = cut)) +
geom_point() +
xlab("Weight of diamond (carat)") +
ylab("Price (USD)")
Exercise 2.13
1. To set different shapes for different values of cut we use:
ggplot(diamonds, aes(carat, price, colour = cut, shape = cut)) +
geom_point(alpha = 0.25) +
xlab("Weight of diamond (carat)") +
ylab("Price (USD)")
2. We can then change the size of the points as follows. The resulting figure is
unfortunately not that informative in this case.
ggplot(diamonds, aes(carat, price, colour = cut,
shape = cut, size = x)) +
geom_point(alpha = 0.25) +
xlab("Weight of diamond (carat)") +
ylab("Price (USD)")
Exercise 2.14
Using the scale_axis_log10 options:
ggplot(msleep, aes(bodywt, brainwt, colour = sleep_total)) +
geom_point() +
xlab("Body weight (logarithmic scale)") +
ylab("Brain weight (logarithmic scale)") +
scale_x_log10() +
scale_y_log10()
Exercise 2.15
1. We use facet_wrap(~ cut) to create the facetting:
ggplot(diamonds, aes(carat, price)) +
geom_point() +
facet_wrap(~ cut)
Exercise 2.16
1.
ggplot(diamonds, aes(cut, price)) +
geom_boxplot()
2. To change the colours of the boxes, we add colour (outline colour) and fill
(box colour) arguments to geom_boxplot:
ggplot(diamonds, aes(cut, price)) +
geom_boxplot(colour = "magenta", fill = "turquoise")
Exercise 2.17
1.
ggplot(diamonds, aes(price)) +
geom_histogram()
facet_wrap(~ cut)
Exercise 2.18
1.
ggplot(diamonds, aes(cut)) +
geom_bar()
2. To set different colours for the bars, we can use fill, either to set the colours
manually or using default colours (by adding a colour aesthetic):
# Set colours manually:
ggplot(diamonds, aes(cut)) +
geom_bar(fill = c("red", "yellow", "blue", "green", "purple"))
# Use defaults:
ggplot(diamonds, aes(cut, fill = cut)) +
geom_bar()
Exercise 2.19
To save the png file, use
myPlot <- ggplot(msleep, aes(sleep_total, sleep_rem)) +
geom_point()
452 CHAPTER 13. SOLUTIONS TO EXERCISES
Chapter 3
Exercise 3.1
1. Both approaches render a character object with the text A rainy day in
Edinburgh:
a <- "A rainy day in Edinburgh"
a
class(a)
That is, you are free to choose whether to use single or double quotation marks. I
tend to use double quotation marks, because I was raised to believe that double
quotation marks are superior in every way (well, that, and the fact that I think that
they make code easier to read simply because they are easier to notice).
2. The first two sums are numeric whereas the third is integer
class(1 + 2) # numeric
class(1L + 2) # numeric
class (1L + 2L) # integer
If we mix numeric and integer variables, the result is a numeric. But as long as we
stick to just integer variables, the result is usually an integer. There are exceptions
though - computing 2L/3L won’t result in an integer because… well, because it’s
not an integer.
3. When we run "Hello" + 1 we receive an error message:
> "Hello" + 1
Error in "Hello" + 1 : non-numeric argument to binary operator
In R, binary operators are mathematical operators like +, -, * and / that takes two
numbers and returns a number. Because "Hello" is a character and not a numeric,
it fails in this case. So, in English the error message reads Error in "Hello" + 1 :
trying to perform addition with something that is not a number. Maybe
you know a bit of algebra and want to say hey, we can add characters together, like
453
Exercise 3.2
The functions return information about the data frame:
ncol(airquality) # Number of columns of the data frame
nrow(airquality) # Number of rows of the data frame
dim(airquality) # Number of rows, followed by number of columns
names(airquality) # The name of the variables in the data frame
row.names(airquality) # The name of the rows in the data frame
# (indices unless the rows have been named)
Exercise 3.3
To create the matrices, we need to set the number of rows nrow, the number of
columns ncol and whether to use the elements of the vector x to fill the matrix by
rows or by columns (byrow). To create
1 2 3
( )
4 5 6
we use:
x <- 1:6
And to create
1 4
⎛
⎜2 5⎞
⎟
⎝3 6⎠
we use:
454 CHAPTER 13. SOLUTIONS TO EXERCISES
x <- 1:6
Exercise 3.4
1. In the [i, j] notation, i is the row number and j is the column number. In
this case, airquality[, 3], we have j=3 and therefore asks for the 3rd column,
not the 3rd row. To get the third row, we’d use airquality[3,] instead.
2. To extract the first five rows, we can use:
airquality[1:5,]
# or
airquality[c(1, 2, 3, 4, 5),]
4. To extract all columns except Temp and Wind, we use a minus sign - and a
vector containing their indices:
airquality[, -c(3, 4)]
Exercise 3.5
1. To add the new variable, we can use:
bookstore$rev_per_minute <- bookstore$purchase / bookstore$visit_length
Note that the value of rev_per_minute hasn’t been changed by this operation. We
will therefore need to compute it again, to update its value:
455
Exercise 3.6
1. The coldest day was the day with the lowest temperature:
airquality[which.min(airquality$Temp),]
We see that the 5th day in the period, May 5, was the coldest, with a temperature
of 56 degrees Fahrenheit.
2. To find out how many days the wind speed was greater than 17 mph, we use
sum:
sum(airquality$Wind > 17)
Because there are so few days fulfilling this condition, we could also easily have solved
this by just looking at the rows for those days and counting them:
airquality[airquality$Wind > 17,]
4. In this case, we need to use an ampersand & sign to combine the two conditions:
sum(airquality$Temp < 70 & airquality$Wind > 10)
Exercise 3.7
We should use the breaks argument to set the interval bounds in cut:
airquality$TempCat <- cut(airquality$Temp,
breaks = c(50, 70, 90, 110))
Exercise 3.8
1. The variable X represents the empty column between Visit and VAS. In the X.1
column the researchers have made comments on two rows (rows 692 and 1153),
causing R to read this otherwise empty column. If we wish, we can remove
these columns from the data using the syntax from Section 3.2.1:
vas <- vas[, -c(4, 6)]
read.csv reads the data without any error messages, but now VAS has become a
character vector. By default, read.csv assumes that the file uses decimal points
rather than decimals commas. When we don’t specify that the file has decimal
commas, read.csv interprets 0,4 as text rather than a number.
4. Next, we remove the skip = 4 argument:
vas <- read.csv(file_path, sep = ";", dec = ",")
str(vas)
names(vas)
read.csv looks for column names on the first row that it reads. skip = 4 tells the
function to skip the first 4 rows of the .csv file (which in this case were blank or
contain other information about the data). When it doesn’t skip those lines, the only
text on the first row is Data updated 2020-04-25. This then becomes the name of
the first column, and the remaining columns are named X, X.1, X.2, and so on.
5. Finally, we change skip = 4 to skip = 5:
vas <- read.csv(file_path, sep = ";", dec = ",", skip = 5)
str(vas)
names(vas)
In this case, read.csv skips the first 5 rows, which includes row 5, on which the
457
variable names are given. It still looks for variable names on the first row that it
reads though, meaning that the data values from the first observation become variable
names instead of data points. An X is added at the beginning of the variable names,
because variable names in R cannot begin with a number.
Exercise 3.9
1. First, set file_path to the path to projects-email.xlsx. Then we can use
read.xlsx from the openxlsx package. The argument sheet lets us select
which sheet to read:
library(openxlsx)
emails <- read.xlsx(file_path, sheet = 2)
View(emails)
str(emails)
Exercise 3.10
1. We set file_path to the path to vas-transposed.csv and then read it:
vast <- read.csv(file_path)
dim(vast)
View(vast)
This data frame only contains 2365 variables, because the leftmost column is now
the row names and not a variable.
3. t lets us rotate the data into the format that we are used to. If we only apply
t though, the resulting object is a matrix and not a data.frame. If we want
it to be a data.frame, we must also make a call to as.data.frame:
458 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 3.11
We fit the model and use summary to print estimates and p-values:
m <- lm(mpg ~ hp + wt + cyl + am, data = mtcars)
summary(m)
hp and wt are significant at the 5 % level, but cyl and am are not.
Exercise 3.12
We set file_path to the path for vas.csv and read the data as in Exercise 3.8::
vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)
2. Next, we compute the lowest and highest VAS recorded for each patient:
aggregate(VAS ~ ID, data = vas, FUN = min)
aggregate(VAS ~ ID, data = vas, FUN = max)
3. Finally, we compute the number of high-VAS days for each patient. One way
to do this is to create a logical vector by VAS >= 7 and then compute its
sum.
aggregate((VAS >= 7) ~ ID, data = vas, FUN = sum)
Exercise 3.13
First we load and inspect the data:
library(datasauRus)
View(datasaurus_dozen)
Clearly, the datasets are very different! This is a great example of how simply
computing summary statistics is not enough. They tell a part of the story, yes,
but only a part.
Exercise 3.14
Exercise 3.15
library(magrittr)
bookstore %>% inset("rev_per_minute",
value = .$purchase / .$visit_length)
460 CHAPTER 13. SOLUTIONS TO EXERCISES
Chapter 4
Exercise 4.1
1. We change the background colour of the entire plot to lightblue.
p + theme(panel.background = element_rect(fill = "lightblue"),
plot.background = element_rect(fill = "lightblue"))
4. Finally, we change the colour of the axis ticks to orange and increase their
width:
p + theme(panel.background = element_rect(fill = "lightblue"),
plot.background = element_rect(fill = "lightblue"),
legend.text = element_text(family = "serif"),
legend.title = element_text(family = "serif"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_line(colour = "orange", size = 2))
It doesn’t look all that great, does it? Let’s just stick to the default theme in the
remaining examples.
Exercise 4.2
1. We can use the bw argument to control the smoothness of the curves:
ggplot(diamonds, aes(carat, colour = cut)) +
geom_density(bw = 0.2)
2. We can fill the areas under the density curves by adding fill to the aes:
461
3. Because the densities overlap, it’d be better to make the fill colours slightly
transparent. We add alpha to the geom:
ggplot(diamonds, aes(carat, colour = cut, fill = cut)) +
geom_density(bw = 0.2, alpha = 0.2)
Exercise 4.3
We use xlim to set the boundaries of the x-axis and bindwidth to decrease the bin
width:
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = 0.01) +
xlim(0, 3)
It appears that carat values that are just above multiples of 0.25 are more common
than other values. We’ll explore that next.
Exercise 4.4
1. We set the colours using the fill aesthetic:
ggplot(diamonds, aes(cut, price, fill = cut)) +
geom_violin()
so that the boxplots use the default colours instead of different colours for each
category.
ggplot(diamonds, aes(cut, price)) +
geom_violin(aes(fill = cut), width = 1.25) +
geom_boxplot(width = 0.1, alpha = 0.5) +
theme(legend.position = "none")
4. Finally, we can create a horizontal version of the figure in the same way we did
for boxplots in Section 2.18: by adding coord_flip() to the plot:
ggplot(diamonds, aes(cut, price)) +
geom_violin(aes(fill = cut), width = 1.25) +
geom_boxplot(width = 0.1, alpha = 0.5) +
theme(legend.position = "none") +
coord_flip()
Exercise 4.5
We can create an interactive scatterplot using:
myPlot <- ggplot(diamonds, aes(x, y,
text = paste("Row:", rownames(diamonds)))) +
geom_point()
ggplotly(myPlot)
There are outliers along the y-axis on rows 24,068 and 49,190. There are also some
points for which 𝑥 = 0. Examples include rows 11,183 and 49,558. It isn’t clear
from the plot, but in total there are 8 such points, 7 of which have both 𝑥 = 0 and
𝑦 = 0. To view all such diamonds, you can use filter(diamonds, x==0). These
observations must be due to data errors, since diamonds can’t have 0 width. The high
𝑦-values also seem suspicious - carat is a measure of diamond weight, and if these
diamonds really were 10 times longer than others then we would probably expect
them to have unusually high carat values as well (which they don’t).
Exercise 4.6
The two outliers are the only observations for which 𝑦 > 20, so we use that as our
condition:
ggplot(diamonds, aes(x, y)) +
geom_point() +
geom_text(aes(label = ifelse(y > 20, rownames(diamonds), "")),
hjust = 1.1)
463
Exercise 4.7
In this plot, we see that virtually all high carat diamonds have missing x values. This
seems to indicate that there is a systematic pattern to the missing data (which of
course is correct in this case!), and we should proceed with any analyses of x with
caution.
Exercise 4.8
The code below is an example of what your analysis can look like, with some remarks
as comments:
# Investigate missing data
colSums(is.na(flights2))
# Not too much missing data in this dataset!
View(flights2[is.na(flights2$air_time),])
# Flights with missing data tend to have several missing variables.
Exercise 4.9
2. We can use the method argument in geom_smooth to fit a straight line using
lm instead of LOESS:
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
4. Finally, we can change the colour of the smoothing line using the colour argu-
ment:
ggplot(msleep, aes(brainwt, sleep_total)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour = "red") +
xlab("Brain weight (logarithmic scale)") +
ylab("Total sleep time") +
scale_x_log10()
Exercise 4.10
1. Adding the geom_smooth geom with the default settings produces a trend line
that does not capture seasonality:
autoplot(a10) +
geom_smooth()
4. The colour argument can be passed to autoplot to change the colour of the
time series line:
autoplot(a10, colour = "red") +
geom_smooth() +
xlab("Year") +
ylab("Sales ($ million)") +
466 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 4.11
1. The text can be added by using annotate(geom = "text", ...). In order
not to draw the text on top of the circle, you can shift the x-value of the text
(the appropriate shift depends on the size of your plot window):
autoplot(gold) +
annotate(geom = "point", x = spike_date, y = gold[spike_date],
size = 5, shape = 21, colour = "red",
fill = "transparent") +
annotate(geom = "text", x = spike_date - 100,
y = gold[spike_date],
label = "Incorrect value!")
2. We can remove the erroneous value by replacing it with NA in the time series:
gold[spike_date] <- NA
autoplot(gold)
Exercise 4.12
1. We can specify which variables to include in the plot as follows:
autoplot(elecdaily[, c("Demand", "Temperature")], facets = TRUE)
This produces a terrible-looking label for the y-axis, which we can remove by setting
the y-label to NULL:
autoplot(elecdaily[, c("Demand", "Temperature")], facets = TRUE) +
ylab(NULL)
Exercise 4.13
1. We set the size of the points using geom_point(size):
467
2. To add annotations, we use annotate and some code to find the days of the
lowest and highest temperatures:
## Lowest temperature
lowest <- which.min(elecdaily2$Temperature)
## Highest temperature
highest <- which.max(elecdaily2$Temperature)
Exercise 4.14
We can specify aes(group) for a particular geom only as follows:
ggplot(Oxboys, aes(age, height, colour = Subject)) +
geom_point() +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", colour = "red", se = FALSE)
Subject is now used for grouping the points used to draw the lines (i.e. for
geom_line), but not for geom_smooth, which now uses all the points to create a
trend line showing the average height of the boys over time.
Exercise 4.15
Code for producing the three plots is given below:
library(fma)
geom_smooth() +
ylab("Sales (francs)") +
ggtitle("Sales of printing and writing paper")
# Seasonal plot
ggseasonplot(writing, year.labels = TRUE, year.labels.left = TRUE) +
ylab("Sales (francs)") +
ggtitle("Seasonal plot of sales of printing and writing paper")
# There is a huge dip in sales in August, when many French offices are
# closed due to holidays.
# stl-decomposition
autoplot(stl(writing, s.window = 365)) +
ggtitle("Seasonal decomposition of paper sales time series")
Exercise 4.16
We use the cpt.var functions with the default settings:
library(forecast)
library(fpp2)
library(changepoint)
library(ggfortify)
The variance is greater in the beginning of the year, and then appears to be more or
less constant. Perhaps this can be explained by temperature?
# Plot the time series:
autoplot(elecdaily[,"Temperature"])
We see that the high-variance period coincides with peaks and large oscillations in
temperature, which would cause the energy demand to increase and decrease more
than usual, making the variance greater.
Exercise 4.17
By adding a copy of the observation for month 12, with the Month value replaced by
0, we can connect the endpoints to form a continuous curve:
469
Exercise 4.18
As for all ggplot2 plots, we can use ggtitle to add a title to the plot:
ggpairs(diamonds[, which(sapply(diamonds, class) == "numeric")],
aes(colour = diamonds$cut, alpha = 0.5)) +
ggtitle("Numeric variables in the diamonds dataset")
Exercise 4.19
1. We create the correlogram using ggcorr as follows:
ggcorr(diamonds[, which(sapply(diamonds, class) == "numeric")])
4. low and high can be used to control the colours at the endpoints of the scale:
ggcorr(diamonds[, which(sapply(diamonds, class) == "numeric")],
method = c("pairwise", "spearman"),
nbreaks = 5,
low = "yellow", high = "black")
Exercise 4.20
1. We replace colour = vore in the aes by fill = vore and add colour =
"black", shape = 21 to geom_point. The points now get black borders,
which makes them a bit sharper:
470 CHAPTER 13. SOLUTIONS TO EXERCISES
2. We can use ggplotly to create an interactive version of the plot. Adding text
to the aes allows us to include more information when hovering points:
library(plotly)
myPlot <- ggplot(msleep, aes(brainwt, sleep_total, fill = vore,
size = bodywt, text = name)) +
geom_point(alpha = 0.5, colour = "black", shape = 21) +
xlab("log(Brain weight)") +
ylab("Sleep total (h)") +
scale_x_log10() +
scale_size(range = c(1, 20), trans = "sqrt",
name = "Square root of\nbody weight") +
scale_color_discrete(name = "Feeding behaviour")
ggplotly(myPlot)
Exercise 4.21
1. We create the tile plot using geom_tile. By setting fun = max we obtain the
highest price in each bin:
ggplot(diamonds, aes(table, depth, z = price)) +
geom_tile(binwidth = 1, stat = "summary_2d", fun = max) +
ggtitle("Highest prices for diamonds with different depths
and tables")
Diamonds with carat around 0.3 and price around 1000 have the highest bin counts.
Exercise 4.22
1. VS2 and Ideal is the most common combination:
471
2. As for continuous variables, we can use geom_tile with the arguments stat =
"summary_2d", fun = mean to display the average prices for different combi-
nations. SI2 and Premium is the combination with the highest average price:
ggplot(diamonds, aes(clarity, cut, z = price)) +
geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
ggtitle("Mean prices for diamonds with different
clarities and cuts")
Exercise 4.23
1. We create the scatterplot using:
library(gapminder)
library(GGally)
ggplotly(myPlot)
Exercise 4.24
1. Fixed wing multi engine Boeings are the most common planes:
library(nycflights13)
library(ggplot2)
2. The fixed wing multi engine Airbus has the highest average number of seats:
ggplot(planes, aes(type, manufacturer, z = seats)) +
geom_tile(binwidth = 1, stat = "summary_2d", fun = mean) +
ggtitle("Number of seats for different planes")
3. The number of seats seems to have increased in the 1980’s, and then reached a
plateau:
ggplot(planes, aes(year, seats)) +
geom_point(aes(colour = engine)) +
geom_smooth()
The plane with the largest number of seats is not an Airbus, but a Boeing 747-451. It
can be found using planes[which.max(planes$seats),] or visually using plotly:
myPlot <- ggplot(planes, aes(year, seats,
text = paste("Tail number:", tailnum,
"<br>Manufacturer:",
manufacturer))) +
geom_point(aes(colour = engine)) +
geom_smooth()
ggplotly(myPlot)
4. Finally, we can investigate what engines were used during different time periods
in several ways, for instance by differentiating engines by colour in our previous
plot:
ggplot(planes, aes(year, seats)) +
geom_point(aes(colour = engine)) +
geom_smooth()
473
Exercise 4.25
First, we compute the principal components:
library(ggplot2)
The first PC accounts for 65.5 % of the total variance. The first two account for
86.9 % and the first three account for 98.3 % of the total variance, meaning that 3
components are needed to account for at least 90 % of the total variance.
2. To see the loadings, we type:
pca
The first PC appears to measure size: it is dominated by carat, x, y and z, which all
are size measurements. The second PC appears is dominated by depth and table
and is therefore a summary of those measures.
3. To compute the correlation, we use cor:
cor(pca$x[,1], diamonds$price)
The (Pearson) correlation is 0.89, which is fairly high. Size is clearly correlated to
price!
4. To see if the first two principal components be used to distinguish between
diamonds with different cuts, we make a scatterplot:
autoplot(pca, data = diamonds, colour = "cut")
The points are mostly gathered in one large cloud. Apart from the fact that very
large or very small values of the second PC indicates that a diamond has a Fair cut,
the first two principal components seem to offer little information about a diamond’s
cut.
Exercise 4.26
We create the scatterplot with the added arguments:
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
474 CHAPTER 13. SOLUTIONS TO EXERCISES
"Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
library(ggfortify)
autoplot(pca, data = seeds, colour = "Variety",
loadings = TRUE, loadings.label = TRUE)
Exercise 4.27
We change the hc_method and hc_metric arguments to use complete linkage and
the Manhattan distance:
library(cluster)
library(factoextra)
votes.repub %>% scale() %>%
hcut(k = 5, hc_func = "agnes",
hc_method = "complete",
hc_metric = "manhattan") %>%
fviz_dend()
fviz_dend produces ggplot2 plots. We can save the plots from both approaches
and then plot them side-by-side using patchwork as in Section 4.3.4:
votes.repub %>% scale() %>%
hcut(k = 5, hc_func = "agnes",
hc_method = "average",
hc_metric = "euclidean") %>%
fviz_dend() -> dendro1
votes.repub %>% scale() %>%
hcut(k = 5, hc_func = "agnes",
hc_method = "complete",
hc_metric = "manhattan") %>%
fviz_dend() -> dendro2
library(patchwork)
dendro1 / dendro2
475
Alaska and Vermont are clustered together in both cases. The red leftmost cluster
is similar but not identical, including Alabama, Georgia and Louisiana.
To compare the two dendrograms in a different way, we can use tanglegram. Setting
k_labels = 5 and k_branches = 5 gives us 5 coloured clusters:
votes.repub %>% scale() %>%
hcut(k = 5, hc_func = "agnes",
hc_method = "average",
hc_metric = "euclidean") -> clust1
votes.repub %>% scale() %>%
hcut(k = 5, hc_func = "agnes",
hc_method = "complete",
hc_metric = "manhattan") -> clust2
library(dendextend)
tanglegram(as.dendrogram(clust1),
as.dendrogram(clust2),
k_labels = 5,
k_branches = 5)
Note that the colours of the lines connecting the two dendrograms are unrelated to
the colours of the clusters.
Exercise 4.28
Using the default settings in agnes, we can do the clustering using:
library(cluster)
library(magrittr)
USArrests %>% scale() %>%
agnes() %>%
plot(which = 2)
Maryland is clustered with New Mexico, Michigan and Arizona, in that order.
Exercise 4.29
We draw a heatmap, with the data standardised in the column direction because we
wish to cluster the observations rather than the variables:
library(cluster)
library(magrittr)
USArrests %>% as.matrix() %>% heatmap(scale = "col")
You may want to increase the height of your Plot window so that the names of all
states are displayed properly.
476 CHAPTER 13. SOLUTIONS TO EXERCISES
The heatmap shows that Maryland, and the states similar to it, has higher crime
rates than most other states. There are a few other states with high crime rates in
other clusters, but those tend to only have a high rate for one crime (e.g. Georgia,
which has a very high murder rate), whereas states in the cluster that Maryland is
in have high rates for all or almost all types of violent crime.
Exercise 4.30
First, we inspect the data:
library(cluster)
?chorSub
# Scatterplot matrix:
library(GGally)
ggpairs(chorSub)
There are a few outliers, so it may be a good idea to use pam as it is less affected by
outliers than kmeans. Next, we draw some plots to help use choose 𝑘:
library(factoextra)
library(magrittr)
chorSub %>% scale() %>%
fviz_nbclust(pam, method = "wss")
chorSub %>% scale() %>%
fviz_nbclust(pam, method = "silhouette")
chorSub %>% scale() %>%
fviz_nbclust(pam, method = "gap")
There is no pronounced elbow in the WSS plot, although slight changes appear to
occur at 𝑘 = 3 and 𝑘 = 7. Judging by the silhouette plot, 𝑘 = 3 may be a good
choice, while the gap statistic indicates that 𝑘 = 7 would be preferable. Let’s try
both values:
# k = 3:
chorSub %>% scale() %>%
pam(k = 3) -> kola_cluster
fviz_cluster(kola_cluster, geom = "point")
# k = 7:
chorSub %>% scale() %>%
pam(k = 7) -> kola_cluster
fviz_cluster(kola_cluster, geom = "point")
The plot for 𝑘 = 7 may look a little strange, with two largely overlapping clusters.
Bear in mind though, that the clustering algorithm uses all 10 variables and not just
the first two principal components, which are what is shown in the plot. The differ-
ences between the two clusters isn’t captured by the first two principal components.
Exercise 4.31
First, we try to find a good number of clusters:
library(factoextra)
library(magrittr)
USArrests %>% scale() %>%
fviz_nbclust(fanny, method = "wss")
USArrests %>% scale() %>%
fviz_nbclust(fanny, method = "silhouette")
# Show memberships:
USAclusters$membership
Maryland is mostly associated with the first cluster. Its neighbouring state New
Jersey is equally associated with both clusters.
Exercise 4.32
We do the clustering and plot the resulting clusters:
library(cluster)
library(mclust)
kola_cluster <- Mclust(scale(chorSub))
summary(kola_cluster)
Three clusters, that overlap substantially when the first two principal components
are plotted, are found.
Exercise 4.33
First, we have a look at the data:
478 CHAPTER 13. SOLUTIONS TO EXERCISES
?ability.cov
ability.cov
We can imagine several different latent variables that could explain how well the
participants performed in these tests: general ability, visual ability, verbal ability,
and so on. Let’s use a scree plot to determine how many factors to use:
library(psych)
scree(ability.cov$cov, pc = FALSE)
# 3-factor model:
ab_fa3 <- fa(ability.cov$cov, nfactors = 3,
rotate = "oblimin", fm = "ml")
fa.diagram(ab_fa3, simple = FALSE)
In the 2-factor model, one factor is primarily associated with the visual variables
(which we interpret as the factor describing visual ability), whereas the other primar-
ily is associated with reading and vocabulary (verbal ability). Both are associated
with the measure of general intelligence.
In the 3-factor model, there is still a factor associated with reading and vocabulary.
There are two factors associated with the visual tests: one with block design and
mazes and one with picture completion and general intelligence.
Exercise 4.34
First, we have a look at the data:
library(poLCA)
?cheating
View(cheating)
The two classes roughly correspond to cheaters and non-cheaters. From the table
showing the relationship with GPA, we see students with high GPA’s are less likely to
be cheaters.
479
Chapter 5
Exercise 5.1
1. as.logical returns FALSE for 0 and TRUE for all other numbers:
as.logical(0)
as.logical(1)
as.logical(14)
as.logical(-8.889)
as.logical(pi^2 + exp(18))
2. When the as. functions are applied to vectors, they convert all values in the
vector:
as.character(c(1, 2, 3, pi, sqrt(2)))
3. The is. functions return a logical: TRUE if the variable is of the type and
FALSE otherwise:
is.numeric(27)
is.numeric("27")
is.numeric(TRUE)
4. The is. functions show that NA in fact is a (special type of) logical. This is
also verified by the documentation for NA:
is.logical(NA)
is.numeric(NA)
is.character(NA)
?NA
Exercise 5.2
We set file_path to the path for vas.csv and load the data as in Exercise 3.8:
vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)
To access the values for patient 212, either of the following works:
vas_split$`212`
vas_split[[12]]
480 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 5.3
1. To convert the proportions to percentages with one decimal place, we must first
multiply them by 100 and then round them:
props <- c(0.1010, 0.2546, 0.6009, 0.0400, 0.0035)
round(100 * props, 1)
2. The cumulative maxima and minima are computed using cummax and cummin:
cummax(airquality$Temp)
cummin(airquality$Temp)
The minimum during the period occurs on the 5th day, whereas the maximum occurs
during day 120.
3. To find runs of days with temperatures above 80, we use rle:
runs <- rle(airquality$Temp > 80)
To find runs with temperatures above 80, we extract the length of the runs for which
runs$values is TRUE:
runs$lengths[runs$values == TRUE]
Exercise 5.4
1. On virtually all systems, the largest number that R can represent as a floating
point is 1.797693e+308. You can find this by gradually trying larger and larger
numbers:
1e+100
# ...
1e+308
1e+309 # The largest number must be between 1e+308 and 1e+309!
# ...
1.797693e+308
1.797694e+308
Exercise 5.5
We re-use the solution from Exercise 3.7:
481
Exercise 5.6
1 We start by converting the vore variable to a factor:
library(ggplot2)
str(msleep) # vore is a character vector!
# Alternatively, rank and match can be used to get the new order of
# the levels:
?rank
?match
ranks <- rank(means$sleep_total)
new_order <- match(1:4, ranks)
Exercise 5.7
First, we set file_path to the path to handkerchiefs.csv and import it to the
data frame pricelist:
pricelist <- read.csv(file_path)
2. We can use grep and a regular expression to see that there are 2 rows of the
Italian.handkerchief column that contain numbers:
grep("[[:digit:]]", pricelist$Italian.handkerchiefs)
3. To extract the prices in shillings (S) and pence (D) from the Price column and
store these in two new numeric variables in our data frame, we use strsplit,
unlist and matrix as follows:
# Split strings at the space between the numbers and the letters:
Price_split <- strsplit(pricelist$Price, " ")
Price_split <- unlist(Price_split)
Price_matrix <- matrix(Price_split, nrow = length(Price_split)/4,
ncol = 4, byrow = TRUE)
Exercise 5.8
We set file_path to the path to oslo-biomarkers.xlsx and load the data:
library(openxlsx)
To find out how many patients were included in the study, we use strsplit to split
the ID-timepoint string, and then unique:
oslo_id <- unlist(strsplit(oslo$"PatientID.timepoint", "-"))
unique(oslo_id_matrix[,1])
483
length(unique(oslo_id_matrix[,1]))
Exercise 5.9
1. "$g" matches strings ending with g:
contacts$Address[grep("g$", contacts$Address)]
Exercise 5.10
We want to extract all words, i.e. segments of characters separated by white spaces.
First, let’s create the string containing example sentences:
x <- "This is an example of a sentence, with 10 words. Here are 4 more!"
Note that x_split is a list. To turn this into a vector, we use unlist
x_split <- unlist(x_split)
Finally, we can use gsub to remove the punctuation marks, so that only the words
remain:
gsub("[[:punct:]]", "", x_split)
Exercise 5.11
1. The functions are used to extract the weekday, month and quarter for each
date:
weekdays(dates)
months(dates)
quarters(dates)
2. julian can be used to compute the number of days from a specific date
(e.g. 1970-01-01) to each date in the vector:
julian(dates, origin = as.Date("1970-01-01", format = "%Y-%m-%d"))
Exercise 5.12
1. On most systems, converting the three variables to Date objects using as.Date
yields correct dates without times:
as.Date(c(time1, time2, time3))
The result is 2020-04-02, i.e. adding 1 to the Date object has added 1 day to it.
3. We convert time3 and time1 to Date objects and subtract them:
as.Date(time3) - as.Date(time1)
The result is a difftime object, printed as Time difference of 2 days. Note that
the times are ignored, just as before.
4. We convert time2 and time1 to Date objects and subtract them:
as.Date(time2) - as.Date(time1)
The result is printed as Time difference of 0 days, because the difference in time
is ignored.
5. We convert the three variables to POSIXct date and time objects using
as.POSIXct without specifying the date format:
as.POSIXct(c(time1, time2, time3))
as.POSIXct(time3) - as.POSIXct(time1)
This time out, time is included when the difference is computed, and the output is
Time difference of 2.234722 days.
7. We convert time2 and time1 to POSIXct objects and subtract them:
as.POSIXct(time2) - as.POSIXct(time1)
Exercise 5.13
1. Using the first option, the Date becomes the first day of the quarter. Using
the second option, it becomes the last day of the quarter instead. Both can be
useful for presentation purposes - which you prefer is a matter of taste.
2. To convert the quarter-observations to the first day of their respective quarters,
we use as.yearqtr as follows:
library(zoo)
as.Date(as.yearqtr(qvec2, format = "Q%q/%y"))
as.Date(as.yearqtr(qvec3, format = "Q%q-%Y"))
%q, %y,and %Y are date tokens. The other letters and symbols in the format argument
simply describe other characters included in the format.
Exercise 5.14
The x-axis of the data can be changed in multiple ways. A simple approach is the
following:
## Create a new data frame with the correct dates and the demand data:
dates <- seq.Date(as.Date("2014-01-01"), as.Date("2014-12-31"),
by = "day")
elecdaily2 <- data.frame(dates = dates, demand = elecdaily[,1])
A more elegant approach relies on the xts package for time series:
486 CHAPTER 13. SOLUTIONS TO EXERCISES
library(xts)
## Convert time series to an xts object:
dates <- seq.Date(as.Date("2014-01-01"), as.Date("2014-12-31"),
by = "day")
elecdaily3 <- xts(elecdaily, order.by = dates)
autoplot(elecdaily3[,"Demand"])
Exercise 5.15
Exercise 5.16
We set file_path to the path for vas.csv and read the data as in Exercise 3.8 and
convert it to a data.table (the last step being optional if we’re only using dplyr
for this exercise):
vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)
vas <- as.data.table(vas)
A better option is to achieve the same result in a single line by using the fread
function from data.table:
vas <- fread(file_path, sep = ";", dec = ",", skip = 4)
Exercise 5.17
We re-use the solution from Exercise 3.7:
airquality$TempCat <- cut(airquality$Temp,
breaks = c(50, 70, 90, 110))
aq <- data.table(airquality)
Exercise 5.18
We set file_path to the path for vas.csv and read the data as in Exercise 3.8 using
fread to import it as a data.table:
vas <- fread(file_path, sep = ";", dec = ",", skip = 4)
2. Next, we compute the lowest and highest VAS recorded for each patient:
3. Finally, we compute the number of high-VAS days for each patient. We can
compute the sum directly:
Alternatively, we can do this by first creating a dummy variable for high-VAS days:
Exercise 5.19
First we load the data and convert it to a data.table (the last step being optional
if we’re only using dplyr for this exercise):
library(datasauRus)
dd <- as.data.table(datasaurus_dozen)
Clearly, the datasets are very different! This is a great example of how simply
computing summary statistics is not enough. They tell a part of the story, yes,
but only a part.
Exercise 5.20
We set file_path to the path for vas.csv and read the data as in Exercise 3.8 using
fread to import it as a data.table:
library(data.table)
vas <- fread(file_path, sep = ";", dec = ",", skip = 4)
Exercise 5.21
We set file_path to the path to ucdp-onesided-191.csv and load the data as a
data.table using fread:
library(dplyr)
library(data.table)
1. First, we filter the rows so that only conflicts that took place in Colombia are
retained.
To list the number of different actors responsible for attacks, we can use unique:
unique(colombia$actor_name)
We see that there were attacks by 7 different actors during the period.
2. To find the number of fatalities caused by government attacks on civilians, we
first filter the data to only retain rows where the actor name contains the word
government:
To estimate the number of fatalities cause by these attacks, we sum the fatalities
from each attack:
sum(gov$best_fatality_estimate)
Exercise 5.22
We set file_path to the path to oslo-biomarkers.xlsx and load the data:
library(dplyr)
library(data.table)
library(openxlsx)
1. First, we select only the measurements from blood samples taken at 12 months.
These are the only observations where the PatientID.timepoint column con-
tains the word months:
491
2. Second, we select only the measurements from the patient with ID number
6. Note that we cannot simply search for strings containing a 6, as we then
also would find measurements from other patients taken at 6 weeks, as well as
patients with a 6 in their ID number, e.g. patient 126. Instead, we search for
strings beginning with 6-:
Exercise 5.23
We set file_path to the path to ucdp-onesided-191.csv and load the data as a
data.table using fread:
library(dplyr)
library(data.table)
Exercise 5.24
We set file_path to the path to oslo-biomarkers.xlsx and load the data:
library(dplyr)
library(data.table)
library(openxlsx)
Note that because PatientID.timepoint is a character column, the rows are now
ordered in alphabetical order, meaning that patient 1 is followed by 100, 101, 102,
and so on. To order the patients in numerical order, we must first split the ID and
timepoints into two different columns. We’ll see how to do that in the next section,
and try it out on the oslo data in Exercise 5.25.
Exercise 5.25
We set file_path to the path to oslo-biomarkers.xlsx and load the data:
library(dplyr)
library(tidyr)
library(data.table)
library(openxlsx)
3. Finally, we reformat the data from long to wide, keeping the IL-8 and VEGF-A
measurements. We store it as oslo2, knowing that we’ll need it again in Exercise
5.26.
493
Exercise 5.26
We use the oslo2 data frame that we created in Exercise 5.26. In addition, we set
file_path to the path to oslo-covariates.xlsx and load the data:
library(dplyr)
library(data.table)
library(openxlsx)
1. First, we merge the wide data frame from Exercise 5.25 with the
oslo-covariates.xlsx data, using patient ID as key. A left join, where we
only keep data for patients with biomarker measurements, seems appropriate
here. We see that both datasets have a column named PatientID, which we
can use as our key.
2. Next, we use the oslo-covariates.xlsx data to select data for smokers from
the wide data frame using a semijoin. The Smoker.(1=yes,.2=no) column
contains information about smoking habits. First we create a table for filtering:
Exercise 5.27
Noting that the first four characters in each element of the vector contain the year,
we can use substr to only keep those characters. Finally, we use as.numeric to
convert the text to numbers:
keytars$Dates <- substr(keytars$Dates, 1, 4)
keytars$Dates <- as.numeric(keytars$Dates)
keytars$Dates
Chapter 6
Exercise 6.1
Exercise 6.2
1. We want out function to take a vector as input and return a vector containing
its minimum and the maximum, without using min and max:
minmax <- function(x)
{
# Sort x so that the minimum becomes the first element
# and the maximum becomes the last element:
sorted_x <- sort(x)
min_x <- sorted_x[1]
max_x <- sorted_x[length(sorted_x)]
return(c(min_x, max_x))
}
2. We want a function that computes the mean of the squared values of a vector
using mean, and that takes additional arguments that it passes on to mean
(e.g. na.rm):
mean2 <- function(x, ...)
{
return(mean(x^2, ...))
}
# With NA:
x <- c(3, 2, NA)
mean2(x) # Should be NA
mean2(x, na.rm = TRUE) # Should be 13/2=6.5
Exercise 6.3
We use cat to print a message about missing values, sum(is.na(.)) to compute
the number of missing values, na.omit to remove rows with missing data and then
summary to print the summary:
na_remove <- . %T>% {cat("Missing values:", sum(is.na(.)), "\n")} %>%
na.omit %>% summary
496 CHAPTER 13. SOLUTIONS TO EXERCISES
na_remove(airquality)
Exercise 6.4
The following operator allows us to plot y against x:
`%against%` <- function(y, x) { plot(x, y) }
Exercise 6.5
1. FALSE: x is not greater than 2.
2. TRUE: | means that at least one of the conditions need to be satisfied, and x is
greater than z.
3. FALSE: & means that both conditions must be satisfied, and x is not greater
than y.
4. TRUE: the absolute value of x*z is 6, which is greater than y.
Exercise 6.6
There are two errors: the variable name in exists is not between quotes and x > 0
evaluates to a vector an not a single value. The goal is to check that all values in x
are positive, so all can be used to collapse the logical vector x > 0:
x <- c(1, 2, pi, 8)
Exercise 6.7
1. To compute the mean temperature for each month in the airquality dataset
using a loop, we loop over the 6 months:
months <- unique(airquality$Month)
meanTemp <- vector("numeric", length(months))
for(i in seq_along(months))
{
# Extract data for month[i]:
aq <- airquality[airquality$Month == months[i],]
# Compute mean temperature:
meanTemp[i] <- mean(aq$Temp)
}
2. Next, we use a for loop to compute the maximum and minimum value of each
column of the airquality data frame, storing the results in a data frame:
results <- data.frame(min = vector("numeric", ncol(airquality)),
max = vector("numeric", ncol(airquality)))
for(i in seq_along(airquality))
{
results$min[i] <- min(airquality[,i], na.rm = TRUE)
results$max[i] <- max(airquality[,i], na.rm = TRUE)
}
results
for(i in seq_along(df))
{
results$min[i] <- min(df[,i], ...)
results$max[i] <- max(df[,i], ...)
498 CHAPTER 13. SOLUTIONS TO EXERCISES
return(results)
}
Exercise 6.8
1. We can create 0.25 0.5 0.75 1 in two different ways using seq:
seq(0.25, 1, length.out = 4)
seq(0.25, 1, by = 0.25)
Exercise 6.9
We could create the same sequences using 1:ncol(airquality) and 1:length(airquality$Temp),
but if we accidentally apply those solutions to objects with zero length, we would
run into trouble! Let’s see what happens:
x <- c()
length(x)
Even though there are no elements in the vector, two iterations are run when we use
1:length(x) to set the values of the control variable:
for(i in 1:length(x)) { cat("Element", i, "of the vector\n") }
The reason is that 1:length(x) yields the vector 0 1, providing two values for the
control variable.
If we use seq_along instead, no iterations will be run, because seq_along(x) returns
zero values:
for(i in seq_along(x)) { cat("Element", i, "of the vector\n") }
499
This is the desired behaviour - if there are no elements in the vector then the loop
shouldn’t run! seq_along is the safer option, but 1:length(x) is arguably less
opaque and therefore easier for humans to read, which also has its benefits.
Exercise 6.10
To normalise the variable, we need to map the smallest value to 0 and the largest to
1:
normalise <- function(df, ...)
{
for(i in seq_along(df))
{
df[,i] <- (df[,i] - min(df[,i], ...))/(max(df[,i], ...) -
min(df[,i], ...))
}
return(df)
}
Exercise 6.11
We set folder_path to the path of the folder (making sure that the path ends with
/ (or \\ on Windows)). We can then loop over the .csv files in the folder and print
the names of their variables as follows:
files <- list.files(folder_path, pattern = "\\.csv$")
for(file in files)
{
csv_data <- read.csv(paste(folder_path, file, sep = ""))
cat(file, "\n")
cat(names(csv_data))
cat("\n\n")
}
Exercise 6.12
1. The condition in the outer loop, i < length(x), is used to check that the
element x[i+1] used in the inner loop actually exists. If i is equal to the
length of the vector (i.e. is the last element of the vector) then there is no
element x[i+1] and consequently the run cannot go on. If this condition wasn’t
included, we would end up with an infinite loop.
500 CHAPTER 13. SOLUTIONS TO EXERCISES
2. The condition in the inner loop, x[i+1] == x[i] & i < length(x), is used to
check if the run continues. If [i+1] == x[i] is TRUE then the next element of
x is the same as the current, meaning that the run continues. As in the previous
condition, i < length(x) is included to make sure that we don’t start looking
for elements outside of x, which could create an infinite loop.
3. The line run_values <- c(run_values, x[i-1]) creates a vector combining
the existing elements of run_values with x[i-1]. This allows us to store the
results in a vector without specifying its size in advance. Not however that
this approach is slower than specifying the vector size in advance, and that you
therefore should avoid it when using for loops.
Exercise 6.13
We modify the loop so that it skips to the next iteration if x[i] is 0, and breaks if
x[i] is NA:
x <- c(1, 5, 8, 0, 20, 0, 3, NA, 18, 2)
for(i in seq_along(x))
{
if(is.na(x[i])) { break }
if(x[i] == 0) { next }
cat("Step", i, "- reciprocal is", 1/x[i], "\n")
}
Exercise 6.13
We can put a conditional statement inside each of the loops, to check that both
variables are numeric:
cor_func <- function(df)
{
cor_mat <- matrix(NA, nrow = ncol(df), ncol = ncol(df))
for(i in seq_along(df))
{
if(!is.numeric(df[[i]])) { next }
for(j in seq_along(df))
{
if(!is.numeric(df[[j]])) { next }
cor_mat[i, j] <- cor(df[[i]], df[[j]],
use = "pairwise.complete")
}
}
return(cor_mat)
}
501
An (nicer?) alternative would be to check which columns are numeric and loop over
those:
cor_func <- function(df)
{
cor_mat <- matrix(NA, nrow = ncol(df), ncol = ncol(df))
indices <- which(sapply(df, class) == "numeric")
for(i in indices)
{
for(j in indices)
{
cor_mat[i, j] <- cor(df[[i]], df[[j]],
use = "pairwise.complete")
}
}
return(cor_mat)
}
Exercise 6.15
We could also write a function that computes both the minimum and the maximum
and returns both, and use that with apply:
minmax <- function(x, ...)
{
return(c(min = min(x, ...), max = max(x, ...)))
}
Exercise 6.16
We can for instance make use of the minmax function that we created in Exercise
6.15:
minmax <- function(x, ...)
{
return(c(min = min(x, ...), max = max(x, ...)))
}
# Or:
tapply(airquality$Temp, airquality$Month, minmax)
Exercise 6.17
To compute minima and maxima, we can use:
minmax <- function(x, ...)
{
return(c(min = min(x, ...), max = max(x, ...)))
}
This time out, we want to apply this function to two variables: Temp and Wind. We
can do this using apply:
minmax2 <- function(x, ...)
{
return(apply(x, 2, minmax))
}
tw <- split(airquality[,c("Temp", "Wind")], airquality$Month)
lapply(tw, minmax2)
Exercise 6.18
We can for instance make use of the minmax function that we created in Exercise
6.15:
503
library(purrr)
We can also use a single pipe chain to split the data and apply the functional:
airquality %>% split(.$Month) %>% map(~minmax(.$Temp))
Exercise 6.19
Because we want to use both the variable names and their values, an imap_* function
is appropriate here:
data_summary <- function(df)
{
df %>% imap_dfr(~(data.frame(variable = .y,
unique_values = length(unique(.x)),
class = class(.x),
missing_values = sum(is.na(.x)) )))
}
Exercise 6.20
We combine map and imap to get the desired result. folder_path is the path to the
folder containing the .csv files. We must use set_names to set the file names as
element names, otherwise only the index of each file (in the file name vector) will be
printed:
list.files(folder_path, pattern = "\\.csv$") %>%
paste(folder_path, ., sep = "") %>%
set_names() %>%
map(read.csv) %>%
imap(~cat(.y, "\n", names(.x), "\n\n"))
504 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 6.21
First, we load the data and create vectors containing all combinations
library(gapminder)
combos <- gapminder %>% distinct(continent, year)
continents <- combos$continent
years <- combos$year
Exercise 6.22
First, we write a function for computing the mean of a vector with a loop:
mean_loop <- function(x)
{
m <- 0
n <- length(x)
for(i in seq_along(x))
{
m <- m + x[i]/n
}
return(m)
}
x <- 1:10000
mean_loop(x)
mean(x)
library(bench)
mark(mean(x), mean_loop(x))
mean_loop is several times slower than mean. The memory usage of both functions
is negligible.
Exercise 6.23
We can compare the three solutions as follows:
library(data.table)
library(dplyr)
library(nycflights13)
library(bench)
We see that dplyr is substantially faster and more memory efficient than the base R
solution, but that data.table beats them both by a margin.
506 CHAPTER 13. SOLUTIONS TO EXERCISES
Chapter 7
Exercise 7.1
The parameter replace controls whether or not replacement is used. To draw 5
random numbers with replacement, we use:
sample(1:10, 5, replace = TRUE)
Exercise 7.2
As an alternative to sample(1:10, n, replace = TRUE) we could use runif to
generate random numbers from 1:10. This can be done in at least three different
ways.
1. Generating (decimal) numbers between 0 and 10 and rounding up to the nearest
integer:
n <- 10 # Generate 10 numbers
ceiling(runif(n, 0, 10))
3. Generating (decimal) numbers between 0.5 and 10.5 and rounding to the nearest
integer:
round(runif(n, 0.5, 10.5))
Exercise 7.3
First, we compare the histogram of the data to the normal density function:
library(ggplot2)
ggplot(msleep, aes(x = sleep_total)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(msleep$sleep_total),
sd = sd(msleep$sleep_total)))
The density estimate is fairly similar to the normal density, but there appear to be
too many low values in the data.
Then a normal Q-Q plot:
507
There are some small deviations from the line, but no large deviations. To decide
whether these deviations are large enough to be a concern, it may be a good idea to
compare this Q-Q-plot to Q-Q-plots from simulated normal data:
# Create a Q-Q-plot for the total sleep data, and store
# it in a list:
qqplots <- list(ggplot(msleep, aes(sample = sleep_total)) +
geom_qq() + geom_qq_line() + ggtitle("Actual data"))
The Q-Q-plot for the real data is pretty similar to those from the simulated samples.
We can’t rule out the normal distribution.
# Q-Q plot:
508 CHAPTER 13. SOLUTIONS TO EXERCISES
The right tail of the distribution differs greatly from the data. If we have to choose
between these two distributions, then the normal distribution seems to be the better
choice.
Exercise 7.4
1. The documentation for shapiro.test shows that it takes a vector containing
the data as input. So to apply it to the sleeping times data, we use:
library(ggplot2)
shapiro.test(msleep$sleep_total)
The p-value is 0.21, meaning that we can’t reject the null hypothesis of normality -
the test does not indicate that the data is non-normal.
2. Next, we generate data from a 𝜒2 (100) distribution, and compare its distribu-
tion to a normal density function:
generated_data <- data.frame(x = rchisq(2000, 100))
ggplot(generated_data, aes(x)) +
geom_histogram(colour = "black", aes(y = ..density..)) +
geom_density(colour = "blue", size = 2) +
geom_function(fun = dnorm, colour = "red", size = 2,
args = list(mean = mean(generated_data$x),
sd = sd(generated_data$x)))
The fit is likely to be very good - the data is visually very close to the normal
distribution. Indeed, it is rare in practice to find real data that is closer to the
normal distribution than this.
However, the Shapiro-Wilk test probably tells a different story:
shapiro.test(generated_data$x)
The lesson here is that if the sample size is large enough, the Shapiro-Wilk test (and
any other test for normality, for that matter) is likely to reject normality even if the
deviation from normality is tiny. When the sample size is too large, the power of the
test is close to 1 even for very small deviations. On the other hand, if the sample size
is small, the power of the Shapiro-Wilk test is low, meaning that it can’t be used to
detect non-normality.
In summary, you probably shouldn’t use formal tests for normality at all. And I say
that as someone who has written two papers introducing new tests for normality!
509
Exercise 7.5
As in Section 3.3, we set file_path to the path to vas.csv and load the data using
the code from Exercise 3.8:
vas <- read.csv(file_path, sep = ";", dec = ",", skip = 4)
The null hypothesis is that the mean 𝜇 is less than or equal to 6, meaning that the
alternative is that 𝜇 is greater than 6. To perform the test, we run:
t.test(vas$VAS, mu = 6, alternative = "greater")
The average VAS isn’t much higher than 6 - it’s 6.4 - but because the sample size is
fairly large (𝑛 = 2, 351) we are still able to detect that it indeed is greater.
Exercise 7.6
First, we assume that delta is 0.5 and that the standard deviation is 2, and want to
find the 𝑛 required to achieve 95 % power at a 5 % significance level:
power.t.test(power = 0.95, delta = 0.5, sd = 2, sig.level = 0.05,
type = "one.sample", alternative = "one.sided")
The actual sample size for this dataset was 𝑛 = 2, 351. Let’s see what power that
gives us:
power.t.test(n = 2351, delta = 0.5, sd = 2, sig.level = 0.05,
type = "one.sample", alternative = "one.sided")
The power is 1 (or rather, very close to 1). We’re more or less guaranteed to find
statistical evidence that the mean is greater than 6 if the true mean is 6.5!
Exercise 7.7
First, let’s compute the proportion of herbivores and carnivores that sleep for more
than 7 hours a day:
library(ggplot2)
herbivores <- msleep[msleep$vore == "herbi",]
n1 <- sum(!is.na(herbivores$sleep_total))
x1 <- sum(herbivores$sleep_total > 7, na.rm = TRUE)
The proportions are 0.625 and 0.68, respectively. To obtain a confidence interval for
the difference of the two proportions, we use binomDiffCI as follows:
library(MKinfer)
binomDiffCI(x1, x2, n1, n2, method = "wilson")
Exercise 7.13
To run the same simulation for different 𝑛, we will write a function for the simulation,
with the sample size n as an argument:
# Function for our custom estimator:
max_min_avg <- function(x)
{
return((max(x)+min(x))/2)
}
for(i in seq_along(res$x_mean))
{
x <- rnorm(n, mu, sigma)
res$x_mean[i] <- mean(x)
res$x_median[i] <- median(x)
res$x_mma[i] <- max_min_avg(x)
We could write a for loop to perform the simulation for different values of 𝑛. Alter-
natively, we can use a function, as in Section 6.5. Here are two examples of how this
can be done:
# Create a vector of samples sizes:
n_vector <- seq(10, 100, 10)
# Using base R:
res <- apply(data.frame(n = n_vector), 1, simulate_estimators)
# Using purrr:
library(purrr)
res <- map(n_vector, simulate_estimators)
Next, we want to plot the results. We need to extract the results from the list res
and store them in a data frame, so that we can plot them using ggplot2.
simres <- matrix(unlist(res), 10, 7, byrow = TRUE)
simres <- data.frame(simres)
names(simres) <- names(unlist(res))[1:7]
simres
Transforming the data frame from wide to long format (Section 5.11) makes plotting
easier.
We can do this using data.table:
library(data.table)
simres2 <- data.table(melt(simres, id.vars = c("n"),
measure.vars = 2:7))
simres2[, c("measure", "estimator") := tstrsplit(variable,
".", fixed = TRUE)]
library(ggplot2)
# Plot the bias, with a reference line at 0:
ggplot(subset(simres2, measure == "bias"), aes(n, value,
col = estimator)) +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed") +
ggtitle("Bias")
All three estimators have a bias close to 0 for all values of 𝑛 (indeed, we can verify
analytically that they are unbiased). The mean has the lowest variance for all 𝑛, with
the median as a close competitor. Our custom estimator has a higher variance, that
also has a slower decrease as 𝑛 increases. In summary, based on bias and variance,
the mean is the best estimator for the mean of a normal distribution.
Exercise 7.14
To perform the same simulation with 𝑡(3)-distributed data, we can reuse the same
code as in Exercise 7.13, only replacing three lines:
• The arguments of simulate_estimators (mu and sigma are replaced by the
degrees of freedom df of the 𝑡-distribution,
• The line were the data is generated (rt replaces rnorm),
• The line were the bias is computed (the mean of the 𝑡-distribution is always 0).
# Function for our custom estimator:
max_min_avg <- function(x)
{
return((max(x)+min(x))/2)
}
for(i in seq_along(res$x_mean))
{
x <- rt(n, df)
res$x_mean[i] <- mean(x)
res$x_median[i] <- median(x)
res$x_mma[i] <- max_min_avg(x)
To perform the simulation, we can then e.g. run the following, which has been copied
from the solution to the previous exercise.
# Create a vector of samples sizes:
n_vector <- seq(10, 100, 10)
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed") +
ggtitle("Bias, t(3)-distribution")
Exercise 7.15
We will use the functions that we created to simulate the type I error rates and
powers of the three tests in Sections @ref(simtypeI} and 7.5.3. Also, we must make
sure to load the MKinfer package that contains perm.t.test.
To compare the type I error rates, we only need to supply the function rt for gen-
erating data and the parameter df = 3 to clarify that a 𝑡(3)-distribution should be
used:
simulate_type_I(20, 20, rt, B = 9999, df = 3)
simulate_type_I(20, 30, rt, B = 9999, df = 3)
The old-school t-test appears to be a little conservative, with an actual type I error
rate close to 0.043. We can use binomDiffCI from MKinfer to get a confidence
interval for the difference in type I error rate between the old-school t-test and the
permutation t-test:
B <- 9999
binomDiffCI(B*0.04810481, B*0.04340434, B, B, method = "wilson")
The confidence interval is (−0.001, 0.010). Even though the old-school t-test appeared
to have a lower type I error rate, we cannot say for sure, as a difference of 0 is included
in the confidence interval. Increasing the number of simulated samples to, say, 99, 999,
might be required to detect any differences between the different tests.
515
Next, we compare the power of the tests. For the function used to simulate data for
the second sample, we add a + 1 to shift the distribution to the right (so that the
mean difference is 1):
# Balanced sample sizes:
simulate_power(20, 20, function(n) { rt(n, df = 3,) },
function(n) { rt(n, df = 3) + 1 },
B = 9999)
Exercise 7.16
This means that we can pass the argument method = "spearman" to use the func-
tions to compute the sample size for the Spearman correlation test. Let’s try it:
power.cor.test(n_start = 10, rho = 0.5, power = 0.9,
method = "spearman")
power.cor.test(n_start = 10, rho = 0.2, power = 0.8,
method = "spearman")
In my runs, the Pearson correlation test required the sample sizes 𝑛 = 45 and 𝑛 = 200,
whereas the Spearman correlation test required larger sample sizes: 𝑛 = 50 and
𝑛 = 215.
516 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 7.17
First, we create a function that simulates the expected width of the Clopper-Pearson
interval for a given 𝑛 and 𝑝:
cp_width <- function(n, p, level = 0.05, B = 999)
{
widths <- rep(NA, B)
for(i in 1:B)
{
# Generate binomial data:
x <- rbinom(1, n, p)
close(pbar)
return(mean(widths))
}
Next, we create a function with a while loop that finds the sample sizes required to
achieve a desired expected width:
cp_ssize <- function(n_start = 10, p, n_incr = 5, level = 0.05,
width = 0.1, B = 999)
{
# Set initial values
n <- n_start
width_now <- 1
n <- n + n_incr
}
Finally, we run our simulation for 𝑝 = 0.1 (with expected width 0.01) and 𝑝 = 0.3
(expected width 0.05) and compare the results to the asymptotic answer:
# p = 0.1
# Asymptotic answer:
ssize.propCI(prop = 0.1, width = 0.01, method = "clopper-pearson")
#######
As you can see, the asymptotic results are very close to those obtained from the
simulation, and so using ssize.propCI is preferable in this case, as it is much faster.
Exercise 7.18
If we want to assume that the two populations have equal variances, we first have
to create a centred dataset, where both groups have mean 0. We can then draw
observations from this sample, and shift them by the two group means:
library(ggplot2)
boot_data <- na.omit(subset(msleep,
vore == "carni" | vore == "herbi")[,c("sleep_total",
"vore")])
# Compute group means and sizes:
group_means <- aggregate(sleep_total ~ vore,
518 CHAPTER 13. SOLUTIONS TO EXERCISES
library(boot)
# Do the resampling:
boot_res <- boot(boot_data,
mean_diff_msleep,
9999)
The resulting percentile interval is close to that which we obtained without assuming
equal variances. The BCa interval is however very different.
519
Exercise 7.19
We use the percentile confidence interval from the previous exercise to compute p-
values as follows (the null hypothesis is that the parameter is 0):
library(boot.pval)
boot.pval(boot_res, type = "perc", theta_null = 0)
The p-value is approximately 0.52, and we can not reject the null hypothesis.
Chapter 8
Exercise 8.1
We set file_path to the path of sales-weather.csv. To load the data, fit the
model and plot the results, we do the following:
# Load the data:
weather <- read.csv(file_path, sep =";")
View(weather)
# Fit model:
m <- lm(TEMPERATURE ~ SUN_HOURS, data = weather)
520 CHAPTER 13. SOLUTIONS TO EXERCISES
summary(m)
The coefficient for SUN_HOURS is not significantly non-zero at the 5 % level. The 𝑅2
value is 0.035, which is very low. There is little evidence of a connection between the
number of sun hours and the temperature during this period.
Exercise 8.2
We fit a model using the formula:
m <- lm(mpg ~ ., data = mtcars)
summary(m)
What we’ve just done is to create a model where all variables from the data frame
(except mpg) are used as explanatory variables. This is the same model as we’d have
obtained using the following (much longer) code:
m <- lm(mpg ~ cyl + disp + hp + drat + wt +
qsec + vs + am + gear + carb, data = mtcars)
The ~ . shorthand is very useful when you want to fit a model with a lot of explana-
tory variables.
Exercise 8.3
First, we create the dummy variable:
weather$prec_dummy <- factor(weather$PRECIPITATION > 0)
Then, we fit the new model and have a look at the results. We won’t centre the
SUN_HOURS variable, as the model is easy to interpret without centring. The inter-
cept corresponds to the expected temperature on a day with 0 SUN_HOURS and no
precipitation.
m <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
summary(m)
Both SUN_HOURS and the dummy variable are significantly non-zero. In the next
section, we’ll have a look at how we can visualise the results of this model.
521
Exercise 8.4
We run the code to create the two data frames. We then fit a model to the first
dataset exdata1, and make some plots:
m1 <- lm(y ~ x, data = exdata1)
# Residual plots:
library(ggfortify)
autoplot(m1, which = 1:6, ncol = 2, label.size = 3)
There are clear signs of nonlinearity here, that can be seen both in the scatterplot
and the residuals versus fitted plot.
Next, we do the same for the second dataset:
m2 <- lm(y ~ x, data = exdata2)
# Residual plots:
library(ggfortify)
autoplot(m2, which = 1:6, ncol = 2, label.size = 3)
Exercise 8.5
1. First, we plot the observed values against the fitted values for the two models.
# The two models:
m1 <- lm(TEMPERATURE ~ SUN_HOURS, data = weather)
m2 <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
522 CHAPTER 13. SOLUTIONS TO EXERCISES
n <- nrow(weather)
models <- data.frame(Observed = rep(weather$TEMPERATURE, 2),
Fitted = c(predict(m1), predict(m2)),
Model = rep(c("Model 1", "Model 2"), c(n, n)))
The first model only predicts values within a fairly narrow interval. The second
model does a somewhat better job of predicting high temperatures.
2. Next, we create residual plots for the second model.
library(ggfortify)
autoplot(m2, which = 1:6, ncol = 2, label.size = 3)
There are no clear trends or signs of heteroscedasticity. There are some deviations
from normality in the tail of the residual distribution. There are a few observations -
57, 76 and 83, that have fairly high Cook’s distance. Observation 76 also has a very
high leverage. Let’s have a closer look at them:
weather[c(57, 76, 83),]
Exercise 8.6
We run boxcox to find a suitable Box-Cox transformation for our model:
m <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
library(MASS)
boxcox(m)
The boxcox method can only be used for non-negative response variables. We can
solve this e.g. by transforming the temperature (which currently is in degrees Celsius)
523
boxcox(m)
The value 𝜆 = 1 is inside the interval indicated by the dotted lines. This corresponds
to no transformation at all, meaning that there is no indication that we should
transform our response variable.
Exercise 8.7
We refit the model using:
library(lmPerm)
m <- lmp(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
summary(m)
Exercise 8.8
The easiest way to do this is to use boot_summary:
library(MASS)
m <- rlm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
library(boot.pval)
boot_summary(m, type = "perc", method = "residual")
Next, we compute the confidence intervals using boot and boot.ci (note that we
use rlm inside the coefficients function!):
524 CHAPTER 13. SOLUTIONS TO EXERCISES
library(boot)
Using the connection between hypothesis tests and confidence intervals, to see
whether an effect is significant at the 5 % level, you can check whether 0 is contained
in the confidence interval. If not, then the effect is significant.
Exercise 8.9
We fit the model and then use boot_summary with method = "case":
m <- lm(mpg ~ hp + wt, data = mtcars)
525
library(boot.pval)
boot_summary(m, type = "perc", method = "case", R = 9999)
boot_summary(m, type = "perc", method = "residual", R = 9999)
In this case, the resulting confidence intervals are similar to what we obtained with
residual resampling.
Exercise 8.10
First, we prepare the model and the data:
m <- lm(TEMPERATURE ~ SUN_HOURS*prec_dummy, data = weather)
new_data <- data.frame(SUN_HOURS = 0, prec_dummy = "TRUE")
library(boot)
Exercise 8.11
autoplot uses standard ggplot2 syntax, so by adding colour = mtcars$cyl to
autoplot, we can plot different groups in different colours:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
526 CHAPTER 13. SOLUTIONS TO EXERCISES
library(ggfortify)
autoplot(m, which = 1:6, ncol = 2, label.size = 3,
colour = mtcars$cyl)
Exercise 8.12
We rerun the analysis:
# Convert variables to factors:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am)
Unfortunately, if you run this multiple times, the p-values will vary a lot. To fix
that, you need to increase the maximum number of iterations allowed, by increasing
maxIter, and changing the condition for the accuracy of the p-value by lowering Ca:
m <- aovp(mpg ~ cyl + am, data = mtcars,
perm = "Prob",
Ca = 1e-3,
maxIter = 1e6)
summary(m)
According to ?aovp, the seqs arguments controls which type of table is produced.
It’s perhaps not perfectly clear from the documentation, but the default seqs =
FALSE corresponds to a type III table, whereas seqs = TRUE corresponds to a type
I table:
# Type I table:
m <- aovp(mpg ~ cyl + am, data = mtcars,
seqs = TRUE,
perm = "Prob",
Ca = 1e-3,
maxIter = 1e6)
summary(m)
Exercise 8.13
We can run the test using the usual formula notation:
527
The p-value is very low, and we conclude that the fuel consumption differs between
the three groups.
Exercise 8.16
We set file_path to the path of shark.csv and then load and inspect the data:
sharks <- read.csv(file_path, sep =";")
View(sharks)
We need to convert the Age variable to a numeric, which will cause us to lose informa-
tion (“NAs introduced by coercion”) about the age of the persons involved in some
attacks, i.e. those with values like 20's and 25 or 28, which cannot be automatically
coerced into numbers:
sharks$Age <- as.numeric(sharks$Age)
Judging from the p-values, there is no evidence that sex and age affect the probability
of an attack being fatal.
Exercise 8.17
We use the same logistic regression model for the wine data as before:
m <- glm(type ~ pH + alcohol, data = wine, family = binomial)
The broom functions work also for generalised linear models. As for linear mod-
els, tidy gives the table of coefficients and p-values, glance gives some summary
statistics, and augment adds fitted values and residuals to the original dataset:
library(broom)
tidy(m)
glance(m)
augment(m)
528 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 8.18
Using the model m from the other exercise, we can now do the following.
1. Compute asymptotic confidence intervals:
library(MASS)
confint(m)
If you prefer writing your own bootstrap code, you could proceed as follows:
library(boot)
# Compute p-values:
Exercise 8.19
We draw a binned residual plot for our model:
m <- glm(Fatal..Y.N. ~ Age + Sex., data = sharks, family = binomial)
530 CHAPTER 13. SOLUTIONS TO EXERCISES
library(arm)
binnedplot(predict(m, type = "response"),
residuals(m, type = "response"))
There are a few points outside the interval, but not too many. There is not trend,
i.e. there is for instance no sign that the model has a worse performance when it
predicts a larger probability of a fatal attack.
Next, we plot the Cook’s distances of the observations:
res <- data.frame(Index = 1:length(cooks.distance(m)),
CooksDistance = cooks.distance(m))
There are a few points with a high Cook’s distance. Let’s investigate point 116, which
has the highest distance:
sharks[116,]
This observation corresponds to the oldest person in the dataset, and a fatal attack.
Being an extreme observation, we’d expect it to have a high Cook’s distance.
Exercise 8.20
First, we have a look at the quakes data:
?quakes
View(quakes)
We then fit a Poisson regression model with stations as response variable and mag
as explanatory variable:
m <- glm(stations ~ mag, data = quakes, family = poisson)
summary(m)
We plot the fitted values against the observed values, create a binned residual plot,
and perform a test of overdispersion:
# Plot observed against fitted:
res <- data.frame(Observed = quakes$stations,
Fitted = predict(m, type = "response"))
531
# Test overdispersion
library(AER)
dispersiontest(m, trafo = 1)
Visually, the fit is pretty good. As indicated by the test, there are however signs of
overdispersion. Let’s try a negative binomial regression instead.
# Fit NB regression:
library(MASS)
m2 <- glm.nb(stations ~ mag, data = quakes)
summary(m2)
The difference between the models is tiny. We’d probably need to include more
variables to get a real improvement of the model.
Exercise 8.21
We can get confidence intervals for the 𝛽𝑗 using boot_summary, as in previous sections.
To get bootstrap confidence intervals for the rate ratios 𝑒𝛽𝑗 , we exponentiate the
confidence intervals for the 𝛽𝑗 :
532 CHAPTER 13. SOLUTIONS TO EXERCISES
library(boot.pval)
boot_table <- boot_summary(m, type = "perc", method = "case")
boot_table
Exercise 8.22
First, we load the data and have a quick look at it:
library(nlme)
?Oxboys
View(Oxboys)
Both intercepts and slopes seem to vary between individuals. Are they correlated?
# Collect the coefficients from each linear model:
library(purrr)
Oxboys %>% split(.$Subject) %>%
map(~ lm(height ~ age, data = .)) %>%
map(coef) -> coefficients
There is a strong indication that the intercepts and slopes have a positive correlation.
We’ll therefore fit a linear mixed model with correlated random intercepts and slopes:
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)
summary(m, correlation = FALSE)
Exercise 8.22
We’ll use the model that we fitted to the Oxboys data in the previous exercise:
library(lme4)
library(nlme)
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)
As you can see, fixed and random effects are shown in the same table. However,
different information is displayed for the two types of variables (just as when we use
summary).
Note that if we fit the model after loading the lmerTest, the tidy table also includes
p-values:
library(lmerTest)
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)
tidy(m)
Exercise 8.24
We use the same model as in the previous exercise:
library(nlme)
m <- lmer(height ~ age + (1 + age|Subject), data = Oxboys)
# Plot residuals:
ggplot(fm, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
xlab("Fitted values") + ylab("Residuals")
Overall, the fit seems very good. There may be some heteroscedasticity, but nothing
too bad. Some subjects have a larger spread in their residuals, which is to be expected
in this case - growth in children is non-constant, and a large negative residual is
therefore likely to be followed by a large positive residual, and vice versa. The
regression errors and random effects all appear to be normally distributed.
Exercise 8.25
To look for an interaction between TVset and Assessor, we draw an interaction plot:
library(lmerTest)
interaction.plot(TVbo$Assessor, TVbo$TVset,
response = TVbo$Coloursaturation)
The lines overlap and follow different patterns, so there appears to be an interaction.
There are two ways in which we could include this. Which we choose depends on
535
what we think our clusters of correlated measurements are. If only the assessors are
clusters, we’d include this as a random slope:
m <- lmer(Coloursaturation ~ TVset*Picture + (1 + TVset|Assessor),
data = TVbo)
m
anova(m)
In this case, we think that there is a fixed interaction between each pair of assessor
and TV set.
However, if we think that the interaction is random and varies between repetitions,
the situation is different. In this case the combination of assessor and TV set are
clusters of correlated measurements (which could make sense here, because we have
repeated measurements for each assessor-TV set pair). We can then include the
interaction as a nested random effect:
m <- lmer(Coloursaturation ~ TVset*Picture + (1|Assessor/TVset),
data = TVbo)
m
anova(m)
In either case, the results are similar, and all fixed effects are significant at the 5 %
level.
Exercise 8.26
BROOD, INDEX (subject ID number) and LOCATION all seem like they could cause
measurements to be correlated, and so are good choices for random effects. To keep
the model simple, we’ll only include random intercepts. We fit a mixed Poisson
regression using glmer:
library(lme4)
m <- glmer(TICKS ~ YEAR + HEIGHT + (1|BROOD) + (1|INDEX) + (1|LOCATION),
data = grouseticks, family = poisson)
summary(m, correlation = FALSE)
To compute the bootstrap confidence interval for the effect of HEIGHT, we use
boot_summary:
library(boot.pval)
boot_summary(m, type = "perc", R = 100)
536 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 8.27
The ovarian data comes from a randomised trial comparing two treatments for
ovarian cancer:
library(survival)
?ovarian
str(ovarian)
The parametric confidence interval overlap a lot. Let’s compute a bootstrap confi-
dence interval for the difference in the 75 % quantile of the survival times. We set
the quantile level using the q argument in bootkm:
library(Hmisc)
Exercise 8.28
1. First, we fit a Cox regression model. From ?ovarian we see that the sur-
vival/censoring times are given by futime and the censoring status by fustat.
library(survival)
m <- coxph(Surv(futime, fustat) ~ age + rx,
data = ovarian, model = TRUE)
summary(m)
537
According to the p-value in the table, which is 0.2, there is no significant difference
between the two treatment groups. Put differently, there is no evidence that the
hazard ratio for treatment isn’t equal to 1.
To assess the assumption of proportional hazards, we plot the Schoenfeld residuals:
library(survminer)
ggcoxzph(cox.zph(m), var = 1)
ggcoxzph(cox.zph(m), var = 2)
There is no clear trend over time, and the assumption appears to hold.
2. To compute a bootstrap confidence interval for the hazard ratio for age, we
follow the same steps as in the lung example, using censboot_summary:.
library(boot.pval)
censboot_summary(m)
All values in the confidence interval are positive, meaning that we are fairly sure that
the hazard increases with age.
Exercise 8.29
First, we fit the model:
m <- coxph(Surv(futime, status) ~ age + type + trt,
cluster = id, data = retinopathy)
summary(m)
As there are no trends over time, there is no evidence against the assumption of
proportional hazards.
Exercise 8.30
We fit the model using survreg:
library(survival)
m <- survreg(Surv(futime, fustat) ~ ., data = ovarian,
dist = "loglogistic")
exp(coef(m))
According to the model, the survival time increases 1.8 times for patients in treatment
group 2, compared to patients in treatment group 1. Running summary(m) shows that
the p-value for rx is 0.05, meaning that the result isn’t significant at the 5 % level
(albeit with the smallest possible margin!).
Exercise 8.31
We set file_path to the path to il2rb.csv and then load the data (note that it
uses a decimal comma!):
biomarkers <- read.csv(file_path, sep = ";", dec = ",")
Next, we check which measurements that are nondetects, and impute the detection
limit 0.25:
censored <- is.na(biomarkers$IL2RB)
biomarkers$IL2RB[censored] <- 0.25
Exercise 8.32
We set file_path to the path to il2rb.csv and then load and prepare the data:
biomarkers <- read.csv(file_path, sep = ";", dec = ",")
censored <- is.na(biomarkers$IL2RB)
biomarkers$IL2RB[censored] <- 0.25
Based on the recommendations in Zhang et al. (2009), we can now run a Wilcoxon-
Mann-Whitney test. Because we’ve imputed the LoD for the nondetects, all obser-
vations are included in the test:
wilcox.test(IL2RB ~ Group, data = biomarkers)
The p-value is 0.42, and we do not reject the null hypothesis that there is no difference
in location.
539
Chapter 9
Exercise 9.1
1. We load the data and compute the expected values using the formula 𝑦 =
2𝑥1 − 𝑥2 + 𝑥3 ⋅ 𝑥2 :
exdata <- data.frame(x1 = c(0.87, -1.03, 0.02, -0.25, -1.09, 0.74,
0.09, -1.64, -0.32, -0.33, 1.40, 0.29, -0.71, 1.36, 0.64,
-0.78, -0.58, 0.67, -0.90, -1.52, -0.11, -0.65, 0.04,
-0.72, 1.71, -1.58, -1.76, 2.10, 0.81, -0.30),
x2 = c(1.38, 0.14, 1.46, 0.27, -1.02, -1.94, 0.12, -0.64,
0.64, -0.39, 0.28, 0.50, -1.29, 0.52, 0.28, 0.23, 0.05,
3.10, 0.84, -0.66, -1.35, -0.06, -0.66, 0.40, -0.23,
-0.97, -0.78, 0.38, 0.49, 0.21),
x3 = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1),
y = c(3.47, -0.80, 4.57, 0.16, -1.77, -6.84, 1.28, -0.52,
1.00, -2.50, -1.99, 1.13, -4.26, 1.16, -0.69, 0.89, -1.01,
7.56, 2.33, 0.36, -1.11, -0.53, -1.44, -0.43, 0.69, -2.30,
-3.55, 0.99, -0.50, -1.67))
The points seem to follow a straight line, and a linear model seems appropriate.
The 𝑅2 -value is pretty high: 0.91. x1 and x2 both have low p-values, as does the
F-test for the regression. We can check the model fit by comparing the fitted values
to the actual values. We add a red line that the points should follow if we have a
good fit:
ggplot(exdata[1:20,], aes(y, predict(m))) +
geom_point() +
geom_abline(intercept = 0, slope = 1, col = "red")
The model seems to be pretty good! Now let’s see how well it does when faced with
new data.
540 CHAPTER 13. SOLUTIONS TO EXERCISES
We can plot the results for the last 10 observations, which weren’t used when we
fitted the model:
ggplot(exdata[21:30,], aes(y, predictions)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, col = "red")
The results are much worse than before! The correlation between the predicted values
and the actual values is very low:
cor(exdata[21:30,]$y, exdata[21:30,]$predictions)
Despite the good in-sample performance (as indicated e.g. by the high 𝑅2 ), the model
doesn’t seem to be very useful for prediction.
4. Perhaps you noted that the effect of x3 wasn’t significant in the model. Perhaps
the performance will improve if we remove it? Let’s try!
m <- lm(y ~ x1 + x2, data = exdata[1:20,])
summary(m)
The p-values and 𝑅2 still look very promising. Let’s make predictions for the new
observations and check the results:
exdata$predictions <- predict(m, exdata)
cor(exdata[21:30,]$y, exdata[21:30,]$predictions)
The predictions are no better than before - indeed, the correlation between the actual
and predicted values is even lower this time out!
5. Finally, we fit a correctly specified model and evaluate the results:
m <- lm(y ~ x1 + x2 + x3*x2, data = exdata[1:20,])
summary(m)
cor(exdata[21:30,]$y, exdata[21:30,]$predictions)
The predictive performance of the model remains low, which shows that model mis-
specification wasn’t the (only) reason for the poor performance of the previous mod-
els.
Exercise 9.2
We set file_path to the path to estates.xlsx and then load the data:
library(openxlsx)
estates <- read.xlsx(file_path)
View(estates)
There are a lot of missing values which can cause problems when fitting the model,
so let’s remove those:
estates <- na.omit(estates)
Next, we fit a linear model and evaluate it with LOOCV using caret and train:
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
data = estates,
method = "lm",
trControl = tc)
The 𝑅𝑀 𝑆𝐸 is 547 and the 𝑀 𝐴𝐸 is 395 kSEK. The average selling price in the
data (mean(estates$selling_price)) is 2843 kSEK, meaning that the 𝑀 𝐴𝐸 is
approximately 13 % of the mean selling price. This is not unreasonably high for
this application. Prediction errors are definitely expected here, given the fact that
we have relatively few variables - the selling price can be expected to depend on
several things not captured by the variables in our data (proximity to schools, access
to public transport, and so on). Moreover, houses in Sweden are not sold at fixed
prices, but subject to bidding, which can cause prices to fluctuate a lot. All in all,
and 𝑀 𝐴𝐸 of 395 is pretty good, and, at the very least, the model seems useful for
getting a ballpark figure for the price of a house.
Exercise 9.3
We set file_path to the path to estates.xlsx and then load and clean the data:
library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)
542 CHAPTER 13. SOLUTIONS TO EXERCISES
In my runs, the 𝑀 𝐴𝐸 ranged from to 391 to 405. Not a massive difference on the
scale of the data, but there is clearly some variability in the results.
In my runs the 𝑀 𝐴𝐸 varied between 396.0 and 397.4. There is still some variability,
but it is much smaller than for a simple 10-fold cross-validation.
Exercise 9.4
We set file_path to the path to estates.xlsx and then load and clean the data:
library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)
In my run, the 𝑀 𝐴𝐸 varied between 410.0 and 411.8, meaning that the variability is
similar to the with repeated 10-fold cross-validation. When I increased the number
of bootstrap samples to 9,999, the 𝑀 𝐴𝐸 stabilised around 411.7.
Exercise 9.5
We load and format the data as in the beginning of Section 9.1.7. We can then fit
the two models using train:
library(caret)
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100,
savePredictions = TRUE,
classProbs = TRUE)
To compare the models, we use evalm to plot ROC and calibration curves:
library(MLeval)
plots <- evalm(list(m, m2),
gnames = c("Model 1", "Model 2"))
# ROC:
plots$roc
# Calibration curves:
plots$cc
Model 2 performs much better, both in terms of 𝐴𝑈 𝐶 and calibration. Adding two
more variables has both increased the predictive performance of the model (a much
higher 𝐴𝑈 𝐶) and lead to a better-calibrated model.
544 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 9.9
First, we load and clean the data:
library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)
Next, we fit a ridge regression model and evaluate it with LOOCV using caret and
train:
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
data = estates,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 10, 0.1)),
trControl = tc)
Noticing that the 𝜆 that gave the best 𝑅𝑀 𝑆𝐸 was 10, which was the maximal 𝜆
that we investigated, we rerun the code, allowing for higher values of 𝜆:
m <- train(selling_price ~ .,
data = estates,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0,
lambda = seq(10, 120, 1)),
trControl = tc)
The 𝑅𝑀 𝑆𝐸 is 549 and the 𝑀 𝐴𝐸 is 399. In this case, ridge regression did not improve
the performance of the model compared to an ordinary linear regression.
Exercise 9.10
We load and format the data as in the beginning of Section 9.1.7.
1. We can now fit the models using train, making sure to add family =
"binomial":
library(caret)
tc <- trainControl(method = "cv",
545
number = 10,
savePredictions = TRUE,
classProbs = TRUE)
m1 <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
data = wine,
method = "glmnet",
family = "binomial",
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 10, 0.1)),
trControl = tc)
m1
m2
The best value for 𝜆 is still 0. For this dataset, both accuracy and 𝐴𝑈 𝐶 happened
to give the same 𝜆, but that isn’t always the case.
Exercise 9.11
First, we load and clean the data:
library(openxlsx)
estates <- read.xlsx(file_path)
estates <- na.omit(estates)
Next, we fit a lasso model and evaluate it with LOOCV using caret and train:
546 CHAPTER 13. SOLUTIONS TO EXERCISES
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
data = estates,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0, 10, 0.1)),
trControl = tc)
The 𝑅𝑀 𝑆𝐸 is 545 and the 𝑀 𝐴𝐸 is 394. Both are a little lower than for the ordinary
linear regression, but the difference is small in this case. To see which variables have
been removed, we can use:
coef(m$finalModel, m$finalModel$lambdaOpt)
Note that this data isn’t perfectly suited to the lasso, because most variables are
useful in explaining the selling price. Where the lasso really shines in problems where
a lot of the variables, perhaps even most, aren’t useful in explaining the response
variable. We’ll see an example of that in the next exercise.
Exercise 9.12
1. We try fitting a linear model to the data:
m <- lm(y ~ ., data = simulated_data)
summary(m)
There are no error messages, but summary reveals that there were problems:
Coefficients: (101 not defined because of singularities) and for half
the variables we don’t get estimates of the coefficients. It is not possible to fit
ordinary linear models when there are more variables than observations (there is no
unique solution to the least squares equations from which we obtain the coefficient
estimates), which leads to this strange-looking output.
2. Lasso models can be used even when the number of variables is greater than
the number of observations - regularisation ensures that there will be a unique
solution. We fit a lasso model using caret and train:
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(y ~ .,
data = simulated_data,
method = "glmnet",
547
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0, 10, 0.1)),
trControl = tc)
Your mileage may vary (try running the simulation more than once!), but it is likely
that the lasso will have picked at least the first four of the explanatory variables,
probably along with some additional variables. Try changing the ratio between n
and p in your experiment, or the size of the coefficients used when generating y, and
see what happens.
Exercise 9.13
Next, we fit an elastic net model and evaluate it with LOOCV using caret and
train:
library(caret)
tc <- trainControl(method = "LOOCV")
m <- train(selling_price ~ .,
data = estates,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, 0.2),
lambda = seq(10, 20, 1)),
trControl = tc)
Exercise 9.14
We load and format the data as in the beginning of Section 9.1.7. We can then fit
the model using train. We set summaryFunction = twoClassSummary and metric
= "ROC" to use 𝐴𝑈 𝐶 to find the optimal 𝑘.
library(caret)
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
The tree is pretty large. The parameter cp, called a complexity parameter, can be
used to prune the tree, i.e. to make it smaller. Let’s try setting a larger value for cp:
m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
data = wine,
trControl = tc,
method = "rpart",
metric = "ROC",
tuneGrid = expand.grid(cp = 0.1))
prp(m$finalModel)
That was way too much pruning - now the tree is too small! Try a value somewhere
in-between:
m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
data = wine,
trControl = tc,
method = "rpart",
metric = "ROC",
tuneGrid = expand.grid(cp = 0.01))
549
prp(m$finalModel)
That seems like a good compromise. The tree is small enough for us to understand
and discuss, but hopefully large enough that it still has a high 𝐴𝑈 𝐶.
2. For presentation and interpretability purposes we can experiment with man-
ually setting different values of cp. We can also let train find an optimal
value of cp for us, maximising for instance the 𝐴𝑈 𝐶. We’ll use tuneGrid =
expand.grid(cp = seq(0, 0.01, 0.001)) to find a good choice of cp some-
where between 0 and 0.01:
m <- train(type ~ pH + alcohol + fixed.acidity + residual.sugar,
data = wine,
trControl = tc,
method = "rpart",
metric = "ROC",
tuneGrid = expand.grid(cp = seq(0, 0.01, 0.001)))
m
prp(m$finalModel)
In some cases, increasing cp can increase the 𝐴𝑈 𝐶, but not here - a cp of 0 turns
out to be optimal in this instance.
Finally, to visually evaluate the model, we use evalm to plot ROC and calibration
curves:
library(MLeval)
plots <- evalm(m, gnames = "Decision tree")
# ROC:
plots$roc
# Calibration curves:
plots$cc
Exercise 9.15
We set file_path to the path of bacteria.csv, then load and format the data as
in Section 9.3.3:
bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")
library(caret)
tc <- trainControl(method = "LOOCV")
Finally, we make predictions for the entire dataset and compare the results to the
actual outcomes:
bacteria$Predicted <- predict(m, bacteria)
library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
geom_line() +
geom_line(aes(Time, Predicted), colour = "red")
Regression trees are unable to extrapolate beyond the training data. By design, they
will make constant predictions whenever the values of the explanatory variables go
beyond those in the training data. Bear this in mind if you use tree-based models
for predictions!
Exercise 9.16
First, we load the data as in Section 4.9:
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions, colour = "black")
The decision boundaries seem pretty good - most points in the lower left part belong
to variety 3, most in the middle to variety 1, and most to the right to variety 2.
Exercise 9.17
We load and format the data as in the beginning of Section 9.1.7. We can then fit
the models using train (fitting m2 takes a while):
library(caret)
tc <- trainControl(method = "cv",
number = 10,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
m1 <- train(type ~ .,
data = wine,
trControl = tc,
method = "rpart",
metric = "ROC",
tuneGrid = expand.grid(cp = c(0, 0.1, 0.01)))
m2 <- train(type ~ .,
data = wine,
trControl = tc,
method = "rf",
metric = "ROC",
tuneGrid = expand.grid(mtry = 2:6))
552 CHAPTER 13. SOLUTIONS TO EXERCISES
# ROC:
plots$roc
# Calibration curves:
plots$cc
The calibration curves may look worrisome, but the main reason that they deviate
from the straight line is that almost all observations have predicted probabilities
close to either 0 or 1. To see this, we can have a quick look at the histogram of the
predicted probabilities that the wines are white:
hist(predict(m2, type ="prob")[,2])
Exercise 9.18
We set file_path to the path of bacteria.csv, then load and format the data as
in Section 9.3.3:
bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")
Finally, we make predictions for the entire dataset and compare the results to the
actual outcomes:
bacteria$Predicted <- predict(m, bacteria)
library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
geom_line() +
geom_line(aes(Time, Predicted), colour = "red")
The model does very well for the training data, but fails to extrapolate beyond it.
Because random forests are based on decision trees, they give constant predictions
whenever the values of the explanatory variables go beyond those in the training
data.
Exercise 9.19
First, we load the data as in Section 4.9:
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
Next, we fit a random forest model with Kernel_length and Compactness as ex-
planatory variables:
library(caret)
tc <- trainControl(method = "LOOCV")
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions, colour = "black")
The decision boundaries are much more complex and flexible than those for the
decision tree of Exercise 9.16. Perhaps they are too flexible, and the model has
overfitted to the training data?
Exercise 9.20
We load and format the data as in the beginning of Section 9.1.7. We can then fit
the model using train. Try a large number of parameter values to see if you can get
a high 𝐴𝑈 𝐶. You can try using a simple 10-fold cross-validation to find reasonable
candidate values for the parameters, and then rerun the tuning with a replicated
10-fold cross-validation with parameter values close to those that were optimal in
your first search.
library(caret)
tc <- trainControl(method = "cv",
number = 10,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
ggplot(m)
Exercise 9.21
We set file_path to the path of bacteria.csv, then load and format the data as
in Section 9.3.3:
555
Finally, we make predictions for the entire dataset and compare the results to the
actual outcomes:
bacteria$Predicted <- predict(m, bacteria)
library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
geom_line() +
geom_line(aes(Time, Predicted), colour = "red")
The model does OK for the training data, but fails to extrapolate beyond it. Because
boosted trees models are based on decision trees, they give constant predictions
whenever the values of the explanatory variables go beyond those in the training
data.
Exercise 9.22
First, we load the data as in Section 4.9:
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
Next, we fit a boosted trees model with Kernel_length and Compactness as ex-
planatory variables:
library(caret)
tc <- trainControl(method = "LOOCV")
data = seeds,
trControl = tc,
method = "gbm",
verbose = FALSE)
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions, colour = "black")
The decision boundaries are much complex and flexible than those for the decision
tree of Exercise 9.16, but does not appear to have overfitted like the random forest
in Exercise 9.19.
Exercise 9.23
1. We set file_path to the path of bacteria.csv, then load and format the data
as in Section 9.3.3:
bacteria <- read.csv(file_path)
bacteria$Time <- as.numeric(as.POSIXct(bacteria$Time, format = "%H:%M:%S"))
Next, we fit a model tree using rows 45 to 90. The only explanatory variable available
to us is Time, and we want to use that both for the models in the nodes and for the
splits:
557
library(partykit)
m2 <- lmtree(OD ~ Time | Time, data = bacteria[45:90,])
library(ggparty)
autoplot(m2)
Next, we make predictions for the entire dataset and compare the results to the
actual outcomes. We plot the predictions from the decision tree in red and those
from the model tree in blue:
bacteria$Predicted_dt <- predict(m, bacteria)
bacteria$Predicted_mt <- predict(m2, bacteria)
library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
geom_line() +
geom_line(aes(Time, Predicted_dt), colour = "red") +
geom_line(aes(Time, Predicted_mt), colour = "blue")
autoplot(m2)
library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
geom_line() +
geom_line(aes(Time, Predicted_dt), colour = "red") +
geom_line(aes(Time, Predicted_mt), colour = "blue")
As we can see from the plot of the model tree, it (correctly!) identifies different time
phases in which the bacteria grow at different speeds. It therefore also managed to
make better extrapolation than the decision tree, which predicts no growth as Time
558 CHAPTER 13. SOLUTIONS TO EXERCISES
Exercise 9.24
We load and format the data as in the beginning of Section 9.1.7. We can then fit
the model using train as follows:
library(caret)
tc <- trainControl(method = "repeatedcv",
number = 10, repeats = 100,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
m <- train(type ~ .,
data = wine,
trControl = tc,
method = "qda",
metric = "ROC")
# ROC:
plots$roc
# Calibration curves:
plots$cc
Exercise 9.25
First, we load the data as in Section 4.9:
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
559
Next, we fit LDA and QDA models with Kernel_length and Compactness as ex-
planatory variables:
library(caret)
tc <- trainControl(method = "LOOCV")
Next, we plot the decision boundaries in the same scatterplot (LDA is black and
QDA is orange):
contour_data <- expand.grid(
Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length), length = 500),
Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions1, colour = "black") +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions2, colour = "orange")
The decision boundaries are fairly similar and seem pretty reasonable. QDA offers
more flexible non-linear boundaries, but the difference isn’t huge.
Exercise 9.26
Next, we fit the MDA model with Kernel_length and Compactness as explanatory
variables:
library(caret)
tc <- trainControl(method = "LOOCV")
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions, colour = "black")
Exercise 9.27
We load and format the data as in the beginning of Section 9.1.7. We’ll go with a
polynomial kernel and compare polynomials of degree 2 and 3. We can fit the model
using train as follows:
library(caret)
tc <- trainControl(method = "cv",
number = 10,
summaryFunction = twoClassSummary,
561
savePredictions = TRUE,
classProbs = TRUE)
m <- train(type ~ .,
data = wine,
trControl = tc,
method = "svmPoly",
tuneGrid = expand.grid(C = 1,
degree = 2:3,
scale = 1),
metric = "ROC")
# ROC:
plots$roc
# Calibration curves:
plots$cc
Exercise 9.28
1. We set file_path to the path of bacteria.csv, then load and format the data
as in Section 9.3.3:
bacteria <- read.csv(file_path)
bacteria$Time <- as.POSIXct(bacteria$Time, format = "%H:%M:%S")
Finally, we make predictions for the entire dataset and compare the results to the
actual outcomes:
562 CHAPTER 13. SOLUTIONS TO EXERCISES
library(ggplot2)
ggplot(bacteria, aes(Time, OD)) +
geom_line() +
geom_line(aes(Time, Predicted), colour = "red")
Similar to the linear model in Section 9.3.3, the SVM model does not extrapolate
too well outside the training data. Unlike tree-based models, however, it does not
yield constant predictions for values of the explanatory variable that are outside the
range in the training data. Instead, the fitted function is assumed to follow the same
shape as in the training data.
2. Next, we repeat the same steps using the data from rows 20 to 120:
library(caret)
tc <- trainControl(method = "LOOCV")
The results are disappointing. Using a different kernel could improve the results
though, so go ahead and give that a try!
Exercise 9.29
First, we load the data as in Section 4.9:
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
"Kernel_length", "Kernel_width", "Asymmetry",
"Groove_length", "Variety"))
seeds$Variety <- factor(seeds$Variety)
Next, we two different SVM models with Kernel_length and Compactness as ex-
planatory variables:
563
library(caret)
tc <- trainControl(method = "cv",
number = 10)
Next, we plot the decision boundaries in the same scatterplot (the polynomial kernel
is black and the radial basis kernel is orange):
contour_data <- expand.grid(
Kernel_length = seq(min(seeds$Kernel_length), max(seeds$Kernel_length), length = 500),
Compactness = seq(min(seeds$Compactness), max(seeds$Compactness), length = 500))
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions1, colour = "black") +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions2, colour = "orange")
It is likely the case that the polynomial kernel gives a similar results to e.g. MDA,
whereas the radial basis kernel gives more flexible decision boundaries.
Exercise 9.30
We load and format the data as in the beginning of Section 9.1.7. We can then fit
the model using train. We set summaryFunction = twoClassSummary and metric
= "ROC" to use 𝐴𝑈 𝐶 to find the optimal 𝑘. We make sure to add a preProcess
argument to train, to standardise the data:
564 CHAPTER 13. SOLUTIONS TO EXERCISES
library(caret)
tc <- trainControl(method = "cv",
number = 10,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
classProbs = TRUE)
To visually evaluate the model, we use evalm to plot ROC and calibration curves:
library(MLeval)
plots <- evalm(m, gnames = "kNN")
# ROC:
plots$roc
# Calibration curves:
plots$cc
The performance is as good as, or a little better than, the best logistic regression
model from Exercise 9.5. We shouldn’t make too much of any differences though,
as the models were evaluated in different ways - we used repeated 10-fold cross-
validation for the logistics models and a simple 10-fold cross-validation here (because
repeated cross-validation would be too slow in this case).
Exercise 9.31
First, we load the data as in Section 4.9:
# The data is downloaded from the UCI Machine Learning Repository:
# http://archive.ics.uci.edu/ml/datasets/seeds
seeds <- read.table("https://tinyurl.com/seedsdata",
col.names = c("Area", "Perimeter", "Compactness",
565
Next, we two different kNN models with Kernel_length and Compactness as ex-
planatory variables:
library(caret)
tc <- trainControl(method = "LOOCV")
library(ggplot2)
ggplot(seeds, aes(Kernel_length, Compactness, colour = Variety)) +
geom_point(size = 2) +
stat_contour(aes(x = Kernel_length, y = Compactness, z = Variety),
data = predictions, colour = "black")
The decision boundaries are quite “wiggly”, which will always be the case when there
are enough points in the sample.
Exercise 9.32
We start by plotting the time series:
library(forecast)
library(fma)
autoplot(writing) +
ylab("Sales (francs)") +
ggtitle("Sales of printing and writing paper")
566 CHAPTER 13. SOLUTIONS TO EXERCISES
Finally, we make a forecast for the next 36 months, adding the seasonal component
back and using bootstrap prediction intervals:
autoplot(forecast(tsmod, h = 36, bootstrap = TRUE))
Bibliography
Further reading
Below is a list of some highly recommended books that either partially overlap with
the content in this book or serve as a natural next step after you finish reading this
book. All of these are available for free online.
567
568 CHAPTER 13. SOLUTIONS TO EXERCISES
R.
• Deep Learning with R (https://livebook.manning.com/book/deep-learning-
with-r/) by Chollet & Allaire (2018) delves into neural networks and deep
learning, including computer vision and generative models.
Online resources
• A number of reference cards and cheat sheets can be found online. I like the
one at https://cran.r-project.org/doc/contrib/Short-refcard.pdf
• R-bloggers (https://www.r-bloggers.com/) collects blog posts related to R. A
great place to discover new tricks and see how others are using R.
• RSeek (http://rseek.org/) provides a custom Google search with the aim of
only returning pages related to R.
• Stack Overflow (https://stackoverflow.com/questions/tagged/r) and its
sister-site Cross Validated (https://stats.stackexchange.com/) are questions-
and-answers sites. They are great places for asking questions, and in addition,
they already contain a ton of useful information about all things R-related.
The RStudio Community (https://community.rstudio.com/) is another good
option.
• The R Journal (https://journal.r-project.org/) is an open-access peer-reviewed
journal containing papers on R, mainly describing new add-on packages and
their functionality.
References
Agresti, A. (2013). Categorical Data Analysis. Wiley.
Bates, D., Mächler, M., Bolker, B., Walker, S. (2015). Fitting linear mixed-effects
models using lme4. Journal of Statistical Software, 67, 1.
Boehmke, B., Greenwell, B. (2019). Hands-On Machine Learning with R. CRC Press.
Box, G.E., Cox, D.R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society: Series B (Methodological), 26(2), 211-243.
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A. (1984). Classification and
Regression Trees. CRC press.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Brown, L.D., Cai, T.T., DasGupta, A. (2001). Interval estimation for a binomial
proportion. Statistical Science, 16(2), 101-117.
Buolamwini, J., Gebru, T. (2018). Gender shades: Intersectional accuracy disparities
in commercial gender classification. Proceedings of Machine Learning Research, 81,
1-15.
569
Franks, B. (Ed.) (2020). 97 Things About Ethics Everyone in Data Science Should
Know. O’Reilly Media.
Friedman, J.H. (2002). Stochastic Gradient Boosting, Computational Statistics and
Data Analysis, 38(4), 367-378.
Gao, L.L, Bien, J., Witten, D. (2020). Selective inference for hierarchical clustering.
Pre-print, arXiv:2012.02936.
Groll, A., Tutz, G. (2014). Variable selection for generalized linear mixed models by
L1-penalized estimation. Statistics and Computing, 24(2), 137-154.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer Science & Busi-
ness Media.
Hartigan, J.A., Wong, M.A. (1979). Algorithm AS 136: A k-means clustering algo-
rithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1),
100-108.
Henderson, H.V., Velleman, P.F. (1981). Building multiple regression models inter-
actively. Biometrics, 37, 391–411.
Herr, D.G. (1986). On the history of ANOVA in unbalanced, factorial designs: the
first 30 years. The American Statistician, 40(4), 265-270.
Hoerl, A.E., Kennard, R.W. (1970). Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1), 55-67.
Hyndman, R. J., Athanasopoulos, G. (2018). Forecasting: Principles and Practice.
OTexts.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statis-
tical Learning with Applications in R. Springer.
Kuznetsova, A., Brockhoff, P. B., Christensen, R. H. (2017). lmerTest package: tests
in linear mixed effects models. Journal of Statistical Software, 82(13), 1-26.
Liero, H., Zwanzig, S. (2012). Introduction to the Theory of Statistical Inference.
CRC Press.
Long, J.D., Teetor, P. (2019). The R Cookbook. O’Reilly Media.
Moen, A., Lind, A.L., Thulin, M., Kamali–Moghaddamd, M., Roe, C., Gjerstad, J.,
Gordh, T. (2016). Inflammatory serum protein profiling of patients with lumbar
radicular pain one year after disc herniation. International Journal of Inflammation,
2016, Article ID 3874964.
Persson, I., Arnroth, L., Thulin, M. (2019). Multivariate two-sample permutation
tests for trials with multiple time-to-event outcomes. Pharmaceutical Statistics, 18(4),
476-485.
571
Petterson, T., Högbladh, S., Öberg, M. (2019). Organized violence, 1989-2018 and
peace agreements. Journal of Peace Research, 56(4), 589-603.
Picard, R.R., Cook, R.D. (1984). Cross-validation of regression models. Journal of
the American Statistical Association, 79(387), 575–583.
Recht, B., Roelofs, R., Schmidt, L., Shankar, V. (2019). Do imagenet classifiers
generalize to imagenet?. arXiv preprint arXiv:1902.10811.
Schoenfeld, D. (1982). Partial residuals for the proportional hazards regression model.
Biometrika, 69(1), 239-241.
Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E. (2016). mclust 5: clustering,
classification and density estimation using Gaussian finite mixture models. The R
Journal, 8(1), 289.
Smith, G. (2018). Step away from stepwise. Journal of Big Data, 5(1), 32.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in
a data set via the gap statistic. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 63(2), 411-423.
Thulin, M. (2014a). The cost of using exact confidence intervals for a binomial
proportion. Electronic Journal of Statistics, 8, 817-840.
Thulin, M. (2014b). On Confidence Intervals and Two-Sided Hypothesis Testing.
PhD thesis. Department of Mathematics, Uppsala University.
Thulin, M. (2014c). Decision-theoretic justifications for Bayesian hypothesis testing
using credible sets. Journal of Statistical Planning and Inference, 146, 133-138.
Thulin, M. (2016). Two‐sample tests and one‐way MANOVA for multivariate
biomarker data with nondetects. Statistics in Medicine, 35(20), 3623-3644.
Thulin, M., Zwanzig, S. (2017). Exact confidence intervals and hypothesis tests for
parameters of discrete distributions. Bernoulli, 23(1), 479-502.
Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econo-
metrica, 26, 24-36.
Wasserstein, R.L., Lazar, N.A. (2016). The ASA statement on p-values: context,
process, and purpose. The American Statistician, 70(2), 129-133.
Wei, L.J. (1992). The accelerated failure time model: a useful alternative to the Cox
regression model in survival analysis. Statistics in Medicine, 11(14‐15), 1871-1879.
Wickham, H. (2019). Advanced R. CRC Press.
Wickham, H., Bryan, J. (forthcoming). R Packages.
572 CHAPTER 13. SOLUTIONS TO EXERCISES
573
574 INDEX