CRC Data Science
CRC Data Science
Data Science
CHAPMAN & HALL/CRC DATA SCIENCE SERIES
Reflecting the interdisciplinary nature of the field, this book series brings together researchers, practitio-
ners, and instructors from statistics, computer science, machine learning, and analytics. The series will
publish cutting-edge research, industry applications, and textbooks in data science.
The inclusion of concrete examples, applications, and methods is highly encouraged. The scope of the
series includes titles in the areas of machine learning, pattern recognition, predictive analytics, business
analytics, Big Data, visualization, programming, software, learning analytics, data wrangling, interactive
graphics, and reproducible research.
Published Titles
Data Analytics
A Small Data Approach
Shuai Huang and Houtao Deng
Data Science
A First Introduction
Tiffany Timbers, Trevor Campbell, and Melissa Lee
Tree-Based Methods
A Practical Introduction with Applications in R
Brandon M. Greenwell
Urban Informatics
Using Big Data to Understand and Serve Communities
Daniel T. O’Brien
Tiffany Timbers
Trevor Campbell
Melissa Lee
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The authors
and publishers have attempted to trace the copyright holders of all material reproduced in this publica-
tion and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future
reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or
contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-
8400. For works that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used
only for identification and explanation without intent to infringe.
Names: Timbers, Tiffany, author. | Campbell, Trevor, author. | Lee, Melissa, author.
Title: Data science : a first introduction / Tiffany Timbers, Trevor
Campbell and Melissa Lee.
Description: First edition. | Boca Raton : CRC Press, 2022. | Series:
Statistics | Includes bibliographical references and index.
Identifiers: LCCN 2021054754 (print) | LCCN 2021054755 (ebook) | ISBN
9780367532178 (hardback) | ISBN 9780367524685 (paperback) | ISBN
9781003080978 (ebook)
Subjects: LCSH: Mathematical statistics--Data processing--Textbooks. | R
(Computer program language)--Textbooks. | Quantitative research--Data
processing--Textbooks.
Classification: LCC QA276.45.R3 T56 2022 (print) | LCC QA276.45.R3
(ebook) | DDC 519.50285/5133--dc23/eng20220301
LC record available at https://lccn.loc.gov/2021054754
LC ebook record available at https://lccn.loc.gov/2021054755
DOI: 10.1201/9781003080978
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
For my husband Curtis and daughter Rowan. Thank-you for your love
(and patience with my late night writing).
– Tiffany
To mom and dad: here’s a book. Pretty neat, eh? Love you guys.
– Trevor
To mom and dad, thank you for all your love and support.
– Melissa
ISTUDY
Contents
Foreword xv
Preface xvii
Acknowledgments xix
vii
ISTUDY
viii Contents
ISTUDY
Contents ix
ISTUDY
x Contents
ISTUDY
Contents xi
9 Clustering 289
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.2 Chapter learning objectives . . . . . . . . . . . . . . . . . . 289
9.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.4 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.4.1 Measuring cluster quality . . . . . . . . . . . . . . . . 295
9.4.2 The clustering algorithm . . . . . . . . . . . . . . . . 297
9.4.3 Random restarts . . . . . . . . . . . . . . . . . . . . . 299
9.4.4 Choosing K . . . . . . . . . . . . . . . . . . . . . . . 300
9.5 Data pre-processing for K-means . . . . . . . . . . . . . . . 302
9.6 K-means in R . . . . . . . . . . . . . . . . . . . . . . . . . . 304
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.8 Additional resources . . . . . . . . . . . . . . . . . . . . . . 312
ISTUDY
xii Contents
ISTUDY
Contents xiii
Bibliography 411
Index 415
ISTUDY
Foreword
Roger D. Peng
Johns Hopkins Bloomberg School of Public Health
2022-01-04
The field of data science has expanded and grown significantly in recent years,
attracting excitement and interest from many different directions. The demand
for introductory educational materials has grown concurrently with the growth
of the field itself, leading to a proliferation of textbooks, courses, blog posts,
and tutorials. This book is an important contribution to this fast-growing
literature, but given the wide availability of materials, a reader should be
inclined to ask, “What is the unique contribution of this book?” In order to
answer that question it is useful to step back for a moment and consider the
development of the field of data science over the past few years.
When thinking about data science, it is important to consider two questions:
“What is data science?” and “How should one do data science?” The former
question is under active discussion amongst a broad community of researchers
and practitioners and there does not appear to be much consensus to date.
However, there seems a general understanding that data science focuses on the
more “active” elements—data wrangling, cleaning, and analysis—of answering
questions with data. These elements are often highly problem-specific and may
seem difficult to generalize across applications. Nevertheless, over time we have
seen some core elements emerge that appear to repeat themselves as useful
concepts across different problems. Given the lack of clear agreement over the
definition of data science, there is a strong need for a book like this one to
propose a vision for what the field is and what the implications are for the
activities in which members of the field engage.
The first important concept addressed by this book is tidy data, which is
a format for tabular data formally introduced to the statistical community
in a 2014 paper by Hadley Wickham. The tidy data organization strategy
has proven a powerful abstract concept for conducting data analysis, in large
part because of the vast toolchain implemented in the Tidyverse collection
of R packages. The second key concept is the development of workflows for
reproducible and auditable data analyses. Modern data analyses have only
xv
ISTUDY
xvi Foreword
grown in complexity due to the availability of data and the ease with which
we can implement complex data analysis procedures. Furthermore, these data
analyses are often part of decision-making processes that may have significant
impacts on people and communities. Therefore, there is a critical need to
build reproducible analyses that can be studied and repeated by others in a
reliable manner. Statistical methods clearly represent an important element of
data science for building prediction and classification models and for making
inferences about unobserved populations. Finally, because a field can succeed
only if it fosters an active and collaborative community, it has become clear
that being fluent in the tools of collaboration is a core element of data science.
This book takes these core concepts and focuses on how one can apply them
to do data science in a rigorous manner. Students who learn from this book
will be well-versed in the techniques and principles behind producing reliable
evidence from data. This book is centered around the use of the R program-
ming language within the tidy data framework, and as such employs the most
recent advances in data analysis coding. The use of Jupyter notebooks for
exercises immediately places the student in an environment that encourages
auditability and reproducibility of analyses. The integration of git and GitHub
into the course is a key tool for teaching about collaboration and community,
key concepts that are critical to data science.
The demand for training in data science continues to increase. The availability
of large quantities of data to answer a variety of questions, the computational
power available to many more people than ever before, and the public aware-
ness of the importance of data for decision-making have all contributed to the
need for high-quality data science work. This book provides a sophisticated
first introduction to the field of data science and provides a balanced mix of
practical skills along with generalizable principles. As we continue to intro-
duce students to data science and train them to confront an expanding array
of data science problems, they will be well-served by the ideas presented here.
ISTUDY
Preface
Figure 1 summarizes what you will learn in each chapter of this book. Through-
out, you will learn how to use the R programming language [R Core Team,
2021] to perform all the tasks associated with data analysis. You will spend
the first four chapters learning how to use R to load, clean, wrangle (i.e., re-
structure the data into a usable format) and visualize data while answering
descriptive and exploratory data analysis questions. In the next six chapters,
you will learn how to answer predictive, exploratory, and inferential data anal-
ysis questions with common methods in data science, including classification,
regression, clustering, and estimation. In the final chapters (11–13), you will
learn how to combine R code, formatted text, and images in a single coherent
document with Jupyter, use version control for collaboration, and install and
configure the software needed for data science on your own computer. If you
are reading this book as part of a course that you are taking, the instructor
may have set up all of these tools already for you; in this case, you can continue
on through the book reading the chapters in order. But if you are reading this
independently, you may want to jump to these last three chapters early before
going on to make sure your computer is set up in such a way that you can try
out the example code that we include throughout the book.
Each chapter in the book has an accompanying worksheet that provides exer-
cises to help you practice the concepts you will learn. We strongly recommend
xvii
ISTUDY
xviii Preface
that you work through the worksheet when you finish reading each chapter
before moving on to the next chapter. All of the worksheets are available at
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme; the
“Exercises” section at the end of each chapter points you to the right worksheet
for that chapter. For each worksheet, you can either launch an interactive ver-
sion of the worksheet in your browser by clicking the “launch binder” button,
or preview a non-interactive version of the worksheet by clicking “view work-
sheet.” If you instead decide to download the worksheet and run it on your
own machine, make sure to follow the instructions for computer setup found
in Chapter 13. This will ensure that the automated feedback and guidance
that the worksheets provide will function as intended.
ISTUDY
Acknowledgments
We’d like to thank everyone that has contributed to the development of Data
Science: A First Introduction1 . This is an open source textbook that began as
a collection of course readings for DSCI 100, a new introductory data science
course at the University of British Columbia (UBC). Several faculty members
in the UBC Department of Statistics were pivotal in shaping the direction of
that course, and as such, contributed greatly to the broad structure and list of
topics in this book. We would especially like to thank Matías Salibían-Barrera
for his mentorship during the initial development and roll-out of both DSCI
100 and this book. His door was always open when we needed to chat about
how to best introduce and teach data science to our first-year students.
We would also like to thank all those who contributed to the process of pub-
lishing this book. In particular, we would like to thank all of our reviewers
for their feedback and suggestions: Rohan Alexander, Isabella Ghement, Vir-
gilio Gómez Rubio, Albert Kim, Adam Loy, Maria Prokofieva, Emily Riederer,
and Greg Wilson. The book was improved substantially by their insights. We
would like to give special thanks to Jim Zidek for his support and encourage-
ment throughout the process, and to Roger Peng for graciously offering to
write the Foreword.
Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the
past few years. They provided invaluable feedback on the book and worksheets;
they found bugs for us (and stood by very patiently in class while we frantically
fixed those bugs); and they brought a level of enthusiasm to the class that
sustained us during the hard work of creating a new course and writing a
textbook. Our interactions with them taught us how to teach data science,
and that learning is reflected in the content of this book.
1
https://datasciencebook.ca
xix
ISTUDY
ISTUDY
About the authors
xxi
ISTUDY
ISTUDY
1
R and the Tidyverse
1.1 Overview
This chapter provides an introduction to data science and the R programming
language. The goal here is to get your hands dirty right from the start! We will
walk through an entire data analysis, and along the way introduce different
types of data analysis question, some fundamental programming concepts in
R, and the basics of loading, cleaning, and visualizing data. In the following
chapters, we will dig into each of these steps in much more detail; but for now,
let’s jump in to see how much we can do with data science!
ISTUDY
2 1 R and the Tidyverse
often unique to Canada and not spoken anywhere else in the world [Statistics
Canada, 2018]. Sadly, colonization has led to the loss of many of these lan-
guages. For instance, generations of children were not allowed to speak their
mother tongue (the first language an individual learns in childhood) in Cana-
dian residential schools. Colonizers also renamed places they had “discovered”
[Wilson, 2018]. Acts such as these have significantly harmed the continuity
of Indigenous languages in Canada, and some languages are considered “en-
dangered” as few people report speaking them. To learn more, please see
Canadian Geographic’s article, “Mapping Indigenous Languages in Canada”
[Walker, 2017], They Came for the Children: Canada, Aboriginal peoples, and
Residential Schools [Truth and Reconciliation Commission of Canada, 2012]
and the Truth and Reconciliation Commission of Canada’s Calls to Action
[Truth and Reconciliation Commission of Canada, 2015].
The data set we will study in this chapter is taken from the canlang R data
package1 [Timbers, 2020], which has population language data collected during
the 2016 Canadian census [Statistics Canada, 2016a]. In this data, there are
214 languages recorded, each having six different properties:
ISTUDY
1.4 Asking a question 3
Note: Data science cannot be done without a deep understanding of the data
and problem domain. In this book, we have simplified the data sets used in
our examples to concentrate on methods and fundamental concepts. But in
real life, you cannot and should not do data science without a domain expert.
Alternatively, it is common to practice data science in your own domain of
expertise! Remember that when you work with data, it is essential to think
about how the data were collected, which affects the conclusions you can draw.
If your data are biased, then your results will be biased!
TABLE 1.1: Types of data analysis question [Leek and Peng, 2015, Peng
and Matsui, 2015].
ISTUDY
4 1 R and the Tidyverse
In this book, you will learn techniques to answer the first four types of question:
descriptive, exploratory, predictive, and inferential; causal and mechanistic
questions are beyond the scope of this book. In particular, you will learn how
to apply the following analysis tools:
ISTUDY
1.5 Loading a tabular data set 5
Table 1.1. For example, you might use visualization to answer the
following question: Is there any relationship between race time and
age for runners in this data set? This is covered in detail in Chapter
4, but again appears regularly throughout the book.
3. Classification: predicting a class or category for a new observation.
Classification is used to answer predictive questions. For example,
you might use classification to answer the following question: Given
measurements of a tumor’s average cell area and perimeter, is the
tumor benign or malignant? Classification is covered in Chapters 5
and 6.
4. Regression: predicting a quantitative value for a new observation.
Regression is also used to answer predictive questions. For example,
you might use regression to answer the following question: What will
be the race time for a 20-year-old runner who weighs 50kg? Regression
is covered in Chapters 7 and 8.
5. Clustering: finding previously unknown/unlabeled subgroups in a
data set. Clustering is often used to answer exploratory questions. For
example, you might use clustering to answer the following question:
What products are commonly bought together on Amazon? Clustering
is covered in Chapter 9.
6. Estimation: taking measurements for a small number of items from
a large group and making a good guess for the average or proportion
for the large group. Estimation is used to answer inferential questions.
For example, you might use estimation to answer the following ques-
tion: Given a survey of cellphone ownership of 100 Canadians, what
proportion of the entire Canadian population own Android phones?
Estimation is covered in Chapter 10.
ISTUDY
6 1 R and the Tidyverse
many different forms! Perhaps the most common form of data set that you
will find in the wild, however, is tabular data. Think spreadsheets in Microsoft
Excel: tabular data are rectangular-shaped and spreadsheet-like, as shown in
Figure 1.1. In this book, we will focus primarily on tabular data.
Since we are using R for data analysis in this book, the first step for us is to
load the data into R. When we load tabular data into R, it is represented as
a data frame object. Figure 1.1 shows that an R data frame is very similar
to a spreadsheet. We refer to the rows as observations; these are the things
that we collect the data on, e.g., voters, cities, etc. We refer to the columns
as variables; these are the characteristics of those observations, e.g., voters’
political affiliations, cities’ populations, etc.
The first kind of data file that we will learn how to load into R as a data
frame is the comma-separated values format (.csv for short). These files have
names ending in .csv, and can be opened and saved using common spreadsheet
programs like Microsoft Excel and Google Sheets. For example, the .csv file
named can_lang.csv is included with the code for this book2 . If we were to open
this data in a plain text editor (a program like Notepad that just shows text
with no formatting), we would see each row on its own line, and each entry in
the table separated by a comma:
2
https://github.com/UBC-DSCI/introduction-to-datascience/tree/master/data
ISTUDY
1.5 Loading a tabular data set 7
category,language,mother_tongue,most_at_home,most_at_work,lang_known
Aboriginal languages,”Aboriginal languages, n.o.s.”,590,235,30,665
Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415
Non-Official & Non-Aboriginal languages,”Afro-Asiatic languages, n.i.e.”,1150,44
Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150
Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930
Aboriginal languages,”Algonquian languages, n.i.e.”,45,10,0,120
Aboriginal languages,Algonquin,1260,370,40,2480
Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21
Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670
To load this data into R so that we can do things with it (e.g., perform analyses
or create data visualizations), we will need to use a function. A function is a
special word in R that takes instructions (we call these arguments) and does
something. The function we will use to load a .csv file into R is called read_csv.
In its most basic use-case, read_csv expects that the data file:
• has column names (or headers),
• uses a comma (,) to separate the columns, and
• does not have row names.
Below you’ll see the code used to load the data into R using the read_csv func-
tion. Note that the read_csv function is not included in the base installation
of R, meaning that it is not one of the primary functions ready to use when
you install R. Therefore, you need to load it from somewhere else before you
can use it. The place from which we will load it is called an R package. An R
package is a collection of functions that can be used in addition to the built-in
R package functions once loaded. The read_csv function, in particular, can be
made accessible by loading the tidyverse R package3 [Wickham, 2021b, Wick-
ham et al., 2019] using the library function. The tidyverse package contains
many functions that we will use throughout this book to load, clean, wrangle,
and visualize data.
library(tidyverse)
ISTUDY
8 1 R and the Tidyverse
## -- Conflicts ------------------------------------------
tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Note: You may have noticed that we got some extra output from R saying
Attaching packages and Conflicts below our code line. These are examples of
messages in R, which give the user more information that might be handy
to know. The Attaching packages message is natural when loading tidyverse,
since tidyverse actually automatically causes other packages to be imported
too, such as dplyr. In the future, when we load tidyverse in this book, we will
silence these messages to help with the readability of the book. The Conflicts
message is also totally normal in this circumstance. This message tells you if
functions from different packages share the same name, which is confusing to
R. For example, in this case, the dplyr package and the stats package both
provide a function called filter. The message above (dplyr::filter() masks
stats::filter()) is R telling you that it is going to default to the dplyr package
version of this function. So if you use the filter function, you will be using
the dplyr version. In order to use the stats version, you need to use its full
name stats::filter. Messages are not errors, so generally you don’t need to
take action when you see a message; but you should always read the message
and critically think about what it means and whether you need to do anything
about it.
After loading the tidyverse package, we can call the read_csv function and
pass it a single argument: the name of the file, ”can_lang.csv”. We have to
put quotes around file names and other letters and words that we use in our
code to distinguish it from the special words (like functions!) that make up
the R programming language. The file’s name is the only argument we need
to provide because our file satisfies everything else that the read_csv function
expects in the default use-case. Figure 1.2 describes how we use the read_csv
to read data into R.
ISTUDY
1.5 Loading a tabular data set 9
read_csv(”data/can_lang.csv”)
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
## 10 Non-Official ~ Arabic 419890 223535 5585 629055
## # ... with 204 more rows
Note: There is another function that also loads csv files named read.csv. We
will always use read_csv in this book, as it is designed to play nicely with all of
the other tidyverse functions, which we will use extensively. Be careful not to
accidentally use read.csv, as it can cause some tricky errors to occur in your
code that are hard to track down!
ISTUDY
10 1 R and the Tidyverse
The way to assign a name to a value in R is via the assignment symbol <-.
On the left side of the assignment symbol you put the name that you want
to use, and on the right side of the assignment symbol you put the value that
you want the name to refer to. Names can be used to refer to almost anything
in R, such as numbers, words (also known as strings of characters), and data
frames! Below, we set my_number to 3 (the result of 1+2) and we set name to the
string ”Alice”.
my_number <- 1 + 2
name <- ”Alice”
Note that when we name something in R using the assignment symbol, <-,
we do not need to surround the name we are creating with quotes. This is
because we are formally telling R that this special word denotes the value
of whatever is on the right-hand side. Only characters and words that act as
values on the right-hand side of the assignment symbol—e.g., the file name
”data/can_lang.csv” that we specified before, or ”Alice” above—need to be sur-
rounded by quotes.
After making the assignment, we can use the special name words we have
created in place of their values. For example, if we want to do something with
the value 3 later on, we can just use my_number instead. Let’s try adding 2 to
my_number; you will see that R just interprets this as adding 3 and 2:
my_number + 2
## [1] 5
ISTUDY
1.6 Naming things in R 11
na + me <- 1
There are certain conventions for naming objects in R. When naming an object
we suggest using only lowercase letters, numbers and underscores _ to separate
the words in a name. R is case sensitive, which means that Letter and letter
would be two different objects in R. You should also try to give your objects
meaningful names. For instance, you can name a data frame x. However, using
more meaningful terms, such as language_data, will help you remember what
each name in your code represents. We recommend following the Tidyverse
naming conventions outlined in the Tidyverse Style Guide [Wickham, 2020].
Let’s now use the assignment symbol to give the name can_lang to the 2016
Canadian census language data frame that we get from read_csv.
Wait a minute, nothing happened this time! Where’s our data? Actually, some-
thing did happen: the data was loaded in and now has the name can_lang as-
sociated with it. And we can use that name to access the data frame and do
things with it. For example, we can type the name of the data frame to print
the first few rows on the screen. You will also see at the top that the number
of observations (i.e., rows) and variables (i.e., columns) are printed. Printing
the first few rows of a data frame like this is a handy way to get a quick sense
for what is contained in a data frame.
can_lang
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
ISTUDY
12 1 R and the Tidyverse
Now that we’ve loaded our data into R, we can start wrangling the data to
find the ten Aboriginal languages that were most often reported in 2016 as
mother tongues in Canada. In particular, we will construct a table with the ten
Aboriginal languages that have the largest counts in the mother_tongue column.
The filter and select functions from the tidyverse package will help us here.
The filter function allows you to obtain a subset of the rows with specific
values, while the select function allows you to obtain a subset of the columns.
Therefore, we can filter the rows to extract the Aboriginal languages in the
data set, and then use select to obtain only the columns we want to include
in our table.
ISTUDY
1.7 Creating subsets of data frames with filter & select 13
With these arguments, filter returns a data frame that has all the columns
of the input data frame, but only those rows we asked for in our logical filter
statement.
## # A tibble: 67 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal ~ Aboriginal l~ 590 235 30 665
## 2 Aboriginal ~ Algonquian l~ 45 10 0 120
## 3 Aboriginal ~ Algonquin 1260 370 40 2480
## 4 Aboriginal ~ Athabaskan l~ 50 10 0 85
## 5 Aboriginal ~ Atikamekw 6150 5465 1100 6645
## 6 Aboriginal ~ Babine (Wets~ 110 20 10 210
## 7 Aboriginal ~ Beaver 190 50 0 340
## 8 Aboriginal ~ Blackfoot 2815 1110 85 5645
## 9 Aboriginal ~ Carrier 1025 250 15 2100
## 10 Aboriginal ~ Cayuga 45 10 10 125
## # ... with 57 more rows
It’s good practice to check the output after using a function in R. We can
see the original can_lang data set contained 214 rows with multiple kinds of
category. The data frame aboriginal_lang contains only 67 rows, and looks
like it only contains languages in the “Aboriginal languages” in the category
column. So it looks like the function gave us the result we wanted!
ISTUDY
14 1 R and the Tidyverse
## # A tibble: 67 x 2
## language mother_tongue
## <chr> <dbl>
## 1 Aboriginal languages, n.o.s. 590
## 2 Algonquian languages, n.i.e. 45
## 3 Algonquin 1260
## 4 Athabaskan languages, n.i.e. 50
## 5 Atikamekw 6150
## 6 Babine (Wetsuwet'en) 110
## 7 Beaver 190
## 8 Blackfoot 2815
## 9 Carrier 1025
ISTUDY
1.7 Creating subsets of data frames with filter & select 15
## 10 Cayuga 45
## # ... with 57 more rows
1.7.3 Using arrange to order and slice to select rows by index number
We have used filter and select to obtain a table with only the Aboriginal
languages in the data set and their associated counts. However, we want to
know the ten languages that are spoken most often. As a next step, we could
order the mother_tongue column from greatest to least and then extract only
the top ten rows. This is where the arrange and slice functions come to the
rescue!
The arrange function allows us to order the rows of a data frame by the values
of a particular column. Figure 1.5 details what arguments we need to specify to
use the arrange function. We need to pass the data frame as the first argument
to this function, and the variable to order by as the second argument. Since we
want to choose the ten Aboriginal languages most often reported as a mother
tongue language, we will use the arrange function to order the rows in our
selected_lang data frame by the mother_tongue column. We want to arrange the
rows in descending order (from largest to smallest), so we pass the column to
the desc function before using it as an argument.
## # A tibble: 67 x 2
## language mother_tongue
## <chr> <dbl>
## 1 Cree, n.o.s. 64050
## 2 Inuktitut 35210
ISTUDY
16 1 R and the Tidyverse
## 3 Ojibway 17885
## 4 Oji-Cree 12855
## 5 Dene 10700
## 6 Montagnais (Innu) 10235
## 7 Mi'kmaq 6690
## 8 Atikamekw 6150
## 9 Plains Cree 3065
## 10 Stoney 3025
## # ... with 57 more rows
Next we will use the slice function, which selects rows according to their row
number. Since we want to choose the most common ten languages, we will
indicate we want the rows 1 to 10 using the argument 1:10.
## # A tibble: 10 x 2
## language mother_tongue
## <chr> <dbl>
## 1 Cree, n.o.s. 64050
## 2 Inuktitut 35210
## 3 Ojibway 17885
## 4 Oji-Cree 12855
## 5 Dene 10700
## 6 Montagnais (Innu) 10235
## 7 Mi'kmaq 6690
## 8 Atikamekw 6150
## 9 Plains Cree 3065
## 10 Stoney 3025
We have now answered our initial question by generating this table! Are we
done? Well, not quite; tables are almost never the best way to present the
result of your analysis to your audience. Even the simple table above with
only two columns presents some difficulty: for example, you have to scrutinize
the table quite closely to get a sense for the relative numbers of speakers of each
language. When you move on to more complicated analyses, this issue only gets
worse. In contrast, a visualization would convey this information in a much
more easily understood format. Visualizations are a great tool for summarizing
information to help you effectively communicate with your audience.
ISTUDY
1.8 Exploring data with visualizations 17
ISTUDY
18 1 R and the Tidyverse
60000
mother_tongue
40000
20000
0
Atikamekw
Cree, n.o.s.Dene InuktitutMi'kmaq
Montagnais (Innu)
Oji-CreeOjibway
Plains Cree
Stoney
language
FIGURE 1.7: Bar plot of the ten Aboriginal languages most often reported
by Canadian residents as their mother tongue. Note that this visualization is
not done yet; there are still improvements to be made.
Note: The vast majority of the time, a single expression in R must be con-
tained in a single line of code. However, there are a small number of situations
in which you can have a single R expression span multiple lines. Above is one
such case: here, R knows that a line cannot end with a + symbol, and so it
keeps reading the next line to figure out what the right-hand side of the +
symbol should be. We could, of course, put all of the added layers on one
line of code, but splitting them across multiple lines helps a lot with code
readability.
ISTUDY
1.8 Exploring data with visualizations 19
should replace this default with a more informative label. For the example
above, R uses the column name mother_tongue as the label for the y axis, but
most people will not know what that is. And even if they did, they will not
know how we measured this variable, or the group of people on which the
measurements were taken. An axis label that reads “Mother Tongue (Number
of Canadian Residents)” would be much more informative.
Adding additional layers to our visualizations that we create in ggplot is one
common and easy way to improve and refine our data visualizations. New
layers are added to ggplot objects using the + symbol. For example, we can use
the xlab (short for x axis label) and ylab (short for y axis label) functions to
add layers where we specify meaningful and informative labels for the x and
y axes. Again, since we are specifying words (e.g. ”Mother Tongue (Number of
Canadian Residents)”) as arguments to xlab and ylab, we surround them with
double quotation marks. We can add many more layers to format the plot
further, and we will explore these in Chapter 4.
ISTUDY
20 1 R and the Tidyverse
40000
20000
Atikamekw
Cree, n.o.s.Dene InuktitutMi'kmaq
Montagnais (Innu)
Oji-CreeOjibway
Plains Cree
Stoney
Language
FIGURE 1.8: Bar plot of the ten Aboriginal languages most often reported
by Canadian residents as their mother tongue with x and y labels. Note that
this visualization is not done yet; there are still improvements to be made.
ISTUDY
1.8 Exploring data with visualizations 21
Stoney
Plains Cree
Ojibway
Oji-Cree
Language
Montagnais (Innu)
Mi'kmaq
Inuktitut
Dene
Cree, n.o.s.
Atikamekw
FIGURE 1.9: Horizontal bar plot of the ten Aboriginal languages most often
reported by Canadian residents as their mother tongue. There are no more
serious issues with this visualization, but it could be refined further.
Another big step forward, as shown in Figure 1.9! There are no more serious
issues with the visualization. Now comes time to refine the visualization to
make it even more well-suited to answering the question we asked earlier in
this chapter. For example, the visualization could be made more transparent by
organizing the bars according to the number of Canadian residents reporting
each language, rather than in alphabetical order. We can reorder the bars
using the reorder function, which orders a variable (here language) based on
the values of the second variable (mother_tongue).
ISTUDY
22 1 R and the Tidyverse
Cree, n.o.s.
Inuktitut
Ojibway
Oji-Cree
Language
Dene
Montagnais (Innu)
Mi'kmaq
Atikamekw
Plains Cree
Stoney
FIGURE 1.10: Bar plot of the ten Aboriginal languages most often reported
by Canadian residents as their mother tongue with bars reordered.
Figure 1.10 provides a very clear and well-organized answer to our original
question; we can see what the ten most often reported Aboriginal languages
were, according to the 2016 Canadian census, and how many people speak
each of them. For instance, we can see that the Aboriginal language most
often reported was Cree n.o.s. with over 60,000 Canadian residents reporting
it as their mother tongue.
Note: “n.o.s.” means “not otherwise specified”, so Cree n.o.s. refers to indi-
viduals who reported Cree as their mother tongue. In this data set, the Cree
languages include the following categories: Cree n.o.s., Swampy Cree, Plains
Cree, Woods Cree, and a ‘Cree not included elsewhere’ category (which in-
cludes Moose Cree, Northern East Cree and Southern East Cree) [Statistics
Canada, 2016b].
ISTUDY
1.8 Exploring data with visualizations 23
library(tidyverse)
ISTUDY
24 1 R and the Tidyverse
Cree, n.o.s.
Inuktitut
Ojibway
Oji-Cree
Language
Dene
Montagnais (Innu)
Mi'kmaq
Atikamekw
Plains Cree
Stoney
FIGURE 1.11: Putting it all together: bar plot of the ten Aboriginal lan-
guages most often reported by Canadian residents as their mother tongue.
Figure 1.12 shows the documentation that will pop up, including a high-level
description of the function, its arguments, a description of each, and more.
Note that you may find some of the text in the documentation a bit too tech-
nical right now (for example, what is dbplyr, and what is grouped data?). Fear
not: as you work through this book, many of these terms will be introduced
to you, and slowly but surely you will become more adept at understanding
and navigating documentation like that shown in Figure 1.12. But do keep in
mind that the documentation is not written to teach you about a function; it
is just there as a reference to remind you about the different arguments and
usage of functions that you have already learned about elsewhere.
ISTUDY
1.10 Exercises 25
FIGURE 1.12: The documentation for the filter function, including a high-
level description, a list of arguments and their meanings, and more.
1.10 Exercises
Practice exercises for the material covered in this chapter can be found in
the accompanying worksheets repository4 in the “R and the tidyverse” row.
You can launch an interactive version of the worksheet in your browser by
clicking the “launch binder” button. You can also preview a non-interactive
version of the worksheet by clicking “view worksheet.” If you instead decide to
download the worksheet and run it on your own machine, make sure to follow
the instructions for computer setup found in Chapter 13. This will ensure
that the automated feedback and guidance that the worksheets provide will
function as intended.
4
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme
ISTUDY
ISTUDY
2
Reading in data locally and from the web
2.1 Overview
In this chapter, you’ll learn to read tabular data of various formats into R from
your local device (e.g., your laptop) and the web. “Reading” (or “loading”) is
the process of converting data (stored as plain text, a database, HTML, etc.)
into an object (e.g., a data frame) that R can easily access and manipulate.
Thus reading data is the gateway to any data analysis; you won’t be able to
analyze data unless you’ve loaded it first. And because there are many ways
to store data, there are similarly many ways to read data into R. The more
time you spend upfront matching the data reading method to the type of data
you have, the less time you will have to devote to re-formatting, cleaning and
wrangling your data (the second step to all data analyses). It’s like making
sure your shoelaces are tied well before going for a run so that you don’t trip
later on!
27
ISTUDY
28 2 Reading in data locally and from the web
ISTUDY
2.3 Absolute and relative file paths 29
Suppose our computer’s filesystem looks like the picture in Figure 2.1, and we
are working in a file titled worksheet_02.ipynb. If we want to read the .csv file
named happiness_report.csv into R, we could do this using either a relative or
an absolute path. We show both choices below.
So which one should you use? Generally speaking, you should use relative
paths. Using a relative path helps ensure that your code can be run on a
different computer (and as an added bonus, relative paths are often shorter—
easier to type!). This is because a file’s relative path is often the same across
different computers, while a file’s absolute path (the names of all of the folders
ISTUDY
30 2 Reading in data locally and from the web
between the computer’s root, represented by /, and the file) isn’t usually the
same across different computers. For example, suppose Fatima and Jayden are
working on a project together on the happiness_report.csv data. Fatima’s file
is stored at
/home/Fatima/project/data/happiness_report.csv,
Even though Fatima and Jayden stored their files in the same place on their
computers (in their home folders), the absolute paths are different due to their
different usernames. If Jayden has code that loads the happiness_report.csv
data using an absolute path, the code won’t work on Fatima’s computer. But
the relative path from inside the project folder (data/happiness_report.csv) is
the same on both computers; any code that uses relative paths will work on
both! In the additional resources section, we include a link to a short video on
the difference between absolute and relative paths. You can also check out the
here package, which provides methods for finding and constructing file paths
in R.
Beyond files stored on your computer (i.e., locally), we also need a way to
locate resources stored elsewhere on the internet (i.e., remotely). For this pur-
pose we use a Uniform Resource Locator (URL), i.e., a web address that looks
something like https://datasciencebook.ca/. URLs indicate the location of a
resource on the internet and help us retrieve that resource.
ISTUDY
2.4 Reading tabular data from a plain text file into R 31
Before we jump into the cases where the data aren’t in the expected default
format for tidyverse and read_csv, let’s revisit the more straightforward case
where the defaults hold, and the only argument we need to give to the func-
tion is the path to the file, data/can_lang.csv. The can_lang data set contains
language data from the 2016 Canadian census. We put data/ before the file’s
name when we are loading the data set because this data set is located in a
sub-folder, named data, relative to where we are running our R code. Here is
what the text in the file data/can_lang.csv looks like.
category,language,mother_tongue,most_at_home,most_at_work,lang_known
Aboriginal languages,”Aboriginal languages, n.o.s.”,590,235,30,665
Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415
Non-Official & Non-Aboriginal languages,”Afro-Asiatic languages, n.i.e.”,1150,44
Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150
Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930
Aboriginal languages,”Algonquian languages, n.i.e.”,45,10,0,120
Aboriginal languages,Algonquin,1260,370,40,2480
Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21
Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670
And here is a review of how we can use read_csv to load it into R. First we
load the tidyverse package to gain access to useful functions for reading the
data.
library(tidyverse)
Next we use read_csv to load the data into R, and in that call we specify the
relative path to the file.
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
ISTUDY
32 2 Reading in data locally and from the web
Note: It is also normal and expected that a message is printed out after using
the read_csv and related functions. This message lets you know the data types
of each of the columns that R inferred while reading the data into R. In the
future when we use this and related functions to load data in this book, we
will silence these messages to help with the readability of the book.
Finally, to view the first 10 rows of the data frame, we must call it:
canlang_data
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
## 10 Non-Official ~ Arabic 419890 223535 5585 629055
## # ... with 204 more rows
ISTUDY
2.4 Reading tabular data from a plain text file into R 33
With this extra information being present at the top of the file, using read_csv
as we did previously does not allow us to correctly load the data into R. In
the case of this file we end up only reading in one column of the data set:
Note: In contrast to the normal and expected messages above, this time R
printed out a warning for us indicating that there might be a problem with
how our data is being read in.
canlang_data
## # A tibble: 217 x 1
## `Data source: https://ttimbers.github.io/canlang/`
## <chr>
## 1 ”Data originally published in: Statistics Canada Census of Population 2016.”
## 2 ”Reproduced and distributed on an as-is basis with their permission.”
## 3 ”category,language,mother_tongue,most_at_home,most_at_work,lang_known”
## 4 ”Aboriginal languages,\”Aboriginal languages, n.o.s.\”,590,235,30,665”
## 5 ”Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415”
## 6 ”Non-Official & Non-Aboriginal languages,\”Afro-
Asiatic languages, n.i.e.\”,~
## 7 ”Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150”
## 8 ”Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930”
## 9 ”Aboriginal languages,\”Algonquian languages, n.i.e.\”,45,10,0,120”
## 10 ”Aboriginal languages,Algonquin,1260,370,40,2480”
## # ... with 207 more rows
ISTUDY
34 2 Reading in data locally and from the web
To successfully read data like this into R, the skip argument can be useful to
tell R how many lines to skip before it should start reading in the data. In the
example above, we would set this value to 3.
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
## 10 Non-Official ~ Arabic 419890 223535 5585 629055
## # ... with 204 more rows
How did we know to skip three lines? We looked at the data! The first three
lines of the data had information we didn’t need to import:
Data source: https://ttimbers.github.io/canlang/
Data originally published in: Statistics Canada Census of Population 2016.
Reproduced and distributed on an as-is basis with their permission.
The column names began at line 4, so we skipped the first three lines.
ISTUDY
2.4 Reading tabular data from a plain text file into R 35
To read in this type of data, we can use the read_tsv to read in .tsv (tab
separated values) files.
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
## 10 Non-Official ~ Arabic 419890 223535 5585 629055
## # ... with 204 more rows
Let’s compare the data frame here to the resulting data frame in Section 2.4.1
after using read_csv. Notice anything? They look the same! The same number
of columns/rows and column names! So we needed to use different tools for
the job depending on the file format and our resulting table (canlang_data) in
both cases was the same!
ISTUDY
36 2 Reading in data locally and from the web
To get this into R using the read_delim function, we specify the first argument
as the path to the file (as done with read_csv), and then provide values to
the delim argument (here a tab, which we represent by ”\t”) and the col_names
argument (here we specify that there are no column names to assign, and
give it the value of FALSE). read_csv, read_tsv and read_delim have a col_names
argument and the default is TRUE.
## # A tibble: 214 x 6
## X1 X2 X3 X4 X5 X6
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal languages Aborigina~ 590 235 30 665
## 2 Non-Official & Non-Aboriginal languages Afrikaans 10260 4785 85 23415
## 3 Non-Official & Non-Aboriginal languages Afro-Asia~ 1150 445 10 2775
## 4 Non-Official & Non-Aboriginal languages Akan (Twi) 13460 5985 25 22150
## 5 Non-Official & Non-Aboriginal languages Albanian 26895 13135 345 31930
## 6 Aboriginal languages Algonquia~ 45 10 0 120
## 7 Aboriginal languages Algonquin 1260 370 40 2480
## 8 Non-Official & Non-Aboriginal languages American ~ 2685 3020 1145 21930
## 9 Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670
## 10 Non-Official & Non-Aboriginal languages Arabic 419890 223535 5585 629055
ISTUDY
2.4 Reading tabular data from a plain text file into R 37
Data frames in R need to have column names. Thus if you read in data that
don’t have column names, R will assign names automatically. In the example
above, R assigns each column a name of X1, X2, X3, X4, X5, X6.
It is best to rename your columns to help differentiate between them (e.g., X1,
X2, etc., are not very descriptive names and will make it more confusing as
you code). To rename your columns, you can use the rename function from the
1
dplyr R package [Wickham et al., 2021b] (one of the packages loaded with
tidyverse, so we don’t need to load it separately). The first argument is the
data set, and in the subsequent arguments you write new_name = old_name for
the selected variables to rename. We rename the X1, X2, ..., X6 columns in
the canlang_data data frame to more descriptive names below.
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
## 10 Non-Official ~ Arabic 419890 223535 5585 629055
## # ... with 204 more rows
1
https://dplyr.tidyverse.org/
ISTUDY
j3 k _2/BM; BM /i HQ+HHv M/ 7`QK i?2 r2#
)')"ά/
ϗϗ ϗ /$' ͣ͡͠ 3 ͥ
ϗϗ / "*-4 ')"0" (*/# -ά/*)"0 (*./ά/ά#*( (*./ά/ά2*-& ')"ά&)*2)
ϗϗ Ѷ#-ѷ Ѷ#-ѷ Ѷ'ѷ Ѷ'ѷ Ѷ'ѷ Ѷ'ѷ
ϗϗ ͠ *-$"$)' 'ѽ *-$"$)'ѽ ͤͨ͟ ͤ͢͡ ͢͟ ͥͥͤ
ϗϗ ͡ *)Ζ!!$$' ѽ !-$&). ͥ͟͟͠͡ ͣͦͧͤ ͧͤ ͣͤ͢͡͠
ϗϗ ͢ *)Ζ!!$$' ѽ !-*Ζ.$/ѽ ͤ͟͠͠ ͣͣͤ ͟͠ ͦͦͤ͡
ϗϗ ͣ *)Ζ!!$$' ѽ &) ΰ2$α ͣͥ͢͟͠ ͤͨͧͤ ͤ͡ ͤ͟͡͡͠
ϗϗ ͤ *)Ζ!!$$' ѽ ')$) ͥͧͨͤ͡ ͤ͢͢͠͠ ͣͤ͢ ͨ͢͢͟͠
ϗϗ ͥ *-$"$)' 'ѽ '"*),0$)ѽ ͣͤ ͟͠ ͟ ͟͠͡
ϗϗ ͦ *-$"$)' 'ѽ '"*),0$) ͥ͟͠͡ ͦ͢͟ ͣ͟ ͣͧ͟͡
ϗϗ ͧ *)Ζ!!$$' ѽ ( -$) ѽ ͥͧͤ͡ ͢͟͟͡ ͣͤ͠͠ ͨ͢͟͡͠
ϗϗ ͨ *)Ζ!!$$' ѽ (#-$ ͣͥͤ͡͡ ͦͧͤ͠͡ ͟͟͡ ͥͦ͢͢͟
ϗϗ ͟͠ *)Ζ!!$$' ѽ -$ ͣͨͧͨ͟͠ ͤͤ͢͢͡͡ ͤͤͧͤ ͥͨͤͤ͟͡
ϗϗ ϗ 2$/# ͣ͟͡ (*- -*2.
ISTUDY
2.5 Reading tabular data from a Microsoft Excel file 39
This type of file representation allows Excel files to store additional things that
you cannot store in a .csv file, such as fonts, text formatting, graphics, multiple
sheets and more. And despite looking odd in a plain text editor, we can read
Excel spreadsheets into R using the readxl package developed specifically for
this purpose.
library(readxl)
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
ISTUDY
40 2 Reading in data locally and from the web
If the .xlsx file has multiple sheets, you have to use the sheet argument to
specify the sheet number or name. You can also specify cell ranges using
the range argument. This functionality is useful when a single sheet contains
multiple tables (a sad thing that happens to many Excel spreadsheets since
this makes reading in data more difficult).
As with plain text files, you should always explore the data file before import-
ing it into R. Exploring the data beforehand helps you decide which arguments
you need to load the data into R successfully. If you do not have the Excel
program on your computer, you can use other programs to preview the file.
Examples include Google Sheets and Libre Office.
In Table 2.1 we summarize the read_* functions we covered in this chapter. We
also include the read_csv2 function for data separated by semicolons ;, which
you may run into with data sets where the decimal is represented by a comma
instead of a period (as with some data sets from European countries).
Note: readr is a part of the tidyverse package so we did not need to load this
package separately since we loaded tidyverse.
ISTUDY
2.6 Reading data from a database 41
library(DBI)
Often relational databases have many tables; thus, in order to retrieve data
from a database, you need to know the name of the table in which the data
is stored. You can get the names of all the tables in the database using the
dbListTables function:
## [1] ”lang”
ISTUDY
42 2 Reading in data locally and from the web
The dbListTables function returned only one name, which tells us that there
is only one table in this database. To reference a table in the database (so
that we can perform operations like selecting columns and filtering rows), we
use the tbl function from the dbplyr package. The object returned by the tbl
function allows us to work with data stored in databases as if they were just
regular data frames; but secretly, behind the scenes, dbplyr is turning your
function calls (e.g., select and filter) into SQL queries!
library(dbplyr)
Although it looks like we just got a data frame from the database, we didn’t!
It’s a reference; the data is still stored only in the SQLite database. The dbplyr
package works this way because databases are often more efficient at selecting,
filtering and joining large data sets than R. And typically the database will
not even be stored on your computer, but rather a more powerful machine
somewhere on the web. So R is lazy and waits to bring this data into memory
until you explicitly tell it to using the collect function. Figure 2.2 highlights
the difference between a tibble object in R and the output we just created.
Notice in the table on the right, the first two lines of the output indicate the
source is SQL. The last line doesn’t show how many rows there are (R is trying
to avoid performing expensive query operations), whereas the output for the
tibble object does.
ISTUDY
2.6 Reading data from a database 43
ISTUDY
44 2 Reading in data locally and from the web
We can look at the SQL commands that are sent to the database when we
write tbl(conn_lang_data, ”lang”) in R with the show_query function from the
dbplyr package.
show_query(tbl(conn_lang_data, ”lang”))
## <SQL>
## SELECT *
## FROM `lang`
The output above shows the SQL code that is sent to the database. When
we write tbl(conn_lang_data, ”lang”) in R, in the background, the function is
translating the R code into SQL, sending that SQL to the database, and then
translating the response for us. So dbplyr does all the hard work of translating
from R to SQL and back for us; we can just stick with R!
With our lang_db table reference for the 2016 Canadian Census data in hand,
we can mostly continue onward as if it were a regular data frame. For example,
we can use the filter function to obtain only certain rows. Below we filter the
data to include only Aboriginal languages.
Above you can again see the hints that this data is not actually stored in R
yet: the source is a lazy query [?? x 6] and the output says ... with more rows
ISTUDY
2.6 Reading data from a database 45
at the end (both indicating that R does not know how many rows there are in
total!), and a database type sqlite 3.36.0 is listed. In order to actually retrieve
this data in R as a data frame, we use the collect function. Below you will see
that after running collect, R knows that the retrieved data has 67 rows, and
there is no database listed any more.
## # A tibble: 67 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal ~ Aboriginal l~ 590 235 30 665
## 2 Aboriginal ~ Algonquian l~ 45 10 0 120
## 3 Aboriginal ~ Algonquin 1260 370 40 2480
## 4 Aboriginal ~ Athabaskan l~ 50 10 0 85
## 5 Aboriginal ~ Atikamekw 6150 5465 1100 6645
## 6 Aboriginal ~ Babine (Wets~ 110 20 10 210
## 7 Aboriginal ~ Beaver 190 50 0 340
## 8 Aboriginal ~ Blackfoot 2815 1110 85 5645
## 9 Aboriginal ~ Carrier 1025 250 15 2100
## 10 Aboriginal ~ Cayuga 45 10 10 125
## # ... with 57 more rows
Aside from knowing the number of rows, the data looks pretty similar in both
outputs shown above. And dbplyr provides many more functions (not just
filter) that you can use to directly feed the database reference (lang_db) into
downstream analysis functions (e.g., ggplot2 for data visualization). But dbplyr
does not provide every function that we need for analysis; we do eventually
need to call collect. For example, look what happens when we try to use nrow
to count rows in a data frame:
nrow(aboriginal_lang_db)
## [1] NA
tail(aboriginal_lang_db)
ISTUDY
46 2 Reading in data locally and from the web
Additionally, some operations will not work to extract columns or single values
from the reference given by the tbl function. Thus, once you have finished your
data wrangling of the tbl database reference object, it is advisable to bring
it into R as a data frame using collect. But be very careful using collect:
databases are often very big, and reading an entire table into R might take a
long time to run or even possibly crash your machine. So make sure you use
filter and select on the database table to reduce the data to a reasonable size
before using collect to read it into R!
Additionally, we must use the RPostgres package instead of RSQLite in the db-
Connect function call. Below we demonstrate how to connect to a version of the
can_mov_db database, which contains information about Canadian movies. Note
that the host (fakeserver.stat.ubc.ca), user (user0001), and password (abc123) be-
low are not real; you will not actually be able to connect to a database using
this information.
library(RPostgres)
conn_mov_data <- dbConnect(RPostgres::Postgres(), dbname = ”can_mov_db”,
host = ”fakeserver.stat.ubc.ca”, port = 5432,
user = ”user0001”, password = ”abc123”)
After opening the connection, everything looks and behaves almost identically
to when we were using an SQLite database in R. For example, we can again
use dbListTables to find out what tables are in the can_mov_db database:
ISTUDY
2.6 Reading data from a database 47
dbListTables(conn_mov_data)
We see that there are 10 tables in this database. Let’s first look at the ”ratings”
table to find the lowest rating that exists in the can_mov_db database:
To find the lowest rating that exists in the data base, we first need to extract
the average_rating column using select:
ISTUDY
48 2 Reading in data locally and from the web
5 7.0
6 5.5
7 5.6
8 5.3
9 5.8
10 6.9
# … with more rows
min(avg_rating_db)
[1] 1
We see the lowest rating given to a movie is 1, indicating that it must have
been a really bad movie…
ISTUDY
2.7 Writing data from R to a .csv file 49
conducted daily in 2021 [Real Time Statistics Project, 2021]. Can you imag-
ine if Google stored all of the data from those searches in a single .csv file!?
Chaos would ensue!
Note: This section is not required reading for the remainder of the textbook.
It is included for those readers interested in learning a little bit more about
how to obtain different types of data from the web.
Data doesn’t just magically appear on your computer; you need to get it from
somewhere. Earlier in the chapter we showed you how to access data stored
in a plain text, spreadsheet-like format (e.g., comma- or tab-separated) from
a web URL using one of the read_* functions from the tidyverse. But as time
goes on, it is increasingly uncommon to find data (especially large amounts of
data) in this format available for download from a URL. Instead, websites now
often offer something known as an application programming interface (API),
which provides a programmatic way to ask for subsets of a data set. This
ISTUDY
50 2 Reading in data locally and from the web
allows the website owner to control who has access to the data, what portion
of the data they have access to, and how much data they can access. Typically,
the website owner will give you a token (a secret string of characters somewhat
like a password) that you have to provide when accessing the API.
Another interesting thought: websites themselves are data! When you type a
URL into your browser window, your browser asks the web server (another
computer on the internet whose job it is to respond to requests for the website)
to give it the website’s data, and then your browser translates that data into
something you can see. If the website shows you some information that you’re
interested in, you could create a data set for yourself by copying and pasting
that information into a file. This process of taking information directly from
what a website displays is called web scraping (or sometimes screen scrap-
ing). Now, of course, copying and pasting information manually is a painstak-
ing and error-prone process, especially when there is a lot of information to
gather. So instead of asking your browser to translate the information that
the web server provides into something you can see, you can collect that data
programmatically—in the form of hypertext markup language (HTML) and
cascading style sheet (CSS) code—and process it to extract useful informa-
tion. HTML provides the basic structure of a site and tells the webpage how
to display the content (e.g., titles, paragraphs, bullet lists etc.), whereas CSS
helps style the content and tells the webpage how the HTML elements should
be presented (e.g., colors, layouts, fonts etc.).
This subsection will show you the basics of both web scraping with the rvest
R package2 [Wickham, 2021a] and accessing the Twitter API using the rtweet
R package3 [Kearney, 2019].
When you enter a URL into your browser, your browser connects to the web
server at that URL and asks for the source code for the website. This is the
data that the browser translates into something you can see; so if we are go-
ing to create our own data by scraping a website, we have to first understand
what that data looks like! For example, let’s say we are interested in know-
ing the average rental price (per square foot) of the most recently available
one-bedroom apartments in Vancouver on Craiglist4 . When we visit the Van-
couver Craigslist website and search for one-bedroom apartments, we should
see something similar to Figure 2.3.
2
https://rvest.tidyverse.org/
3
https://github.com/ropensci/rtweet
4
https://vancouver.craigslist.org
ISTUDY
2.8 Obtaining data from the web 51
Based on what our browser shows us, it’s pretty easy to find the size and
price for each apartment listed. But we would like to be able to obtain that
information using R, without any manual human effort or copying and pasting.
We do this by examining the source code that the web server actually sent our
browser to display for us. We show a snippet of it below; the entire source is
included with the code for this book5 :
<span class=”result-meta”>
<span class=”result-price”>$800</span>
<span class=”housing”>
1br -
</span>
<span class=”result-tags”>
<span class=”maptag” data-pid=”6786042973”>map</span>
</span>
5
https://github.com/UBC-DSCI/introduction-to-datascience/blob/master/img/website_source.tx
t
ISTUDY
52 2 Reading in data locally and from the web
</span>
</p>
</li>
<li class=”result-row” data-pid=”6788463837”>
<a href=”https://vancouver.craigslist.org/nvn/apa/d/north-vancouver-luxu
<span class=”result-price”>$2285</span>
</a>
Oof…you can tell that the source code for a web page is not really designed
for humans to understand easily. However, if you look through it closely, you
will find that the information we’re interested in is hidden among the muck.
For example, near the top of the snippet above you can see a line that looks
like
<span class=”result-price”>$800</span>
That is definitely storing the price of a particular apartment. With some more
investigation, you should be able to find things like the date and time of the
listing, the address of the listing, and more. So this source code most likely
contains all the information we are interested in!
Let’s dig into that line above a bit more. You can see that that bit of code
has an opening tag (words between < and >, like <span>) and a closing tag
(the same with a slash, like </span>). HTML source code generally stores its
data between opening and closing tags like these. Tags are keywords that
tell the web browser how to display or format the content. Above you can
see that the information we want ($800) is stored between an opening and
closing tag (<span> and </span>). In the opening tag, you can also see a very
useful “class” (a special word that is sometimes included with opening tags):
class=”result-price”. Since we want R to programmatically sort through all of
ISTUDY
2.8 Obtaining data from the web 53
the source code for the website to find apartment prices, maybe we can look
for all the tags with the ”result-price” class, and grab the information between
the opening and closing tag. Indeed, take a look at another line of the source
snippet above:
<span class=”result-price”>$2285</span>
It’s yet another price for an apartment listing, and the tags surrounding it
have the ”result-price” class. Wonderful! Now that we know what pattern
we are looking for—a dollar amount between opening and closing tags that
have the ”result-price” class—we should be able to use code to pull out all of
the matching patterns from the source code to obtain our data. This sort of
“pattern” is known as a CSS selector (where CSS stands for cascading style
sheet).
The above was a simple example of “finding the pattern to look for”; many
websites are quite a bit larger and more complex, and so is their website
source code. Fortunately, there are tools available to make this process easier.
For example, SelectorGadget6 is an open-source tool that simplifies identifying
the generating and finding of CSS selectors. At the end of the chapter in the
additional resources section, we include a link to a short video on how to
install and use the SelectorGadget tool to obtain CSS selectors for use in
web scraping. After installing and enabling the tool, you can click the website
element for which you want an appropriate selector. For example, if we click
the price of an apartment listing, we find that SelectorGadget shows us the
selector .result-price in its toolbar, and highlights all the other apartment
prices that would be obtained using that selector (Figure 2.4).
If we then click the size of an apartment listing, SelectorGadget shows us the
span selector, and highlights many of the lines on the page; this indicates that
the span selector is not specific enough to capture only apartment sizes (Figure
2.5).
To narrow the selector, we can click one of the highlighted elements that we
do not want. For example, we can deselect the “pic/map” links, resulting in
only the data we want highlighted using the .housing selector (Figure 2.6).
So to scrape information about the square footage and rental price of apart-
ment listings, we need to use the two CSS selectors .housing and .result-price,
respectively. The selector gadget returns them to us as a comma-separated list
(here .housing , .result-price), which is exactly the format we need to provide
to R if we are using more than one CSS selector.
6
https://selectorgadget.com/
ISTUDY
54 2 Reading in data locally and from the web
ISTUDY
2.8 Obtaining data from the web 55
Stop! Are you allowed to scrape that website? Before scraping data
from the web, you should always check whether or not you are allowed to
scrape it! There are two documents that are important for this: the robots.txt
file and the Terms of Service document. If we take a look at Craigslist’s Terms
of Service document7 , we find the following text: “You agree not to copy/collect
CL content via robots, spiders, scripts, scrapers, crawlers, or any automated
or manual equivalent (e.g., by hand).” So unfortunately, without explicit per-
mission, we are not allowed to scrape the website.
What to do now? Well, we could ask the owner of Craigslist for permission
to scrape. However, we are not likely to get a response, and even if we did
they would not likely give us permission. The more realistic answer is that we
simply cannot scrape Craigslist. If we still want to find data about rental prices
in Vancouver, we must go elsewhere. To continue learning how to scrape data
from the web, let’s instead scrape data on the population of Canadian cities
from Wikipedia. We have checked the Terms of Service document8 , and it does
not mention that web scraping is disallowed. We will use the SelectorGadget
tool to pick elements that we are interested in (city names and population
counts) and deselect others to indicate that we are not interested in them
(province names), as shown in Figure 2.7.
7
https://www.craigslist.org/about/terms.of.use
8
https://foundation.wikimedia.org/wiki/Terms_of_Use/en
ISTUDY
56 2 Reading in data locally and from the web
We include a link to a short video tutorial on this process at the end of the
chapter in the additional resources section. SelectorGadget provides in its
toolbar the following list of CSS selectors to use:
td:nth-child(5),
td:nth-child(7),
.infobox:nth-child(122) td:nth-child(1),
.infobox td:nth-child(3)
Now that we have the CSS selectors that describe the properties of the ele-
ments that we want to target, we can use them to find certain elements in web
pages and extract data.
Using rvest
Now that we have our CSS selectors we can use the rvest R package to scrape
our desired data from the website. We start by loading the rvest package:
library(rvest)
Next, we tell R what page we want to scrape by providing the webpage’s URL
in quotations to the function read_html:
ISTUDY
2.8 Obtaining data from the web 57
The read_html function directly downloads the source code for the page at the
URL you specify, just like your browser would if you navigated to that site. But
instead of displaying the website to you, the read_html function just returns the
HTML source code itself, which we have stored in the page variable. Next, we
send the page object to the html_nodes function, along with the CSS selectors we
obtained from the SelectorGadget tool. Make sure to surround the selectors
with quotation marks; the function, html_nodes, expects that argument is a
string. The html_nodes function then selects nodes from the HTML document
that match the CSS selectors you specified. A node is an HTML tag pair (e.g.,
<td> and </td> which defines the cell of a table) combined with the content
stored between the tags. For our CSS selector td:nth-child(5), an example
node that would be selected would be:
<td style=”text-align:left;background:#f0f0f0;”>
<a href=”/wiki/London,_Ontario” title=”London, Ontario”>London</a>
</td>
## {xml_nodeset (6)}
## [1] <td style=”text-align:left;background:#f0f0f0;”><a href=”/wiki/London,_On .
## [2] <td style=”text-align:right;”>543,551\n</td>
## [3] <td style=”text-align:left;background:#f0f0f0;”><a href=”/wiki/Halifax,_N .
## [4] <td style=”text-align:right;”>465,703\n</td>
ISTUDY
58 2 Reading in data locally and from the web
Note: head is a function that is often useful for viewing only a short summary
of an R object, rather than the whole thing (which may be quite a lot to
look at). For example, here head shows us only the first 6 items in the pop-
ulation_nodes object. Note that some R objects by default print only a small
summary. For example, tibble data frames only show you the first 10 rows. But
not all R objects do this, and that’s where the head function helps summarize
things for you.
Next we extract the meaningful data—in other words, we get rid of the HTML
code syntax and tags—from the nodes using the html_text function. In the case
of the example node above, html_text function returns ”London”.
Fantastic! We seem to have extracted the data of interest from the raw HTML
source code. But we are not quite done; the data is not yet in an optimal format
for data analysis. Both the city names and population are encoded as char-
acters in a single vector, instead of being in a data frame with one character
column for city and one numeric column for population (like a spreadsheet).
Additionally, the populations contain commas (not useful for programmati-
cally dealing with numbers), and some even contain a line break character at
the end (\n). In Chapter 3, we will learn more about how to wrangle data such
as this into a more useful format for data analysis using R.
ISTUDY
2.8 Obtaining data from the web 59
book, with the hope that it gives you enough of a basic idea that you can
learn how to use another API if needed.
In particular, in this book we will show you the basics of how to use the rtweet
package in R to access data from the Twitter API. One nice feature of this
particular API is that you don’t need a special token to access it; you simply
need to make an account with them. Your access to the data will then be
authenticated and controlled through your account username and password.
If you have a Twitter account already (or are willing to make one), you can
follow along with the examples that we show here. To get started, load the
rtweet package:
library(rtweet)
This package provides an extensive set of functions to search Twitter for tweets,
users, their followers, and more. Let’s construct a small data set of the last 400
tweets and retweets from the @tidyverse9 account. A few of the most recent
tweets are shown in Figure 2.8.
Stop! Think about your API usage carefully!
When you access an API, you are initiating a transfer of data from a web
server to your computer. Web servers are expensive to run and do not have
infinite resources. If you try to ask for too much data at once, you can use
up a huge amount of the server’s bandwidth. If you try to ask for data too
frequently—e.g., if you make many requests to the server in quick succession—
you can also bog the server down and make it unable to talk to anyone else.
Most servers have mechanisms to revoke your access if you are not careful, but
you should try to prevent issues from happening in the first place by being
extra careful with how you write and run your code. You should also keep
in mind that when a website owner grants you API access, they also usually
specify a limit (or quota) of how much data you can ask for. Be careful not
to overrun your quota! In this example, we should take a look at the Twitter
website10 to see what limits we should abide by when using the API.
Using rtweet
After checking the Twitter website, it seems like asking for 400 tweets one
time is acceptable. So we can use the get_timelines function to ask for the last
400 tweets from the @tidyverse11 account.
9
https://twitter.com/tidyverse
10
https://developer.twitter.com/en/docs/twitter-api/rate-limits
11
https://twitter.com/tidyverse
ISTUDY
60 2 Reading in data locally and from the web
ISTUDY
2.8 Obtaining data from the web 61
When you call the get_timelines for the first time (or any other rtweet function
that accesses the API), you will see a browser pop-up that looks something
like Figure 2.9.
This is the rtweet package asking you to provide your own Twitter account’s
login information. When rtweet talks to the Twitter API, it uses your account
information to authenticate requests; Twitter then can keep track of how much
data you’re asking for, and how frequently you’re asking. If you want to follow
along with this example using your own Twitter account, you should read over
ISTUDY
62 2 Reading in data locally and from the web
the list of permissions you are granting rtweet very carefully and make sure
you are comfortable with it. Note that rtweet can be used to manage most
aspects of your account (make posts, follow others, etc.), which is why rtweet
asks for such extensive permissions. If you decide to allow rtweet to talk to the
Twitter API using your account information, then input your username and
password and hit “Sign In.” Twitter will probably send you an email to say
that there was an unusual login attempt on your account, and in that case
you will have to take the one-time code they send you and provide that to the
rtweet login page too.
Note: Every API has its own way to authenticate users when they try to
access data. Many APIs require you to sign up to receive a token, which is a
secret password that you input into the R package (like rtweet) that you are
using to access the API.
With the authentication setup out of the way, let’s run the get_timelines func-
tion again to actually access the API and take a look at what was returned:
## # A tibble: 293 x 71
## created_at reply_to_status_id quoted_created_at reply_to_user_id
## <dttm> <lgl> <dttm> <lgl>
## 1 2021-04-29 11:59:04 NA NA NA
## 2 2021-04-26 17:05:47 NA NA NA
## 3 2021-04-24 09:13:12 NA NA NA
## 4 2021-04-18 06:06:21 NA NA NA
## 5 2021-04-12 05:48:33 NA NA NA
## 6 2021-04-08 17:45:34 NA NA NA
## 7 2021-04-01 05:01:38 NA NA NA
## 8 2021-03-25 06:05:49 NA NA NA
## 9 2021-03-18 17:16:21 NA NA NA
## 10 2021-03-12 19:12:49 NA NA NA
## # ... with 283 more rows, and 67 more variables: reply_to_screen_name <lgl>,
## # is_quote <lgl>, is_retweet <lgl>, quoted_verified <lgl>,
## # retweet_verified <lgl>, protected <lgl>, verified <lgl>,
## # account_lang <lgl>, profile_background_url <lgl>, user_id <dbl>,
ISTUDY
2.8 Obtaining data from the web 63
The data has quite a few variables! (Notice that the output above shows that
we have a data table with 293 rows and 71 columns). Let’s reduce this down
to a few variables of interest: created_at, retweet_screen_name, is_retweet, and
text.
tidyverse_tweets
## # A tibble: 293 x 4
## created_at retweet_screen_name is_retweet text
## <dttm> <chr> <lgl> <chr>
## 1 2021-04-29 11:59:04 yutannihilat_en TRUE ”Just curious, after the ~
## 2 2021-04-26 17:05:47 statwonk TRUE ”List columns provide a t~
## 3 2021-04-24 09:13:12 topepos TRUE ”A new release of the {{r~
## 4 2021-04-18 06:06:21 ninarbrooks TRUE ”Always typing `? pivot_l~
## 5 2021-04-12 05:48:33 rfunctionaday TRUE ”If you are fluent in {dp~
## 6 2021-04-08 17:45:34 RhesusMaCassidy TRUE ”R-Ladies of Göttingen! T~
## 7 2021-04-01 05:01:38 dvaughan32 TRUE ”I am ridiculously excite~
## 8 2021-03-25 06:05:49 rdpeng TRUE ”New book out on using ti~
## 9 2021-03-18 17:16:21 SolomonKurz TRUE ”The 0.2.0 version of my ~
## 10 2021-03-12 19:12:49 hadleywickham TRUE ”rvest 1.0.0 out now! — h~
## # ... with 283 more rows
If you look back up at the image of the @tidyverse12 Twitter page, you will
recognize the text of the most recent few tweets in the above data frame. In
other words, we have successfully created a small data set using the Twitter
API—neat! This data is also quite different from what we obtained from web
scraping; it is already well-organized into a tidyverse data frame (although not
every API will provide data in such a nice format). From this point onward,
the tidyverse_tweets data frame is stored on your machine, and you can play
with it to your heart’s content. For example, you can use write_csv to save it
to a file and read_csv to read it into R again later; and after reading the next
12
https://twitter.com/tidyverse
ISTUDY
64 2 Reading in data locally and from the web
few chapters you will have the skills to compute the percentage of retweets
versus tweets, find the most oft-retweeted account, make visualizations of the
data, and much more! If you decide that you want to ask the Twitter API
for more data (see the rtweet page13 for more examples of what is possible),
just be mindful as usual about how much data you are requesting and how
frequently you are making requests.
2.9 Exercises
Practice exercises for the material covered in this chapter can be found in the
accompanying worksheets repository14 in the “Reading in data locally and
from the web” row. You can launch an interactive version of the worksheet in
your browser by clicking the “launch binder” button. You can also preview a
non-interactive version of the worksheet by clicking “view worksheet.” If you
instead decide to download the worksheet and run it on your own machine,
make sure to follow the instructions for computer setup found in Chapter 13.
This will ensure that the automated feedback and guidance that the work-
sheets provide will function as intended.
ISTUDY
2.10 Additional resources 65
• The here R package17 [Müller, 2020] provides a way for you to construct or
find your files’ paths.
• The readxl documentation18 provides more details on reading data from
Excel, such as reading in data with multiple sheets, or specifying the cells
to read in.
• The rio R package19 [Leeper, 2021] provides an alternative set of tools for
reading and writing data in R. It aims to be a “Swiss army knife” for data
reading/writing/converting, and supports a wide variety of data types (in-
cluding data formats generated by other statistical software like SPSS and
SAS).
• A video20 from the Udacity course Linux Command Line Basics provides a
good explanation of absolute versus relative paths.
• If you read the subsection on obtaining data from the web via scraping and
APIs, we provide two companion tutorial video links for how to use the
SelectorGadget tool to obtain desired CSS selectors for:
– extracting the data for apartment listings on Craigslist21 , and
– extracting Canadian city names and 2016 populations from
Wikipedia22 .
• The polite R package23 [Perepolkin, 2021] provides a set of tools for respon-
sibly scraping data from websites.
17
https://here.r-lib.org/
18
https://readxl.tidyverse.org/
19
https://github.com/leeper/rio
20
https://www.youtube.com/embed/ephId3mYu9o
21
https://www.youtube.com/embed/YdIWI6K64zo
22
https://www.youtube.com/embed/O9HKbdhqYzk
23
https://dmi3kno.github.io/polite/
ISTUDY
ISTUDY
3
Cleaning and wrangling data
3.1 Overview
This chapter is centered around defining tidy data—a data format that is
suitable for analysis—and the tools needed to transform raw data into this
format. This will be presented in the context of a real-world data science
application, providing more practice working through a whole case study.
67
ISTUDY
68 3 Cleaning and wrangling data
• Recall and use the following operators for their intended data wrangling
tasks:
– ==
– %in%
– !
– &
– |
– |> and %>%
ISTUDY
3.3 Data frames, vectors, and lists 69
FIGURE 3.1: A data frame storing data regarding the population of various
regions in Canada. In this example data frame, the row that corresponds to
the observation for the city of Vancouver is colored yellow, and the column
that corresponds to the population variable is colored blue.
R stores the columns of a data frame as either lists or vectors. For example,
the data frame in Figure 3.2 has three vectors whose names are region, year
and population. The next two sections will explain what lists and vectors are.
ISTUDY
70 3 Cleaning and wrangling data
Note: Technically, these objects are called “atomic vectors.” In this book we
have chosen to call them “vectors,” which is how they are most commonly
referred to in the R community. To be totally precise, “vector” is an umbrella
term that encompasses both atomic vector and list objects in R. But this cre-
ates a confusing situation where the term “vector” could mean “atomic vector”
or “the umbrella term for atomic vector and list,” depending on context. Very
confusing indeed! So to keep things simple, in this book we always use the
term “vector” to refer to “atomic vector.” We encourage readers who are en-
thusiastic to learn more to read the Vectors chapter of Advanced R [Wickham,
2019].
ISTUDY
3.3 Data frames, vectors, and lists 71
It is important in R to make sure you represent your data with the correct type.
Many of the tidyverse functions we use in this book treat the various data types
differently. You should use integers and double types (which both fall under
the “numeric” umbrella type) to represent numbers and perform arithmetic.
Doubles are more common than integers in R, though; for instance, a double
data type is the default when you create a vector of numbers using c(), and
ISTUDY
72 3 Cleaning and wrangling data
when you read in whole numbers via read_csv. Characters are used to represent
data that should be thought of as “text”, such as words, names, paths, URLs,
and more. Factors help us encode variables that represent categories; a factor
variable takes one of a discrete set of values known as levels (one for each
category). The levels can be ordered or unordered. Even though factors can
sometimes look like characters, they are not used to represent text, words,
names, and paths in the way that characters are; in fact, R internally stores
factors using integers! There are other basic data types in R, such as raw and
complex, but we do not use these in this textbook.
ISTUDY
3.3 Data frames, vectors, and lists 73
Not all columns in a data frame need to be of the same type. Figure 3.5 shows
a data frame where the columns are vectors of different types. But remember:
because the columns in this example are vectors, the elements must be the
same data type within each column. On the other hand, if our data frame had
list columns, there would be no such requirement. It is generally much more
common to use vector columns, though, as the values for a single variable are
usually all of the same type.
The functions from the tidyverse package that we use often give us a special
class of data frame called a tibble. Tibbles have some additional features and
benefits over the built-in data frame object. These include the ability to add
useful attributes (such as grouping, which we will discuss later) and more
predictable type preservation when subsetting. Because a tibble is just a data
frame with some added features, we will collectively refer to both built-in R
data frames and tibbles as data frames in this book.
Note: You can use the function class on a data object to assess whether a
data frame is a built-in R data frame or a tibble. If the data object is a data
frame, class will return ”data.frame”. If the data object is a tibble it will return
ISTUDY
74 3 Cleaning and wrangling data
”tbl_df” ”tbl” ”data.frame”. You can easily convert built-in R data frames to
tibbles using the tidyverse as_tibble function. For example we can check the
class of the Canadian languages data set, can_lang, we worked with in the
previous chapters and we see it is a tibble.
class(can_lang)
Vectors, data frames and lists are basic types of data structure in R, which
are core to most data analyses. We summarize them in Table 3.2. There are
several other data structures in the R programming language (e.g., matrices),
but these are beyond the scope of this book.
ISTUDY
3.4 Tidy data 75
There are many good reasons for making sure your data are tidy as a first step
in your analysis. The most important is that it is a single, consistent format
that nearly every function in the tidyverse recognizes. No matter what the
variables and observations in your data represent, as long as the data frame
is tidy, you can manipulate it, plot it, and analyze it using the same tools.
If your data is not tidy, you will have to write special bespoke code in your
analysis that will not only be error-prone, but hard for others to understand.
Beyond making your analysis more accessible to others and less error-prone,
tidy data is also typically easy for humans to interpret. Given these benefits,
it is well worth spending the time to get your data into a tidy format upfront.
ISTUDY
76 3 Cleaning and wrangling data
Note: Is there only one shape for tidy data for a given data set? Not nec-
essarily! It depends on the statistical question you are asking and what the
variables are for that question. For tidy data, each variable should be its own
column. So, just as it’s essential to match your statistical question with the ap-
propriate data analysis tool, it’s important to match your statistical question
with the appropriate variables and ensure they are represented as individual
columns to make the data tidy.
One task that is commonly performed to get data into a tidy format is to
combine values that are stored in separate columns, but are really part of
the same variable, into one. Data is often stored this way because this format
is sometimes more intuitive for human readability and understanding, and
humans create data sets. In Figure 3.7, the table on the left is in an untidy,
“wide” format because the year values (2006, 2011, 2016) are stored as column
names. And as a consequence, the values for population for the various cities
over these years are also split across several columns.
For humans, this table is easy to read, which is why you will often find data
stored in this wide format. However, this format is difficult to work with when
performing data visualization or statistical analysis using R. For example, if
we wanted to find the latest year it would be challenging because the year
values are stored as column names instead of as values in a single column. So
before we could apply a function to find the latest year (for example, by using
max), we would have to first extract the column names to get them as a vector
and then apply a function to extract the latest year. The problem only gets
worse if you would like to find the value for the population for a given region
for the latest year. Both of these tasks are greatly simplified once the data is
tidied.
Another problem with data in this format is that we don’t know what the
numbers under each year actually represent. Do those numbers represent pop-
ulation size? Land area? It’s not clear. To solve both of these problems, we
can reshape this data set to a tidy data format by creating a column called
“year” and a column called “population.” This transformation—which makes
the data “longer”—is shown as the right table in Figure 3.7.
ISTUDY
3.4 Tidy data 77
We can achieve this effect in R using the pivot_longer function from the tidy-
verse package. The pivot_longer function combines columns, and is usually
used during tidying data when we need to make the data frame longer and
narrower. To learn how to use pivot_longer, we will work through an example
with the region_lang_top5_cities_wide.csv data set. This data set contains the
counts of how many Canadians cited each language as their mother tongue
for five major Canadian cities (Toronto, Montréal, Vancouver, Calgary and
Edmonton) from the 2016 Canadian census. To get started, we will load the
tidyverse package and use read_csv to load the (untidy) data.
library(tidyverse)
## # A tibble: 214 x 7
## category language Toronto Montréal Vancouver Calgary Edmonton
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal langua~ Aboriginal la~ 80 30 70 20 25
## 2 Non-Official & No~ Afrikaans 985 90 1435 960 575
## 3 Non-Official & No~ Afro-Asiatic ~ 360 240 45 45 65
## 4 Non-Official & No~ Akan (Twi) 8485 1015 400 705 885
## 5 Non-Official & No~ Albanian 13260 2450 1090 1365 770
ISTUDY
78 3 Cleaning and wrangling data
What is wrong with the untidy format above? The table on the left in Figure
3.8 represents the data in the “wide” (messy) format. From a data analysis
perspective, this format is not ideal because the values of the variable region
(Toronto, Montréal, Vancouver, Calgary and Edmonton) are stored as column
names. Thus they are not easily accessible to the data analysis functions we
will apply to our data set. Additionally, the mother tongue variable values are
spread across multiple columns, which will prevent us from doing any desired
visualization or statistical tasks until we combine them into one column. For
instance, suppose we want to know the languages with the highest number
of Canadians reporting it as their mother tongue among all five regions. This
question would be tough to answer with the data in its current format. We
could find the answer with the data in this format, though it would be much
easier to answer if we tidy our data first. If mother tongue were instead stored
as one column, as shown in the tidy data on the right in Figure 3.8, we could
simply use the max function in one line of code to get the maximum value.
FIGURE 3.8: Going from wide to long with the pivot_longer function.
Figure 3.9 details the arguments that we need to specify in the pivot_longer
function to accomplish this data transformation.
ISTUDY
3.4 Tidy data 79
lang_mother_tidy
## # A tibble: 1,070 x 4
## category language region mother_tongue
## <chr> <chr> <chr> <dbl>
## 1 Aboriginal languages Aboriginal lan~ Toronto 80
## 2 Aboriginal languages Aboriginal lan~ Montré~ 30
## 3 Aboriginal languages Aboriginal lan~ Vancou~ 70
## 4 Aboriginal languages Aboriginal lan~ Calgary 20
## 5 Aboriginal languages Aboriginal lan~ Edmont~ 25
## 6 Non-Official & Non-Aboriginal languages Afrikaans Toronto 985
## 7 Non-Official & Non-Aboriginal languages Afrikaans Montré~ 90
ISTUDY
80 3 Cleaning and wrangling data
Note: In the code above, the call to the pivot_longer function is split across
several lines. This is allowed in certain cases; for example, when calling a
function as above, as long as the line ends with a comma , R knows to keep
reading on the next line. Splitting long lines like this across multiple lines is
encouraged as it helps significantly with code readability. Generally speaking,
you should limit each line of code to about 80 characters.
The data above is now tidy because all three criteria for tidy data have now
been met:
1. All the variables (category, language, region and mother_tongue) are now
their own columns in the data frame.
2. Each observation, (i.e., each language in a region) is in a single row.
3. Each value is a single cell, i.e., its row, column position in the data
frame is not shared with another value.
ISTUDY
3.4 Tidy data 81
To tidy this type of data in R, we can use the pivot_wider function. The
pivot_wider function generally increases the number of columns (widens) and
decreases the number of rows in a data set. To learn how to use pivot_wider, we
will work through an example with the region_lang_top5_cities_long.csv data
set. This data set contains the number of Canadians reporting the primary lan-
guage at home and work for five major cities (Toronto, Montréal, Vancouver,
Calgary and Edmonton).
## # A tibble: 2,140 x 5
## region category language type count
## <chr> <chr> <chr> <chr> <dbl>
## 1 Montréal Aboriginal languages Aboriginal languages, n.o.s. most_at_home 15
## 2 Montréal Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0
## 3 Toronto Aboriginal languages Aboriginal languages, n.o.s. most_at_home 50
## 4 Toronto Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0
## 5 Calgary Aboriginal languages Aboriginal languages, n.o.s. most_at_home 5
## 6 Calgary Aboriginal languages Aboriginal languages, n.o.s. most_at_work 0
## 7 Edmonton Aboriginal languages Aboriginal languages, n.o.s. most_at_home 10
ISTUDY
82 3 Cleaning and wrangling data
What makes the data set shown above untidy? In this example, each obser-
vation is a language in a region. However, each observation is split across
multiple rows: one where the count for most_at_home is recorded, and the other
where the count for most_at_work is recorded. Suppose the goal with this data
was to visualize the relationship between the number of Canadians reporting
their primary language at home and work. Doing that would be difficult with
this data in its current form, since these two variables are stored in the same
column. Figure 3.11 shows how this data will be tidied using the pivot_wider
function.
FIGURE 3.11: Going from long to wide with the pivot_wider function.
Figure 3.12 details the arguments that we need to specify in the pivot_wider
function.
ISTUDY
3.4 Tidy data 83
## # A tibble: 1,070 x 5
## region category language most_at_home most_at_work
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Montréal Aboriginal languages Aboriginal langu~ 15 0
## 2 Toronto Aboriginal languages Aboriginal langu~ 50 0
## 3 Calgary Aboriginal languages Aboriginal langu~ 5 0
## 4 Edmonton Aboriginal languages Aboriginal langu~ 10 0
## 5 Vancouver Aboriginal languages Aboriginal langu~ 15 0
## 6 Montréal Non-Official & Non-Abo~ Afrikaans 10 0
## 7 Toronto Non-Official & Non-Abo~ Afrikaans 265 0
## 8 Calgary Non-Official & Non-Abo~ Afrikaans 505 15
## 9 Edmonton Non-Official & Non-Abo~ Afrikaans 300 0
## 10 Vancouver Non-Official & Non-Abo~ Afrikaans 520 10
## # ... with 1,060 more rows
The data above is now tidy! We can go through the three criteria again to
check that this data is a tidy data set.
ISTUDY
84 3 Cleaning and wrangling data
1. All the statistical variables are their own columns in the data frame
(i.e., most_at_home, and most_at_work have been separated into their own
columns in the data frame).
2. Each observation, (i.e., each language in a region) is in a single row.
3. Each value is a single cell (i.e., its row, column position in the data
frame is not shared with another value).
You might notice that we have the same number of columns in the tidy data
set as we did in the messy one. Therefore pivot_wider didn’t really “widen”
the data, as the name suggests. This is just because the original type column
only had two categories in it. If it had more than two, pivot_wider would have
created more columns, and we would see the data set “widen.”
## # A tibble: 214 x 7
## category language Toronto Montréal Vancouver Calgary Edmonton
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Aboriginal langu~ Aboriginal la~ 50/0 15/0 15/0 5/0 10/0
## 2 Non-Official & N~ Afrikaans 265/0 10/0 520/10 505/15 300/0
## 3 Non-Official & N~ Afro-Asiatic ~ 185/10 65/0 10/0 15/0 20/0
## 4 Non-Official & N~ Akan (Twi) 4045/20 440/0 125/10 330/0 445/0
## 5 Non-Official & N~ Albanian 6380/215 1445/20 530/10 620/25 370/10
## 6 Aboriginal langu~ Algonquian la~ 5/0 0/0 0/0 0/0 0/0
## 7 Aboriginal langu~ Algonquin 0/0 10/0 0/0 0/0 0/0
## 8 Non-Official & N~ American Sign~ 720/245 70/0 300/140 85/25 190/85
## 9 Non-Official & N~ Amharic 3820/55 315/0 540/10 2730/50 1695/35
## 10 Non-Official & N~ Arabic 45025/1~ 72980/1~ 8680/275 11010/~ 10590/3~
## # ... with 204 more rows
First we’ll use pivot_longer to create two columns, region and value, similar
to what we did previously. The new region columns will contain the region
ISTUDY
3.4 Tidy data 85
names, and the new column value will be a temporary holding place for the
data that we need to further separate, i.e., the number of Canadians reporting
their primary language at home and work.
lang_messy_longer
## # A tibble: 1,070 x 4
## category language region value
## <chr> <chr> <chr> <chr>
## 1 Aboriginal languages Aboriginal languages,~ Toronto 50/0
## 2 Aboriginal languages Aboriginal languages,~ Montréal 15/0
## 3 Aboriginal languages Aboriginal languages,~ Vancouv~ 15/0
## 4 Aboriginal languages Aboriginal languages,~ Calgary 5/0
## 5 Aboriginal languages Aboriginal languages,~ Edmonton 10/0
## 6 Non-Official & Non-Aboriginal languages Afrikaans Toronto 265/0
## 7 Non-Official & Non-Aboriginal languages Afrikaans Montréal 10/0
## 8 Non-Official & Non-Aboriginal languages Afrikaans Vancouv~ 520/~
## 9 Non-Official & Non-Aboriginal languages Afrikaans Calgary 505/~
## 10 Non-Official & Non-Aboriginal languages Afrikaans Edmonton 300/0
## # ... with 1,060 more rows
Next we’ll use separate to split the value column into two columns. One column
will contain only the counts of Canadians that speak each language most at
home, and the other will contain the counts of Canadians that speak each
language most at work for each region. Figure 3.13 outlines what we need to
specify to use separate.
ISTUDY
86 3 Cleaning and wrangling data
tidy_lang
## # A tibble: 1,070 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <chr> <chr>
## 1 Aboriginal languages Aboriginal langua~ Toronto 50 0
## 2 Aboriginal languages Aboriginal langua~ Montré~ 15 0
## 3 Aboriginal languages Aboriginal langua~ Vancou~ 15 0
## 4 Aboriginal languages Aboriginal langua~ Calgary 5 0
## 5 Aboriginal languages Aboriginal langua~ Edmont~ 10 0
## 6 Non-Official & Non-Abor~ Afrikaans Toronto 265 0
## 7 Non-Official & Non-Abor~ Afrikaans Montré~ 10 0
## 8 Non-Official & Non-Abor~ Afrikaans Vancou~ 520 10
## 9 Non-Official & Non-Abor~ Afrikaans Calgary 505 15
## 10 Non-Official & Non-Abor~ Afrikaans Edmont~ 300 0
## # ... with 1,060 more rows
Is this data set now tidy? If we recall the three criteria for tidy data:
• each row is a single observation,
• each column is a single variable, and
• each value is a single cell.
ISTUDY
3.4 Tidy data 87
We can see that this data now satisfies all three criteria, making it easier to
analyze. But we aren’t done yet! Notice in the table above that the word <chr>
appears beneath each of the column names. The word under the column name
indicates the data type of each column. Here all of the variables are “char-
acter” data types. Recall, character data types are letter(s) or digits(s) sur-
rounded by quotes. In the previous example in Section 3.4.2, the most_at_home
and most_at_work variables were <dbl> (double)—you can verify this by looking
at the tables in the previous sections—which is a type of numeric data. This
change is due to the delimiter (/) when we read in this messy data set. R
read these columns in as character types, and by default, separate will return
columns as character data types.
It makes sense for region, category, and language to be stored as a character (or
perhaps factor) type. However, suppose we want to apply any functions that
treat the most_at_home and most_at_work columns as a number (e.g., finding rows
above a numeric threshold of a column). In that case, it won’t be possible to
do if the variable is stored as a character. Fortunately, the separate function
provides a natural way to fix problems like this: we can set convert = TRUE to
convert the most_at_home and most_at_work columns to the correct data type.
tidy_lang
## # A tibble: 1,070 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Aboriginal languages Aboriginal langua~ Toronto 50 0
## 2 Aboriginal languages Aboriginal langua~ Montré~ 15 0
## 3 Aboriginal languages Aboriginal langua~ Vancou~ 15 0
## 4 Aboriginal languages Aboriginal langua~ Calgary 5 0
## 5 Aboriginal languages Aboriginal langua~ Edmont~ 10 0
## 6 Non-Official & Non-Abor~ Afrikaans Toronto 265 0
## 7 Non-Official & Non-Abor~ Afrikaans Montré~ 10 0
## 8 Non-Official & Non-Abor~ Afrikaans Vancou~ 520 10
## 9 Non-Official & Non-Abor~ Afrikaans Calgary 505 15
## 10 Non-Official & Non-Abor~ Afrikaans Edmont~ 300 0
ISTUDY
88 3 Cleaning and wrangling data
Now we see <int> appears under the most_at_home and most_at_work columns,
indicating they are integer data types (i.e., numbers)!
## # A tibble: 1,070 x 4
## language region most_at_home most_at_work
## <chr> <chr> <int> <int>
## 1 Aboriginal languages, n.o.s. Toronto 50 0
## 2 Aboriginal languages, n.o.s. Montréal 15 0
## 3 Aboriginal languages, n.o.s. Vancouver 15 0
## 4 Aboriginal languages, n.o.s. Calgary 5 0
## 5 Aboriginal languages, n.o.s. Edmonton 10 0
## 6 Afrikaans Toronto 265 0
## 7 Afrikaans Montréal 10 0
## 8 Afrikaans Vancouver 520 10
## 9 Afrikaans Calgary 505 15
## 10 Afrikaans Edmonton 300 0
## # ... with 1,060 more rows
Here we wrote out the names of each of the columns. However, this method
is time-consuming, especially if you have a lot of columns! Another approach
is to use a “select helper”. Select helpers are operators that make it easier
for us to select columns. For instance, we can use a select helper to choose
ISTUDY
3.5 Using select to extract a range of columns 89
a range of columns rather than typing each column name out. To do this,
we use the colon (:) operator to denote the range. For example, to get all
the columns in the tidy_lang data frame from language to most_at_work we pass
language:most_at_work as the second argument to the select function.
## # A tibble: 1,070 x 4
## language region most_at_home most_at_work
## <chr> <chr> <int> <int>
## 1 Aboriginal languages, n.o.s. Toronto 50 0
## 2 Aboriginal languages, n.o.s. Montréal 15 0
## 3 Aboriginal languages, n.o.s. Vancouver 15 0
## 4 Aboriginal languages, n.o.s. Calgary 5 0
## 5 Aboriginal languages, n.o.s. Edmonton 10 0
## 6 Afrikaans Toronto 265 0
## 7 Afrikaans Montréal 10 0
## 8 Afrikaans Vancouver 520 10
## 9 Afrikaans Calgary 505 15
## 10 Afrikaans Edmonton 300 0
## # ... with 1,060 more rows
Notice that we get the same output as we did above, but with less (and clearer!)
code. This type of operator is especially handy for large data sets.
Suppose instead we wanted to extract columns that followed a particular pat-
tern rather than just selecting a range. For example, let’s say we wanted only
to select the columns most_at_home and most_at_work. There are other helpers
that allow us to select variables based on their names. In particular, we can
use the select helper starts_with to choose only the columns that start with
the word “most”:
select(tidy_lang, starts_with(”most”))
## # A tibble: 1,070 x 2
## most_at_home most_at_work
## <int> <int>
## 1 50 0
## 2 15 0
## 3 15 0
## 4 5 0
ISTUDY
90 3 Cleaning and wrangling data
## 5 10 0
## 6 265 0
## 7 10 0
## 8 520 10
## 9 505 15
## 10 300 0
## # ... with 1,060 more rows
select(tidy_lang, contains(”_”))
## # A tibble: 1,070 x 2
## most_at_home most_at_work
## <int> <int>
## 1 50 0
## 2 15 0
## 3 15 0
## 4 5 0
## 5 10 0
## 6 265 0
## 7 10 0
## 8 520 10
## 9 505 15
## 10 300 0
## # ... with 1,060 more rows
There are many different select helpers that select variables based on certain
criteria. The additional resources section at the end of this chapter provides
a comprehensive resource on select helpers.
ISTUDY
3.6 Using filter to extract rows 91
an in-depth treatment of the variety of logical statements one can use in the
filter function to select subsets of rows.
## # A tibble: 10 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Official languages English Toronto 3836770 3218725
## 2 Official languages English Montréal 620510 412120
## 3 Official languages English Vancouver 1622735 1330555
## 4 Official languages English Calgary 1065070 844740
## 5 Official languages English Edmonton 1050410 792700
## 6 Official languages French Toronto 29800 11940
## 7 Official languages French Montréal 2669195 1607550
## 8 Official languages French Vancouver 8630 3245
## 9 Official languages French Calgary 8630 2140
## 10 Official languages French Edmonton 10950 2520
What if we want all the other language categories in the data set except for
those in the ”Official languages” category? We can accomplish this with the !=
operator, which means “not equal to”. So if we want to find all the rows where
the category does not equal ”Official languages” we write the code below.
## # A tibble: 1,060 x 5
## category language region most_at_home most_at_work
ISTUDY
92 3 Cleaning and wrangling data
Suppose now we want to look at only the rows for the French language in
Montréal. To do this, we need to filter the data set to find rows that satisfy
multiple conditions simultaneously. We can do this with the comma symbol
(,), which in the case of filter is interpreted by R as “and”. We write the
code as shown below to filter the official_langs data frame to subset the rows
where region == ”Montréal” and the language == ”French”.
## # A tibble: 1 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Official languages French Montréal 2669195 1607550
We can also use the ampersand (&) logical operator, which gives us cases
where both one condition and another condition are satisfied. You can use
either comma (,) or ampersand (&) in the filter function interchangeably.
## # A tibble: 1 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Official languages French Montréal 2669195 1607550
ISTUDY
3.6 Using filter to extract rows 93
## # A tibble: 4 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Official languages English Calgary 1065070 844740
## 2 Official languages English Edmonton 1050410 792700
## 3 Official languages French Calgary 8630 2140
## 4 Official languages French Edmonton 10950 2520
Next, suppose we want to see the populations of our five cities. Let’s read
in the region_data.csv file that comes from the 2016 Canadian census, as it
contains statistics for number of households, land area, population and number
of dwellings for different regions.
## # A tibble: 35 x 5
## region households area population dwellings
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Belleville 43002 1355. 103472 45050
## 2 Lethbridge 45696 3047. 117394 48317
## 3 Thunder Bay 52545 2618. 121621 57146
## 4 Peterborough 50533 1637. 121721 55662
## 5 Saint John 52872 3793. 126202 58398
## 6 Brantford 52530 1086. 134203 54419
## 7 Moncton 61769 2625. 144810 66699
## 8 Guelph 59280 604. 151984 63324
## 9 Trois-Rivières 72502 1053. 156042 77734
ISTUDY
94 3 Cleaning and wrangling data
To get the population of the five cities we can filter the data set using the %in%
operator. The %in% operator is used to see if an element belongs to a vector.
Here we are filtering for rows where the value in the region column matches any
of the five cities we are intersted in: Toronto, Montréal, Vancouver, Calgary,
and Edmonton.
## # A tibble: 5 x 5
## region households area population dwellings
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Edmonton 502143 9858. 1321426 537634
## 2 Calgary 519693 5242. 1392609 544870
## 3 Vancouver 960894 3040. 2463431 1027613
## 4 Montréal 1727310 4638. 4098927 1823281
## 5 Toronto 2135909 6270. 5928040 2235145
Note: What’s the difference between == and %in%? Suppose we have two vec-
tors, vectorA and vectorB. If you type vectorA == vectorB into R it will compare
the vectors element by element. R checks if the first element of vectorA equals
the first element of vectorB, the second element of vectorA equals the second el-
ement of vectorB, and so on. On the other hand, vectorA %in% vectorB compares
the first element of vectorA to all the elements in vectorB. Then the second el-
ement of vectorA is compared to all the elements in vectorB, and so on. Notice
the difference between == and %in% in the example below.
ISTUDY
3.7 Using mutate to modify or add columns 95
3.6.6 Extracting rows above or below a threshold using > and <
## # A tibble: 1 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Official languages English Toronto 3836770 3218725
filter returns a data frame with only one row, indicating that when consider-
ing the official languages, only English in Toronto is reported by more people
as their primary language at home than French in Montréal according to the
2016 Canadian census.
ISTUDY
96 3 Cleaning and wrangling data
official_langs_chr
## # A tibble: 10 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <chr> <chr>
## 1 Official languages English Toronto 3836770 3218725
## 2 Official languages English Montréal 620510 412120
## 3 Official languages English Vancouver 1622735 1330555
## 4 Official languages English Calgary 1065070 844740
## 5 Official languages English Edmonton 1050410 792700
## 6 Official languages French Toronto 29800 11940
## 7 Official languages French Montréal 2669195 1607550
## 8 Official languages French Vancouver 8630 3245
## 9 Official languages French Calgary 8630 2140
## 10 Official languages French Edmonton 10950 2520
To use mutate, again we first specify the data set in the first argument, and
in the following arguments, we specify the name of the column we want to
modify or create (here most_at_home and most_at_work), an = sign, and then
the function we want to apply (here as.numeric). In the function we want
to apply, we refer directly to the column name upon which we want it to
act (here most_at_home and most_at_work). In our example, we are naming the
columns the same names as columns that already exist in the data frame
(“most_at_home”, “most_at_work”) and this will cause mutate to overwrite
those columns (also referred to as modifying those columns in-place). If we
were to give the columns a new name, then mutate would create new columns
with the names we specified. mutate’s general syntax is detailed in Figure 3.14.
ISTUDY
3.7 Using mutate to modify or add columns 97
official_langs_numeric
## # A tibble: 10 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Official languages English Toronto 3836770 3218725
## 2 Official languages English Montréal 620510 412120
## 3 Official languages English Vancouver 1622735 1330555
## 4 Official languages English Calgary 1065070 844740
## 5 Official languages English Edmonton 1050410 792700
## 6 Official languages French Toronto 29800 11940
## 7 Official languages French Montréal 2669195 1607550
## 8 Official languages French Vancouver 8630 3245
## 9 Official languages French Calgary 8630 2140
## 10 Official languages French Edmonton 10950 2520
Now we see <dbl> appears under the most_at_home and most_at_work columns,
indicating they are double data types (which is a numeric data type)!
ISTUDY
98 3 Cleaning and wrangling data
To create a vector containing the population values for the five cities (Toronto,
Montréal, Vancouver, Calgary, Edmonton), we will use the c function (recall
that c stands for “concatenate”):
And next, we will filter the official_langs data frame so that we only keep the
rows where the language is English. We will name the new data frame we get
from this english_langs:
## # A tibble: 5 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
ISTUDY
3.7 Using mutate to modify or add columns 99
english_langs
## # A tibble: 5 x 6
## category language region most_at_home most_at_work most_at_home_pr~
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 Official languages English Toronto 3836770 3218725 0.647
## 2 Official languages English Montré~ 620510 412120 0.151
## 3 Official languages English Vancou~ 1622735 1330555 0.659
## 4 Official languages English Calgary 1065070 844740 0.765
## 5 Official languages English Edmont~ 1050410 792700 0.795
Note: In more advanced data wrangling, one might solve this problem in a less
error-prone way though using a technique called “joins.” We link to resources
that discuss this in the additional resources at the end of this chapter.
ISTUDY
100 3 Cleaning and wrangling data
One way of performing these three steps is to just write multiple lines of code,
storing temporary objects as you go:
This is difficult to understand for multiple reasons. The reader may be tricked
into thinking the named output_1 and output_2 objects are important for some
reason, while they are just temporary intermediate computations. Further, the
reader has to look through and find where output_1 and output_2 are used in
each subsequent line.
Another option for doing this would be to compose the functions:
Code like this can also be difficult to understand. Functions compose (reading
from left to right) in the opposite order in which they are computed by R
(above, mutate happens first, then filter, then select). It is also just a really
long line of code to read in one go.
The pipe operator (|>) solves this problem, resulting in cleaner and easier-to-
follow code. |> is built into R so you don’t need to load any packages to use
it. You can think of the pipe as a physical pipe. It takes the output from the
function on the left-hand side of the pipe, and passes it as the first argument to
the function on the right-hand side of the pipe. The code below accomplishes
the same thing as the previous two code blocks:
ISTUDY
3.8 Combining functions using the pipe operator, |> 101
Note: You might also have noticed that we split the function calls across
lines after the pipe, similar to when we did this earlier in the chapter for long
function calls. Again, this is allowed and recommended, especially when the
piped function calls create a long line of code. Doing this makes your code
more readable. When you do this, it is important to end each line with the
pipe operator |> to tell R that your code is continuing onto the next line.
Note: In this textbook, we will be using the base R pipe operator syntax,
|>. This base R |> pipe operator was inspired by a previous version of the
pipe operator, %>%. The %>% pipe operator is not built into R and is from the
magrittr R package. The tidyverse metapackage imports the %>% pipe operator
via dplyr (which in turn imports the magrittr R package). There are some
other differences between %>% and |> related to more advanced R uses, such
as sharing and distributing code as R packages, however, these are beyond
the scope of this textbook. We have this note in the book to make the reader
aware that %>% exists as it is still commonly used in data analysis code and in
many data science books and other resources. In most cases these two pipes
are interchangeable and either can be used.
Let’s work with the tidy tidy_lang data set from Section 3.4.3, which contains
the number of Canadians reporting their primary language at home and work
for five major cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton):
tidy_lang
## # A tibble: 1,070 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
ISTUDY
102 3 Cleaning and wrangling data
Suppose we want to create a subset of the data with only the languages and
counts of each language spoken most at home for the city of Vancouver. To do
this, we can use the functions filter and select. First, we use filter to create
a data frame called van_data that contains only values for Vancouver.
## # A tibble: 214 x 5
## category language region most_at_home most_at_work
## <chr> <chr> <chr> <int> <int>
## 1 Aboriginal languages Aboriginal languag~ Vancou~ 15 0
## 2 Non-Official & Non-Abo~ Afrikaans Vancou~ 520 10
## 3 Non-Official & Non-Abo~ Afro-Asiatic langu~ Vancou~ 10 0
## 4 Non-Official & Non-Abo~ Akan (Twi) Vancou~ 125 10
## 5 Non-Official & Non-Abo~ Albanian Vancou~ 530 10
## 6 Aboriginal languages Algonquian languag~ Vancou~ 0 0
## 7 Aboriginal languages Algonquin Vancou~ 0 0
## 8 Non-Official & Non-Abo~ American Sign Lang~ Vancou~ 300 140
## 9 Non-Official & Non-Abo~ Amharic Vancou~ 540 10
## 10 Non-Official & Non-Abo~ Arabic Vancou~ 8680 275
## # ... with 204 more rows
We then use select on this data frame to keep only the variables we want:
## # A tibble: 214 x 2
## language most_at_home
ISTUDY
3.8 Combining functions using the pipe operator, |> 103
## <chr> <int>
## 1 Aboriginal languages, n.o.s. 15
## 2 Afrikaans 520
## 3 Afro-Asiatic languages, n.i.e. 10
## 4 Akan (Twi) 125
## 5 Albanian 530
## 6 Algonquian languages, n.i.e. 0
## 7 Algonquin 0
## 8 American Sign Language 300
## 9 Amharic 540
## 10 Arabic 8680
## # ... with 204 more rows
Although this is valid code, there is a more readable approach we could take
by using the pipe, |>. With the pipe, we do not need to create an intermediate
object to store the output from filter. Instead, we can directly send the output
of filter to the input of select:
van_data_selected
## # A tibble: 214 x 2
## language most_at_home
## <chr> <int>
## 1 Aboriginal languages, n.o.s. 15
## 2 Afrikaans 520
## 3 Afro-Asiatic languages, n.i.e. 10
## 4 Akan (Twi) 125
## 5 Albanian 530
## 6 Algonquian languages, n.i.e. 0
## 7 Algonquin 0
## 8 American Sign Language 300
## 9 Amharic 540
## 10 Arabic 8680
## # ... with 204 more rows
But wait…Why do the select function calls look different in these two exam-
ples? Remember: when you use the pipe, the output of the first function is
automatically provided as the first argument for the function that comes after
it. Therefore you do not specify the first argument in that function call. In
ISTUDY
104 3 Cleaning and wrangling data
the code above, The pipe passes the left-hand side (the output of filter) to
the first argument of the function on the right (select), so in the select func-
tion you only see the second argument (and beyond). As you can see, both of
these approaches—with and without pipes—give us the same output, but the
second approach is clearer and more readable.
large_region_lang
## # A tibble: 67 x 3
## region language most_at_home
## <chr> <chr> <int>
## 1 Edmonton Arabic 10590
## 2 Montréal Tamil 10670
## 3 Vancouver Russian 10795
## 4 Edmonton Spanish 10880
## 5 Edmonton French 10950
## 6 Calgary Arabic 11010
## 7 Calgary Urdu 11060
## 8 Vancouver Hindi 11235
## 9 Montréal Armenian 11835
ISTUDY
3.8 Combining functions using the pipe operator, |> 105
You will notice above that we passed tidy_lang as the first argument of the
filter function. We can also pipe the data frame into the same sequence of
functions rather than using it as the first argument of the first function. These
two choices are equivalent, and we get the same result.
large_region_lang
## # A tibble: 67 x 3
## region language most_at_home
## <chr> <chr> <int>
## 1 Edmonton Arabic 10590
## 2 Montréal Tamil 10670
## 3 Vancouver Russian 10795
## 4 Edmonton Spanish 10880
## 5 Edmonton French 10950
## 6 Calgary Arabic 11010
## 7 Calgary Urdu 11060
## 8 Vancouver Hindi 11235
## 9 Montréal Armenian 11835
## 10 Toronto Romanian 12200
## # ... with 57 more rows
Now that we’ve shown you the pipe operator as an alternative to storing
temporary objects and composing code, does this mean you should never store
temporary objects or compose code? Not necessarily! There are times when
you will still want to do these things. For example, you might store a temporary
object before feeding it into a plot function so you can iteratively change the
plot without having to redo all of your data transformations. Additionally,
piping many functions can be overwhelming and difficult to debug; you may
want to store a temporary object midway through to inspect your result before
moving on with further steps.
ISTUDY
106 3 Cleaning and wrangling data
region_lang
## # A tibble: 7,490 x 7
## region category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 St. Jo~ Aborigi~ Aborigin~ 5 0 0 0
## 2 Halifax Aborigi~ Aborigin~ 5 0 0 0
## 3 Moncton Aborigi~ Aborigin~ 0 0 0 0
## 4 Saint ~ Aborigi~ Aborigin~ 0 0 0 0
## 5 Saguen~ Aborigi~ Aborigin~ 5 5 0 0
## 6 Québec Aborigi~ Aborigin~ 0 5 0 20
## 7 Sherbr~ Aborigi~ Aborigin~ 0 0 0 0
## 8 Trois-~ Aborigi~ Aborigin~ 0 0 0 0
ISTUDY
3.9 Aggregating data with summarize and map 107
summarize(region_lang,
min_most_at_home = min(most_at_home),
max_most_at_home = max(most_at_home))
## # A tibble: 1 x 2
## min_most_at_home max_most_at_home
## <dbl> <dbl>
## 1 0 3836770
From this we see that there are some languages in the data set that no one
speaks as their primary language at home. We also see that the most commonly
spoken primary language at home is spoken by 3,836,770 people.
In data frames in R, the value NA is often used to denote missing data. Many
of the base R statistical summary functions (e.g., max, min, mean, sum, etc) will
return NA when applied to columns containing NA values. Usually that is not
what we want to happen; instead, we would usually like R to ignore the missing
entries and calculate the summary statistic using all of the other non-NA values
in the column. Fortunately many of these functions provide an argument na.rm
that lets us tell the function what to do when it encounters NA values. In
particular, if we specify na.rm = TRUE, the function will ignore missing values
and return a summary of all the non-missing entries. We show an example of
this combined with summarize below.
First we create a new version of the region_lang data frame, named re-
gion_lang_na, that has a seemingly innocuous NA in the first row of the
most_at_home column:
region_lang_na
## # A tibble: 7,490 x 7
## region category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
ISTUDY
108 3 Cleaning and wrangling data
Now if we apply the summarize function as above, we see that we no longer get
the minimum and maximum returned, but just an NA instead!
summarize(region_lang_na,
min_most_at_home = min(most_at_home),
max_most_at_home = max(most_at_home))
## # A tibble: 1 x 2
## min_most_at_home max_most_at_home
## <dbl> <dbl>
## 1 NA NA
summarize(region_lang_na,
min_most_at_home = min(most_at_home, na.rm = TRUE),
max_most_at_home = max(most_at_home, na.rm = TRUE))
## # A tibble: 1 x 2
## min_most_at_home max_most_at_home
## <dbl> <dbl>
## 1 0 3836770
ISTUDY
3.9 Aggregating data with summarize and map 109
The group_by function takes at least two arguments. The first is the data frame
that will be grouped, and the second and onwards are columns to use in the
grouping. Here we use only one column for grouping (region), but more than
one can also be used. To do this, list additional columns separated by commas.
## # A tibble: 35 x 3
## region min_most_at_home max_most_at_home
## <chr> <dbl> <dbl>
## 1 Abbotsford - Mission 0 137445
## 2 Barrie 0 182390
## 3 Belleville 0 97840
## 4 Brantford 0 124560
## 5 Calgary 0 1065070
## 6 Edmonton 0 1050410
## 7 Greater Sudbury 0 133960
## 8 Guelph 0 130950
## 9 Halifax 0 371215
## 10 Hamilton 0 630380
## # ... with 25 more rows
Notice that group_by on its own doesn’t change the way the data looks. In the
ISTUDY
110 3 Cleaning and wrangling data
output below, the grouped data set looks the same, and it doesn’t appear to
be grouped by region. Instead, group_by simply changes how other functions
work with the data, as we saw with summarize above.
group_by(region_lang, region)
## # A tibble: 7,490 x 7
## # Groups: region [35]
## region category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 St. Jo~ Aborigi~ Aborigin~ 5 0 0 0
## 2 Halifax Aborigi~ Aborigin~ 5 0 0 0
## 3 Moncton Aborigi~ Aborigin~ 0 0 0 0
## 4 Saint ~ Aborigi~ Aborigin~ 0 0 0 0
## 5 Saguen~ Aborigi~ Aborigin~ 5 5 0 0
## 6 Québec Aborigi~ Aborigin~ 0 5 0 20
## 7 Sherbr~ Aborigi~ Aborigin~ 0 0 0 0
## 8 Trois-~ Aborigi~ Aborigin~ 0 0 0 0
## 9 Montré~ Aborigi~ Aborigin~ 30 15 0 10
## 10 Kingst~ Aborigi~ Aborigin~ 0 0 0 0
## # ... with 7,480 more rows
ISTUDY
3.9 Aggregating data with summarize and map 111
To summarize statistics across many columns, we can use the summarize func-
tion we have just recently learned about. However, in such a case, using summa-
rize alone means that we have to type out the name of each column we want
to summarize. To do this more efficiently, we can pair summarize with across
and use a colon : to specify a range of columns we would like to perform the
statistical summaries on. Here we demonstrate finding the maximum value of
each of the numeric columns of the region_lang data set.
region_lang |>
summarize(across(mother_tongue:lang_known, max))
## # A tibble: 1 x 4
## mother_tongue most_at_home most_at_work lang_known
## <dbl> <dbl> <dbl> <dbl>
## 1 3061820 3836770 3218725 5600480
region_lang_na |>
summarize(across(mother_tongue:lang_known, max, na.rm = TRUE))
## # A tibble: 1 x 4
## mother_tongue most_at_home most_at_work lang_known
## <dbl> <dbl> <dbl> <dbl>
## 1 3061820 3836770 3218725 5600480
ISTUDY
112 3 Cleaning and wrangling data
want to apply the function to, and the function that you would like to apply
to each column. Note that map does not have an argument to specify which
columns to apply the function to. Therefore, we will use the select function
before calling map to choose the columns for which we want the maximum.
region_lang |>
select(mother_tongue:lang_known) |>
map(max)
## $mother_tongue
## [1] 3061820
##
## $most_at_home
## [1] 3836770
##
## $most_at_work
## [1] 3218725
##
## $lang_known
## [1] 5600480
Note: The map function comes from the purrr package. But since purrr is part
of the tidyverse, once we call library(tidyverse) we do not need to load the
purrr package separately.
The output looks a bit weird… we passed in a data frame, but the output
doesn’t look like a data frame. As it so happens, it is not a data frame, but
rather a plain list:
region_lang |>
select(mother_tongue:lang_known) |>
map(max) |>
typeof()
## [1] ”list”
ISTUDY
3.9 Aggregating data with summarize and map 113
few to choose from, they all work similarly, but their name reflects the type of
output you want from the mapping operation. Table 3.3 lists the commonly
used map functions as well as their output type.
Let’s get the columns’ maximums again, but this time use the map_dfr function
to return the output as a data frame:
region_lang |>
select(mother_tongue:lang_known) |>
map_dfr(max)
## # A tibble: 1 x 4
## mother_tongue most_at_home most_at_work lang_known
## <dbl> <dbl> <dbl> <dbl>
## 1 3061820 3836770 3218725 5600480
Note: Similar to when we use base R statistical summary functions (e.g., max,
min, mean, sum, etc.) with summarize, map functions paired with base R statistical
summary functions also return NA values when we apply them to columns that
contain NA values.
To avoid this, again we need to add the argument na.rm = TRUE. When we use
this with map, we do this by adding a , and then na.rm = TRUE after specifying
the function, as illustrated below:
region_lang_na |>
select(mother_tongue:lang_known) |>
map_dfr(max, na.rm = TRUE)
## # A tibble: 1 x 4
ISTUDY
114 3 Cleaning and wrangling data
The map functions are generally quite useful for solving many problems involv-
ing repeatedly applying functions in R. Additionally, their use is not limited to
columns of a data frame; map family functions can be used to apply functions
to elements of a vector, or a list, and even to lists of (nested!) data frames. To
learn more about the map functions, see the additional resources section at the
end of this chapter.
3.10 Apply functions across many columns with mutate and across
FIGURE 3.18: mutate and across is useful for applying functions across many
columns. The darker, top row of each table represents the column headers.
For example, imagine that we wanted to convert all the numeric columns in the
region_lang data frame from double type to integer type using the as.integer
function. When we revisit the region_lang data frame, we can see that this
would be the columns from mother_tongue to lang_known.
region_lang
## # A tibble: 7,490 x 7
## region category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 St. Jo~ Aborigi~ Aborigin~ 5 0 0 0
## 2 Halifax Aborigi~ Aborigin~ 5 0 0 0
ISTUDY
3.10 Apply functions across many columns with mutate and across 115
To accomplish such a task, we can use mutate paired with across. This works
in a similar way for column selection, as we saw when we used summarize +
across earlier. As we did above, we again use across to specify the columns
using select syntax as well as the function we want to apply on the specified
columns. However, a key difference here is that we are using mutate, which
means that we get back a data frame with the same number of columns and
rows. The only thing that changes is the transformation we applied to the
specified columns (here mother_tongue to lang_known).
region_lang |>
mutate(across(mother_tongue:lang_known, as.integer))
## # A tibble: 7,490 x 7
## region category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 St. Jo~ Aborigi~ Aborigin~ 5 0 0 0
## 2 Halifax Aborigi~ Aborigin~ 5 0 0 0
## 3 Moncton Aborigi~ Aborigin~ 0 0 0 0
## 4 Saint ~ Aborigi~ Aborigin~ 0 0 0 0
## 5 Saguen~ Aborigi~ Aborigin~ 5 5 0 0
## 6 Québec Aborigi~ Aborigin~ 0 5 0 20
## 7 Sherbr~ Aborigi~ Aborigin~ 0 0 0 0
## 8 Trois-~ Aborigi~ Aborigin~ 0 0 0 0
## 9 Montré~ Aborigi~ Aborigin~ 30 15 0 10
## 10 Kingst~ Aborigi~ Aborigin~ 0 0 0 0
## # ... with 7,480 more rows
ISTUDY
116 3 Cleaning and wrangling data
3.11 Apply functions across columns within one row with rowwise
and mutate
What if you want to apply a function across columns but within one row? We
illustrate such a data transformation in Figure 3.19.
FIGURE 3.19: rowwise and mutate is useful for applying functions across
columns within one row. The darker, top row of each table represents the
column headers.
region_lang |>
select(mother_tongue:lang_known)
## # A tibble: 7,490 x 4
## mother_tongue most_at_home most_at_work lang_known
## <dbl> <dbl> <dbl> <dbl>
## 1 5 0 0 0
## 2 5 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 5 5 0 0
## 6 0 5 0 20
## 7 0 0 0 0
## 8 0 0 0 0
## 9 30 15 0 10
ISTUDY
3.11 Apply functions across columns within one row with rowwise and mutate 117
## 10 0 0 0 0
## # ... with 7,480 more rows
Now we apply rowwise before mutate, to tell R that we would like the mutate
function to be applied across, and within, a row, as opposed to being applied
on a column (which is the default behavior of mutate):
region_lang |>
select(mother_tongue:lang_known) |>
rowwise() |>
mutate(maximum = max(c(mother_tongue,
most_at_home,
most_at_work,
lang_known)))
## # A tibble: 7,490 x 5
## # Rowwise:
## mother_tongue most_at_home most_at_work lang_known maximum
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 0 0 0 5
## 2 5 0 0 0 5
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 5 5 0 0 5
## 6 0 5 0 20 20
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 30 15 0 10 30
## 10 0 0 0 0 0
## # ... with 7,480 more rows
We see that we get an additional column added to the data frame, named
maximum, which is the maximum value between mother_tongue, most_at_home,
most_at_work and lang_known for each language and region.
ISTUDY
118 3 Cleaning and wrangling data
region_lang |>
select(mother_tongue:lang_known) |>
mutate(maximum = max(c(mother_tongue,
most_at_home,
most_at_home,
lang_known)))
## # A tibble: 7,490 x 5
## mother_tongue most_at_home most_at_work lang_known maximum
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 0 0 0 5600480
## 2 5 0 0 0 5600480
## 3 0 0 0 0 5600480
## 4 0 0 0 0 5600480
## 5 5 5 0 0 5600480
## 6 0 5 0 20 5600480
## 7 0 0 0 0 5600480
## 8 0 0 0 0 5600480
## 9 30 15 0 10 5600480
## 10 0 0 0 0 5600480
## # ... with 7,480 more rows
3.12 Summary
Cleaning and wrangling data can be a very time-consuming process. However,
it is a critical step in any data analysis. We have explored many different
functions for cleaning and wrangling data into a tidy format. Table 3.4 sum-
marizes some of the key wrangling functions we learned in this chapter. In the
following chapters, you will learn how you can take this tidy data and do so
much more with it to answer your burning data science questions!
ISTUDY
3.13 Exercises 119
Function Description
across allows you to apply function(s) to multiple columns
filter subsets rows of a data frame
group_by allows you to apply function(s) to groups of rows
mutate adds or modifies columns in a data frame
map general iteration function
pivot_longer generally makes the data frame longer and narrower
pivot_wider generally makes a data frame wider and decreases the
number of rows
rowwise applies functions across columns within one row
separate splits up a character column into multiple columns
select subsets columns of a data frame
summarize calculates summaries of inputs
3.13 Exercises
Practice exercises for the material covered in this chapter can be found in the
accompanying worksheets repository1 in the “Cleaning and wrangling data”
row. You can launch an interactive version of the worksheet in your browser
by clicking the “launch binder” button. You can also preview a non-interactive
version of the worksheet by clicking “view worksheet.” If you instead decide to
download the worksheet and run it on your own machine, make sure to follow
the instructions for computer setup found in Chapter 13. This will ensure
that the automated feedback and guidance that the worksheets provide will
function as intended.
1
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme
ISTUDY
120 3 Cleaning and wrangling data
and meet a few more useful functions, we recommend you check out Chapters
5-9 of the STAT545 online notes2 . of the data wrangling, exploration, and
analysis with R book.
• The dplyr R package documentation3 [Wickham et al., 2021b] is another
resource to learn more about the functions in this chapter, the full set of
arguments you can use, and other related functions. The site also provides a
very nice cheat sheet that summarizes many of the data wrangling functions
from this chapter.
• Check out the tidyselect R package page4 [Henry and Wickham, 2021] for
a comprehensive list of select helpers. These helpers can be used to choose
columns in a data frame when paired with the select function (and other
functions that use the tidyselect syntax, such as pivot_longer). The docu-
mentation for select helpers5 is a useful reference to find the helper you
need for your particular problem.
• R for Data Science [Wickham and Grolemund, 2016] has a few chapters
related to data wrangling that go into more depth than this book. For ex-
ample, the tidy data chapter6 covers tidy data, pivot_longer/pivot_wider and
separate, but also covers missing values and additional wrangling functions
(like unite). The data transformation chapter7 covers select, filter, arrange,
8
mutate, and summarize. And the map functions chapter provides more about
the map functions.
• You will occasionally encounter a case where you need to iterate over items
in a data frame, but none of the above functions are flexible enough to do
what you want. In that case, you may consider using a for loop9 .
2
https://stat545.com/
3
https://dplyr.tidyverse.org/
4
https://tidyselect.r-lib.org/index.html
5
https://tidyselect.r-lib.org/reference/select_helpers.html
6
https://r4ds.had.co.nz/tidy-data.html
7
https://r4ds.had.co.nz/transform.html
8
https://r4ds.had.co.nz/iteration.html#the-map-functions
9
https://r4ds.had.co.nz/iteration.html#iteration
ISTUDY
4
Effective data visualization
4.1 Overview
This chapter will introduce concepts and tools relating to data visualization
beyond what we have seen and practiced so far. We will focus on guiding
principles for effective data visualization and explaining visualizations inde-
pendent of any particular tool or programming language. In the process, we
will cover some specifics of creating visualizations (scatter plots, bar plots, line
plots, and histograms) for data using R.
121
ISTUDY
122 4 Effective data visualization
• Use the ggplot2 package in R to create and refine the above visualizations
using:
– geometric objects: geom_point, geom_line, geom_histogram, geom_bar,
geom_vline, geom_hline
– scales: xlim, ylim
– aesthetic mappings: x, y, fill, color, shape
– labeling: xlab, ylab, labs
– font control and legend positioning: theme
– subplots: facet_grid
• Describe the difference in raster and vector output formats.
• Use ggsave to save visualizations in .png and .svg format.
ISTUDY
4.3 Choosing the visualization 123
9 9
6 6
y
y
3 3
0 0
0 3 6 9 12 0 3 6 9 12
x x
30
40
count
count
20
20
10
0 0
Group 1 Group 2 Group 3 10 20 30
category measurements
FIGURE 4.1: Examples of scatter, line and bar plots, as well as histograms.
All types of visualization have their (mis)uses, but three kinds are usually
hard to understand or are easily replaced with an oft-better alternative. In
particular, you should avoid pie charts; it is generally better to use bars, as
it is easier to compare bar heights than pie slice sizes. You should also not use
3-D visualizations, as they are typically hard to understand when converted
to a static 2-D image format. Finally, do not use tables to make numerical
comparisons; humans are much better at quickly processing visual information
than text and math. Bar plots are again typically a better alternative.
ISTUDY
124 4 Effective data visualization
Just being able to make a visualization in R (or any other language, for that
matter) doesn’t mean that it effectively communicates your message to others.
Once you have selected a broad type of visualization to use, you will have
to refine it to suit your particular need. Some rules of thumb for doing this
are listed below. They generally fall into two classes: you want to make your
visualization convey your message, and you want to reduce visual noise as much
as possible. Humans have limited cognitive ability to process information; both
of these types of refinement aim to reduce the mental load on your audience
when viewing your visualization, making it easier for them to understand and
remember your message quickly.
Convey the message
• Make sure the visualization answers the question you have asked most simply
and plainly as possible.
• Use legends and labels so that your visualization is understandable without
reading the surrounding text.
• Ensure the text, symbols, lines, etc., on your visualization are big enough to
be easily read.
• Ensure the data are clearly visible; don’t hide the shape/distribution of the
data behind other objects (e.g., a bar).
• Make sure to use color schemes that are understandable by those with color-
blindness (a surprisingly large fraction of the overall population—from about
1% to 10%, depending on sex and ancestry [Deeb, 2005]). For example, Col-
orBrewer1 and the RColorBrewer R package2 [Neuwirth, 2014]
provide the ability to pick such color schemes, and you can check your visu-
alizations after you have created them by uploading to online tools such as
a color blindness simulator3 .
• Redundancy can be helpful; sometimes conveying the same message in mul-
tiple ways reinforces it for the audience.
Minimize noise
• Use colors sparingly. Too many different colors can be distracting, create
false patterns, and detract from the message.
• Be wary of overplotting. Overplotting is when marks that represent the data
1
https://colorbrewer2.org
2
https://cran.r-project.org/web/packages/RColorBrewer/index.html
3
https://www.color-blindness.com/coblis-color-blindness-simulator/
ISTUDY
4.5 Creating visualizations with ggplot2 125
overlap, and is problematic as it prevents you from seeing how many data
points are represented in areas of the visualization where this occurs. If your
plot has too many dots or lines and starts to look like a mess, you need to
do something different.
• Only make the plot area (where the dots, lines, bars are) as big as needed.
Simple plots can be made small.
• Don’t adjust the axes to zoom in on small differences. If the difference is
small, show that it’s small!
This section will cover examples of how to choose and refine a visualization
given a data set and a question that you want to answer, and then how to
create the visualization in R using the ggplot2 R package. Given that the
ggplot2package is loaded by the tidyverse metapackage, we still need to load
only ‘tidyverse’:
library(tidyverse)
4.5.1 Scatter plots and line plots: the Mauna Loa CO2 data set
The Mauna Loa CO2 data set4 , curated by Dr. Pieter Tans, NOAA/GML
and Dr. Ralph Keeling, Scripps Institution of Oceanography, records the at-
mospheric concentration of carbon dioxide (CO2 , in parts per million) at the
Mauna Loa research station in Hawaii from 1959 onward [Tans and Keeling,
2020]. For this book, we are going to focus on the last 40 years of the data set,
1980-2020.
Question: Does the concentration of atmospheric CO2 change over time, and
are there any interesting patterns to note?
To get started, we will read and inspect the data:
## # A tibble: 484 x 2
4
https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html
ISTUDY
126 4 Effective data visualization
## date_measured ppm
## <date> <dbl>
## 1 1980-02-01 338.
## 2 1980-03-01 340.
## 3 1980-04-01 341.
## 4 1980-05-01 341.
## 5 1980-06-01 341.
## 6 1980-07-01 339.
## 7 1980-08-01 338.
## 8 1980-09-01 336.
## 9 1980-10-01 336.
## 10 1980-11-01 337.
## # ... with 474 more rows
We see that there are two columns in the co2_df data frame; date_measured and
ppm. The date_measured column holds the date the measurement was taken, and
is of type date. The ppm column holds the value of CO2 in parts per million
that was measured on each date, and is type double.
Note: read_csv was able to parse the date_measured column into the date vector
type because it was entered in the international standard date format, called
ISO 8601, which lists dates as year-month-day. date vectors are double vectors
with special properties that allow them to handle dates correctly. For example,
date type vectors allow functions like ggplot to treat them as numeric dates
and not as character vectors, even though they contain non-numeric characters
(e.g., in the date_measured column in the co2_df data frame). This means R will
not accidentally plot the dates in the wrong order (i.e., not alphanumerically
as would happen if it was a character vector). An in-depth study of dates and
times is beyond the scope of the book, but interested readers may consult the
Dates and Times chapter of R for Data Science [Wickham and Grolemund,
2016]; see the additional resources at the end of this chapter.
ISTUDY
4.5 Creating visualizations with ggplot2 127
5
https://ggplot2.tidyverse.org/reference/
ISTUDY
128 4 Effective data visualization
co2_scatter
400
ppm
375
350
Certainly, the visualization in Figure 4.3 shows a clear upward trend in the
atmospheric concentration of CO2 over time. This plot answers the first part
of our question in the affirmative, but that appears to be the only conclusion
one can make from the scatter visualization.
One important thing to note about this data is that one of the variables we
are exploring is time. Time is a special kind of quantitative variable because
it forces additional structure on the data—the data points have a natural
order. Specifically, each observation in the data set has a predecessor and
a successor, and the order of the observations matters; changing their order
alters their meaning. In situations like this, we typically use a line plot to
visualize the data. Line plots connect the sequence of x and y coordinates of
the observations with line segments, thereby emphasizing their order.
We can create a line plot in ggplot using the geom_line function. Let’s now try
to visualize the co2_df as a line plot with just the default arguments:
ISTUDY
4.5 Creating visualizations with ggplot2 129
co2_line
400
ppm
375
350
ISTUDY
130 4 Effective data visualization
co2_line
400
375
350
Note: The theme function is quite complex and has many arguments that can
be specified to control many non-data aspects of a visualization. An in-depth
discussion of the theme function is beyond the scope of this book. Interested
readers may consult the theme function documentation; see the additional re-
sources section at the end of this chapter.
Finally, let’s see if we can better understand the oscillation by changing the
visualization slightly. Note that it is totally fine to use a small number of
visualizations to answer different aspects of the question you are trying to
answer. We will accomplish this by using scales, another important feature of
ggplot2 that easily transforms the different variables and set limits. We scale
the horizontal axis using the xlim function, and the vertical axis with the ylim
function. In particular, here, we will use the xlim function to zoom in on just
five years of data (say, 1990-1994). xlim takes a vector of length two to specify
the upper and lower bounds to limit the axis. We can create that using the c
function. Note that it is important that the vector given to xlim must be of the
ISTUDY
4.5 Creating visualizations with ggplot2 131
same type as the data that is mapped to that axis. Here, we have mapped a
date to the x-axis, and so we need to use the date function (from the tidyverse
6
lubridate R package [Spinu et al., 2021, Grolemund and Wickham, 2011]) to
convert the character strings we provide to c to date vectors.
library(lubridate)
co2_line
Atmospheric CO2 (ppm)
400
375
350
ISTUDY
132 4 Effective data visualization
Interesting! It seems that each year, the atmospheric CO2 increases until it
reaches its peak somewhere around April, decreases until around late Septem-
ber, and finally increases again until the end of the year. In Hawaii, there are
two seasons: summer from May through October, and winter from November
through April. Therefore, the oscillating pattern in CO2 matches up fairly
closely with the two seasons.
As you might have noticed from the code used to create the final visualization
of the co2_df data frame, we construct the visualizations in ggplot with layers.
New layers are added with the + operator, and we can really add as many as
we would like! A useful analogy to constructing a data visualization is painting
a picture. We start with a blank canvas, and the first thing we do is prepare
the surface for our painting by adding primer. In our data visualization this
is akin to calling ggplot and specifying the data set we will be using. Next,
we sketch out the background of the painting. In our data visualization, this
would be when we map data to the axes in the aes function. Then we add our
key visual subjects to the painting. In our data visualization, this would be
the geometric objects (e.g., geom_point, geom_line, etc.). And finally, we work
on adding details and refinements to the painting. In our data visualization
this would be when we fine tune axis labels, change the font, adjust the point
size, and do other related things.
4.5.2 Scatter plots: the Old Faithful eruption time data set
The faithful data set contains measurements of the waiting time between erup-
tions and the subsequent eruption duration (in minutes) of the Old Faithful
geyser in Yellowstone National Park, Wyoming, United States. The faithful
data set is available in base R as a data frame, so it does not need to be loaded.
We convert it to a tibble to take advantage of the nicer print output these
specialized data frames provide.
Question: Is there a relationship between the waiting time before an eruption
and the duration of the eruption?
## # A tibble: 272 x 2
## eruptions waiting
## <dbl> <dbl>
## 1 3.6 79
## 2 1.8 54
ISTUDY
4.5 Creating visualizations with ggplot2 133
## 3 3.33 74
## 4 2.28 62
## 5 4.53 85
## 6 2.88 55
## 7 4.7 88
## 8 3.6 85
## 9 1.95 51
## 10 4.35 85
## # ... with 262 more rows
faithful_scatter
4
eruptions
50 60 70 80 90
waiting
ISTUDY
134 4 Effective data visualization
We can see in Figure 4.7 that the data tend to fall into two groups: one with
short waiting and eruption times, and one with long waiting and eruption
times. Note that in this case, there is no overplotting: the points are generally
nicely visually separated, and the pattern they form is clear. In order to refine
the visualization, we need only to add axis labels and make the font more
readable:
faithful_scatter
5
Eruption Duration (mins)
50 60 70 80 90
Waiting Time (mins)
FIGURE 4.8: Scatter plot of waiting time and eruption time with clearer
axes and labels.
ISTUDY
4.5 Creating visualizations with ggplot2 135
4.5.3 Axis transformation and colored scatter plots: the Canadian lan-
guages data set
Recall the can_lang data set [Timbers, 2020] from Chapters 1, 2, and 3, which
contains counts of languages from the 2016 Canadian census.
Question: Is there a relationship between the percentage of people who speak
a language as their mother tongue and the percentage for whom that is the
primary language spoken at home? And is there a pattern in the strength
of this relationship in the higher-level language categories (Official languages,
Aboriginal languages, or non-official and non-Aboriginal languages)?
To get started, we will read and inspect the data:
## # A tibble: 214 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Aboriginal la~ Aboriginal~ 590 235 30 665
## 2 Non-Official ~ Afrikaans 10260 4785 85 23415
## 3 Non-Official ~ Afro-Asiat~ 1150 445 10 2775
## 4 Non-Official ~ Akan (Twi) 13460 5985 25 22150
## 5 Non-Official ~ Albanian 26895 13135 345 31930
## 6 Aboriginal la~ Algonquian~ 45 10 0 120
## 7 Aboriginal la~ Algonquin 1260 370 40 2480
## 8 Non-Official ~ American S~ 2685 3020 1145 21930
## 9 Non-Official ~ Amharic 22465 12785 200 33670
## 10 Non-Official ~ Arabic 419890 223535 5585 629055
## # ... with 204 more rows
We will begin with a scatter plot of the mother_tongue and most_at_home columns
from our data frame. The resulting plot is shown in Figure 4.9.
ISTUDY
136 4 Effective data visualization
2.0e+07
1.5e+07
mother_tongue
1.0e+07
5.0e+06
0.0e+00
ISTUDY
4.5 Creating visualizations with ggplot2 137
2.0e+07
Mother tongue
1.0e+07
5.0e+06
0.0e+00
0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07
Language spoken most at home
(number of Canadian residents)
FIGURE 4.10: Scatter plot of number of Canadians reporting a language
as their mother tongue vs the primary language at home with x and y labels.
Okay! The axes and labels in Figure 4.10 are much more readable and inter-
pretable now. However, the scatter points themselves could use some work;
most of the 214 data points are bunched up in the lower left-hand side of
the visualization. The data is clumped because many more people in Canada
speak English or French (the two points in the upper right corner) than
other languages. In particular, the most common mother tongue language has
19,460,850 speakers, while the least common has only 10. That’s a 6-decimal-
place difference in the magnitude of these two numbers! We can confirm that
the two points in the upper right-hand corner correspond to Canada’s two
official languages by filtering the data:
can_lang |>
filter(language == ”English” | language == ”French”)
## # A tibble: 2 x 6
## category language mother_tongue most_at_home most_at_work lang_known
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Official languages English 19460850 22162865 15265335 29748265
## 2 Official languages French 7166700 6943800 3825215 10242945
ISTUDY
138 4 Effective data visualization
Recall that our question about this data pertains to all languages; so to prop-
erly answer our question, we will need to adjust the scale of the axes so that
we can clearly see all of the scatter points. In particular, we will improve
the plot by adjusting the horizontal and vertical axes so that they are on a
logarithmic (or log) scale. Log scaling is useful when your data take both
very large and very small values, because it helps space out small values and
squishes larger values together. For example, log10 (1) = 0, log10 (10) = 1,
log10 (100) = 2, and log10 (1000) = 3; on the logarithmic scale, the values 1,
10, 100, and 1000 are all the same distance apart! So we see that applying this
function is moving big values closer together and moving small values farther
apart. Note that if your data can take the value 0, logarithmic scaling may not
be appropriate (since log10(0) = -Inf in R). There are other ways to transform
the data in such a case, but these are beyond the scope of the book.
We can accomplish logarithmic scaling in a ggplot visualization using the
scale_x_log10 and scale_y_log10 functions. Given that the x and y axes have
large numbers, we should also format the axis labels to put commas in these
numbers to increase their readability. We can do this in R by passing the
label_comma function (from the scales package) to the labels argument of the
scale_x_log10 and scale_x_log10 functions.
library(scales)
ISTUDY
4.5 Creating visualizations with ggplot2 139
Mother tongue
10,000
100
ISTUDY
140 4 Effective data visualization
can_lang |>
select(mother_tongue_percent, most_at_home_percent)
## # A tibble: 214 x 2
## mother_tongue_percent most_at_home_percent
## <dbl> <dbl>
## 1 0.00168 0.000669
## 2 0.0292 0.0136
## 3 0.00327 0.00127
## 4 0.0383 0.0170
## 5 0.0765 0.0374
## 6 0.000128 0.0000284
## 7 0.00358 0.00105
## 8 0.00764 0.00859
## 9 0.0639 0.0364
## 10 1.19 0.636
## # ... with 204 more rows
Finally, we will edit the visualization to use the percentages we just computed
(and change our axis labels to reflect this change in units). Figure 4.12 displays
the final result.
ISTUDY
4.5 Creating visualizations with ggplot2 141
0.001
Figure 4.12 is the appropriate visualization to use to answer the first question
in this section, i.e., whether there is a relationship between the percentage
of people who speak a language as their mother tongue and the percentage
for whom that is the primary language spoken at home. To fully answer the
question, we need to use Figure 4.12 to assess a few key characteristics of the
data:
• Direction: if the y variable tends to increase when the x variable increases,
then y has a positive relationship with x. If y tends to decrease when x
increases, then y has a negative relationship with x. If y does not meaning-
fully increase or decrease as x increases, then y has little or no relationship
with x.
• Strength: if the y variable reliably increases, decreases, or stays flat as
x increases, then the relationship is strong. Otherwise, the relationship is
weak. Intuitively, the relationship is strong when the scatter points are close
together and look more like a “line” or “curve” than a “cloud.”
• Shape: if you can draw a straight line roughly through the data points, the
relationship is linear. Otherwise, it is nonlinear.
In Figure 4.12, we see that as the percentage of people who have a language
as their mother tongue increases, so does the percentage of people who speak
that language at home. Therefore, there is a positive relationship between
ISTUDY
142 4 Effective data visualization
these two variables. Furthermore, because the points in Figure 4.12 are fairly
close together, and the points look more like a “line” than a “cloud”, we can
say that this is a strong relationship. And finally, because drawing a straight
line through these points in Figure 4.12 would fit the pattern we observe quite
well, we say that the relationship is linear.
Onto the second part of our exploratory data analysis question! Recall that
we are interested in knowing whether the strength of the relationship we un-
covered in Figure 4.12 depends on the higher-level language category (Official
languages, Aboriginal languages, and non-official, non-Aboriginal languages).
One common way to explore this is to color the data points on the scatter
plot we have already created by group. For example, given that we have the
higher-level language category for each language recorded in the 2016 Cana-
dian census, we can color the points in our previous scatter plot to represent
each language’s higher-level language category.
Here we want to distinguish the values according to the category group with
which they belong. We can add an argument to the aes function, specifying
that the category column should color the points. Adding this argument will
color the points according to their group and add a legend at the side of the
plot.
ISTUDY
4.5 Creating visualizations with ggplot2 143
0.001
0.0001
0.0100
1.0000
100.0000
Language spoken most at home
(percentage of Canadian residents)
FIGURE 4.13: Scatter plot of percentage of Canadians reporting a language
as their mother tongue vs the primary language at home colored by language
category.
The legend in Figure 4.13 takes up valuable plot area. We can improve this
by moving the legend title using the legend.position and legend.direction ar-
guments of the theme function. Here we set legend.position to ”top” to put
the legend above the plot and legend.direction to ”vertical” so that the legend
items remain vertically stacked on top of each other. When the legend.position
is set to either ”top” or ”bottom” the default direction is to stack the legend
items horizontally. However, that will not work well for this particular visual-
ization because the legend labels are quite long and would run off the page if
displayed this way.
ISTUDY
144 4 Effective data visualization
scale_x_log10(labels = comma) +
scale_y_log10(labels = comma)
category
Aboriginal languages
Non-Official & Non-Aboriginal languages
Official languages
(percentage of Canadian residents)
10.000
Mother tongue
0.100
0.001
In Figure 4.14, the points are colored with the default ggplot2 color palette. But
what if you want to use different colors? In R, two packages that provide al-
ternative color palettes are RColorBrewer [Neuwirth, 2014] and ggthemes [Arnold,
2019]; in this book we will cover how to use RColorBrewer. You can visualize the
list of color palettes that RColorBrewer has to offer with the display.brewer.all
function. You can also print a list of color-blind friendly palettes by adding
colorblindFriendly = TRUE to the function.
library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)
ISTUDY
4.5 Creating visualizations with ggplot2 145
YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Set2
Paired
Dark2
RdYlBu
RdBu
PuOr
PRGn
PiYG
BrBG
FIGURE 4.15: Color palettes available from the RColorBrewer R package.
From Figure 4.15, we can choose the color palette we want to use in our plot.
To change the color palette, we add the scale_color_brewer layer indicating the
palette we want to use. You can use this color blindness simulator7 to check if
your visualizations are color-blind friendly. Below we pick the ”Set2” palette,
with the result shown in Figure 4.16. We also set the shape aesthetic mapping
to the category variable as well; this makes the scatter point shapes different
for each category. This kind of visual redundancy—i.e., conveying the same
information with both scatter point color and shape—can further improve the
clarity and accessibility of your visualization.
7
https://www.color-blindness.com/coblis-color-blindness-simulator/
ISTUDY
146 4 Effective data visualization
category
Aboriginal languages
Non-Official & Non-Aboriginal languages
Official languages
(percentage of Canadian residents)
10.000
Mother tongue
0.100
0.001
ISTUDY
4.5 Creating visualizations with ggplot2 147
From the visualization in Figure 4.16, we can now clearly see that the vast
majority of Canadians reported one of the official languages as their mother
tongue and as the language they speak most often at home. What do we see
when considering the second part of our exploratory question? Do we see a
difference in the relationship between languages spoken as a mother tongue
and as a primary language at home across the higher-level language categories?
Based on Figure 4.16, there does not appear to be much of a difference. For
each higher-level language category, there appears to be a strong, positive, and
linear relationship between the percentage of people who speak a language as
their mother tongue and the percentage who speak it as their primary language
at home. The relationship looks similar regardless of the category.
Does this mean that this relationship is positive for all languages in the world?
And further, can we use this data visualization on its own to predict how many
people have a given language as their mother tongue if we know how many
people speak it as their primary language at home? The answer to both these
questions is “no!” However, with exploratory data analysis, we can create
new hypotheses, ideas, and questions (like the ones at the beginning of this
paragraph). Answering those questions often involves doing more complex
analyses, and sometimes even gathering additional data. We will see more of
such complex analyses later on in this book.
# islands data
islands_df <- read_csv(”data/islands.csv”)
islands_df
## # A tibble: 48 x 3
## landmass size landmass_type
## <chr> <dbl> <chr>
## 1 Africa 11506 Continent
## 2 Antarctica 5500 Continent
## 3 Asia 16988 Continent
## 4 Australia 2968 Continent
ISTUDY
148 4 Effective data visualization
Here, we have a data frame of Earth’s landmasses, and are trying to compare
their sizes. The right type of visualization to answer this question is a bar plot.
In a bar plot, the height of the bar represents the value of a summary statistic
(usually a size, count, proportion or percentage). They are particularly useful
for comparing summary statistics between different groups of a categorical
variable.
We specify that we would like to use a bar plot via the geom_bar function in
ggplot2. However, by default, geom_bar sets the heights of bars to the number
of times a value appears in a data frame (its count); here, we want to plot
exactly the values in the data frame, i.e., the landmass sizes. So we have to
pass the stat = ”identity” argument to geom_bar. The result is shown in Figure
4.17.
islands_bar
15000
10000
size
5000
0
Antarctica
Africa
Axel
Australia
AsiaHeiberg
Baffin
Banks
Borneo
Britain
Celebes
Celon
Cuba
Ellesmere
Devon
Greenland
Europe
Hispaniola
Hainan
Hokkaido
Honshu
Iceland
Ireland
Java
Kyushu
Madagascar
Luzon
Melville
Mindanao
Moluccas
New
New
New
New Newfoundland
Britain
Zealand
Guinea
North
Zealand
Novaya
Prince
America
South
(N)
Sakhalin
Southampton
Zemlya
(S)
ofSpitsbergen
Wales
America
Sumatra
Tierra
Taiwan
Tasmania
Vancouver
del
Timor
Victoria
Fuego
landmass
FIGURE 4.17: Bar plot of all Earth’s landmasses’ size with squished labels.
ISTUDY
4.5 Creating visualizations with ggplot2 149
Alright, not bad! The plot in Figure 4.17 is definitely the right kind of visu-
alization, as we can clearly see and compare sizes of landmasses. The major
issues are that the smaller landmasses’ sizes are hard to distinguish, and the
names of the landmasses are obscuring each other as they have been squished
into too little space. But remember that the question we asked was only about
the largest landmasses; let’s make the plot a little bit clearer by keeping only
the largest 12 landmasses. We do this using the slice_max function. Then to
help us make sure the labels have enough space, we’ll use horizontal bars
instead of vertical ones. We do this by swapping the x and y variables:
islands_bar
South America
North America
New Guinea
Madagascar
landmass
Greenland
Europe
Borneo
Baffin
Australia
Asia
Antarctica
Africa
0 5000 10000 15000
size
FIGURE 4.18: Bar plot of size for Earth’s largest 12 landmasses.
The plot in Figure 4.18 is definitely clearer now, and allows us to answer our
question (“are the top 7 largest landmasses continents?”) in the affirmative.
But the question could be made clearer from the plot by organizing the bars
not by alphabetical order but by size, and to color them based on whether they
are a continent. The data for this is stored in the landmass_type column. To
use this to color the bars, we add the fill argument to the aesthetic mapping
and set it to landmass_type.
To organize the landmasses by their size variable, we will use the tidyverse
fct_reorder function in the aesthetic mapping to organize the landmasses by
ISTUDY
150 4 Effective data visualization
their size variable. The first argument passed to fct_reorder is the name of
the factor column whose levels we would like to reorder (here, landmass). The
second argument is the column name that holds the values we would like to
use to do the ordering (here, size). The fct_reorder function uses ascending
order by default, but this can be changed to descending order by setting .desc
= TRUE. We do this here so that the largest bar will be closest to the axis line,
which is more visually appealing.
To label the x and y axes, we will use the labs function instead of the xlab and
ylab functions from earlier in this chapter. The labs function is more general;
we are using it in this case because we would also like to change the legend
label. The default label is the name of the column being mapped to fill. Here
that would be landmass_type; however landmass_type is not proper English (and
so is less readable). Thus we use the fill argument inside labs to change that
to “Type.” Finally, we again use the theme function to change the font size.
islands_bar
Baffin
Madagascar
Borneo
New Guinea
Landmass
Greenland Type
Australia
Continent
Europe
Antarctica Other
South America
North America
Africa
Asia
0 5000 10000 15000
Size (1000 square mi)
FIGURE 4.19: Bar plot of size for Earth’s largest 12 landmasses colored by
whether its a continent with clearer axes and labels.
ISTUDY
4.5 Creating visualizations with ggplot2 151
The plot in Figure 4.19 is now a very effective visualization for answering our
original questions. Landmasses are organized by their size, and continents are
colored differently than other landmasses, making it quite clear that continents
are the largest seven landmasses.
## # A tibble: 100 x 3
## Expt Run Speed
## <int> <int> <int>
## 1 1 1 850
## 2 1 2 740
## 3 1 3 900
## 4 1 4 1070
## 5 1 5 930
## 6 1 6 850
## 7 1 7 950
## 8 1 8 980
## 9 1 9 980
## 10 1 10 880
## # ... with 90 more rows
ISTUDY
152 4 Effective data visualization
In this experimental data, Michelson was trying to measure just a single quan-
titative number (the speed of light). The data set contains many measurements
of this single quantity. To tell how accurate the experiments were, we need to
visualize the distribution of the measurements (i.e., all their possible values
and how often each occurs). We can do this using a histogram. A histogram
helps us visualize how a particular variable is distributed in a data set by
separating the data into bins, and then using vertical bars to show how many
data points fell in each bin.
To create a histogram in ggplot2 we will use the geom_histogram geometric object,
setting the x axis to the Speed measurement variable. As usual, let’s use the
default arguments just to see how things look.
morley_hist
15
10
count
0
600 700 800 900 1000 1100
Speed
Figure 4.20 is a great start. However, we cannot tell how accurate the measure-
ments are using this visualization unless we can see the true value. In order to
visualize the true speed of light, we will add a vertical line with the geom_vline
function. To draw a vertical line with geom_vline, we need to specify where on
the x-axis the line should be drawn. We can do this by setting the xintercept
argument. Here we set it to 792.458, which is the true value of light speed
minus 299,000; this ensures it is coded the same way as the measurements in
the morley data frame. We would also like to fine tune this vertical line, styling
ISTUDY
4.5 Creating visualizations with ggplot2 153
morley_hist
15
10
count
0
600 700 800 900 1000 1100
Speed
In Figure 4.21, we still cannot tell which experiments (denoted in the Expt
column) led to which measurements; perhaps some experiments were more
accurate than others. To fully answer our question, we need to separate the
measurements from each other visually. We can try to do this using a colored
histogram, where counts from different experiments are stacked on top of each
other in different colors. We can create a histogram colored by the Expt variable
by adding it to the fill aesthetic mapping. We make sure the different colors
can be seen (despite them all sitting on top of each other) by setting the alpha
argument in geom_histogram to 0.5 to make the bars slightly translucent. We
also specify position = ”identity” in geom_histogram to ensure the histograms for
each experiment will be overlaid side-by-side, instead of stacked bars (which
ISTUDY
154 4 Effective data visualization
is the default for bar plots or histograms when they are colored by another
categorical variable).
morley_hist
15
10
count
0
600 700 800 900 1000 1100
Speed
Alright great, Figure 4.22 looks…wait a second! The histogram is still all the
same color! What is going on here? Well, if you recall from Chapter 3, the
data type you use for each variable can influence how R and tidyverse treats
it. Here, we indeed have an issue with the data types in the morley data frame.
In particular, the Expt column is currently an integer (you can see the label
<int> underneath the Expt column in the printed data frame at the start of
this section). But we want to treat it as a category, i.e., there should be one
category per type of experiment.
To fix this issue we can convert the Expt variable into a factor by passing it
to as_factor in the fill aesthetic mapping. Recall that factor is a data type
in R that is often used to represent categories. By writing as_factor(Expt) we
are ensuring that R will treat this variable as a factor, and the color will be
mapped discretely.
ISTUDY
4.5 Creating visualizations with ggplot2 155
morley_hist
6
as_factor(Expt)
1
4
count
2
3
4
2
5
0
600 700 800 900 1000 1100
Speed
Note: Factors impact plots in two ways: (1) ensuring a color is mapped as
discretely where appropriate (as in this example) and (2) the ordering of levels
in a plot. ggplot takes into account the order of the factor levels as opposed
to the order of data in your data frame. Learning how to reorder your factor
levels will help you with reordering the labels of a factor on a plot.
ISTUDY
156 4 Effective data visualization
split the plot into subplots, and how to split them (i.e., into rows or columns).
If the plot is to be split horizontally, into rows, then the rows argument is used.
If the plot is to be split vertically, into columns, then the columns argument is
used. Both the rows and columns arguments take the column names on which to
split the data when creating the subplots. Note that the column names must
be surrounded by the vars function. This function allows the column names
to be correctly evaluated in the context of the data frame.
morley_hist
6
4 1
2
0
6
4
2
2
as_factor(Expt)
0
1
6
count
2
4
3
2 3
0 4
6 5
4
4
2
0
6
4
5
2
0
600 700 800 900 1000 1100
Speed
ISTUDY
4.5 Creating visualizations with ggplot2 157
The visualization in Figure 4.24 now makes it quite clear how accurate the
different experiments were with respect to one another. The most variable
measurements came from Experiment 1. There the measurements ranged from
about 650–1050 km/sec. The least variable measurements came from Exper-
iment 2. There, the measurements ranged from about 750–950 km/sec. The
most different experiments still obtained quite similar results!
There are two finishing touches to make this visualization even clearer. First
and foremost, we need to add informative axis labels using the labs function,
and increase the font size to make it readable using the theme function. Second,
and perhaps more subtly, even though it is easy to compare the experiments
on this plot to one another, it is hard to get a sense of just how accurate all
the experiments were overall. For example, how accurate is the value 800 on
the plot, relative to the true speed of light? To answer this question, we’ll use
the mutate function to transform our data into a relative measure of accuracy
rather than absolute measurements:
morley_hist
ISTUDY
158 4 Effective data visualization
6
4
1
2
0
6
4
2
2
Experiment ID
# Measurements
0
6 1
4 2
3
2 3
0 4
6 5
4
4
2
0
6
4
5
2
0
-0.05 0.00 0.05 0.10
Relative Accuracy (%)
FIGURE 4.25: Histogram of relative accuracy split vertically by experiment
with clearer axes and labels.
Wow, impressive! These measurements of the speed of light from 1879 had
errors around 0.05% of the true speed. Figure 4.25 shows you that even though
experiments 2 and 5 were perhaps the most accurate, all of the experiments
did quite an admirable job given the technology available at the time.
ISTUDY
4.5 Creating visualizations with ggplot2 159
When you create a histogram in R, the default number of bins used is 30.
Naturally, this is not always the right number to use. You can set the number
of bins yourself by using the bins argument in the geom_histogram geometric
object. You can also set the width of the bins using the binwidth argument in
the geom_histogram geometric object. But what number of bins, or bin width,
is the right one to use?
Unfortunately there is no hard rule for what the right bin number or width is. It
depends entirely on your problem; the right number of bins or bin width is the
one that helps you answer the question you asked. Choosing the correct setting
for your problem is something that commonly takes iteration. We recommend
setting the bin width (not the number of bins) because it often more directly
corresponds to values in your problem of interest. For example, if you are
looking at a histogram of human heights, a bin width of 1 inch would likely
be reasonable, while the number of bins to use is not immediately clear. It’s
usually a good idea to try out several bin widths to see which one most clearly
captures your data in the context of the question you want to answer.
To get a sense for how different bin widths affect visualizations, let’s exper-
iment with the histogram that we have been working on in this section. In
Figure 4.26, we compare the default setting with three other histograms where
we set the binwidth to 0.001, 0.01 and 0.1. In this case, we can see that both
the default number of bins and the binwidth of 0.01 are effective for helping
answer our question. On the other hand, the bin widths of 0.001 and 0.1 are
too small and too big, respectively.
ISTUDY
160 4 Effective data visualization
1
2 2
0 0
6 6
4 4
2
2 2
# Measurements
# Measurements
0 0
6 6
4 4
3
2 2
0 0
6 6
4 4 4
4
2 2
0 0
6 6
4 4
5
5
2 2
0 0
-0.05 0.00 0.05 0.10 -0.05 0.00 0.05 0.10
Relative Accuracy (%) Relative Accuracy (%)
binwidth = 0.01 binwidth = 0.1
20
7.5 15
5.0 10
1
1
2.5 5
0.0 0
20
7.5 15
5.0 10
2
2
2.5
# Measurements
# Measurements
5
0.0 0
20
7.5 15
5.0 10
3
2.5 5
0.0 0
20
7.5 15
5.0 10
4
2.5 5
0.0 0
20
7.5 15
5.0 10
5
2.5 5
0.0 0
-0.05 0.00 0.05 0.10 -0.1 0.0 0.1
Relative Accuracy (%) Relative Accuracy (%)
FIGURE 4.26: Effect of varying bin width on histograms.
ISTUDY
4.5 Creating visualizations with ggplot2 161
One of the powerful features of ggplot is that you can continue to iterate on a
single plot object, adding and refining one layer at a time. If you stored your
plot as a named object using the assignment symbol (<-), you can add to it
using the + operator. For example, if we wanted to add a title to the last plot
we created (morley_hist), we can use the + operator to add a title layer with
the ggtitle function. The result is shown in Figure 4.27.
morley_hist_title
2
Experiment ID
# Measurements
0
6 1
4 2
3
2 3
0 4
6
5
4
4
2
0
6
4
5
2
0
-0.05 0.00 0.05 0.10
Relative Accuracy (%)
FIGURE 4.27: Histogram of relative accuracy split vertically by experiment
with a descriptive title highlighting the take home message of the visualization.
ISTUDY
162 4 Effective data visualization
Note: Good visualization titles clearly communicate the take home message
to the audience. Typically, that is the answer to the question you posed before
making the visualization.
Typically, your visualization will not be shown entirely on its own, but rather
it will be part of a larger presentation. Further, visualizations can provide
supporting information for any aspect of a presentation, from opening to con-
clusion. For example, you could use an exploratory visualization in the opening
of the presentation to motivate your choice of a more detailed data analysis /
model, a visualization of the results of your analysis to show what your anal-
ysis has uncovered, or even one at the end of a presentation to help suggest
directions for future work.
Regardless of where it appears, a good way to discuss your visualization is as
a story:
1) Establish the setting and scope, and describe why you did what you
did.
2) Pose the question that your visualization answers. Justify why the
question is important to answer.
3) Answer the question using your visualization. Make sure you describe
all aspects of the visualization (including describing the axes). But
you can emphasize different aspects based on what is important to
answer your question:
•trends (lines): Does a line describe the trend well? If so, the
trend is linear, and if not, the trend is nonlinear. Is the trend
increasing, decreasing, or neither? Is there a periodic oscillation
(wiggle) in the trend? Is the trend noisy (does the line “jump
around” a lot) or smooth?
•distributions (scatters, histograms): How spread out are
the data? Where are they centered, roughly? Are there any ob-
vious “clusters” or “subgroups”, which would be visible as mul-
tiple bumps in the histogram?
•distributions of two variables (scatters): Is there a clear /
ISTUDY
4.6 Explaining the visualization 163
Below are two examples of how one might take these four steps in describing
the example visualizations that appeared earlier in this chapter. Each of the
steps is denoted by its numeral in parentheses, e.g. (3).
Mauna Loa Atmospheric CO2 Measurements: (1) Many current forms
of energy generation and conversion—from automotive engines to natural gas
power plants—rely on burning fossil fuels and produce greenhouse gases, typ-
ically primarily carbon dioxide (CO2 ), as a byproduct. Too much of these
gases in the Earth’s atmosphere will cause it to trap more heat from the sun,
leading to global warming. (2) In order to assess how quickly the atmospheric
concentration of CO2 is increasing over time, we (3) used a data set from
the Mauna Loa observatory in Hawaii, consisting of CO2 measurements from
1980 to 2020. We plotted the measured concentration of CO2 (on the vertical
axis) over time (on the horizontal axis). From this plot, you can see a clear,
increasing, and generally linear trend over time. There is also a periodic os-
cillation that occurs once per year and aligns with Hawaii’s seasons, with an
amplitude that is small relative to the growth in the overall trend. This shows
that atmospheric CO2 is clearly increasing over time, and (4) it is perhaps
worth investigating more into the causes.
Michelson Light Speed Experiments: (1) Our modern understanding of
the physics of light has advanced significantly from the late 1800s when Michel-
son and Morley’s experiments first demonstrated that it had a finite speed. We
now know, based on modern experiments, that it moves at roughly 299,792.458
kilometers per second. (2) But how accurately were we first able to measure
this fundamental physical constant, and did certain experiments produce more
accurate results than others? (3) To better understand this, we plotted data
from 5 experiments by Michelson in 1879, each with 20 trials, as histograms
stacked on top of one another. The horizontal axis shows the accuracy of the
measurements relative to the true speed of light as we know it today, expressed
as a percentage. From this visualization, you can see that most results had
relative errors of at most 0.05%. You can also see that experiments 1 and 3
had measurements that were the farthest from the true value, and experiment
5 tended to provide the most consistently accurate result. (4) It would be
ISTUDY
164 4 Effective data visualization
Just as there are many ways to store data sets, there are many ways to store
visualizations and images. Which one you choose can depend on several factors,
such as file size/type limitations (e.g., if you are submitting your visualization
as part of a conference paper or to a poster printing shop) and where it will be
displayed (e.g., online, in a paper, on a poster, on a billboard, in talk slides).
Generally speaking, images come in two flavors: raster formats and vector
formats.
Raster images are represented as a 2-D grid of square pixels, each with its
own color. Raster images are often compressed before storing so they take
up less space. A compressed format is lossy if the image cannot be perfectly
re-created when loading and displaying, with the hope that the change is not
noticeable. Lossless formats, on the other hand, allow a perfect display of the
original image.
• Common file types:
– JPEG8 (.jpg, .jpeg): lossy, usually used for photographs
– PNG9 (.png): lossless, usually used for plots / line drawings
– BMP10 (.bmp): lossless, raw image data, no compression (rarely used)
– TIFF11 (.tif, .tiff): typically lossless, no compression, used mostly in
graphic arts, publishing
• Open-source software: GIMP12
Vector images are represented as a collection of mathematical objects (lines,
surfaces, shapes, curves). When the computer displays the image, it redraws
all of the elements using their mathematical formulas.
• Common file types:
– SVG13 (.svg): general-purpose use
8
https://en.wikipedia.org/wiki/JPEG
9
https://en.wikipedia.org/wiki/Portable_Network_Graphics
10
https://en.wikipedia.org/wiki/BMP_file_format
11
https://en.wikipedia.org/wiki/TIFF
12
https://www.gimp.org/
13
https://en.wikipedia.org/wiki/Scalable_Vector_Graphics
ISTUDY
4.7 Saving the visualization 165
Note: The portable document format PDF16 (.pdf) is commonly used to store
both raster and vector formats. If you try to open a PDF and it’s taking a
long time to load, it may be because there is a complicated vector graphics
image that your computer is rendering.
Let’s learn how to save plot images to these different file formats using a scatter
plot of the Old Faithful data set17 [Hardle, 1991], shown in Figure 4.28.
faithful_plot
14
https://en.wikipedia.org/wiki/Encapsulated_PostScript
15
https://inkscape.org/
16
https://en.wikipedia.org/wiki/PDF
17
https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat
ISTUDY
166 4 Effective data visualization
Eruption time
(minutes)
3
50 60 70 80 90
Waiting time to next eruption
(minutes)
FIGURE 4.28: Scatter plot of waiting time and eruption time.
Now that we have a named ggplot plot object, we can use the ggsave function
to save a file containing this image. ggsave works by taking a file name to
create for the image as its first argument. This can include the path to the
directory where you would like to save the file (e.g., img/filename.png to save a
file named filename to the img directory), and the name of the plot object to
save as its second argument. The kind of image to save is specified by the file
extension. For example, to create a PNG image file, we specify that the file
extension is .png. Below we demonstrate how to save PNG, JPG, BMP, TIFF
and SVG file types for the faithful_plot:
ggsave(”img/faithful_plot.png”, faithful_plot)
ggsave(”img/faithful_plot.jpg”, faithful_plot)
ggsave(”img/faithful_plot.bmp”, faithful_plot)
ggsave(”img/faithful_plot.tiff”, faithful_plot)
ggsave(”img/faithful_plot.svg”, faithful_plot)
Take a look at the file sizes in Table 4.1. Wow, that’s quite a difference! Notice
that for such a simple plot with few graphical elements (points), the vector
graphics format (SVG) is over 100 times smaller than the uncompressed raster
images (BMP, TIFF). Also, note that the JPG format is twice as large as the
PNG format since the JPG compression algorithm is designed for natural
images (not plots).
ISTUDY
4.7 Saving the visualization 167
TABLE 4.1: File sizes of the scatter plot of the Old Faithful data set when
saved as different file formats.
In Figure 4.29, we also show what the images look like when we zoom in to a
rectangle with only 2 data points. You can see why vector graphics formats are
so useful: because they’re just based on mathematical formulas, vector graph-
ics can be scaled up to arbitrary sizes. This makes them great for presentation
media of all sizes, from papers to posters to billboards.
FIGURE 4.29: Zoomed in faithful, raster (PNG, left) and vector (SVG,
right) formats.
ISTUDY
168 4 Effective data visualization
4.8 Exercises
Practice exercises for the material covered in this chapter can be found in
the accompanying worksheets repository18 in the “Effective data visualization”
row. You can launch an interactive version of the worksheet in your browser
by clicking the “launch binder” button. You can also preview a non-interactive
version of the worksheet by clicking “view worksheet.” If you instead decide to
download the worksheet and run it on your own machine, make sure to follow
the instructions for computer setup found in Chapter 13. This will ensure
that the automated feedback and guidance that the worksheets provide will
function as intended.
ISTUDY
4.9 Additional resources 169
• R for Data Science [Wickham and Grolemund, 2016] has a chapter on dates
and times22 . This chapter is where you should look if you want to learn about
date vectors, including how to create them, and how to use them to effectively
handle durations, periods and intervals using the lubridate package.
22
https://r4ds.had.co.nz/dates-and-times.html
ISTUDY
ISTUDY
5
Classification I: training & predicting
5.1 Overview
In previous chapters, we focused solely on descriptive and exploratory data
analysis questions. This chapter and the next together serve as our first foray
into answering predictive questions about data. In particular, we will focus on
classification, i.e., using one or more variables to predict the value of a cate-
gorical variable of interest. This chapter will cover the basics of classification,
how to preprocess data to make it suitable for use in a classifier, and how to
use our observed data to make predictions. The next chapter will focus on
how to evaluate how accurate the predictions from our classifier are, as well
as how to improve our classifier (where possible) to maximize its accuracy.
171
ISTUDY
172 5 Classification I: training & predicting
ISTUDY
5.4 Exploring a data set 173
library(tidyverse)
In this case, the file containing the breast cancer data set is a .csv file with
headers. We’ll use the read_csv function with no additional arguments, and
then inspect its contents:
## # A tibble: 569 x 12
## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
1
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
ISTUDY
174 5 Classification I: training & predicting
ISTUDY
5.4 Exploring a data set 175
Below we use glimpse to preview the data frame. This function can make it
easier to inspect the data when we have a lot of columns, as it prints the data
such that the columns go down the page (instead of across).
glimpse(cancer)
## Rows: 569
## Columns: 12
## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786~
## $ Class <chr> ”M”, ”M”, ”M”, ”M”, ”M”, ”M”, ”M”, ”M”, ”M”, ”M”, ”M~
## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875~
## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150~
## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7~
## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8~
## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2~
## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5~
## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3~
## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42~
## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, ~
## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -
0~
From the summary of the data above, we can see that Class is of type char-
acter (denoted by <chr>). Since we will be working with Class as a categorical
statistical variable, we will convert it to a factor using the function as_factor.
## Rows: 569
## Columns: 12
## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786~
## $ Class <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M~
## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875~
## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150~
## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7~
## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8~
## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2~
## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5~
## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3~
## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42~
ISTUDY
176 5 Classification I: training & predicting
Recall that factors have what are called “levels”, which you can think of as
categories. We can verify the levels of the Class column by using the levels
function. This function should return the name of each category in that column.
Given that we only have two different values in our Class column (B for benign
and M for malignant), we only expect to get two names back. Note that the
levels function requires a vector argument; so we use the pull function to
extract a single column (Class) and pass that into the levels function to see
the categories in the Class column.
cancer |>
pull(Class) |>
levels()
## # A tibble: 2 x 3
## Class count percentage
## <fct> <int> <dbl>
## 1 M 212 37.3
## 2 B 357 62.7
ISTUDY
5.4 Exploring a data set 177
Next, let’s draw a scatter plot to visualize the relationship between the perime-
ter and concavity variables. Rather than use ggplot's default palette, we select
our own colorblind-friendly colors—”orange2” for light orange and ”steelblue2”
for light blue—and pass them as the values argument to the scale_color_manual
function. We also make the category labels (“B” and “M”) more readable by
changing them to “Benign” and “Malignant” using the labels argument.
4
Concavity (standardized)
2 Diagnosis
Malignant
1 Benign
-1
-2 0 2 4
Perimeter (standardized)
FIGURE 5.1: Scatter plot of concavity versus perimeter colored by diagnosis
label.
In Figure 5.1, we can see that malignant observations typically fall in the
upper right-hand corner of the plot area. By contrast, benign observations
typically fall in the lower left-hand corner of the plot. In other words, benign
ISTUDY
178 5 Classification I: training & predicting
observations tend to have lower concavity and perimeter values, and malignant
ones tend to have larger values. Suppose we obtain a new observation not in
the current data set that has all the variables measured except the label (i.e.,
an image without the physician’s diagnosis for the tumor class). We could
compute the standardized perimeter and concavity values, resulting in values
of, say, 1 and 1. Could we use this information to classify that observation
as benign or malignant? Based on the scatter plot, how might you classify
that new observation? If the standardized concavity and perimeter values are
1 and 1 respectively, the point would lie in the middle of the orange cloud of
malignant points and thus we could probably classify it as malignant. Based
on our visualization, it seems like the prediction of an unobserved label might
be possible.
ISTUDY
5.5 Classification with 𝐾-nearest neighbors 179
Concavity (standardized)
3
Diagnosis
2
Benign
Malignant
1
Unknown
-1
-2 0 2 4
Perimeter (standardized)
FIGURE 5.2: Scatter plot of concavity versus perimeter with new observa-
tion represented as a red diamond.
4
Concavity (standardized)
Diagnosis
2
Benign
Malignant
1
Unknown
-1
-2 0 2 4
Perimeter (standardized)
FIGURE 5.3: Scatter plot of concavity versus perimeter. The new observa-
tion is represented as a red diamond with a line to the one nearest neighbor,
which has a malignant label.
ISTUDY
180 5 Classification I: training & predicting
Concavity (standardized)
3
Diagnosis
2
Benign
Malignant
1
Unknown
-1
-2 0 2 4
Perimeter (standardized)
FIGURE 5.4: Scatter plot of concavity versus perimeter. The new observa-
tion is represented as a red diamond with a line to the one nearest neighbor,
which has a benign label.
ISTUDY
5.5 Classification with 𝐾-nearest neighbors 181
Concavity (standardized)
3
Diagnosis
2
Benign
Malignant
1
Unknown
-1
-2 0 2 4
Perimeter (standardized)
FIGURE 5.5: Scatter plot of concavity versus perimeter with three nearest
neighbors.
ISTUDY
182 5 Classification I: training & predicting
Concavity (standardized)
3
Diagnosis
2
Benign
Malignant
1
Unknown
-1
-2 -1 0 1 2 3 4
Perimeter (standardized)
FIGURE 5.6: Scatter plot of concavity versus perimeter with new observa-
tion represented as a red diamond.
new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
cancer |>
select(ID, Perimeter, Concavity, Class) |>
mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
(Concavity - new_obs_Concavity)^2)) |>
arrange(dist_from_new) |>
slice(1:5) # take the first 5 rows
## # A tibble: 5 x 5
## ID Perimeter Concavity Class dist_from_new
## <dbl> <dbl> <dbl> <fct> <dbl>
## 1 86409 0.241 2.65 B 0.881
## 2 887181 0.750 2.87 M 0.980
## 3 899667 0.623 2.54 M 1.14
## 4 907914 0.417 2.31 M 1.26
## 5 8710441 -1.16 4.04 B 1.28
In Table 5.1 we show in mathematical detail how the mutate step was used to
compute the dist_from_new variable (the distance to the new observation) for
each of the 5 nearest neighbors in the training data.
ISTUDY
5.5 Classification with 𝐾-nearest neighbors 183
TABLE 5.1: Evaluating the distances from the new observation to each of
its 5 nearest neighbors
The result of this computation shows that 3 of the 5 nearest neighbors to our
new observation are malignant (M); since this is the majority, we classify our
new observation as malignant. These 5 neighbors are circled in Figure 5.7.
4
Concavity (standardized)
3
Diagnosis
Benign
2
Malignant
1 Unknown
-1
-2 -1 0 1 2 3 4
Perimeter (standardized)
FIGURE 5.7: Scatter plot of concavity versus perimeter with 5 nearest neigh-
bors circled.
ISTUDY
184 5 Classification I: training & predicting
the distance between points. Suppose we have 𝑚 predictor variables for two
observations 𝑎 and 𝑏, i.e., 𝑎 = (𝑎1 , 𝑎2 , … , 𝑎𝑚 ) and 𝑏 = (𝑏1 , 𝑏2 , … , 𝑏𝑚 ).
The distance formula becomes
Let’s calculate the distances between our new observation and each of the
observations in the training set to find the 𝐾 = 5 neighbors when we have
these three predictors.
new_obs_Perimeter <- 0
new_obs_Concavity <- 3.5
new_obs_Symmetry <- 1
cancer |>
select(ID, Perimeter, Concavity, Symmetry, Class) |>
mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
(Concavity - new_obs_Concavity)^2 +
(Symmetry - new_obs_Symmetry)^2)) |>
arrange(dist_from_new) |>
slice(1:5) # take the first 5 rows
## # A tibble: 5 x 6
## ID Perimeter Concavity Symmetry Class dist_from_new
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 907914 0.417 2.31 0.837 M 1.27
## 2 90439701 1.33 2.89 1.10 M 1.47
## 3 925622 0.470 2.08 1.15 M 1.50
## 4 859471 -1.37 2.81 1.09 B 1.53
## 5 899667 0.623 2.54 2.06 M 1.56
ISTUDY
5.5 Classification with 𝐾-nearest neighbors 185
1. Compute the distance between the new observation and each obser-
vation in the training set.
2. Sort the data table in ascending order according to the distances.
3. Choose the top 𝐾 rows of the sorted table.
ISTUDY
186 5 Classification I: training & predicting
library(tidymodels)
## # A tibble: 569 x 3
## Class Perimeter Concavity
## <fct> <dbl> <dbl>
## 1 M 1.27 2.65
## 2 M 1.68 -0.0238
## 3 M 1.57 1.36
## 4 M -0.592 1.91
2
https://parsnip.tidymodels.org/
3
https://www.tidymodels.org/find/parsnip/
ISTUDY
5.6 𝐾-nearest neighbors with tidymodels 187
## 5 M 1.78 1.37
## 6 M -0.387 0.866
## 7 M 1.14 0.300
## 8 M -0.0728 0.0610
## 9 M -0.184 1.22
## 10 M -0.329 1.74
## # ... with 559 more rows
In order to fit the model on the breast cancer data, we need to pass the model
specification and the data set to the fit function. We also need to specify what
variables to use as predictors and what variable to use as the target. Below,
the Class ~ Perimeter + Concavity argument specifies that Class is the target
variable (the one we want to predict), and both Perimeter and Concavity are to
be used as the predictors.
4
https://parsnip.tidymodels.org/reference/nearest_neighbor.html
ISTUDY
188 5 Classification I: training & predicting
Here you can see the final trained model summary. It confirms that the com-
putational engine used to train the model was kknn::train.kknn. It also shows
the fraction of errors made by the nearest neighbor model, but we will ignore
this for now and discuss it in more detail in the next chapter. Finally, it shows
(somewhat confusingly) that the “best” weight function was “rectangular” and
“best” setting of 𝐾 was 5; but since we specified these earlier, R is just repeat-
ing those settings to us here. In the next chapter, we will actually let R find
the value of 𝐾 for us.
Finally, we make the prediction on the new observation by calling the predict
function, passing both the fit object we just created and the new observation
itself. As above, when we ran the 𝐾-nearest neighbors classification algorithm
manually, the knn_fit object classifies the new observation as malignant (“M”).
Note that the predict function outputs a data frame with a single variable
named .pred_class.
ISTUDY
5.7 Data preprocessing with tidymodels 189
## # A tibble: 1 x 1
## .pred_class
## <fct>
## 1 M
Is this predicted malignant label the true class for this observation? Well, we
don’t know because we do not have this observation’s diagnosis— that is what
we were trying to predict! The classifier’s prediction is not necessarily correct,
but in the next chapter, we will learn ways to quantify how accurate we think
our predictions are.
ISTUDY
190 5 Classification I: training & predicting
To scale and center our data, we need to find our variables’ mean (the average,
which quantifies the “central” value of a set of numbers) and standard devi-
ation (a number quantifying how spread out values are). For each observed
value of the variable, we subtract the mean (i.e., center the variable) and di-
vide by the standard deviation (i.e., scale the variable). When we do this, the
data is said to be standardized, and all variables in a data set will have a mean
of 0 and a standard deviation of 1. To illustrate the effect that standardization
can have on the 𝐾-nearest neighbor algorithm, we will read in the original,
unstandardized Wisconsin breast cancer data set; we have been using a stan-
dardized version of the data set up until now. To keep things simple, we will
just use the Area, Smoothness, and Class variables:
## # A tibble: 569 x 3
## Class Area Smoothness
## <fct> <dbl> <dbl>
## 1 M 1001 0.118
## 2 M 1326 0.0847
## 3 M 1203 0.110
## 4 M 386. 0.142
## 5 M 1297 0.100
## 6 M 477. 0.128
## 7 M 1040 0.0946
## 8 M 578. 0.119
## 9 M 520. 0.127
## 10 M 476. 0.119
## # ... with 559 more rows
Looking at the unscaled and uncentered data above, you can see that the
differences between the values for area measurements are much larger than
those for smoothness. Will this affect predictions? In order to find out, we will
create a scatter plot of these two predictors (colored by diagnosis) for both
the unstandardized data we just loaded, and the standardized version of that
same data. But first, we need to standardize the unscaled_cancer data set with
tidymodels.
ISTUDY
5.7 Data preprocessing with tidymodels 191
from the recipes R package5 [Kuhn and Wickham, 2021]. Here we will initialize
a recipe for the unscaled_cancer data above, specifying that the Class variable
is the target, and all other variables are predictors:
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
So far, there is not much in the recipe; just a statement about the num-
ber of targets and predictors. Let’s add scaling (step_scale) and centering
(step_center) steps for all of the predictors so that they each have a mean
of 0 and standard deviation of 1. Note that tidyverse actually provides
step_normalize, which does both centering and scaling in a single recipe step;
in this book we will keep step_scale and step_center separate to emphasize con-
ceptually that there are two steps happening. The prep function finalizes the
recipe by using the data (here, unscaled_cancer) to compute anything necessary
to run the recipe (in this case, the column means and standard deviations):
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
##
## Training data contained 569 data points and no missing data.
5
https://recipes.tidymodels.org/
ISTUDY
192 5 Classification I: training & predicting
##
## Operations:
##
## Scaling for Area, Smoothness [trained]
## Centering for Area, Smoothness [trained]
You can now see that the recipe includes a scaling and centering step for all
predictor variables. Note that when you add a step to a recipe, you must
specify what columns to apply the step to. Here we used the all_predictors()
function to specify that each step should be applied to all predictor variables.
However, there are a number of different arguments one could use here, as well
as naming particular columns with the same syntax as the select function. For
example:
• all_nominal() and all_numeric(): specify all categorical or all numeric vari-
ables
• all_predictors() and all_outcomes(): specify all predictor or all target vari-
ables
• Area, Smoothness: specify both the Area and Smoothness variable
• -Class: specify everything except the Class variable
You can find a full set of all the steps and variable selection functions on the
6
recipes reference page .
At this point, we have calculated the required statistics based on the data input
into the recipe, but the data are not yet scaled and centered. To actually scale
and center the data, we need to apply the bake function to the unscaled data.
## # A tibble: 569 x 3
## Area Smoothness Class
## <dbl> <dbl> <fct>
## 1 0.984 1.57 M
## 2 1.91 -0.826 M
## 3 1.56 0.941 M
## 4 -0.764 3.28 M
## 5 1.82 0.280 M
## 6 -0.505 2.24 M
## 7 1.09 -0.123 M
## 8 -0.219 1.60 M
6
https://recipes.tidymodels.org/reference/index.html
ISTUDY
5.7 Data preprocessing with tidymodels 193
## 9 -0.384 2.20 M
## 10 -0.509 1.58 M
## # ... with 559 more rows
It may seem redundant that we had to both bake and prep to scale and center
the data. However, we do this in two steps so we can specify a different data
set in the bake step if we want. For example, we may want to specify new data
that were not part of the training set.
You may wonder why we are doing so much work just to center and scale
our variables. Can’t we just manually scale and center the Area and Smooth-
ness variables ourselves before building our 𝐾-nearest neighbor model? Well,
technically yes; but doing so is error-prone. In particular, we might acciden-
tally forget to apply the same centering / scaling when making predictions,
or accidentally apply a different centering / scaling than what we used while
training. Proper use of a recipe helps keep our code simple, readable, and
error-free. Furthermore, note that using prep and bake is required only when
you want to inspect the result of the preprocessing steps yourself. You will see
further on in Section 5.8 that tidymodels provides tools to automatically apply
prep and bake as necessary without additional coding effort.
Figure 5.9 shows the two scatter plots side-by-side—one for unscaled_cancer and
one for scaled_cancer. Each has the same new observation annotated with its
𝐾 = 3 nearest neighbors. In the original unstandardized data plot, you can see
some odd choices for the three nearest neighbors. In particular, the “neighbors”
are visually well within the cloud of benign observations, and the neighbors are
all nearly vertically aligned with the new observation (which is why it looks
like there is only one black line on this plot). Figure 5.10 shows a close-up
of that region on the unstandardized plot. Here the computation of nearest
neighbors is dominated by the much larger-scale area variable. The plot for
standardized data on the right in Figure 5.9 shows a much more intuitively
reasonable selection of nearest neighbors. Thus, standardizing the data can
change things in an important way when we are using predictive algorithms.
Standardizing your data should be a part of the preprocessing you do before
predictive modeling and you should always think carefully about your problem
domain and whether you need to standardize your data.
5.7.2 Balancing
Another potential issue in a data set for a classifier is class imbalance, i.e.,
when one label is much more common than another. Since classifiers like the
𝐾-nearest neighbor algorithm use the labels of nearby points to predict the
label of a new point, if there are many more data points with one label overall,
the algorithm is more likely to pick that label in general (even if the “pattern”
ISTUDY
194 5 Classification I: training & predicting
Smoothness (standardized)
0.150
Smoothness
0.125 2
0.100
0
0.075
-2
0.050
500 1000 1500 2000 2500 0 2 4
Area Area (standardized)
Unstandardized Data
0.14
0.150
Smoothness
0.12 0.125
0.100
0.10
0.075
0.08 0.050
380 390 400 410 420 500 1000150020002500
Area
ISTUDY
5.7 Data preprocessing with tidymodels 195
rare_plot
ISTUDY
196 5 Classification I: training & predicting
Concavity (standardized)
3
2 Diagnosis
Malignant
1 Benign
-1
-2 -1 0 1
Perimeter (standardized)
FIGURE 5.11: Imbalanced data.
ISTUDY
5.7 Data preprocessing with tidymodels 197
Concavity (standardized)
3
Diagnosis
2
Benign
Malignant
1
Unknown
-1
-2 -1 0 1 2
Perimeter (standardized)
FIGURE 5.12: Imbalanced data with 7 nearest neighbors to a new observa-
tion highlighted.
4
Concavity (standardized)
2 Diagnosis
Malignant
1 Benign
-1
-2 -1 0 1
Perimeter (standardized)
FIGURE 5.13: Imbalanced data with background color indicating the deci-
sion of the classifier and the points represent the labeled data.
ISTUDY
198 5 Classification I: training & predicting
library(themis)
ups_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
##
## Training data contained 360 data points and no missing data.
##
## Operations:
##
## Up-sampling based on Class [trained]
upsampled_cancer |>
group_by(Class) |>
summarize(n = n())
## # A tibble: 2 x 2
## Class n
## <fct> <int>
## 1 M 357
## 2 B 357
Now suppose we train our 𝐾-nearest neighbor classifier with 𝐾 = 7 on this bal-
anced data. Figure 5.14 shows what happens now when we set the background
color of each area of our scatter plot to the decision the 𝐾-nearest neigh-
bor classifier would make. We can see that the decision is more reasonable;
when the points are close to those labeled malignant, the classifier predicts
a malignant tumor, and vice versa when they are closer to the benign tumor
observations.
ISTUDY
5.8 Putting it together in a workflow 199
Concavity (standardized)
3
2 Diagnosis
Malignant
1 Benign
-1
-2 -1 0 1
Perimeter (standardized)
FIGURE 5.14: Upsampled data with background color indicating the deci-
sion of the classifier.
The tidymodels package collection also provides the workflow, a way to chain
together multiple data analysis steps without a lot of otherwise necessary code
for intermediate steps. To illustrate the whole pipeline, let’s start from scratch
with the unscaled_wdbc.csv data. First we will load the data, create a model,
and specify a recipe for how the data should be preprocessed:
ISTUDY
200 5 Classification I: training & predicting
Note that each of these steps is exactly the same as earlier, except for one major
difference: we did not use the select function to extract the relevant variables
from the data frame, and instead simply specified the relevant variables to
use via the formula Class ~ Area + Smoothness (instead of Class ~ .) in the
recipe. You will also notice that we did not call prep() on the recipe; this is
unnecessary when it is placed in a workflow.
We will now place these steps in a workflow using the add_recipe and add_model
functions, and finally we will use the fit function to run the whole workflow
on the unscaled_cancer data. Note another difference from earlier here: we do
not include a formula in the fit function. This is again because we included
the formula in the recipe, so there is no need to respecify it:
knn_fit
ISTUDY
5.8 Putting it together in a workflow 201
##
## Type of response variable: nominal
## Minimal misclassification: 0.112478
## Best kernel: rectangular
## Best k: 7
As before, the fit object lists the function that trains the model as well as
the “best” settings for the number of neighbors and weight function (for now,
these are just the values we chose manually when we created knn_spec above).
But now the fit object also includes information about the overall workflow,
including the centering and scaling preprocessing steps. In other words, when
we use the predict function with the knn_fit object to make a prediction for a
new observation, it will first apply the same recipe steps to the new observation.
As an example, we will predict the class label of two new observations: one
with Area = 500 and Smoothness = 0.075, and one with Area = 1500 and Smoothness
= 0.1.
prediction
## # A tibble: 2 x 1
## .pred_class
## <fct>
## 1 B
## 2 M
The classifier predicts that the first observation is benign (“B”), while the
second is malignant (“M”). Figure 5.15 visualizes the predictions that this
trained 𝐾-nearest neighbor model will make on a large range of new observa-
tions. Although you have seen colored prediction map visualizations like this a
few times now, we have not included the code to generate them, as it is a little
bit complicated. For the interested reader who wants a learning challenge, we
now include it below. The basic idea is to create a grid of synthetic new obser-
vations using the expand.grid function, predict the label of each, and visualize
the predictions with a colored scatter having a very high transparency (low
alpha value) and large point radius. See if you can figure out what each line is
doing!
ISTUDY
202 5 Classification I: training & predicting
Note: Understanding this code is not required for the remainder of the text-
book. It is included for those readers who would like to use similar visualiza-
tions in their own data analyses.
# plot:
# 1. the colored scatter of the original data
# 2. the faded colored scatter for the grid points
wkflw_plot <-
ggplot() +
geom_point(data = unscaled_cancer,
mapping = aes(x = Area,
y = Smoothness,
color = Class),
alpha = 0.75) +
geom_point(data = prediction_table,
mapping = aes(x = Area,
y = Smoothness,
color = Class),
alpha = 0.02,
size = 5) +
ISTUDY
5.9 Exercises 203
labs(color = ”Diagnosis”,
x = ”Area (standardized)”,
y = ”Smoothness (standardized)”) +
scale_color_manual(labels = c(”Malignant”, ”Benign”),
values = c(”orange2”, ”steelblue2”)) +
theme(text = element_text(size = 12))
wkflw_plot
0.150
Smoothness (standardized)
0.125
Diagnosis
Malignant
0.100 Benign
0.075
0.050
500 1000 1500 2000 2500
Area (standardized)
FIGURE 5.15: Scatter plot of smoothness versus area where background
color indicates the decision of the classifier.
5.9 Exercises
Practice exercises for the material covered in this chapter can be found in
the accompanying worksheets repository7 in the “Classification I: training and
predicting” row. You can launch an interactive version of the worksheet in your
browser by clicking the “launch binder” button. You can also preview a non-
interactive version of the worksheet by clicking “view worksheet.” If you instead
7
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme
ISTUDY
204 5 Classification I: training & predicting
decide to download the worksheet and run it on your own machine, make sure
to follow the instructions for computer setup found in Chapter 13. This will
ensure that the automated feedback and guidance that the worksheets provide
will function as intended.
ISTUDY
6
Classification II: evaluation & tuning
6.1 Overview
This chapter continues the introduction to predictive modeling through classi-
fication. While the previous chapter covered training and data preprocessing,
this chapter focuses on how to evaluate the accuracy of a classifier, as well as
how to improve the classifier (where possible) to maximize its accuracy.
205
ISTUDY
206 6 Classification II: evaluation & tuning
“good” our classifier is? Let’s revisit the breast cancer images data1 [Street
et al., 1993] and think about how our classifier will be used in practice. A
biopsy will be performed on a new patient’s tumor, the resulting image will
be analyzed, and the classifier will be asked to decide whether the tumor is
benign or malignant. The key word here is new: our classifier is “good” if it
provides accurate predictions on data not seen during training. But then, how
can we evaluate our classifier without visiting the hospital to collect more
tumor images?
The trick is to split the data into a training set and test set (Figure 6.1)
and use only the training set when building the classifier. Then, to evaluate
the accuracy of the classifier, we first set aside the true labels from the test
set, and then use the classifier to predict the labels in the test set. If our
predictions match the true labels for the observations in the test set, then
we have some confidence that our classifier might also accurately predict the
class labels for new observations without known class labels.
Note: If there were a golden rule of machine learning, it might be this: you
cannot use the test data to build the model! If you do, the model gets to
“see” the test data in advance, making it look more accurate than it really is.
Imagine how bad it would be to overestimate your classifier’s accuracy when
predicting whether a patient’s tumor is malignant or benign!
How exactly can we assess how well our predictions match the true labels for
the observations in the test set? One way we can do this is to calculate the
prediction accuracy. This is the fraction of examples for which the classifier
made the correct prediction. To calculate this, we divide the number of correct
predictions by the number of predictions made.
The process for assessing if our predictions match the true labels in the test set
is illustrated in Figure 6.2. Note that there are other measures for how well
classifiers perform, such as precision and recall; these will not be discussed
here, but you will likely encounter them in other more advanced books on this
topic.
1
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
ISTUDY
6.3 Evaluating accuracy 207
FIGURE 6.1: Splitting the data into training and testing sets.
FIGURE 6.2: Process for splitting the data and finding the prediction accu-
racy.
ISTUDY
208 6 Classification II: evaluation & tuning
set.seed(1)
random_numbers1 <- sample(0:9, 10, replace=TRUE)
random_numbers1
## [1] 8 3 6 0 1 6 1 2 0 4
You can see that random_numbers1 is a list of 10 numbers from 0 to 9 that, from
all appearances, looks random. If we run the sample function again, we will get
a fresh batch of 10 numbers that also look random.
ISTUDY
6.4 Randomness and seeds 209
## [1] 4 9 5 9 6 8 4 4 8 8
set.seed(1)
random_numbers1_again <- sample(0:9, 10, replace=TRUE)
random_numbers1_again
## [1] 8 3 6 0 1 6 1 2 0 4
## [1] 4 9 5 9 6 8 4 4 8 8
Notice that after setting the seed, we get the same two sequences of numbers
in the same order. random_numbers1 and random_numbers1_again produce the same
sequence of numbers, and the same can be said about random_numbers2 and
random_numbers2_again. And if we choose a different value for the seed—say,
4235—we obtain a different sequence of random numbers.
set.seed(4235)
random_numbers <- sample(0:9, 10, replace=TRUE)
random_numbers
## [1] 8 3 1 4 6 8 8 4 1 7
## [1] 3 7 8 2 8 8 6 3 3 8
ISTUDY
210 6 Classification II: evaluation & tuning
your results will likely not be reproducible. And finally, be careful to set the
seed only once at the beginning of a data analysis. Each time you set the seed,
you are inserting your own human input, thereby influencing the analysis. If
you use set.seed many times throughout your analysis, the randomness that
R uses will not look as random as it should.
In summary: if you want your analysis to be reproducible, i.e., produce the
same result each time you run it, make sure to use set.seed exactly once at
the beginning of the analysis. Different argument values in set.seed lead to
different patterns of randomness, but as long as you pick the same argument
value your result will be the same. In the remainder of the textbook, we will
set the seed once at the beginning of each chapter.
Back to evaluating classifiers now! In R, we can use the tidymodels package not
only to perform 𝐾-nearest neighbors classification, but also to assess how well
our classification worked. Let’s work through an example of how to use tools
from tidymodels to evaluate a classifier using the breast cancer data set from
the previous chapter. We begin the analysis by loading the packages we re-
quire, reading in the breast cancer data, and then making a quick scatter plot
visualization of tumor cell concavity versus smoothness colored by diagnosis
in Figure 6.3. You will also notice that we set the random seed here at the be-
ginning of the analysis using the set.seed function, as described in Section 6.4.
# load packages
library(tidyverse)
library(tidymodels)
# load data
cancer <- read_csv(”data/unscaled_wdbc.csv”) |>
# convert the character Class variable to the factor datatype
mutate(Class = as_factor(Class))
ISTUDY
6.5 Evaluating accuracy with tidymodels 211
perim_concav
0.4
0.3
Concavity
Diagnosis
Malignant
0.2
Benign
0.1
0.0
0.050 0.075 0.100 0.125 0.150
Smoothness
FIGURE 6.3: Scatter plot of tumor cell concavity versus smoothness colored
by diagnosis label.
ISTUDY
212 6 Classification II: evaluation & tuning
ensure that the accuracy estimates from the test data are reasonable. First, it
shuffles the data before splitting, which ensures that any ordering present in
the data does not influence the data that ends up in the training and testing
sets. Second, it stratifies the data by the class label, to ensure that roughly
the same proportion of each class ends up in both the training and testing sets.
For example, in our data set, roughly 63% of the observations are from the
benign class (B), and 37% are from the malignant class (M), so initial_split
ensures that roughly 63% of the training data are benign, 37% of the training
data are malignant, and the same proportions exist in the testing data.
Let’s use the initial_split function to create the training and testing sets. We
will specify that prop = 0.75 so that 75% of our original data set ends up in
the training set. We will also set the strata argument to the categorical label
variable (here, Class) to ensure that the training and testing subsets contain
the right proportions of each category of observation. The training and testing
functions then extract the training and testing data sets into two separate data
frames. Note that the initial_split function uses randomness, but since we
set the seed earlier in the chapter, the split will be reproducible.
glimpse(cancer_train)
## Rows: 426
## Columns: 12
## $ ID <dbl> 8510426, 8510653, 8510824, 857373, 857810, 858477, 8~
## $ Class <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B~
## $ Radius <dbl> 13.540, 13.080, 9.504, 13.640, 13.050, 8.618, 10.170~
## $ Texture <dbl> 14.36, 15.71, 12.44, 16.34, 19.31, 11.79, 14.88, 20.~
## $ Perimeter <dbl> 87.46, 85.63, 60.34, 87.21, 82.61, 54.34, 64.55, 54.~
## $ Area <dbl> 566.3, 520.0, 273.9, 571.8, 527.2, 224.5, 311.9, 221~
## $ Smoothness <dbl> 0.09779, 0.10750, 0.10240, 0.07685, 0.08060, 0.09752~
## $ Compactness <dbl> 0.08129, 0.12700, 0.06492, 0.06059, 0.03789, 0.05272~
## $ Concavity <dbl> 0.066640, 0.045680, 0.029560, 0.018570, 0.000692, 0.~
## $ Concave_Points <dbl> 0.047810, 0.031100, 0.020760, 0.017230, 0.004167, 0.~
## $ Symmetry <dbl> 0.1885, 0.1967, 0.1815, 0.1353, 0.1819, 0.1683, 0.27~
## $ Fractal_Dimension <dbl> 0.05766, 0.06811, 0.06905, 0.05953, 0.05501, 0.07187~
glimpse(cancer_test)
ISTUDY
6.5 Evaluating accuracy with tidymodels 213
## Rows: 143
## Columns: 12
## $ ID <dbl> 84501001, 846381, 84799002, 849014, 852763, 853401, ~
## $ Class <fct> M, M, M, M, M, M, M, B, M, M, M, B, B, B, B, B, B, M~
## $ Radius <dbl> 12.460, 15.850, 14.540, 19.810, 14.580, 18.630, 16.7~
## $ Texture <dbl> 24.04, 23.95, 27.54, 22.15, 21.53, 25.11, 21.59, 18.~
## $ Perimeter <dbl> 83.97, 103.70, 96.73, 130.00, 97.41, 124.80, 110.10,~
## $ Area <dbl> 475.9, 782.7, 658.8, 1260.0, 644.8, 1088.0, 869.5, 5~
## $ Smoothness <dbl> 0.11860, 0.08401, 0.11390, 0.09831, 0.10540, 0.10640~
## $ Compactness <dbl> 0.23960, 0.10020, 0.15950, 0.10270, 0.18680, 0.18870~
## $ Concavity <dbl> 0.22730, 0.09938, 0.16390, 0.14790, 0.14250, 0.23190~
## $ Concave_Points <dbl> 0.085430, 0.053640, 0.073640, 0.094980, 0.087830, 0.~
## $ Symmetry <dbl> 0.2030, 0.1847, 0.2303, 0.1582, 0.2252, 0.2183, 0.18~
## $ Fractal_Dimension <dbl> 0.08243, 0.05338, 0.07077, 0.05395, 0.06924, 0.06197~
We can see from glimpse in the code above that the training set contains 426
observations, while the test set contains 143 observations. This corresponds to
a train / test split of 75% / 25%, as desired. Recall from Chapter 5 that we
use the glimpse function to view data with a large number of columns, as it
prints the data such that the columns go down the page (instead of across).
We can use group_by and summarize to find the percentage of malignant and
benign classes in cancer_train and we see about 63% of the training data are
benign and 37% are malignant, indicating that our class proportions were
roughly preserved when we split the data.
cancer_proportions
## # A tibble: 2 x 3
## Class n percent
## <fct> <int> <dbl>
## 1 M 159 37.3
## 2 B 267 62.7
ISTUDY
214 6 Classification II: evaluation & tuning
knn_fit
ISTUDY
6.5 Evaluating accuracy with tidymodels 215
##
## * step_scale()
## * step_center()
##
## -- Model ------------------------------------------------------------------
-----
##
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3, data, 5),
## kernel = ~”rectangular”)
##
## Type of response variable: nominal
## Minimal misclassification: 0.1150235
## Best kernel: rectangular
## Best k: 3
cancer_test_predictions
## # A tibble: 143 x 13
## .pred_class ID Class Radius Texture Perimeter Area Smoothness
## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 84501001 M 12.5 24.0 84.0 476. 0.119
## 2 B 846381 M 15.8 24.0 104. 783. 0.0840
## 3 M 84799002 M 14.5 27.5 96.7 659. 0.114
## 4 M 849014 M 19.8 22.2 130 1260 0.0983
## 5 M 852763 M 14.6 21.5 97.4 645. 0.105
## 6 M 853401 M 18.6 25.1 125. 1088 0.106
## 7 B 854253 M 16.7 21.6 110. 870. 0.0961
## 8 B 854941 B 13.0 18.4 82.6 524. 0.0898
## 9 M 855138 M 13.5 20.8 88.4 559. 0.102
## 10 B 855167 M 13.4 21.6 86.2 563 0.0816
ISTUDY
216 6 Classification II: evaluation & tuning
## # ... with 133 more rows, and 5 more variables: Compactness <dbl>,
## # Concavity <dbl>, Concave_Points <dbl>, Symmetry <dbl>,
## # Fractal_Dimension <dbl>
cancer_test_predictions |>
metrics(truth = Class, estimate = .pred_class) |>
filter(.metric == ”accuracy”)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.860
In the metrics data frame, we filtered the .metric column since we are interested
in the accuracy row. Other entries involve more advanced metrics that are
beyond the scope of this book. Looking at the value of the .estimate variable
shows that the estimated accuracy of the classifier on the test data was 86%.
We can also look at the confusion matrix for the classifier, which shows the
table of predicted labels and correct labels, using the conf_mat function:
## Truth
## Prediction M B
## M 39 6
## B 14 84
ISTUDY
6.5 Evaluating accuracy with tidymodels 217
cancer_proportions
## # A tibble: 2 x 3
## Class n percent
## <fct> <int> <dbl>
## 1 M 159 37.3
## 2 B 267 62.7
Since the benign class represents the majority of the training data, the majority
classifier would always predict that a new observation is benign. The estimated
accuracy of the majority classifier is usually fairly close to the majority class
ISTUDY
218 6 Classification II: evaluation & tuning
proportion in the training data. In this case, we would suspect that the major-
ity classifier will have an accuracy of around 63%. The 𝐾-nearest neighbors
classifier we built does quite a bit better than this, with an accuracy of 86%.
This means that from the perspective of accuracy, the 𝐾-nearest neighbors
classifier improved quite a bit on the basic majority classifier. Hooray! But we
still need to be cautious; in this application, it is likely very important not
to misdiagnose any malignant tumors to avoid missing patients who actually
need medical care. The confusion matrix above shows that the classifier does,
indeed, misdiagnose a significant number of malignant tumors as benign (14
out of 53 malignant tumors, or 26%!). Therefore, even though the accuracy
improved upon the majority classifier, our critical analysis suggests that this
classifier may not have appropriate performance for the application.
6.6.1 Cross-validation
The first step in choosing the parameter 𝐾 is to be able to evaluate the
classifier using only the training data. If this is possible, then we can compare
the classifier’s performance for different values of 𝐾—and pick the best—using
only the training data. As suggested at the beginning of this section, we will
ISTUDY
6.6 Tuning the classifier 219
accomplish this by splitting the training data, training on one subset, and
evaluating on the other. The subset of training data used for evaluation is
often called the validation set.
There is, however, one key difference from the train/test split that we per-
formed earlier. In particular, we were forced to make only a single split of
the data. This is because at the end of the day, we have to produce a sin-
gle classifier. If we had multiple different splits of the data into training and
testing data, we would produce multiple different classifiers. But while we are
tuning the classifier, we are free to create multiple classifiers based on multiple
splits of the training data, evaluate them, and then choose a parameter value
based on all of the different results. If we just split our overall training data
once, our best parameter choice will depend strongly on whatever data was
lucky enough to end up in the validation set. Perhaps using multiple different
train/validation splits, we’ll get a better estimate of accuracy, which will lead
to a better choice of the number of neighbors 𝐾 for the overall set of training
data.
Let’s investigate this idea in R! In particular, we will generate five differ-
ent train/validation splits of our overall training data, train five different 𝐾-
nearest neighbors models, and evaluate their accuracy. We will start with just
a single split.
# create the 25/75 split of the training data into training and validation
cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class)
cancer_subtrain <- training(cancer_split)
cancer_validation <- testing(cancer_split)
# fit the knn model (we can reuse the old knn_spec model from before)
knn_fit <- workflow() |>
add_recipe(cancer_recipe) |>
add_model(knn_spec) |>
fit(data = cancer_subtrain)
ISTUDY
220 6 Classification II: evaluation & tuning
acc
## [1] 0.8878505
The accuracy estimate using this split is 88.8%. Now we repeat the above code
4 more times, which generates 4 more splits. Therefore we get five different
shuffles of the data, and therefore five different values for accuracy: 88.8%,
86.9%, 83.2%, 88.8%, 87.9%. None of these values are necessarily “more cor-
rect” than any other; they’re just five estimates of the true, underlying accu-
racy of our classifier built using our overall training data. We can combine the
estimates by taking their average (here 87%) to try to get a single assessment
of our classifier’s accuracy; this has the effect of reducing the influence of any
one (un)lucky validation set on the estimate.
In practice, we don’t use random splits, but rather use a more structured split-
ting procedure so that each observation in the data set is used in a validation
set only a single time. The name for this strategy is cross-validation. In
cross-validation, we split our overall training data into 𝐶 evenly sized
chunks. Then, iteratively use 1 chunk as the validation set and combine the
remaining 𝐶 − 1 chunks as the training set. This procedure is shown in Fig-
ure 6.4. Here, 𝐶 = 5 different chunks of the data set are used, resulting in 5
different choices for the validation set; we call this 5-fold cross-validation.
ISTUDY
6.6 Tuning the classifier 221
Then, when we create our data analysis workflow, we use the fit_resamples
ISTUDY
222 6 Classification II: evaluation & tuning
function instead of the fit function for training. This runs cross-validation on
each train/validation split.
# fit the knn model (we can reuse the old knn_spec model from before)
knn_fit <- workflow() |>
add_recipe(cancer_recipe) |>
add_model(knn_spec) |>
fit_resamples(resamples = cancer_vfold)
knn_fit
## # Resampling results
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [340/86]> Fold1 <tibble [2 x 4]> <tibble [0 x 1]>
## 2 <split [340/86]> Fold2 <tibble [2 x 4]> <tibble [0 x 1]>
## 3 <split [341/85]> Fold3 <tibble [2 x 4]> <tibble [0 x 1]>
## 4 <split [341/85]> Fold4 <tibble [2 x 4]> <tibble [0 x 1]>
## 5 <split [342/84]> Fold5 <tibble [2 x 4]> <tibble [0 x 1]>
The collect_metrics function is used to aggregate the mean and standard error
of the classifier’s validation accuracy across the folds. You will find results
related to the accuracy in the row with accuracy listed under the .metric column.
You should consider the mean (mean) to be the estimated accuracy, while the
standard error (std_err) is a measure of how uncertain we are in the mean
value. A detailed treatment of this is beyond the scope of this chapter; but
roughly, if your estimated mean is 0.87 and standard error is 0.02, you can
expect the true average accuracy of the classifier to be somewhere roughly
between 85% and 89% (although it may fall outside this range). You may
ignore the other columns in the metrics data frame, as they do not provide
any additional insight. You can also ignore the entire second row with roc_auc
in the .metric column, as it is beyond the scope of this book.
ISTUDY
6.6 Tuning the classifier 223
knn_fit |>
collect_metrics()
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.871 5 0.0212 Preprocessor1_Model1
## 2 roc_auc binary 0.905 5 0.0188 Preprocessor1_Model1
We can choose any number of folds, and typically the more we use the better
our accuracy estimate will be (lower standard error). However, we are limited
by computational power: the more folds we choose, the more computation it
takes, and hence the more time it takes to run the analysis. So when you
do cross-validation, you need to consider the size of the data, the speed of
the algorithm (e.g., 𝐾-nearest neighbors), and the speed of your computer.
In practice, this is a trial-and-error process, but typically 𝐶 is chosen to be
either 5 or 10. Here we will try 10-fold cross-validation to see if we get a lower
standard error:
vfold_metrics
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.887 10 0.0207 Preprocessor1_Model1
## 2 roc_auc binary 0.917 10 0.0134 Preprocessor1_Model1
In this case, using 10-fold instead of 5-fold cross validation did reduce the
standard error, although by only an insignificant amount. In fact, due to the
randomness in how the data are split, sometimes you might even end up with
a higher standard error when increasing the number of folds! We can make the
reduction in standard error more dramatic by increasing the number of folds
by a large amount. In the following code we show the result when 𝐶 = 50;
ISTUDY
224 6 Classification II: evaluation & tuning
picking such a large number of folds often takes a long time to run in practice,
so we usually stick to 5 or 10.
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.881 50 0.0158 Preprocessor1_Model1
## 2 roc_auc binary 0.911 50 0.0154 Preprocessor1_Model1
Then instead of using fit or fit_resamples, we will use the tune_grid function
ISTUDY
6.6 Tuning the classifier 225
to fit the model for each value in a range of parameter values. In particular, we
first create a data frame with a neighbors variable that contains the sequence
of values of 𝐾 to try; below we create the k_vals data frame with the neighbors
variable containing values from 1 to 100 (stepping by 5) using the seq function.
Then we pass that data frame to the grid argument of tune_grid.
accuracies
## # A tibble: 20 x 7
## neighbors .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy binary 0.850 10 0.0209 Preprocessor1_Model01
## 2 6 accuracy binary 0.868 10 0.0216 Preprocessor1_Model02
## 3 11 accuracy binary 0.873 10 0.0201 Preprocessor1_Model03
## 4 16 accuracy binary 0.880 10 0.0175 Preprocessor1_Model04
## 5 21 accuracy binary 0.882 10 0.0179 Preprocessor1_Model05
## 6 26 accuracy binary 0.887 10 0.0168 Preprocessor1_Model06
## 7 31 accuracy binary 0.887 10 0.0180 Preprocessor1_Model07
## 8 36 accuracy binary 0.884 10 0.0166 Preprocessor1_Model08
## 9 41 accuracy binary 0.892 10 0.0152 Preprocessor1_Model09
## 10 46 accuracy binary 0.889 10 0.0156 Preprocessor1_Model10
## 11 51 accuracy binary 0.889 10 0.0155 Preprocessor1_Model11
## 12 56 accuracy binary 0.889 10 0.0155 Preprocessor1_Model12
## 13 61 accuracy binary 0.882 10 0.0174 Preprocessor1_Model13
## 14 66 accuracy binary 0.887 10 0.0170 Preprocessor1_Model14
## 15 71 accuracy binary 0.882 10 0.0167 Preprocessor1_Model15
## 16 76 accuracy binary 0.882 10 0.0167 Preprocessor1_Model16
## 17 81 accuracy binary 0.877 10 0.0161 Preprocessor1_Model17
## 18 86 accuracy binary 0.875 10 0.0188 Preprocessor1_Model18
## 19 91 accuracy binary 0.882 10 0.0158 Preprocessor1_Model19
## 20 96 accuracy binary 0.871 10 0.0165 Preprocessor1_Model20
ISTUDY
226 6 Classification II: evaluation & tuning
accuracy_vs_k
0.89
Accuracy Estimate
0.88
0.87
0.86
0.85
0 25 50 75 100
Neighbors
FIGURE 6.5: Plot of estimated accuracy versus the number of neighbors.
ISTUDY
6.6 Tuning the classifier 227
number) doesn’t decrease accuracy too much, so that our choice is reliable
in the presence of uncertainty;
• the cost of training the model is not prohibitive (e.g., in our situation, if 𝐾
is too large, predicting becomes expensive!).
We know that 𝐾 = 41 provides the highest estimated accuracy. Further, Figure
6.5 shows that the estimated accuracy changes by only a small amount if we
increase or decrease 𝐾 near 𝐾 = 41. And finally, 𝐾 = 41 does not create
a prohibitively expensive computational cost of training. Considering these
three points, we would indeed select 𝐾 = 41 for the classifier.
6.6.3 Under/Overfitting
To build a bit more intuition, what happens if we keep increasing the number
of neighbors 𝐾? In fact, the accuracy actually starts to decrease! Let’s specify
a much larger range of values of 𝐾 to try in the grid argument of tune_grid.
Figure 6.6 shows a plot of estimated accuracy as we vary 𝐾 from 1 to almost
the number of observations in the data set.
accuracy_vs_k_lots
ISTUDY
228 6 Classification II: evaluation & tuning
0.9
0.7
ISTUDY
6.7 Summary 229
K= 1 K= 7
0.4 0.4
0.3 0.3
Concavity
Concavity
0.2 0.2
0.1 0.1
0.0 0.0
0.050 0.075 0.100 0.125 0.050 0.075 0.100 0.125
Smoothness Smoothness
K = 20 K = 300
0.4 0.4
0.3 0.3
Concavity
Concavity
0.2 0.2
0.1 0.1
0.0 0.0
0.050 0.075 0.100 0.125 0.050 0.075 0.100 0.125
Smoothness Smoothness
Both overfitting and underfitting are problematic and will lead to a model
that does not generalize well to new data. When fitting a model, we need to
strike a balance between the two. You can see these two effects in Figure 6.7,
which shows how the classifier changes as we set the number of neighbors 𝐾
to 1, 7, 20, and 300.
6.7 Summary
Classification algorithms use one or more quantitative variables to predict
the value of another categorical variable. In particular, the 𝐾-nearest neigh-
bors algorithm does this by first finding the 𝐾 points in the training data
ISTUDY
230 6 Classification II: evaluation & tuning
nearest to the new observation, and then returning the majority class vote
from those training observations. We can evaluate a classifier by splitting the
data randomly into a training and test data set, using the training set to
build the classifier, and using the test set to estimate its accuracy. Finally,
we can tune the classifier (e.g., select the number of neighbors 𝐾 in 𝐾-NN)
by maximizing estimated accuracy via cross-validation. The overall process is
summarized in Figure 6.8.
1. Use the initial_split function to split the data into a training and
test set. Set the strata argument to the class label variable. Put the
test set aside for now.
2. Use the vfold_cv function to split up the training data for cross-
validation.
3. Create a recipe that specifies the class label and predictors, as well
as preprocessing steps for all variables. Pass the training data as the
data argument of the recipe.
4. Create a nearest_neighbors model specification, with neighbors =
tune().
5. Add the recipe and model specification to a workflow(), and use the
ISTUDY
6.8 Predictor variable selection 231
Note: This section is not required reading for the remainder of the textbook.
It is included for those readers interested in learning how irrelevant variables
can influence the performance of a classifier, and how to pick a subset of useful
variables to include as predictors.
ISTUDY
232 6 Classification II: evaluation & tuning
variable in your data; the 𝐾-nearest neighbors algorithm accepts any number
of predictors. However, it is not the case that using more predictors always
yields better predictions! In fact, sometimes including irrelevant predictors can
actually negatively affect classifier performance.
cancer_irrelevant |>
select(Class, Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2)
## # A tibble: 569 x 6
## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 0.118 0.300 123. 1 0
## 2 M 0.0847 0.0869 133. 0 0
## 3 M 0.110 0.197 130 0 0
## 4 M 0.142 0.241 77.6 0 1
## 5 M 0.100 0.198 135. 0 0
## 6 M 0.128 0.158 82.6 1 0
## 7 M 0.0946 0.113 120. 0 1
## 8 M 0.119 0.0937 90.2 1 0
## 9 M 0.127 0.186 87.5 0 0
## 10 M 0.119 0.227 84.0 1 1
## # ... with 559 more rows
ISTUDY
6.8 Predictor variable selection 233
influence they have, and the more they corrupt the set of nearest neighbors
that vote on the class of the new observation to predict.
0.875
0.850
0.825
0 10 20 30 40
Number of Irrelevant Predictors
Although the accuracy decreases as expected, one surprising thing about Fig-
ure 6.9 is that it shows that the method still outperforms the baseline majority
classifier (with about 63% accuracy) even with 40 irrelevant variables. How
could that be? Figure 6.10 provides the answer: the tuning procedure for the
𝐾-nearest neighbors classifier combats the extra randomness from the irrele-
vant variables by increasing the number of neighbors. Of course, because of all
the extra noise in the data from the irrelevant variables, the number of neigh-
bors does not increase smoothly; but the general trend is increasing. Figure
6.11 corroborates this evidence; if we fix the number of neighbors to 𝐾 = 3,
the accuracy falls off more quickly.
ISTUDY
234 6 Classification II: evaluation & tuning
12.5
7.5
5.0
2.5
0 10 20 30 40
Number of Irrelevant Predictors
0.90
0.85
Accuracy
Type
Tuned K
K=3
0.80
0.75
0 10 20 30 40
Number of Irrelevant Predictors
FIGURE 6.11: Accuracy versus number of irrelevant predictors for tuned
and untuned number of neighbors.
ISTUDY
6.8 Predictor variable selection 235
ISTUDY
236 6 Classification II: evaluation & tuning
Say you have 𝑚 total predictors to work with. In the first iteration, you have to
make 𝑚 candidate models, each with 1 predictor. Then in the second iteration,
you have to make 𝑚 − 1 candidate models, each with 2 predictors (the one you
chose before and a new one). This pattern continues for as many iterations as
you want. If you run the method all the way until you run out of predictors
to choose, you will end up training 12 𝑚(𝑚 + 1) separate models. This is a big
improvement from the 2𝑚 − 1 models that best subset selection requires you
to train! For example, while best subset selection requires training over 1000
candidate models with 𝑚 = 10 predictors, forward selection requires training
only 55 candidate models. Therefore we will continue the rest of this section
using forward selection.
Note: One word of caution before we move on. Every additional model that
you train increases the likelihood that you will get unlucky and stumble on
a model that has a high cross-validation accuracy estimate, but a low true
accuracy on the test data and other future observations. Since forward se-
lection involves training a lot of models, you run a fairly high risk of this
happening. To keep this risk low, only use forward selection when you have a
large amount of data and a relatively small total number of predictors. More
advanced methods do not suffer from this problem as much; see the additional
resources at the end of this chapter for where to learn more about advanced
predictor selection methods.
ISTUDY
6.8 Predictor variable selection 237
cancer_subset
## # A tibble: 569 x 7
## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2 Irrelevant3
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 0.118 0.300 123. 1 0 1
## 2 M 0.0847 0.0869 133. 0 0 0
## 3 M 0.110 0.197 130 0 0 0
## 4 M 0.142 0.241 77.6 0 1 0
## 5 M 0.100 0.198 135. 0 0 0
## 6 M 0.128 0.158 82.6 1 0 1
## 7 M 0.0946 0.113 120. 0 1 1
## 8 M 0.119 0.0937 90.2 1 0 0
## 9 M 0.127 0.186 87.5 0 0 1
## 10 M 0.119 0.227 84.0 1 1 0
## # ... with 559 more rows
The key idea of the forward selection code is to use the paste function (which
concatenates strings separated by spaces) to create a model formula for each
subset of predictors for which we want to build a model. The collapse argument
tells paste what to put between the items in the list; to make a formula, we
need to put a + symbol between each variable. As an example, let’s make a
model formula for all the predictors, which should output something like Class
~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3:
Finally, we need to write some code that performs the task of sequentially
ISTUDY
238 6 Classification II: evaluation & tuning
finding the best predictor to add to the model. If you recall the end of the
wrangling chapter, we mentioned that sometimes one needs more flexible forms
of iteration than what we have used earlier, and in these cases one typically
resorts to a for loop; see the chapter on iteration2 in R for Data Science [Wick-
ham and Grolemund, 2016]. Here we will use two for loops: one over increasing
predictor set sizes (where you see for (i in 1:length(names)) below), and an-
other to check which predictor to add in each round (where you see for (j in
1:length(names)) below). For each set of predictors to try, we construct a model
formula, pass it into a recipe, build a workflow that tunes a 𝐾-NN classifier
using 5-fold cross-validation, and finally records the estimated accuracy.
2
https://r4ds.had.co.nz/iteration.html
ISTUDY
6.8 Predictor variable selection 239
## # A tibble: 6 x 3
## size model_string accuracy
## <int> <chr> <dbl>
## 1 1 Class ~ Perimeter 0.896
## 2 2 Class ~ Perimeter+Concavity 0.916
## 3 3 Class ~ Perimeter+Concavity+Smoothness 0.931
## 4 4 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1 0.928
## 5 5 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3 0.924
## 6 6 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3~ 0.902
ISTUDY
240 6 Classification II: evaluation & tuning
Interesting! The forward selection procedure first added the three meaningful
variables Perimeter, Concavity, and Smoothness, followed by the irrelevant vari-
ables. Figure 6.12 visualizes the accuracy versus the number of predictors in
the model. You can see that as meaningful predictors are added, the estimated
accuracy increases substantially; and as you add irrelevant variables, the ac-
curacy either exhibits small fluctuations or decreases as the model attempts
to tune the number of neighbors to account for the extra noise. In order to
pick the right model from the sequence, you have to balance high accuracy
and model simplicity (i.e., having fewer predictors and a lower chance of over-
fitting). The way to find that balance is to look for the elbow in Figure 6.12,
i.e., the place on the plot where the accuracy stops increasing dramatically
and levels off or begins to decrease. The elbow in Figure 6.12 appears to occur
at the model with 3 predictors; after that point the accuracy levels off. So
here the right trade-off of accuracy and number of predictors occurs with 3
variables: Class ~ Perimeter + Concavity + Smoothness. In other words, we have
successfully removed irrelevant predictors from the model! It is always worth
remembering, however, that what cross-validation gives you is an estimate of
the true accuracy; you have to use your judgement when looking at this plot
to decide where the elbow occurs, and whether adding a variable provides a
meaningful increase in accuracy.
0.93
Estimated Accuracy
0.92
0.91
0.90
2 4 6
Number of Predictors
FIGURE 6.12: Estimated accuracy versus the number of predictors for the
sequence of models built using forward selection.
ISTUDY
6.9 Exercises 241
6.9 Exercises
Practice exercises for the material covered in this chapter can be found in the
accompanying worksheets repository3 in the “Classification II: evaluation and
tuning” row. You can launch an interactive version of the worksheet in your
browser by clicking the “launch binder” button. You can also preview a non-
interactive version of the worksheet by clicking “view worksheet.” If you instead
decide to download the worksheet and run it on your own machine, make sure
to follow the instructions for computer setup found in Chapter 13. This will
ensure that the automated feedback and guidance that the worksheets provide
will function as intended.
3
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme
4
https://tidymodels.org/packages
5
https://www.tidymodels.org/start/
6
https://www.tidymodels.org/learn/
ISTUDY
ISTUDY
7
Regression I: K-nearest neighbors
7.1 Overview
This chapter continues our foray into answering predictive questions. Here we
will focus on predicting numerical variables and will use regression to perform
this task. This is unlike the past two chapters, which focused on predicting
categorical variables via classification. However, regression does have many
similarities to classification: for example, just as in the case of classification,
we will split our data into training, validation, and test sets, we will use tidy-
models workflows, we will use a K-nearest neighbors (KNN) approach to make
predictions, and we will use cross-validation to choose K. Because of how sim-
ilar these procedures are, make sure to read Chapters 5 and 6 before reading
this one—we will move a little bit faster here with the concepts that have al-
ready been covered. This chapter will primarily focus on the case where there
is a single predictor, but the end of the chapter shows how to perform regres-
sion with more than one predictor variable, i.e., multivariable regression. It is
important to note that regression can also be used to answer inferential and
causal questions, however that is beyond the scope of this book.
243
ISTUDY
244 7 Regression I: K-nearest neighbors
• Evaluate KNN regression prediction accuracy in R using a test data set and
the root mean squared prediction error (RMSPE).
• In the context of KNN regression, compare and contrast goodness of fit and
prediction properties (namely RMSE vs RMSPE).
• Describe the advantages and disadvantages of K-nearest neighbors regres-
sion.
ISTUDY
7.4 Exploring a data set 245
library(tidyverse)
library(tidymodels)
library(gridExtra)
set.seed(5)
## # A tibble: 932 x 9
## city zip beds baths sqft type price latitude longitude
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1
https://support.spatialkey.com/spatialkey-sample-csv-data/
ISTUDY
246 7 Regression I: K-nearest neighbors
The scientific question guides our initial exploration: the columns in the data
that we are interested in are sqft (house size, in livable square feet) and price
(house sale price, in US dollars (USD)). The first step is to visualize the data
as a scatter plot where we place the predictor variable (house size) on the x-
axis, and we place the target/response variable that we want to predict (sale
price) on the y-axis.
Note: Given that the y-axis unit is dollars in Figure 7.1, we format the axis
labels to put dollar signs in front of the house prices, as well as commas to
increase the readability of the larger numbers. We can do this in R by passing
the dollar_format function (from the scales package) to the labels argument of
the scale_y_continuous function.
eda
The plot is shown in Figure 7.1. We can see that in Sacramento, CA, as the
size of a house increases, so does its sale price. Thus, we can reason that we
may be able to use the size of a not-yet-sold house (for which we don’t know
the sale price) to predict its final sale price. Note that we do not suggest here
that a larger house size causes a higher sale price; just that house price tends
ISTUDY
7.5 K-nearest neighbors regression 247
$750,000
Price (USD)
$500,000
$250,000
$0
1000 2000 3000 4000 5000
House size (square feet)
FIGURE 7.1: Scatter plot of price (USD) versus house size (square feet).
to increase with house size, and that we may be able to use the latter to
predict the former.
Next let’s say we come across a 2,000 square-foot house in Sacramento we are
ISTUDY
248 7 Regression I: K-nearest neighbors
small_plot
$600,000
Price (USD)
$400,000
$200,000
We will employ the same intuition from the classification chapter, and use the
neighboring points to the new point of interest to suggest/predict what its
sale price might be. For the example shown in Figure 7.2, we find and label
ISTUDY
7.5 K-nearest neighbors regression 249
the 5 nearest neighbors to our observation of a house that is 2,000 square feet.
nearest_neighbors
## # A tibble: 5 x 10
## city zip beds baths sqft type price latitude longitude diff
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ROSEVILLE z95661 3 2 2049 Residenti~ 395500 38.7 -121. 49
## 2 ANTELOPE z95843 4 3 2085 Residenti~ 408431 38.7 -121. 85
## 3 SACRAMENTO z95823 4 2 1876 Residenti~ 299940 38.5 -121. 124
## 4 ROSEVILLE z95747 3 2.5 1829 Residenti~ 306500 38.8 -121. 171
## 5 SACRAMENTO z95825 4 2 1776 Multi_Fam~ 221250 38.6 -121. 224
$600,000
Price (USD)
$400,000
$200,000
Figure 7.3 illustrates the difference between the house sizes of the 5 nearest
neighbors (in terms of house size) to our new 2,000 square-foot house of in-
terest. Now that we have obtained these nearest neighbors, we can use their
ISTUDY
250 7 Regression I: K-nearest neighbors
values to predict the sale price for the new home. Specifically, we can take the
mean (or average) of these 5 values as our predicted value, as illustrated by
the red point in Figure 7.4.
prediction
## # A tibble: 1 x 1
## predicted
## <dbl>
## 1 326324.
$600,000
Price (USD)
$400,000
$200,000
Our predicted price is $326,324 (shown as a red point in Figure 7.4), which
is much less than $350,000; perhaps we might want to offer less than the list
price at which the house is advertised. But this is only the very beginning of
the story. We still have all the same unanswered questions here with KNN
regression that we had with KNN classification: which 𝐾 do we choose, and is
our model any good at making predictions? In the next few sections, we will
address these questions in the context of KNN regression.
ISTUDY
7.6 Training, evaluating, and tuning the model 251
One strength of the KNN regression algorithm that we would like to draw
attention to at this point is its ability to work well with non-linear relationships
(i.e., if the relationship is not a straight line). This stems from the use of nearest
neighbors to predict values. The algorithm really has very few assumptions
about what the data must look like for it to work.
1 𝑛
RMSPE = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 𝑖=1
where:
• 𝑛 is the number of observations,
• 𝑦𝑖 is the observed value for the 𝑖th observation, and
• 𝑦𝑖̂ is the forecasted/predicted value for the 𝑖th observation.
In other words, we compute the squared difference between the predicted and
true response value for each observation in our test (or validation) set, compute
the average, and then finally take the square root. The reason we use the
squared difference (and not just the difference) is that the differences can be
positive or negative, i.e., we can overshoot or undershoot the true response
value. Figure 7.5 illustrates both positive and negative differences between
ISTUDY
252 7 Regression I: K-nearest neighbors
$600,000
Price (USD)
$400,000
$200,000
Note: When using many code packages (tidymodels included), the evaluation
output we will get to assess the prediction quality of our KNN regression
models is labeled “RMSE”, or “root mean squared error”. Why is this so, and
why not RMSPE? In statistics, we try to be very precise with our language to
indicate whether we are calculating the prediction error on the training data
(in-sample prediction) versus on the testing data (out-of-sample prediction).
When predicting and evaluating prediction quality on the training data, we
say RMSE. By contrast, when predicting and evaluating prediction quality on
the testing or validation data, we say RMSPE. The equation for calculating
RMSE and RMSPE is exactly the same; all that changes is whether the 𝑦s
are training or testing data. But many people just use RMSE for both, and
rely on context to denote which data the root mean squared error is being
calculated on.
ISTUDY
7.6 Training, evaluating, and tuning the model 253
Now that we know how we can assess how well our model predicts a numeri-
cal value, let’s use R to perform cross-validation and to choose the optimal 𝐾.
First, we will create a recipe for preprocessing our data. Note that we include
standardization in our preprocessing to build good habits, but since we only
have one predictor, it is technically not necessary; there is no risk of comparing
two predictors of different scales. Next we create a model specification for K-
nearest neighbors regression. Note that we use set_mode(”regression”) now in
the model specification to denote a regression problem, as opposed to the classi-
fication problems from the previous chapters. The use of set_mode(”regression”)
essentially tells tidymodels that we need to use different metrics (RMSPE, not
accuracy) for tuning and evaluation. Then we create a 5-fold cross-validation
object, and put the recipe and model specification together in a workflow.
sacr_wkflw
## == Workflow ==================================================================
## Preprocessor: Recipe
## Model: nearest_neighbor()
##
## -- Preprocessor -----------------------------------------------------------
-----
## 2 Recipe Steps
##
## * step_scale()
ISTUDY
254 7 Regression I: K-nearest neighbors
## * step_center()
##
## -- Model ------------------------------------------------------------------
-----
## K-Nearest Neighbor Model Specification (regression)
##
## Main Arguments:
## neighbors = tune()
## weight_func = rectangular
##
## Computational engine: kknn
## # A tibble: 67 x 7
## neighbors .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 113086. 5 996. Preprocessor1_Model01
## 2 4 rmse standard 93911. 5 3078. Preprocessor1_Model02
## 3 7 rmse standard 87395. 5 2882. Preprocessor1_Model03
## 4 10 rmse standard 86203. 5 3014. Preprocessor1_Model04
ISTUDY
7.6 Training, evaluating, and tuning the model 255
110000
105000
RMSPE
100000
95000
90000
85000
0 50 100 150 200
Neighbors
FIGURE 7.6: Effect of the number of neighbors on the RMSPE.
Figure 7.6 visualizes how the RMSPE varies with the number of neighbors
𝐾. We take the minimum RMSPE to find the best setting for the number of
neighbors:
sacr_min
## # A tibble: 1 x 7
## neighbors .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 37 rmse standard 85227. 5 2177. Preprocessor1_Model13
ISTUDY
256 7 Regression I: K-nearest neighbors
ISTUDY
7.7 Underfitting and overfitting 257
K=1 K=3
$750,000 $750,000
Price (USD)
Price (USD)
$500,000 $500,000
$250,000 $250,000
$0 $0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
House size (square feet) House size (square feet)
K = 37 K = 41
$750,000 $750,000
Price (USD)
Price (USD)
$500,000 $500,000
$250,000 $250,000
$0 $0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
House size (square feet) House size (square feet)
K = 250 K = 932
$750,000 $750,000
Price (USD)
Price (USD)
$500,000 $500,000
$250,000 $250,000
$0 $0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
House size (square feet) House size (square feet)
FIGURE 7.7: Predicted values for house price (represented as a blue line)
from KNN regression models for six different values for 𝐾.
ISTUDY
258 7 Regression I: K-nearest neighbors
we would like a model that (1) follows the overall “trend” in the training data,
so the model actually uses the training data to learn something useful, and (2)
does not follow the noisy fluctuations, so that we can be confident that our
model will transfer/generalize well to other new data. If we explore the other
values for 𝐾, in particular 𝐾 = 37 (as suggested by cross-validation), we can
see it achieves this goal: it follows the increasing trend of house price versus
house size, but is not influenced too much by the idiosyncratic variations in
price. All of this is similar to how the choice of 𝐾 affects K-nearest neighbors
classification, as discussed in the previous chapter.
sacr_summary
ISTUDY
7.8 Evaluating on the test set 259
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 89279.
Our final model’s test error as assessed by RMSPE is $89,279. Note that RM-
SPE is measured in the same units as the response variable. In other words, on
new observations, we expect the error in our prediction to be roughly $89,279.
From one perspective, this is good news: this is about the same as the cross-
validation RMSPE estimate of our tuned model (which was $85,227), so we can
say that the model appears to generalize well to new data that it has never
seen before. However, much like in the case of KNN classification, whether
this value for RMSPE is good—i.e., whether an error of around $89,279 is
acceptable—depends entirely on the application. In this application, this er-
ror is not prohibitively large, but it is not negligible either; $89,279 might
represent a substantial fraction of a home buyer’s budget, and could make or
break whether or not they could afford put an offer on a house.
Finally, Figure 7.8 shows the predictions that our final model makes across the
range of house sizes we might encounter in the Sacramento area—from 500 to
5000 square feet. You have already seen a few plots like this in this chapter,
but here we also provide the code that generated it as a learning challenge.
plot_final
ISTUDY
260 7 Regression I: K-nearest neighbors
K = 37
$750,000
Price (USD)
$500,000
$250,000
$0
1000 2000 3000 4000 5000
House size (square feet)
FIGURE 7.8: Predicted values of house price (blue line) for the final KNN
regression model.
ISTUDY
7.9 Multivariable KNN regression 261
750000
Price (USD)
500000
250000
0
2 4 6 8
Number of Bedrooms
FIGURE 7.9: Scatter plot of the sale price of houses versus the number of
bedrooms.
ISTUDY
262 7 Regression I: K-nearest neighbors
Figure 7.9 shows that as the number of bedrooms increases, the house sale
price tends to increase as well, but that the relationship is quite weak. Does
adding the number of bedrooms to our model improve our ability to predict
price? To answer that question, we will have to create a new KNN regression
model using house size and number of bedrooms, and then we can compare it
to the model we previously came up with that only used house size. Let’s do
that now!
First we’ll build a new model specification and recipe for the analysis. Note
that we use the formula price ~ sqft + beds to denote that we have two
predictors, and set neighbors = tune() to tell tidymodels to tune the number of
neighbors for us.
Next, we’ll use 5-fold cross-validation to choose the number of neighbors via
the minimum RMSPE:
## # A tibble: 1 x 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 12 rmse standard 82648. 5 2365. Preprocessor1_Model012
ISTUDY
7.9 Multivariable KNN regression 263
Here we see that the smallest estimated RMSPE from cross-validation occurs
when 𝐾 = 12. If we want to compare this multivariable KNN regression model
to the model with only a single predictor as part of the model tuning process
(e.g., if we are running forward selection as described in the chapter on eval-
uating and tuning classification models), then we must compare the accuracy
estimated using only the training data via cross-validation. Looking back, the
estimated cross-validation accuracy for the single-predictor model was 85,227.
The estimated cross-validation accuracy for the multivariable model is 82,648.
Thus in this case, we did not improve the model by a large amount by adding
this additional predictor.
Regardless, let’s continue the analysis to see how we can make predictions
with a multivariable KNN regression model and evaluate its performance on
test data. We first need to re-train the model on the entire training data set
with 𝐾 = 12, and then use that model to make predictions on the test data.
knn_mult_mets
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 90953.
This time, when we performed KNN regression on the same data set, but also
included number of bedrooms as a predictor, we obtained a RMSPE test error
of 90,953. Figure 7.10 visualizes the model’s predictions overlaid on top of the
ISTUDY
264 7 Regression I: K-nearest neighbors
data. This time the predictions are a surface in 3D space, instead of a line in
2D space, as we have 2 predictors instead of 1.
We can see that the predictions in this case, where we have 2 predictors, form
a surface instead of a line. Because the newly added predictor (number of
bedrooms) is related to price (as price changes, so does number of bedrooms)
and is not totally determined by house size (our other predictor), we get
additional and useful information for making our predictions. For example,
in this model we would predict that the cost of a house with a size of 2,500
square feet generally increases slightly as the number of bedrooms increases.
Without having the additional predictor of number of bedrooms, we would
predict the same price for these two houses.
ISTUDY
7.11 Exercises 265
7.11 Exercises
Practice exercises for the material covered in this chapter can be found in the
accompanying worksheets repository2 in the “Regression I: K-nearest neigh-
bors” row. You can launch an interactive version of the worksheet in your
browser by clicking the “launch binder” button. You can also preview a non-
interactive version of the worksheet by clicking “view worksheet.” If you instead
decide to download the worksheet and run it on your own machine, make sure
to follow the instructions for computer setup found in Chapter 13. This will
ensure that the automated feedback and guidance that the worksheets provide
will function as intended.
2
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme
ISTUDY
ISTUDY
8
Regression II: linear regression
8.1 Overview
Up to this point, we have solved all of our predictive problems—both classi-
fication and regression—using K-nearest neighbors (KNN)-based approaches.
In the context of regression, there is another commonly used method known as
linear regression. This chapter provides an introduction to the basic concept
of linear regression, shows how to use tidymodels to perform linear regression
in R, and characterizes its strengths and weaknesses compared to KNN regres-
sion. The focus is, as usual, on the case where there is a single predictor and
single response variable of interest; but the chapter concludes with an example
using multivariable linear regression when there is more than one predictor.
267
ISTUDY
268 8 Regression II: linear regression
Note: Although we did not cover it in earlier chapters, there is another pop-
ular method for classification called logistic regression (it is used for clas-
sification even though the name, somewhat confusingly, has the word “re-
gression” in it). In logistic regression—similar to linear regression—you “fit”
the model to the training data and then “look up” the prediction for each
new observation. Logistic regression and KNN classification have an advan-
tage/disadvantage comparison similar to that of linear regression and KNN
regression. It is useful to have a good understanding of linear regression before
learning about logistic regression. After reading this chapter, see the “Addi-
tional Resources” section at the end of the classification chapters to learn
more about logistic regression.
Let’s return to the Sacramento housing data from Chapter 7 to learn how to
apply linear regression and compare it to KNN regression. For now, we will
consider a smaller version of the housing data to help make our visualizations
clear. Recall our predictive question: can we use the size of a house in the
Sacramento, CA area to predict its sale price? In particular, recall that we have
come across a new 2,000 square-foot house we are interested in purchasing with
an advertised list price of $350,000. Should we offer the list price, or is that
over/undervalued? To answer this question using simple linear regression, we
use the data we have to draw the straight line of best fit through our existing
data points. The small subset of data as well as the line of best fit are shown
in Figure 8.1.
ISTUDY
8.3 Simple linear regression 269
$600,000
Price (USD)
$400,000
$200,000
ISTUDY
270 8 Regression II: linear regression
$600,000
Price (USD)
$400,000
$295,564
$200,000
By using simple linear regression on this small data set to predict the sale
price for a 2,000 square-foot house, we get a predicted value of $295,564. But
wait a minute… how exactly does simple linear regression choose the line of
best fit? Many different lines could be drawn through the data points. Some
plausible examples are shown in Figure 8.3.
Simple linear regression chooses the straight line of best fit by choosing the
line that minimizes the average squared vertical distance between itself
and each of the observed data points in the training data. Figure 8.4 illustrates
these vertical distances as red lines. Finally, to assess the predictive accuracy
of a simple linear regression model, we use RMSPE—the same measure of
predictive performance we used with KNN regression.
ISTUDY
8.3 Simple linear regression 271
$600,000
Price (USD)
$400,000
$200,000
$600,000
Price (USD)
$400,000
$200,000
ISTUDY
272 8 Regression II: linear regression
library(tidyverse)
library(tidymodels)
set.seed(1234)
Now that we have our training data, we will create the model specification
and recipe, and fit our simple linear regression model:
ISTUDY
8.4 Linear regression in R 273
lm_fit
Note: An additional difference that you will notice here is that we do not stan-
dardize (i.e., scale and center) our predictors. In K-nearest neighbors models,
recall that the model fit changes depending on whether we standardize first
or not. In linear regression, standardization does not affect the fit (it does
affect the coefficients in the equation, though!). So you can standardize if you
want—it won’t hurt anything—but if you leave the predictors in their original
form, the best fit coefficients are usually easier to interpret afterward.
Our coefficients are (intercept) 𝛽0 = 12292 and (slope) 𝛽1 = 140. This means
that the equation of the line of best fit is
In other words, the model predicts that houses start at $12,292 for 0 square
feet, and that every extra square foot increases the cost of the house by $140.
Finally, we predict on the test data set to assess how well our model does:
ISTUDY
274 8 Regression II: linear regression
lm_test_results
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 82342.
## 2 rsq standard 0.596
## 3 mae standard 60555.
Our final model’s test error as assessed by RMSPE is 82,342. Remember that
this is in units of the target/response variable, and here that is US Dollars
(USD). Does this mean our model is “good” at predicting house sale price
based off of the predictor of home size? Again, answering this is tricky and
requires knowledge of how you intend to use the prediction.
To visualize the simple linear regression model, we can plot the predicted house
sale price across all possible house sizes we might encounter superimposed on
a scatter plot of the original housing price data. There is a plotting function
in the tidyverse, geom_smooth, that allows us to add a layer on our plot with
the simple linear regression predicted line of best fit. By default geom_smooth
adds some other information to the plot that we are not interested in at this
point; we provide the argument se = FALSE to tell geom_smooth not to show that
information. Figure 8.5 displays the result.
lm_plot_final
ISTUDY
8.4 Linear regression in R 275
$750,000
Price (USD)
$500,000
$250,000
$0
1000 2000 3000 4000 5000
House size (square feet)
FIGURE 8.5: Scatter plot of sale price versus size with line of best fit for
the full Sacramento housing data.
We can extract the coefficients from our model by accessing the fit object that
is output by the fit function; we first have to extract it from the workflow using
the pull_workflow_fit function, and then apply the tidy function to convert the
result into a data frame:
coeffs
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 12292. 9141. 1.34 1.79e- 1
## 2 sqft 140. 5.00 28.0 4.06e-108
ISTUDY
276 8 Regression II: linear regression
$750,000 $750,000
Price (USD)
Price (USD)
$500,000 $500,000
$250,000 $250,000
$0 $0
1000 2000 3000 4000 5000 1000 2000 3000 4000 5000
House size (square feet) House size (square feet)
ISTUDY
8.6 Multivariable linear regression 277
have a quite high RMSE when assessing model goodness of fit on the training
data and a quite high RMSPE when assessing model prediction quality on a
test data set. On such a data set, KNN regression may fare better. Additionally,
there are other types of regression you can learn about in future books that
may do even better at predicting with such data.
How do these two models compare on the Sacramento house prices data set?
In Figure 8.6, we also printed the RMSPE as calculated from predicting on
the test data set that was not used to train/fit the models. The RMSPE for
the simple linear regression model is slightly lower than the RMSPE for the
KNN regression model. Considering that the simple linear regression model
is also more interpretable, if we were comparing these in practice we would
likely choose to use the simple linear regression model.
Finally, note that the KNN regression model becomes “flat” at the left and
right boundaries of the data, while the linear model predicts a constant slope.
Predicting outside the range of the observed data is known as extrapolation;
KNN and linear models behave quite differently when extrapolating. Depend-
ing on the application, the flat or constant slope trend may make more sense.
For example, if our housing data were slightly different, the linear model may
have actually predicted a negative price for a small house (if the intercept 𝛽0
was negative), which obviously does not match reality. On the other hand, the
trend of increasing house size corresponding to increasing house price probably
continues for large houses, so the “flat” extrapolation of KNN likely does not
match reality.
ISTUDY
278 8 Regression II: linear regression
mlm_fit
And finally, we make predictions on the test data set to assess the quality of
our model:
ISTUDY
8.6 Multivariable linear regression 279
lm_mult_test_results
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rms
e standard 81418.
## 2 rsq standard 0.606
## e
3 ma standard 59310.
Our model’s test error as assessed by RMSPE is 81,418. In the case of two
predictors, we can plot the predictions made by our linear regression creates
a plane of best fit, as shown in Figure 8.7.
FIGURE 8.7: Linear regression plane of best fit overlaid on top of the data
(using price, house size, and number of bedrooms as predictors). Note that
in general we recommend against using 3D visualizations; here we use a 3D
visualization only to illustrate what the regression plane looks like for learning
purposes.
We see that the predictions from linear regression with two predictors form
a flat plane. This is the hallmark of linear regression, and differs from the
ISTUDY
280 8 Regression II: linear regression
wiggly, flexible surface we get from other methods such as KNN regression.
As discussed, this can be advantageous in one aspect, which is that for each
predictor, we can get slopes/intercept from linear regression, and thus describe
the plane mathematically. We can extract those slope values from our model
object as shown below:
mcoeffs
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 63475. 13426. 4.73 2.88e- 6
## 2 sqft 166. 7.03 23.6 1.07e-85
## 3 beds -28761. 5628. -5.11 4.43e- 7
And then use those slopes to write a mathematical equation to describe the
prediction plane:
where:
• 𝛽0 is the vertical intercept of the hyperplane (the price when both house size
and number of bedrooms are 0)
• 𝛽1 is the slope for the first predictor (how quickly the price changes as you
increase house size, holding number of bedrooms constant)
• 𝛽2 is the slope for the second predictor (how quickly the price changes as
you increase the number of bedrooms, holding house size constant)
Finally, we can fill in the values for 𝛽0 , 𝛽1 and 𝛽2 from the model output above
to create the equation of the plane of best fit to the data:
house sale price = 63475 + 166 ⋅ (house size) − 28761 ⋅ (number of bedrooms)
This model is more interpretable than the multivariable KNN regression model;
we can write a mathematical equation that explains how each predictor is af-
fecting the predictions. But as always, we should question how well multivari-
able linear regression is doing compared to the other tools we have, such as
simple linear regression and multivariable KNN regression. If this comparison
ISTUDY
8.7 Multicollinearity and outliers 281
is part of the model tuning process—for example, if we are trying out many dif-
ferent sets of predictors for multivariable linear and KNN regression—we must
perform this comparison using cross-validation on only our training data. But
if we have already decided on a small number (e.g., 2 or 3) of tuned candidate
models and we want to make a final comparison, we can do so by comparing
the prediction error of the methods on the test data.
lm_mult_test_results
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 81418.
## 2 rsq standard 0.606
## 3 mae standard 59310.
8.7.1 Outliers
Outliers are data points that do not follow the usual pattern of the rest of the
data. In the setting of linear regression, these are points that have a vertical
distance to the line of best fit that is either much higher or much lower than
you might expect based on the rest of the data. The problem with outliers is
that they can have too much influence on the line of best fit. In general, it
ISTUDY
282 8 Regression II: linear regression
is very difficult to judge accurately which data are outliers without advanced
techniques that are beyond the scope of this book.
But to illustrate what can happen when you have outliers, Figure 8.8 shows
a small subset of the Sacramento housing data again, except we have added a
single data point (highlighted in red). This house is 5,000 square feet in size,
and sold for only $50,000. Unbeknownst to the data analyst, this house was
sold by a parent to their child for an absurdly low price. Of course, this is not
representative of the real housing market values that the other data points
follow; the data point is an outlier. In blue we plot the original line of best
fit, and in red we plot the new line of best fit including the outlier. You can
see how different the red line is from the blue line, which is entirely caused by
that one extra outlier data point.
$600,000
Price (USD)
$400,000
$200,000
Fortunately, if you have enough data, the inclusion of one or two outliers—as
long as their values are not too wild—will typically not have a large effect
on the line of best fit. Figure 8.9 shows how that same outlier data point
from earlier influences the line of best fit when we are working with the entire
original Sacramento training data. You can see that with this larger data set,
the line changes much less when adding the outlier. Nevertheless, it is still
important when working with linear regression to critically think about how
much any individual data point is influencing the model.
ISTUDY
8.7 Multicollinearity and outliers 283
$750,000
Price (USD)
$500,000
$250,000
$0
1000 2000 3000 4000 5000
House size (square feet)
FIGURE 8.9: Scatter plot of the full data, with outlier highlighted in red.
8.7.2 Multicollinearity
The second, and much more subtle, issue can occur when performing multi-
variable linear regression. In particular, if you include multiple predictors that
are strongly linearly related to one another, the coefficients that describe the
plane of best fit can be very unreliable—small changes to the data can result
in large changes in the coefficients. Consider an extreme example using the
Sacramento housing data where the house was measured twice by two people.
Since the two people are each slightly inaccurate, the two measurements might
not agree exactly, but they are very strongly linearly related to each other, as
shown in Figure 8.10.
If we again fit the multivariable linear regression model on this data, then the
plane of best fit has regression coefficients that are very sensitive to the exact
values in the data. For example, if we change the data ever so slightly—e.g.,
by running cross-validation, which splits up the data randomly into different
chunks—the coefficients vary by large amounts:
Best Fit 1: house sale price = 3682 + (−43) ⋅ (house size 1 (ft2 )) + (182) ⋅
(house size 2 (ft2 )).
Best Fit 2: house sale price = 20596 + (312) ⋅ (house size 1 (ft2 )) + (−172) ⋅
(house size 2 (ft2 )).
ISTUDY
284 8 Regression II: linear regression
4000
3000
2000
1000
Best Fit 3: house sale price = 6673 + (37) ⋅ (house size 1 (ft2 )) + (104) ⋅
(house size 2 (ft2 )).
Therefore, when performing multivariable linear regression, it is important to
avoid including very linearly related predictors. However, techniques for doing
so are beyond the scope of this book; see the list of additional resources at the
end of this chapter to find out where you can learn more.
ISTUDY
8.8 Designing new predictors 285
housing market and homeowner ice cream preferences). In cases like these, the
only option is to obtain measurements of more useful variables.
There are, however, a wide variety of cases where the predictor variables do
have a meaningful relationship with the response variable, but that relation-
ship does not fit the assumptions of the regression method you have chosen.
For example, a data frame df with two variables—x and y—with a nonlinear
relationship between the two variables will not be fully captured by simple
linear regression, as shown in Figure 8.11.
df
## # A tibble: 100 x 2
## x y
## <dbl> <dbl>
## 1 0.102 0.0720
## 2 0.800 0.532
## 3 0.478 0.148
## 4 0.972 1.01
## 5 0.846 0.677
## 6 0.405 0.157
## 7 0.879 0.768
## 8 0.130 0.0402
## 9 0.852 0.576
## 10 0.180 0.0847
## # ... with 90 more rows
df <- df |>
mutate(z = x^3)
Then we can perform linear regression for y using the predictor variable z, as
shown in Figure 8.12. Here you can see that the transformed predictor z helps
the linear regression model make more accurate predictions. Note that none
of the y response values have changed between Figures 8.11 and 8.12; the only
change is that the x values have been replaced by z values.
The process of transforming predictors (and potentially combining multiple
predictors in the process) is known as feature engineering. In real data analysis
ISTUDY
286 8 Regression II: linear regression
1.00
0.75
0.50
y
0.25
0.00
1.00
0.75
0.50
y
0.25
0.00
ISTUDY
8.9 The other sides of regression 287
Note: Feature engineering is part of tuning your model, and as such you must
not use your test data to evaluate the quality of the features you produce.
You are free to use cross-validation, though!
8.10 Exercises
Practice exercises for the material covered in this chapter can be found in the
accompanying worksheets repository1 in the “Regression II: linear regression”
row. You can launch an interactive version of the worksheet in your browser
by clicking the “launch binder” button. You can also preview a non-interactive
version of the worksheet by clicking “view worksheet.” If you instead decide to
download the worksheet and run it on your own machine, make sure to follow
the instructions for computer setup found in Chapter 13. This will ensure
1
https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme
ISTUDY
288 8 Regression II: linear regression
that the automated feedback and guidance that the worksheets provide will
function as intended.
2
https://tidymodels.org/packages
3
https://www.tidymodels.org/start/
4
https://www.tidymodels.org/learn/
ISTUDY
9
Clustering
9.1 Overview
As part of exploratory data analysis, it is often helpful to see if there are
meaningful subgroups (or clusters) in the data. This grouping can be used
for many purposes, such as generating new questions or improving predictive
analyses. This chapter provides an introduction to clustering using the K-
means algorithm, including techniques to choose the number of clusters.
9.3 Clustering
Clustering is a data analysis technique involving separating a data set into
subgroups of related data. For example, we might use clustering to separate
289
ISTUDY
290 9 Clustering
a data set of documents into groups that correspond to topics, a data set
of human genetic information into groups that correspond to ancestral sub-
populations, or a data set of online customers into groups that correspond to
purchasing behaviors. Once the data are separated, we can, for example, use
the subgroups to generate new questions about the data and follow up with
a predictive modeling exercise. In this course, clustering will be used only for
exploratory analysis, i.e., uncovering patterns in the data.
Note that clustering is a fundamentally different kind of task than classifica-
tion or regression. In particular, both classification and regression are super-
vised tasks where there is a response variable (a category label or value), and
we have examples of past data with labels/values that help us predict those of
future data. By contrast, clustering is an unsupervised task, as we are trying to
understand and examine the structure of data without any response variable
labels or values to help us. This approach has both advantages and disad-
vantages. Clustering requires no additional annotation or input on the data.
For example, while it would be nearly impossible to annotate all the articles
on Wikipedia with human-made topic labels, we can cluster the articles with-
out this information to find groupings corresponding to topics automatically.
However, given that there is no response variable, it is not as easy to evaluate
the “quality” of a clustering. With classification, we can use a test data set to
assess prediction performance. In clustering, there is not a single good choice
for evaluation. In this book, we will use visualization to ascertain the quality
of a clustering, and leave rigorous evaluation for more advanced courses.
As in the case of classification, there are many possible methods that we
could use to cluster our observations to look for subgroups. In this book, we
will focus on the widely used K-means algorithm [Lloyd, 1982]. In your future
studies, you might encounter hierarchical clustering, principal component anal-
ysis, multidimensional scaling, and more; see the additional resources section
at the end of this chapter for where to begin learning more about these other
methods.
ISTUDY
9.3 Clustering 291
Note: There are also so-called semisupervised tasks, where only some of the
data come with response variable labels/values, but the vast majority don’t.
The goal is to try to uncover underlying structure in the data that allows one
to guess the missing labels. This sort of task is beneficial, for example, when
one has an unlabeled data set that is too large to manually label, but one is
willing to provide a few informative example labels as a “seed” to guess the
labels for all the data.
An illustrative example
Here we will present an illustrative example using a data set from the palmer-
1
penguins R package [Horst et al., 2020]. This data set was collected by Dr. Kris-
ten Gorman and the Palmer Station, Antarctica Long Term Ecological Re-
search Site, and includes measurements for adult penguins found near there
[Gorman et al., 2014]. We have modified the data set for use in this chapter.
Here we will focus on using two variables—penguin bill and flipper length,
both in millimeters—to determine whether there are distinct types of pen-
guins in our data. Understanding this might help us with species discovery
and classification in a data-driven way.
ISTUDY
292 9 Clustering
Before we get started, we will load the tidyverse metapackage as well as set
a random seed. This will ensure we have access to the functions we need and
that our analysis will be reproducible. As we will learn in more detail later in
the chapter, setting the seed here is important because the K-means clustering
algorithm uses random numbers.
library(tidyverse)
set.seed(1)
## # A tibble: 18 x 2
## flipper_length_standardized bill_length_standardized
## <dbl> <dbl>
## 1 -0.190 -0.641
## 2 -1.33 -1.14
## 3 -0.922 -1.52
## 4 -0.922 -1.11
## 5 -1.41 -0.847
## 6 -0.678 -0.641
## 7 -0.271 -1.24
## 8 -0.434 -0.902
## 9 1.19 0.720
## 10 1.36 0.646
## 11 1.36 0.963
## 12 1.76 0.440
## 13 1.11 1.21
## 14 0.786 0.123
## 15 -0.271 0.627
## 16 -0.271 0.757
## 17 -0.108 1.78
## 18 -0.759 0.776
Next, we can create a scatter plot using this data set to see if we can detect
subtypes or groups in our data set.
ISTUDY
9.3 Clustering 293
-1
-1 0 1
Flipper Length (standardized)
FIGURE 9.2: Scatter plot of standardized bill length versus standardized
flipper length.
Based on the visualization in Figure 9.2, we might suspect there are a few
subtypes of penguins within our data set. We can see roughly 3 groups of
observations in Figure 9.2, including:
ISTUDY
294 9 Clustering
cluster
1
0 2
3
-1
-1 0 1
Flipper Length (standardized)
FIGURE 9.3: Scatter plot of standardized bill length versus standardized
flipper length with colored groups.
What are the labels for these groups? Unfortunately, we don’t have any. K-
means, like almost all clustering algorithms, just outputs meaningless “cluster
labels” that are typically whole numbers: 1, 2, 3, etc. But in a simple case like
this, where we can easily visualize the clusters on a scatter plot, we can give
human-made labels to the groups using their positions on the plot:
• small flipper length and small bill length (orange cluster),
• small flipper length and large bill length (blue cluster).
• and large flipper length and large bill length (yellow cluster).
Once we have made these determinations, we can use them to inform our
species classifications or ask further questions about our data. For example, we
might be interested in understanding the relationship between flipper length
and bill length, and that relationship may differ depending on the type of
penguin we have.
ISTUDY
9.4 K-means 295
9.4 K-means
9.4.1 Measuring cluster quality
The K-means algorithm is a procedure that groups data into K clusters. It
starts with an initial clustering of the data, and then iteratively improves it
by making adjustments to the assignment of data to clusters until it cannot
improve any further. But how do we measure the “quality” of a clustering,
and what does it mean to improve it? In K-means clustering, we measure
the quality of a cluster by its within-cluster sum-of-squared-distances (WSSD).
Computing this involves two steps. First, we find the cluster centers by com-
puting the mean of each variable over data points in the cluster. For example,
suppose we have a cluster containing four observations, and we are using two
variables, 𝑥 and 𝑦, to cluster the data. Then we would compute the coordinates,
𝜇𝑥 and 𝜇𝑦 , of the cluster center via
1 1
𝜇𝑥 = (𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 ) 𝜇𝑦 = (𝑦1 + 𝑦2 + 𝑦3 + 𝑦4 ).
4 4
In the first cluster from the example, there are 4 data points. These are
shown with their cluster center (flipper_length_standardized = -0.35 and
bill_length_standardized = 0.99) highlighted in Figure 9.4.
Bill Length (standardized)
1.6
1.2
0.8
ISTUDY
296 9 Clustering
The second step in computing the WSSD is to add up the squared distance
between each point in the cluster and the cluster center. We use the straight-
line / Euclidean distance formula that we learned about in Chapter 5. In the
4-observation cluster example above, we would compute the WSSD 𝑆 2 via
These distances are denoted by lines in Figure 9.5 for the first cluster of the
penguin data example.
1.8
Bill Length (standardized)
1.5
1.2
0.9
0.6
-0.6 -0.4 -0.2
Flipper Length (standardized)
FIGURE 9.5: Cluster 1 from the penguin_data data set example. Observations
are in blue, with the cluster center highlighted in red. The distances from the
observations to the cluster center are represented as black lines.
The larger the value of 𝑆 2 , the more spread out the cluster is, since large 𝑆 2
means that points are far from the cluster center. Note, however, that “large”
is relative to both the scale of the variables for clustering and the number of
points in the cluster. A cluster where points are very close to the center might
still have a large 𝑆 2 if there are many data points in the cluster.
After we have calculated the WSSD for all the clusters, we sum them together
to get the total WSSD. For our example, this means adding up all the squared
distances for the 18 observations. These distances are denoted by black lines
in Figure 9.6.
ISTUDY
9.4 K-means 297
cluster
1
0 2
3
-1
-1 0 1
Flipper Length (standardized)
FIGURE 9.6: All clusters from the penguin_data data set example. Observa-
tions are in orange, blue, and yellow with the cluster center highlighted in red.
The distances from the observations to each of the respective cluster centers
are represented as black lines.
These two steps are repeated until the cluster assignments no longer change.
We show what the first four iterations of K-means would look like in
Figure 9.8. There each row corresponds to an iteration, where the left column
depicts the center update, and the right column depicts the reassignment of
data to clusters.
Note that at this point, we can terminate the algorithm since none of the
assignments changed in the fourth iteration; both the centers and labels will
remain the same from this point onward.
ISTUDY
298 9 Clustering
-1
-1 0 1
Flipper Length (standardized)
FIGURE 9.7: Random initialization of labels.
1
Bill Length
-1
1
Bill Length
-1
-1 0 1 -1 0 1 -1 0 1 -1 0 1
Flipper Length Flipper Length Flipper Length Flipper Length
(standardized) (standardized) (standardized) (standardized)
ISTUDY
9.4 K-means 299
What kind of data is suitable for K-means clustering? In the simplest version
of K-means clustering that we have presented here, the straight-line distance
is used to measure the distance between observations and cluster centers. This
means that only quantitative data should be used with this algorithm. There
are variants on the K-means algorithm, as well as other clustering algorithms
entirely, that use other distance metrics to allow for non-quantitative data to
be clustered. These, however, are beyond the scope of this book.
-1
-1 0 1
Flipper Length (standardized)
FIGURE 9.9: Random initialization of labels.
ISTUDY
300 9 Clustering
Figure 9.10 shows what the iterations of K-means would look like with the
unlucky random initialization shown in Figure 9.9.
1
Bill Length
-1
1
Bill Length
-1
-1 0 1 -1 0 1
Iteration 5 Iteration 5 Flipper Length Flipper Length
(standardized)
1 (standardized) (standardized)
Bill Length
-1
-1 0 1 -1 0 1
Flipper Length Flipper Length
(standardized) (standardized)
This looks like a relatively bad clustering of the data, but K-means cannot
improve it. To solve this problem when clustering data using K-means, we
should randomly re-initialize the labels a few times, run K-means for each
initialization, and pick the clustering that has the lowest final total WSSD.
9.4.4 Choosing K
In order to cluster data using K-means, we also have to pick the number
of clusters, K. But unlike in classification, we have no response variable and
ISTUDY
9.4 K-means 301
-1
Bill Length (standardized)
-1
-1
-1 0 1 -1 0 1 -1 0 1
Flipper Length (standardized)
FIGURE 9.11: Clustering of the penguin data for K clusters ranging from 1
to 9. Cluster centers are indicated by larger points that are outlined in black.
If we set K less than 3, then the clustering merges separate groups of data;
this causes a large total WSSD, since the cluster center is not close to any of
the data in the cluster. On the other hand, if we set K greater than 3, the
ISTUDY
302 9 Clustering
clustering subdivides subgroups of data; this does indeed still decrease the
total WSSD, but by only a diminishing amount. If we plot the total WSSD
versus the number of clusters, we see that the decrease in total WSSD levels
off (or forms an “elbow shape”) when we reach roughly the right number of
clusters (Figure 9.12).
30
Total WSSD
20
Elbow
10
0
1 2 3 4 5 6 7 8 9
Number of Clusters
FIGURE 9.12: Total WSSD for K clusters ranging from 1 to 9.
ISTUDY
9.5 Data pre-processing for K-means 303
## # A tibble: 18 x 2
## bill_length_mm flipper_length_mm
## <dbl> <dbl>
## 1 39.2 196
## 2 36.5 182
## 3 34.5 187
## 4 36.7 187
## 5 38.1 181
## 6 39.2 190
## 7 36 195
## 8 37.8 193
## 9 46.5 213
## 10 46.1 215
## 11 47.8 215
## 12 45 220
## 13 49.1 212
## 14 43.3 208
## 15 46 195
## 16 46.7 195
## 17 52.2 197
## 18 46.8 189
And then we apply the scale function to every column in the data frame using
mutate and across.
standardized_data
## # A tibble: 18 x 2
## bill_length_mm[,1] flipper_length_mm[,1]
## <dbl> <dbl>
## 1 -0.641 -0.190
## 2 -1.14 -1.33
## 3 -1.52 -0.922
## 4 -1.11 -0.922
## 5 -0.847 -1.41
ISTUDY
304 9 Clustering
## 6 -0.641 -0.678
## 7 -1.24 -0.271
## 8 -0.902 -0.434
## 9 0.720 1.19
## 10 0.646 1.36
## 11 0.963 1.36
## 12 0.440 1.76
## 13 1.21 1.11
## 14 0.123 0.786
## 15 0.627 -0.271
## 16 0.757 -0.271
## 17 1.78 -0.108
## 18 0.776 -0.759
9.6 K-means in R
To perform K-means clustering in R, we use the kmeans function. It takes at
least two arguments: the data frame containing the data you wish to cluster,
and K, the number of clusters (here we choose K = 3). Note that the K-means
algorithm uses a random initialization of assignments; but since we set the
random seed earlier, the clustering will be reproducible.
ISTUDY
9.6 K-means in R 305
## Available components:
##
## [1] ”cluster” ”centers” ”totss” ”withinss” ”tot.withinss”
## [6] ”betweenss” ”size” ”iter” ”ifault”
As you can see above, the clustering object returned by kmeans has a lot of
information that can be used to visualize the clusters, pick K, and evaluate
the total WSSD. To obtain this information in a tidy format, we will call
in help from the broom package. Let’s start by visualizing the clustering as a
colored scatter plot. To do that, we use the augment function, which takes in
the model and the original data frame, and returns a data frame with the data
and the cluster assignments for each point:
library(broom)
## # A tibble: 18 x 3
## bill_length_mm[,1] flipper_length_mm[,1] .cluster
## <dbl> <dbl> <fct>
## 1 -0.641 -0.190 2
## 2 -1.14 -1.33 2
## 3 -1.52 -0.922 2
## 4 -1.11 -0.922 2
## 5 -0.847 -1.41 2
## 6 -0.641 -0.678 2
## 7 -1.24 -0.271 2
## 8 -0.902 -0.434 2
## 9 0.720 1.19 3
## 10 0.646 1.36 3
## 11 0.963 1.36 3
## 12 0.440 1.76 3
## 13 1.21 1.11 3
## 14 0.123 0.786 3
## 15 0.627 -0.271 1
## 16 0.757 -0.271 1
## 17 1.78 -0.108 1
## 18 0.776 -0.759 1
Now that we have this information in a tidy data frame, we can make a
visualization of the cluster assignments for each point, as shown in Figure
9.13.
ISTUDY
306 9 Clustering
cluster_plot
Bill Length (standardized)
1
Cluster
1
0 2
3
-1
-1 0 1
Flipper Length (standardized)
FIGURE 9.13: The data colored by the cluster assignments returned by
K-means.
ISTUDY
9.6 K-means in R 307
glance(penguin_clust)
## # A tibble: 1 x 4
## totss tot.withinss betweenss iter
## <dbl> <dbl> <dbl> <int>
## 1 34 4.47 29.5 1
To calculate the total WSSD for a variety of Ks, we will create a data frame
with a column named k with rows containing each value of K we want to run
K-means with (here, 1 to 9).
## # A tibble: 9 x 1
## k
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
Then we use rowwise + mutate to apply the kmeans function within each row to
each K. However, given that the kmeans function returns a model object to us
(not a vector), we will need to store the results as a list column. This works
because both vectors and lists are legitimate data structures for data frame
columns. To make this work, we have to put each model object in a list using
the list function. We demonstrate how to do this below:
If we take a look at our data frame penguin_clust_ks now, we see that it has
two columns: one with the value for K, and the other holding the clustering
model object in a list column.
ISTUDY
308 9 Clustering
penguin_clust_ks
## # A tibble: 9 x 2
## # Rowwise:
## k penguin_clusts
## <int> <list>
## 1 1 <kmeans>
## 2 2 <kmeans>
## 3 3 <kmeans>
## 4 4 <kmeans>
## 5 5 <kmeans>
## 6 6 <kmeans>
## 7 7 <kmeans>
## 8 8 <kmeans>
## 9 9 <kmeans>
If we wanted to get one of the clusterings out of the list column in the data
frame, we could use a familiar friend: pull. pull will return to us a data frame
column as a simpler data structure; here, that would be a list. And then to
extract the first item of the list, we can use the pluck function. We pass
it the index for the element we would like to extract (here, 1).
penguin_clust_ks |>
pull(penguin_clusts) |>
pluck(1)
ISTUDY
9.6 K-means in R 309
Next, we use mutate again to apply glance to each of the K-means clustering
objects to get the clustering statistics (including WSSD). The output of glance
is a data frame, and so we need to create another list column (using list) for
this to work. This results in a complex data frame with 3 columns, one for K,
one for the K-means clustering objects, and one for the clustering statistics:
penguin_clust_ks
## # A tibble: 9 x 3
## # Rowwise:
## k penguin_clusts glanced
## <int> <list> <list>
## 1 1 <kmeans> <tibble [1 x 4]>
## 2 2 <kmeans> <tibble [1 x 4]>
## 3 3 <kmeans> <tibble [1 x 4]>
## 4 4 <kmeans> <tibble [1 x 4]>
## 5 5 <kmeans> <tibble [1 x 4]>
## 6 6 <kmeans> <tibble [1 x 4]>
## 7 7 <kmeans> <tibble [1 x 4]>
## 8 8 <kmeans> <tibble [1 x 4]>
## 9 9 <kmeans> <tibble [1 x 4]>
Finally we extract the total WSSD from the column named glanced. Given
that each item in this list column is a data frame, we will need to use the
unnest function to unpack the data frames into simpler column data types.
clustering_statistics
## # A tibble: 9 x 6
## k penguin_clusts totss tot.withinss betweenss iter
## <int> <list> <dbl> <dbl> <dbl> <int>
ISTUDY
310 9 Clustering
## 1 1 <kmeans> 34 34 7.11e-15 1
## 2 2 <kmeans> 34 10.9 2.31e+ 1 1
## 3 3 <kmeans> 34 4.47 2.95e+ 1 1
## 4 4 <kmeans> 34 3.54 3.05e+ 1 1
## 5 5 <kmeans> 34 2.23 3.18e+ 1 2
## 6 6 <kmeans> 34 2.15 3.19e+ 1 3
## 7 7 <kmeans> 34 1.53 3.25e+ 1 2
## 8 8 <kmeans> 34 2.46 3.15e+ 1 1
## 9 9 <kmeans> 34 0.843 3.32e+ 1 2
Now that we have tot.withinss and k as columns in a data frame, we can make
a line plot (Figure 9.14) and search for the “elbow” to find which value of K
to use.
elbow_plot
Total within-cluster sum of squares
30
20
10
0
1 2 3 4 5 6 7 8 9
K
FIGURE 9.14: A plot showing the total WSSD versus the number of clusters.
ISTUDY
9.6 K-means in R 311
It looks like 3 clusters is the right choice for this data. But why is there a
“bump” in the total WSSD plot here? Shouldn’t total WSSD always decrease
as we add more clusters? Technically yes, but remember: K-means can get
“stuck” in a bad solution. Unfortunately, for K = 8 we had an unlucky ini-
tialization and found a bad clustering! We can help prevent finding a bad
clustering by trying a few different random initializations via the nstart ar-
gument (Figure 9.15 shows a setup where we use 10 restarts). When we do
this, K-means clustering will be performed the number of times specified by
the nstart argument, and R will return to us the best clustering from this.
The more times we perform K-means clustering, the more likely we are to find
a good clustering (if one exists). What value should you choose for nstart?
The answer is that it depends on many factors: the size and characteristics
of your data set, as well as how powerful your computer is. The larger the
nstart value the better from an analysis perspective, but there is a trade-off
that doing many clusterings could take a long time. So this is something that
needs to be balanced.
elbow_plot
ISTUDY
312 9 Clustering
20
10
0
1 2 3 4 5 6 7 8 9
K
FIGURE 9.15: A plot showing the total WSSD versus the number of clusters
when K-means is run with 10 restarts.
9.7 Exercises
Practice exercises for the material covered in this chapter can be found in
the accompanying worksheets repository2 in the “Clustering” row. You can
launch an interactive version of the worksheet in your browser by clicking the
“launch binder” button. You can also preview a non-interactive version of the
worksheet by clicking “view worksheet.” If you instead decide to download the
worksheet and run it on your own machine, make sure to follow the instructions
for computer setup found in Chapter 13. This will ensure that the automated
feedback and guidance that the worksheets provide will function as intended.
ISTUDY
9.8 Additional resources 313
hierarchical clustering for when you expect there to be subgroups, and then
subgroups within subgroups, etc., in your data. In the realm of more gen-
eral unsupervised learning, it covers principal components analysis (PCA),
which is a very popular technique for reducing the number of predictors in
a dataset.
ISTUDY
ISTUDY
10
Statistical inference
10.1 Overview
A typical data analysis task in practice is to draw conclusions about some un-
known aspect of a population of interest based on observed data sampled from
that population; we typically do not get data on the entire population. Data
analysis questions regarding how summaries, patterns, trends, or relationships
in a data set extend to the wider population are called inferential questions.
This chapter will start with the fundamental ideas of sampling from popula-
tions and then introduce two common techniques in statistical inference: point
estimation and interval estimation.
315
ISTUDY
316 10 Statistical inference
ISTUDY
10.3 Why do we need sampling? 317
ISTUDY
318 10 Statistical inference
tion of studio apartment rentals that cost more than $1000 per month. The
question we want to answer will help us determine the parameter we want to
estimate. If we were somehow able to observe the whole population of studio
apartment rental offerings in Vancouver, we could compute each of these num-
bers exactly; therefore, these are all population parameters. There are many
kinds of observations and population parameters that you will run into in
practice, but in this chapter, we will focus on two settings:
library(tidyverse)
set.seed(123)
## # A tibble: 4,594 x 8
## id neighbourhood room_type accommodates bathrooms bedrooms beds price
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Downtown Entire hom~ 5 2 baths 2 2 150
## 2 2 Downtown Easts~ Entire hom~ 4 2 baths 2 2 132
## 3 3 West End Entire hom~ 2 1 bath 1 1 85
## 4 4 Kensington-Ced~ Entire hom~ 2 1 bath 1 0 146
## 5 5 Kensington-Ced~ Entire hom~ 4 1 bath 1 2 110
## 6 6 Hastings-Sunri~ Entire hom~ 4 1 bath 2 3 195
1
http://insideairbnb.com/
ISTUDY
10.4 Sampling distributions 319
airbnb |>
summarize(
n = sum(room_type == ”Entire home/apt”),
proportion = sum(room_type == ”Entire home/apt”) / nrow(airbnb)
)
## # A tibble: 1 x 2
## n proportion
## <int> <dbl>
## 1 3434 0.747
We can see that the proportion of Entire home/apt listings in the data set is
0.747. This value, 0.747, is the population parameter. Remember, this param-
eter value is usually unknown in real data analysis problems, as it is typically
not possible to make measurements for an entire population.
Instead, perhaps we can approximate it with a small subset of data! To inves-
tigate this idea, let’s try randomly selecting 40 listings (i.e., taking a random
sample of size 40 from our population), and computing the proportion for that
sample. We will use the rep_sample_n function from the infer package to take
the sample. The arguments of rep_sample_n are (1) the data frame to sample
from, and (2) the size of the sample to take.
library(infer)
ISTUDY
320 10 Statistical inference
airbnb_sample_1
## # A tibble: 1 x 3
## replicate n prop
## <int> <int> <dbl>
## 1 1 28 0.7
airbnb_sample_2
## # A tibble: 1 x 3
## replicate n prop
## <int> <int> <dbl>
## 1 1 35 0.875
Confirmed! We get a different value for our estimate this time. That means that
our point estimate might be unreliable. Indeed, estimates vary from sample to
sample due to sampling variability. But just how much should we expect
the estimates of our random samples to vary? Or in other words, how much
can we really trust our point estimate based on a single sample?
To understand this, we will simulate many samples (much more than just two)
ISTUDY
10.4 Sampling distributions 321
of size 40 from our population of listings and calculate the proportion of entire
home/apartment listings in each sample. This simulation will create many
sample proportions, which we can visualize using a histogram. The distribution
of the estimate for all possible samples of a given size (which we commonly
refer to as 𝑛) from a population is called a sampling distribution. The
sampling distribution will help us see how much we would expect our sample
proportions from this population to vary for samples of size 40.
We again use the rep_sample_n to take samples of size 40 from our population
of Airbnb listings. But this time we set the reps argument to 20,000 to specify
that we want to take 20,000 samples of size 40.
## # A tibble: 800,000 x 9
## # Groups: replicate [20,000]
## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 4403 Downtown Entire h~ 2 1 bath 1 1
## 2 1 902 Kensington-C~ Private ~ 2 1 shared~ 1 1
## 3 1 3808 Hastings-Sun~ Entire h~ 6 1.5 baths 1 3
## 4 1 561 Kensington-C~ Entire h~ 6 1 bath 2 2
## 5 1 3385 Mount Pleasa~ Entire h~ 4 1 bath 1 1
## 6 1 4232 Shaughnessy Entire h~ 6 1.5 baths 2 2
## 7 1 1169 Downtown Entire h~ 3 1 bath 1 1
## 8 1 959 Kitsilano Private ~ 1 1.5 shar~ 1 1
## 9 1 2171 Downtown Entire h~ 2 1 bath 1 1
## 10 1 1258 Dunbar South~ Entire h~ 4 1 bath 2 2
## # ... with 799,990 more rows, and 1 more variable: price <dbl>
Notice that the column replicate indicates the replicate, or sample, to which
each listing belongs. Above, since by default R only prints the first few rows, it
looks like all of the listings have replicate set to 1. But you can check the last
few entries using the tail() function to verify that we indeed created 20,000
samples (or replicates).
tail(samples)
## # A tibble: 6 x 9
## # Groups: replicate [1]
## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
ISTUDY
322 10 Statistical inference
Now that we have obtained the samples, we need to compute the proportion of
entire home/apartment listings in each sample. We first group the data by the
replicate variable—to group the set of listings in each sample together—and
then use summarize to compute the proportion in each sample. We print both
the first and last few entries of the resulting data frame below to show that
we end up with 20,000 point estimates, one for each of the 20,000 samples.
sample_estimates
## # A tibble: 20,000 x 2
## replicate sample_proportion
## <int> <dbl>
## 1 1 0.85
## 2 2 0.85
## 3 3 0.65
## 4 4 0.7
## 5 5 0.75
## 6 6 0.725
## 7 7 0.775
## 8 8 0.775
## 9 9 0.7
## 10 10 0.675
## # ... with 19,990 more rows
tail(sample_estimates)
## # A tibble: 6 x 2
## replicate sample_proportion
## <int> <dbl>
ISTUDY
10.4 Sampling distributions 323
## 1 19995 0.75
## 2 19996 0.675
## 3 19997 0.625
## 4 19998 0.75
## 5 19999 0.875
## 6 20000 0.65
sampling_distribution
4000
Count
2000
0
0.4 0.6 0.8 1.0
Sample proportions
FIGURE 10.2: Sampling distribution of the sample proportion for sample
size 40.
ISTUDY
324 10 Statistical inference
sample_estimates |>
summarize(mean = mean(sample_proportion))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 0.747
We notice that the sample proportions are centered around the population pro-
portion value, 0.747! In general, the mean of the sampling distribution should
be equal to the population proportion. This is great news because it means
that the sample proportion is neither an overestimate nor an underestimate
of the population proportion. In other words, if you were to take many sam-
ples as we did above, there is no tendency towards over or underestimating
the population proportion. In a real data analysis setting where you just have
access to your single sample, this implies that you would suspect that your
sample point estimate is roughly equally likely to be above or below the true
population proportion.
population_distribution
ISTUDY
10.4 Sampling distributions 325
900
Count
600
300
0
0 250 500 750 1000
Price per night (Canadian dollars)
FIGURE 10.3: Population distribution of price per night (Canadian dollars)
for all Airbnb listings in Vancouver, Canada.
In Figure 10.3, we see that the population distribution has one peak. It is
also skewed (i.e., is not symmetric): most of the listings are less than $250
per night, but a small number of listings cost much more, creating a long tail
on the histogram’s right side. Along with visualizing the population, we can
calculate the population mean, the average price per night for all the Airbnb
listings.
population_parameters
## # A tibble: 1 x 1
## pop_mean
## <dbl>
## 1 154.51
ISTUDY
326 10 Statistical inference
the case!), yet we wanted to estimate the mean price per night. We could
answer this question by taking a random sample of as many Airbnb listings as
our time and resources allow. Let’s say we could do this for 40 listings. What
would such a sample look like? Let’s take advantage of the fact that we do
have access to the population data and simulate taking one random sample of
40 listings in R, again using rep_sample_n.
sample_distribution
7.5
Count
5.0
2.5
0.0
0 200 400 600
Price per night (Canadian dollars)
FIGURE 10.4: Distribution of price per night (Canadian dollars) for sample
of 40 Airbnb listings.
ISTUDY
10.4 Sampling distributions 327
estimates
## # A tibble: 1 x 2
## replicate sample_mean
## <int> <dbl>
## 1 1 155.80
The average value of the sample of size 40 is $155.8. This number is a point
estimate for the mean of the full population. Recall that the population mean
was $154.51. So our estimate was fairly close to the population parameter: the
mean was about 0.8% off. Note that we usually cannot compute the estimate’s
accuracy in practice since we do not have access to the population parameter;
if we did, we wouldn’t need to estimate it!
Also, recall from the previous section that the point estimate can vary; if we
took another random sample from the population, our estimate’s value might
change. So then, did we just get lucky with our point estimate above? How
much does our estimate vary across different samples of size 40 in this example?
Again, since we have access to the population, we can take many samples and
plot the sampling distribution of sample means for samples of size 40 to get a
sense for this variation. In this case, we’ll use 20,000 samples of size 40.
## # A tibble: 800,000 x 9
## # Groups: replicate [20,000]
## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 1177 Downtown Entire h~ 4 2 baths 2 2
## 2 1 4063 Downtown Entire h~ 2 1 bath 1 1
## 3 1 2641 Kitsilano Private ~ 1 1 shared~ 1 1
## 4 1 1941 West End Entire h~ 2 1 bath 1 1
## 5 1 2431 Mount Pleasa~ Entire h~ 2 1 bath 1 1
## 6 1 1871 Arbutus Ridge Entire h~ 4 1 bath 2 2
## 7 1 2557 Marpole Private ~ 3 1 privat~ 1 2
## 8 1 3534 Downtown Entire h~ 2 1 bath 1 1
## 9 1 4379 Downtown Entire h~ 4 1 bath 1 0
## 10 1 2161 Downtown Entire h~ 4 2 baths 2 2
ISTUDY
328 10 Statistical inference
## # ... with 799,990 more rows, and 1 more variable: price <dbl>
Now we can calculate the sample mean for each replicate and plot the sampling
distribution of sample means for samples of size 40.
sample_estimates
## # A tibble: 20,000 x 2
## replicate sample_mean
## <int> <dbl>
## 1 1 160.06
## 2 2 173.18
## 3 3 131.20
## 4 4 176.96
## 5 5 125.65
## 6 6 148.84
## 7 7 134.82
## 8 8 137.26
## 9 9 166.11
## 10 10 157.81
## # ... with 19,990 more rows
sampling_distribution_40
ISTUDY
10.4 Sampling distributions 329
2000
1500
Count
1000
500
0
100 150 200 250
Sample mean price per night (Canadian dollars)
FIGURE 10.5: Sampling distribution of the sample means for sample size
of 40.
In Figure 10.5, the sampling distribution of the mean has one peak and is bell-
shaped. Most of the estimates are between about $140 and $170; but there are
a good fraction of cases outside this range (i.e., where the point estimate was
not close to the population parameter). So it does indeed look like we were
quite lucky when we estimated the population mean with only 0.8% error.
Let’s visualize the population distribution, distribution of the sample, and the
sampling distribution on one plot to compare them in Figure 10.6. Comparing
these three distributions, the centers of the distributions are all around the
same price (around $150). The original population distribution has a long
right tail, and the sample distribution has a similar shape to that of the
population distribution. However, the sampling distribution is not shaped like
the population or sample distribution. Instead, it has a bell shape, and it
has a lower spread than the population or sample distributions. The sample
means vary less than the individual observations because there will be some
high values and some small values in any random sample, which will keep the
average from being too extreme.
Given that there is quite a bit of variation in the sampling distribution of the
sample mean—i.e., the point estimate that we obtain is not very reliable—is
there any way to improve the estimate? One way to improve a point estimate
is to take a larger sample. To illustrate what effect this has, we will take many
ISTUDY
330 10 Statistical inference
Population
600
Count
400
200
0
0 200 400 600
Price per night (Canadian dollars)
Sample (n = 40)
6
Count
4
2
0
0 200 400 600
Price per night (Canadian dollars)
4000
2000
0
0 200 400 600
Sample mean price per night (Canadian dollars)
FIGURE 10.6: Comparison of population distribution, sample distribution,
and sampling distribution.
samples of size 20, 50, 100, and 500, and plot the sampling distribution of the
sample mean. We indicate the mean of the sampling distribution with a red
vertical line.
Based on the visualization in Figure 10.7, three points about the sample mean
become clear. First, the mean of the sample mean (across samples) is equal
to the population mean. In other words, the sampling distribution is centered
at the population mean. Second, increasing the size of the sample decreases
the spread (i.e., the variability) of the sampling distribution. Therefore, a
larger sample size results in a more reliable point estimate of the population
parameter. And third, the distribution of the sample mean is roughly bell-
shaped.
ISTUDY
10.4 Sampling distributions 331
n = 20 n = 50
mean = 154.3 3000
mean = 154.6
2000
Count
Count
1500 2000
1000
1000
500
0 0
100 150 200 250 100 150 200 250
Sample mean price per night Sample mean price per night
(Canadian dollars) (Canadian dollars)
n = 100 n = 500
5000
mean = 154.5 10000 mean = 154.5
4000
7500
Count
Count
3000
2000 5000
1000 2500
0 0
100 150 200 250 100 150 200 250
Sample mean price per night Sample mean price per night
(Canadian dollars) (Canadian dollars)
Note: You might notice that in the n = 20 case in Figure 10.7, the distribution
is not quite bell-shaped. There is a bit of skew towards the right! You might
also notice that in the n = 50 case and larger, that skew seems to disappear.
In general, the sampling distribution—for both means and proportions—only
becomes bell-shaped once the sample size is large enough. How large is “large
enough?” Unfortunately, it depends entirely on the problem at hand. But as
a rule of thumb, often a sample size of at least 20 will suffice.
10.4.3 Summary
1. A point estimate is a single value computed using a sample from a
population (e.g., a mean or proportion).
2. The sampling distribution of an estimate is the distribution of the
estimate for all possible samples of a fixed size from the same popu-
lation.
3. The shape of the sampling distribution is usually bell-shaped with
one peak and centered at the population mean or proportion.
ISTUDY
332 10 Statistical inference
10.5 Bootstrapping
10.5.1 Overview
Why all this emphasis on sampling distributions?
We saw in the previous section that we could compute a point estimate of
a population parameter using a sample of observations from the population.
And since we constructed examples where we had access to the population,
we could evaluate how accurate the estimate was, and even get a sense of
how much the estimate would vary for different samples from the population.
But in real data analysis settings, we usually have just one sample from our
population and do not have access to the population itself. Therefore we cannot
construct the sampling distribution as we did in the previous section. And as
we saw, our sample estimate’s value can vary significantly from the population
parameter. So reporting the point estimate from a single sample alone may
not be enough. We also need to report some notion of uncertainty in the value
of the point estimate.
Unfortunately, we cannot construct the exact sampling distribution without
full access to the population. However, if we could somehow approximate what
the sampling distribution would look like for a sample, we could use that
approximation to then report how uncertain our sample point estimate is
(as we did above with the exact sampling distribution). There are several
methods to accomplish this; in this book, we will use the bootstrap. We will
discuss interval estimation and construct confidence intervals using just
a single sample from a population. A confidence interval is a range of plausible
values for our population parameter.
Here is the key idea. First, if you take a big enough sample, it looks like the
population. Notice the histograms’ shapes for samples of different sizes taken
from the population in Figure 10.8. We see that the sample’s distribution looks
like that of the population for a large enough sample.
In the previous section, we took many samples of the same size from our
population to get a sense of the variability of a sample estimate. But if our
sample is big enough that it looks like our population, we can pretend that our
sample is the population, and take more samples (with replacement) of the
ISTUDY
10.5 Bootstrapping 333
n = 10 n = 20
6 3
4 2
Count
Count
2 1
0 0
0 200 400 600 0 200 400 600
Price per night (Canadian dollars) Price per night (Canadian dollars)
n = 50 n = 100
15
7.5
Count
Count
5.0 10
2.5 5
0.0 0
0 200 400 600 0 200 400 600
Price per night (Canadian dollars) Price per night (Canadian dollars)
n = 200 Population distribution
25
600
20
Count
Count
15 400
10
200
5
0 0
0 200 400 600 0 200 400 600
Price per night (Canadian dollars) Price per night (Canadian dollars)
same size from it instead! This very clever technique is called the bootstrap.
Note that by taking many samples from our single, observed sample, we do
not obtain the true sampling distribution, but rather an approximation that
we call the bootstrap distribution.
ISTUDY
334 10 Statistical inference
Note: We must sample with replacement when using the bootstrap. Other-
wise, if we had a sample of size 𝑛, and obtained a sample from it of size 𝑛
without replacement, it would just return our original sample!
This section will explore how to create a bootstrap distribution from a single
sample using R. The process is visualized in Figure 10.9. For a sample of size
𝑛, you would do the following:
ISTUDY
10.5 Bootstrapping 335
10.5.2 Bootstrapping in R
Let’s continue working with our Airbnb example to illustrate how we might
create and use a bootstrap distribution using just a single sample from the
population. Once again, suppose we are interested in estimating the population
mean price per night of all Airbnb listings in Vancouver, Canada, using a
single sample size of 40. Recall our point estimate was $155.8. The histogram
of prices in the sample is displayed in Figure 10.10.
one_sample
## # A tibble: 40 x 8
## id neighbourhood room_type accommodates bathrooms bedrooms beds price
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 3928 Marpole Private r~ 2 1 shared~ 1 1 58
## 2 3013 Kensington-Ceda~ Entire ho~ 4 1 bath 2 2 112
## 3 3156 Downtown Entire ho~ 6 2 baths 2 2 151
## 4 3873 Dunbar Southlan~ Private r~ 5 1 bath 2 3 700
## 5 3632 Downtown Eastsi~ Entire ho~ 6 2 baths 3 3 157
## 6 296 Kitsilano Private r~ 1 1 shared~ 1 1 100
## 7 3514 West End Entire ho~ 2 1 bath 1 1 110
## 8 594 Sunset Entire ho~ 5 1 bath 3 3 105
## 9 3305 Dunbar Southlan~ Entire ho~ 4 1 bath 1 2 196
## 10 938 Downtown Entire ho~ 7 2 baths 2 3 269
## # ... with 30 more rows
one_sample_dist
ISTUDY
336 10 Statistical inference
7.5
Count
5.0
2.5
0.0
0 200 400 600
Price per night (Canadian dollars)
FIGURE 10.10: Histogram of price per night (Canadian dollars) for one
sample of size 40.
The histogram for the sample is skewed, with a few observations out to the
right. The mean of the sample is $155.8. Remember, in practice, we usually
only have this one sample from the population. So this sample and estimate
are the only data we can work with.
We now perform steps 1–5 listed above to generate a single bootstrap sample
in R and calculate a point estimate from that bootstrap sample. We will use
the rep_sample_n function as we did when we were creating our sampling distri-
bution. But critically, note that we now pass one_sample—our single sample of
size 40—as the first argument. And since we need to sample with replacement,
we change the argument for replace from its default value of FALSE to TRUE.
boot1_dist
ISTUDY
10.5 Bootstrapping 337
Count 6
0
0 200 400 600
Price per night (Canadian dollars)
FIGURE 10.11: Bootstrap distribution.
## # A tibble: 1 x 2
## replicate mean
## <int> <dbl>
## 1 1 164.20
Notice in Figure 10.11 that the histogram of our bootstrap sample has a similar
shape to the original sample histogram. Though the shapes of the distributions
are similar, they are not identical. You’ll also notice that the original sample
mean and the bootstrap sample mean differ. How might that happen? Remem-
ber that we are sampling with replacement from the original sample, so we
don’t end up with the same sample values again. We are pretending that our
single sample is close to the population, and we are trying to mimic drawing
another sample from the population by drawing one from our original sample.
Let’s now take 20,000 bootstrap samples from the original sample (one_sample)
using rep_sample_n, and calculate the means for each of those replicates. Recall
that this assumes that one_sample looks like our original population; but since
we do not have access to the population itself, this is often the best we can
do.
ISTUDY
338 10 Statistical inference
boot20000
## # A tibble: 800,000 x 9
## # Groups: replicate [20,000]
## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 1276 Hastings-Sun~ Entire h~ 2 1 bath 1 1
## 2 1 3235 Hastings-Sun~ Entire h~ 2 1 bath 1 1
## 3 1 1301 Oakridge Entire h~ 12 2 baths 2 12
## 4 1 118 Grandview-Wo~ Entire h~ 4 1 bath 2 2
## 5 1 2550 Downtown Eas~ Private ~ 2 1.5 shar~ 1 1
## 6 1 1006 Grandview-Wo~ Entire h~ 5 1 bath 3 4
## 7 1 3632 Downtown Eas~ Entire h~ 6 2 baths 3 3
## 8 1 1923 West End Entire h~ 4 2 baths 2 2
## 9 1 3873 Dunbar South~ Private ~ 5 1 bath 2 3
## 10 1 2349 Kerrisdale Private ~ 2 1 shared~ 1 1
## # ... with 799,990 more rows, and 1 more variable: price <dbl>
tail(boot20000)
## # A tibble: 6 x 9
## # Groups: replicate [1]
## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 20000 1949 Kitsilano Entire h~ 3 1 bath 1 1
## 2 20000 1025 Kensington-Ce~ Entire h~ 3 1 bath 1 1
## 3 20000 3013 Kensington-Ce~ Entire h~ 4 1 bath 2 2
## 4 20000 2868 Downtown Entire h~ 2 1 bath 1 1
## 5 20000 3156 Downtown Entire h~ 6 2 baths 2 2
## 6 20000 1923 West End Entire h~ 4 2 baths 2 2
## # ... with 1 more variable: price <dbl>
Let’s take a look at histograms of the first six replicates of our bootstrap
samples.
ISTUDY
10.5 Bootstrapping 339
ggplot(six_bootstrap_samples, aes(price)) +
geom_histogram(fill = ”dodgerblue3”, color = ”lightgrey”) +
labs(x = ”Price per night (Canadian dollars)”, y = ”Count”) +
facet_wrap(~replicate) +
theme(text = element_text(size = 12))
1 2 3
10
0
Count
4 5 6
10
0
0 200 400 600 0 200 400 600 0 200 400 600
Price per night (Canadian dollars)
We see in Figure 10.12 how the bootstrap samples differ. We can also calculate
the sample mean for each of these six replicates.
six_bootstrap_samples |>
group_by(replicate) |>
summarize(mean = mean(price))
## # A tibble: 6 x 2
## replicate mean
## <int> <dbl>
## 1 1 177.2
## 2 2 131.45
## 3 3 179.10
ISTUDY
340 10 Statistical inference
## 4 4 171.35
## 5 5 191.32
## 6 6 170.05
We can see that the bootstrap sample distributions and the sample means are
different. They are different because we are sampling with replacement. We will
now calculate point estimates for our 20,000 bootstrap samples and generate
a bootstrap distribution of our point estimates. The bootstrap distribution
(Figure 10.13) suggests how we might expect our point estimate to behave if
we took another sample.
boot20000_means
## # A tibble: 20,000 x 2
## replicate mean
## <int> <dbl>
## 1 1 177.2
## 2 2 131.45
## 3 3 179.10
## 4 4 171.35
## 5 5 191.32
## 6 6 170.05
## 7 7 178.83
## 8 8 154.78
## 9 9 163.85
## 10 10 209.28
## # ... with 19,990 more rows
tail(boot20000_means)
## # A tibble: 6 x 2
## replicate mean
## <int> <dbl>
## 1 19995 130.40
## 2 19996 189.18
## 3 19997 168.98
## 4 19998 168.23
## 5 19999 155.73
ISTUDY
10.5 Bootstrapping 341
## 6 20000 136.95
boot_est_dist
2000
1500
Count
1000
500
0
100 150 200 250
Sample mean price per night
(Canadian dollars)
FIGURE 10.13: Distribution of the bootstrap sample means.
ISTUDY
342 10 Statistical inference
1500
1500
Count
Count
1000
1000
500
500
0 0
100 150 200 100 150 200
Sample mean price per night Sample mean price per night
(Canadian dollars) (Canadian dollars)
ISTUDY
10.5 Bootstrapping 343
mean = 164
Sample
mean = 130
mean = 164
willing to take of being wrong based on the implications of being wrong for
our application. In general, we choose confidence levels to be comfortable with
our level of uncertainty but not so strict that the interval is unhelpful. For in-
stance, if our decision impacts human life and the implications of being wrong
are deadly, we may want to be very confident and choose a higher confidence
level.
To calculate a 95% percentile bootstrap confidence interval, we will do the
following:
ISTUDY
344 10 Statistical inference
bounds
## 2.5% 97.5%
## 119 204
Our interval, $119.28 to $203.63, captures the middle 95% of the sample mean
prices in the bootstrap distribution. We can visualize the interval on our dis-
tribution in Figure 10.16.
2000
2.5th percentile = 119.28 97.5th percentile = 203.63
1500
Count
1000
500
0
100 150 200 250
Sample mean price per night
(Canadian dollars)
ISTUDY
10.6 Exercises 345
practice, we would not know whether our interval captured the population
parameter or not because we usually only have a single sample, not the entire
population. This is the best we can do when we only have one sample!
This chapter is only the beginning of the journey into statistical inference.
We can extend the concepts learned here to do much more than report point
estimates and confidence intervals, such as testing for real differences between
populations, tests for associations between variables, and so much more. We
have just scratched the surface of statistical inference; however, the material
presented here will serve as the foundation for more advanced statistical tech-
niques you may learn about in the future!
10.6 Exercises
Practice exercises for the material covered in this chapter can be found in the
accompanying worksheets repository2 in the two “Statistical inference” rows.
You can launch an interactive version of each worksheet in your browser by
clicking the “launch binder” button. You can also preview a non-interactive
version of each worksheet by clicking “view worksheet.” If you instead decide
to download the worksheets and run them on your own machine, make sure
to follow the instructions for computer setup found in Chapter 13. This will
ensure that the automated feedback and guidance that the worksheets provide
will function as intended.
ISTUDY
346 10 Statistical inference
ISTUDY
11
Combining code and text with Jupyter
11.1 Overview
A typical data analysis involves not only writing and executing code, but also
writing text and displaying images that help tell the story of the analysis. In
fact, ideally, we would like to interleave these three media, with the text and
images serving as narration for the code and its output. In this chapter we will
show you how to accomplish this using Jupyter notebooks, a common coding
platform in data science. Jupyter notebooks do precisely what we need: they
let you combine text, images, and (executable!) code in a single document. In
this chapter, we will focus on the use of Jupyter notebooks to program in R
and write text via a web interface. These skills are essential to getting your
analysis running; think of it like getting dressed in the morning! Note that we
assume that you already have Jupyter set up and ready to use. If that is not
the case, please first read Chapter 13 to learn how to install and configure
Jupyter on your own computer.
347
ISTUDY
348 11 Combining code and text with Jupyter
11.3 Jupyter
Jupyter is a web-based interactive development environment for creating, edit-
ing, and executing documents called Jupyter notebooks. Jupyter notebooks
are documents that contain a mix of computer code (and its output) and
formattable text. Given that they combine these two analysis artifacts in a
single document—code is not separate from the output or written report—
notebooks are one of the leading tools to create reproducible data analyses.
Reproducible data analysis is one where you can reliably and easily re-create
the same results when analyzing the same data. Although this sounds like
something that should always be true of any data analysis, in reality, this
is not often the case; one needs to make a conscious effort to perform data
analysis in a reproducible manner. An example of what a Jupyter notebook
looks like is shown in Figure 11.1.
ISTUDY
11.4 Code cells 349
authentication to gain access. For example, if you are reading this book as part
of a course, your instructor may have a JupyterHub already set up for you to
use! Jupyter can also be installed on your own computer; see Chapter 13 for
instructions.
FIGURE 11.2: A code cell in Jupyter that has not yet been executed.
ISTUDY
350 11 Combining code and text with Jupyter
been activated (Figure 11.4), the cell can be run by either pressing the Run
( ) button in the toolbar, or by using the keyboard shortcut Shift + Enter.
FIGURE 11.4: An activated cell that is ready to be run. The blue rectangle
to the cell’s left (annotated by a red arrow) indicates that it is ready to be
run. The cell can be run by clicking the run button (circled in red).
To execute all of the code cells in an entire notebook, you have three options:
ISTUDY
11.4 Code cells 351
All of these commands result in all of the code cells in a notebook being run.
However, there is a slight difference between them. In particular, only options
2 and 3 above will restart the R session before running all of the cells; option
1 will not restart the session. Restarting the R session means that all previous
objects that were created from running cells before this command was run will
be deleted. In other words, restarting the session and then running all cells
(options 2 or 3) emulates how your notebook code would run if you completely
restarted Jupyter before executing your entire notebook.
ISTUDY
352 11 Combining code and text with Jupyter
clicking File at the top left of your screen, then Save Notebook.
Next, if you are accessing Jupyter using a JupyterHub server, from
the File menu click Hub Control Panel. Choose Stop My Server
to shut it down, then the My Server button to start it back up. If
you are running Jupyter on your own computer, from the File menu
click Shut Down, then start Jupyter again. Finally, navigate back
to the notebook you were working on.
FIGURE 11.6: New cells can be created by clicking the + button, and are
by default code cells.
ISTUDY
11.5 Markdown cells 353
FIGURE 11.7: A Markdown cell in Jupyter that has not yet been rendered
and can be edited.
FIGURE 11.8: A Markdown cell in Jupyter that has been rendered and
exhibits rich text formatting.
ISTUDY
354 11 Combining code and text with Jupyter
FIGURE 11.9: New cells are by default code cells. To create Markdown cells,
the cell format must be changed.
ISTUDY
11.7 Best practices for running a notebook 355
some R code that creates an R object, say a variable named y. When you
execute that cell and create y, it will continue to exist until it is deliberately
deleted with R code, or when the Jupyter notebook R session (i.e., kernel) is
stopped or restarted. It can also be referenced in another distinct code cell
(Figure 11.10). Together, this means that you could then write a code cell
further above in the notebook that references y and execute it without error
in the current session (Figure 11.11). This could also be done successfully in
future sessions if, and only if, you run the cells in the same unconventional
order. However, it is difficult to remember this unconventional order, and
it is not the order that others would expect your code to be executed in.
Thus, in the future, this would lead to errors when the notebook is run in the
conventional linear order (Figure 11.12).
FIGURE 11.10: Code that was written out of order, but not yet executed.
FIGURE 11.11: Code that was written out of order, and was executed using
the run button in a nonlinear order without error. The order of execution can
be traced by following the numbers to the left of the code cells; their order
indicates the order in which the cells were executed.
ISTUDY
356 11 Combining code and text with Jupyter
FIGURE 11.12: Code that was written out of order, and was executed in
a linear order using “Restart Kernel and Run All Cells…” This resulted in an
error at the execution of the second code cell and it failed to run all code cells
in the notebook.
These events may not negatively affect the current R session when the code
is being written; but as you might now see, they will likely lead to errors
when that notebook is run in a future session. Regularly executing the entire
notebook in a fresh R session will help guard against this. If you restart your
session and new errors seem to pop up when you run all of your cells in linear
order, you can at least be aware that there is an issue. Knowing this sooner
rather than later will allow you to fix the issue and ensure your notebook can
be run linearly from start to finish.
We recommend as a best practice to run the entire notebook in a fresh R
session at least 2–3 times within any period of work. Note that, critically, you
must do this in a fresh R session by restarting your kernel. We recommend
using either the Kernel » Restart Kernel and Run All Cells… command
from the menu or the button in the toolbar. Note that the Run » Run All
Cells menu item will not restart the kernel, and so it is not sufficient to guard
against these errors.
ISTUDY
11.8 Exploring data files 357
ISTUDY
358 11 Combining code and text with Jupyter
ISTUDY
11.9 Exporting to a different file format 359
ISTUDY
360 11 Combining code and text with Jupyter
get a new one via clicking the + button at the top of the Jupyter file explorer
(Figure 11.15).
FIGURE 11.15: Clicking on the R icon under the Notebook heading will
create a new Jupyter notebook with an R kernel.
Once you have created a new Jupyter notebook, be sure to give it a descriptive
name, as the default file name is Untitled.ipynb. You can rename files by first
right-clicking on the file name of the notebook you just created, and then
clicking Rename. This will make the file name editable. Use your keyboard
to change the name. Pressing Enter or clicking anywhere else in the Jupyter
interface will save the changed file name.
We recommend not using white space or non-standard characters in file names.
Doing so will not prevent you from using that file in Jupyter. However, these
sorts of things become troublesome as you start to do more advanced data
science projects that involve repetition and automation. We recommend nam-
ing files using lower case characters and separating words by a dash (-) or an
underscore (_).
ISTUDY
11.11 Additional resources 361
1
https://jupyterlab.readthedocs.io/en/latest/
2
https://commonmark.org/help/
3
https://commonmark.org/help/tutorial/
ISTUDY
ISTUDY
12
Collaboration with version control
12.1 Overview
This chapter will introduce the concept of using version control systems to
track changes to a project over its lifespan, to share and edit code in a collab-
orative team, and to distribute the finished project to its intended audience.
This chapter will also introduce how to use the two most common version con-
trol tools: Git for local version control, and GitHub for remote version control.
We will focus on the most common version control operations used day-to-day
in a standard data science project. There are many user interfaces for Git; in
this chapter we will cover the Jupyter Git interface.
363
ISTUDY
364 12 Collaboration with version control
Additionally, the iterative nature of data analysis projects means that most of
the time, the final version of the analysis that is shared with the audience is
only a fraction of what was explored during the development of that analysis.
Changes in data visualizations and modeling approaches, as well as some neg-
ative results, are often not observable from reviewing only the final, polished
analysis. The lack of observability of these parts of the analysis development
can lead to others repeating things that did not work well, instead of seeing
what did not work well, and using that as a springboard to new, more fruitful
approaches.
Finally, data analyses are typically completed by a team of people rather than
a single person. This means that files need to be shared across multiple com-
puters, and multiple people often end up editing the project simultaneously.
In such a situation, determining who has the latest version of the project—and
how to resolve conflicting edits—can be a real challenge.
Version control helps solve these challenges. Version control is the process of
keeping a record of changes to documents, including when the changes were
made and who made them, throughout the history of their development. It also
provides the means both to view earlier versions of the project and to revert
changes. Version control is most commonly used in software development, but
ISTUDY
12.4 Version control repositories 365
can be used for any electronic files for any type of project, including data
analyses. Being able to record and view the history of a data analysis project
is important for understanding how and why decisions to use one method
or another were made, among other things. Version control also facilitates
collaboration via tools to share edits with others and resolve conflicting edits.
But even if you’re working on a project alone, you should still use version
control. It helps you keep track of what you’ve done, when you did it, and
what you’re planning to do next!
To version control a project, you generally need two things: a version control
system and a repository hosting service. The version control system is the soft-
ware responsible for tracking changes, sharing changes you make with others,
obtaining changes from others, and resolving conflicting edits. The reposi-
tory hosting service is responsible for storing a copy of the version-controlled
project online (a repository), where you and your collaborators can access it
remotely, discuss issues and bugs, and distribute your final product. For both
of these items, there is a wide variety of choices. In this textbook we’ll use
Git for version control, and GitHub for repository hosting, because both are
currently the most widely used platforms. In the additional resources section
at the end of the chapter, we list many of the common version control systems
and repository hosting services in use today.
Note: Technically you don’t have to use a repository hosting service. You
can, for example, version control a project that is stored only in a folder on
your computer—never sharing it on a repository hosting service. But using
a repository hosting service provides a few big benefits, including managing
collaborator access permissions, tools to discuss and track bugs, and the ability
to have external collaborators contribute work, not to mention the safety
of having your work backed up in the cloud. Since most repository hosting
services now offer free accounts, there are not many situations in which you
wouldn’t want to use one for your project.
ISTUDY
366 12 Collaboration with version control
on our computer or laptop, but can also exist within a workspace on a server
(e.g., JupyterHub). The other copy is typically stored in a repository hosting
service (e.g., GitHub), where we can easily share it with our collaborators.
This copy is commonly referred to as the remote repository.
Both copies of the repository have a working directory where you can create,
store, edit, and delete files (e.g., analysis.ipynb in Figure 12.1). Both copies of
the repository also maintain a full project history (Figure 12.1). This history
is a record of all versions of the project files that have been created. The
repository history is not automatically generated; Git must be explicitly told
when to record a version of the project. These records are called commits.
They are a snapshot of the file contents as well metadata about the repository
at that time the record was created (who made the commit, when it was
made, etc.). In the local and remote repositories shown in Figure 12.1, there
are two commits represented as gray circles. Each commit can be identified by
ISTUDY
12.5 Version control workflows 367
a human-readable message, which you write when you make a commit, and
a commit hash that Git automatically adds for you.
The purpose of the message is to contain a brief, rich description of what work
was done since the last commit. Messages act as a very useful narrative of the
changes to a project over its lifespan. If you ever want to view or revert to an
earlier version of the project, the message can help you identify which commit
to view or revert to. In Figure 12.1, you can see two such messages, one for
each commit: Created README.md and Added analysis draft.
The hash is a string of characters consisting of about 40 letters and numbers.
The purpose of the hash is to serve as a unique identifier for the commit, and is
used by Git to index project history. Although hashes are quite long—imagine
having to type out 40 precise characters to view an old project version!—Git
is able to work with shorter versions of hashes. In Figure 12.1, you can see
two of these shortened hashes, one for each commit: Daa29d6 and 884c7ce.
1. Tell Git when to make a commit of your own changes in the local
repository.
2. Tell Git when to send your new commits to the remote GitHub repos-
itory.
3. Tell Git when to retrieve any new changes (that others made) from
the remote GitHub repository.
ISTUDY
368 12 Collaboration with version control
snapshot. We call this step adding the files to the staging area. Note that
the staging area is not a real physical location on your computer; it is instead
a conceptual placeholder for these files until they are committed. The benefit
of the Git version control system using a staging area is that you can choose
to commit changes in only certain files. For example, in Figure 12.3, we add
only the two files that are important to the analysis project (analysis.ipynb
and README.md) and not our personal scratch notes for the project (notes.txt).
Once the files we wish to commit have been added to the staging area, we can
then commit those files to the repository history (Figure 12.4). When we do
this, we are required to include a helpful commit message to tell collaborators
(which often includes future you!) about the changes that were made. In Figure
12.4, the message is Message about changes...; in your work you should make
sure to replace this with an informative message about what changed. It is also
important to note here that these changes are only being committed to the
ISTUDY
12.5 Version control workflows 369
FIGURE 12.3: Adding modified files to the staging area in the local reposi-
tory.
local repository’s history. The remote repository on GitHub has not changed,
and collaborators are not yet able to see your new changes.
FIGURE 12.4: Committing the modified files in the staging area to the local
repository history, with an informative message about what changed.
ISTUDY
370 12 Collaboration with version control
(Figure 12.5). This updates the history in the remote repository (i.e., GitHub)
to match what you have in your local repository. Now when collaborators
interact with the remote repository, they will be able to see the changes you
made. And you can also take comfort in the fact that your work is now backed
up in the cloud!
FIGURE 12.5: Pushing the commit to send the changes to the remote repos-
itory on GitHub.
ISTUDY
12.5 Version control workflows 371
pushing their commits to the remote GitHub repository to share them with
you. When they push their changes, those changes will only initially exist in
the remote GitHub repository and not in your local repository (Figure 12.6).
To obtain the new changes from the remote repository on GitHub, you will
need to pull those changes to your own local repository. By pulling changes,
you synchronize your local repository to what is present on GitHub (Figure
12.7). Additionally, until you pull changes from the remote repository, you will
not be able to push any more changes yourself (though you will still be able
to work and make commits in your own local repository).
ISTUDY
372 12 Collaboration with version control
FIGURE 12.7: Pulling changes from the remote GitHub repository to syn-
chronize your local repository.
ISTUDY
12.6 Working with remote repositories using GitHub 373
A newly created public repository with a README.md template file should look
something like what is shown in Figure 12.10.
ISTUDY
374 12 Collaboration with version control
ISTUDY
12.6 Working with remote repositories using GitHub 375
FIGURE 12.11: Clicking on the pen tool opens a text box for editing plain
text files.
After you are done with your edits, they can be “saved” by committing your
changes. When you commit a file in a repository, the version control system
takes a snapshot of what the file looks like. As you continue working on the
project, over time you will possibly make many commits to a single file; this
generates a useful version history for that file. On GitHub, if you click the
ISTUDY
376 12 Collaboration with version control
FIGURE 12.12: The text box where edits can be made after clicking on the
pen tool.
green “Commit changes” button, it will save the file and then make a commit
(Figure 12.13).
Recall from Section 12.5.1 that you normally have to add files to the staging
area before committing them. Why don’t we have to do that when we work
directly on GitHub? Behind the scenes, when you click the green “Commit
changes” button, GitHub is adding that one file to the staging area prior to
committing it. But note that on GitHub you are limited to committing changes
to only one file at a time. When you work in your own local repository, you
can commit changes to multiple files simultaneously. This is especially useful
when one “improvement” to the project involves modifying multiple files. You
can also do things like run code when working in a local repository, which you
cannot do on GitHub. In general, editing on GitHub is reserved for small edits
to plain text files.
ISTUDY
12.6 Working with remote repositories using GitHub 377
FIGURE 12.13: Saving changes using the pen tool requires committing
those changes, and an associated commit message.
FIGURE 12.14: New plain text files can be created directly on GitHub.
A page will open with a small text box for the file name to be entered, and
a larger text box where the desired file content text can be entered. Note the
two tabs, “Edit new file” and “Preview”. Toggling between them lets you enter
and edit text and view what the text will look like when rendered, respectively
(Figure 12.15). Note that GitHub understands and renders .md files using a
ISTUDY
378 12 Collaboration with version control
FIGURE 12.15: New plain text files require a file name in the text box
circled in red, and file content entered in the larger text box (red arrow).
Save and commit your changes by clicking the green “Commit changes” button
at the bottom of the page (Figure 12.16).
You can also upload files that you have created on your local machine by using
the “Add file” drop-down menu and selecting “Upload files” (Figure 12.17). To
select the files from your local computer to upload, you can either drag and
drop them into the gray box area shown below, or click the “choose your files”
link to access a file browser dialog. Once the files you want to upload have
been selected, click the green “Commit changes” button at the bottom of the
page (Figure 12.18).
1
https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf
ISTUDY
12.6 Working with remote repositories using GitHub 379
FIGURE 12.18: Specify files to upload by dragging them into the GitHub
website (red circle) or by clicking on “choose your files.” Uploaded files are
also required to be committed along with an associated commit message.
Note that Git and GitHub are designed to track changes in individual files.
Do not upload your whole project in an archive file (e.g., .zip). If you do,
then Git can only keep track of changes to the entire .zip file, which will not
be human-readable. Committing one big archive defeats the whole purpose of
ISTUDY
380 12 Collaboration with version control
using version control: you won’t be able to see, interpret, or find changes in
the history of any of the actual content of your project!
ISTUDY
12.7 Working with local repositories using Jupyter 381
FIGURE 12.19: The “Generate new token” button used to initiate the cre-
ation of a new personal access token. It is found in the “Personal access tokens”
section of the “Developer settings” page in your account settings.
Finally, you will be taken to a page where you will be able to see and copy
ISTUDY
382 12 Collaboration with version control
the personal access token you just generated (Figure 12.21). Since it provides
access to certain parts of your account, you should treat this token like a
password; for example, you should consider securely storing it (and your other
passwords and tokens, too!) using a password manager. Note that this page
will only display the token to you once, so make sure you store it in a safe
place right away. If you accidentally forget to store it, though, do not fret—you
can delete that token by clicking the “Delete” button next to your token, and
generate a new one from scratch. To learn more about GitHub authentication,
see the additional resources section at the end of this chapter.
FIGURE 12.22: The green “Code” drop-down menu contains the remote
address (URL) corresponding to the location of the remote GitHub repository.
ISTUDY
12.7 Working with local repositories using Jupyter 383
Open Jupyter, and click the Git+ icon on the file browser tab (Figure 12.23).
Paste the URL of the GitHub project repository you created and click the
blue “CLONE” button (Figure 12.24).
ISTUDY
384 12 Collaboration with version control
On the file browser tab, you will now see a folder for the repository. Inside
this folder will be all the files that existed on GitHub (Figure 12.25).
FIGURE 12.25: Cloned GitHub repositories can been seen and accessed via
the Jupyter file browser.
ISTUDY
12.7 Working with local repositories using Jupyter 385
This opens the Jupyter Git graphical user interface pane. Next, click the plus
sign (+) beside the file(s) that you want to “add” (Figure 12.27). Note that
because this is the first change for this file, it falls under the “Untracked”
heading. However, next time you edit this file and want to add the changes,
you will find it under the “Changed” heading.
You will also see an eda-checkpoint.ipynb file under the “Untracked” heading.
This is a temporary “checkpoint file” created by Jupyter when you work on
eda.ipynb. You generally do not want to add auto-generated files to Git repos-
itories; only add the files you directly create and edit.
FIGURE 12.27: eda.ipynb is added to the staging area via the plus sign (+).
Clicking the plus sign (+) moves the file from the “Untracked” heading to the
“Staged” heading, so that Git knows you want a snapshot of its current state as
a commit (Figure 12.28). Now you are ready to “commit” the changes. Make
sure to include a (clear and helpful!) message about what was changed so that
your collaborators (and future you) know what happened in this commit.
ISTUDY
386 12 Collaboration with version control
FIGURE 12.29: A commit message must be added into the Jupyter Git
extension commit text box before the blue Commit button can be used to
record the commit.
After “committing” the file(s), you will see there are 0 “Staged” files. You are
now ready to push your changes to the remote repository on GitHub (Figure
12.30).
ISTUDY
12.7 Working with local repositories using Jupyter 387
FIGURE 12.30: After recording a commit, the staging area should be empty.
FIGURE 12.31: The Jupyter Git extension “push” button (circled in red).
You will then be prompted to enter your GitHub username and the personal
access token that you generated earlier (not your account password!). Click
the blue “OK” button to initiate the push (Figure 12.32).
ISTUDY
388 12 Collaboration with version control
FIGURE 12.32: Enter your Git credentials to authorize the push to the
remote repository.
If the files were successfully pushed to the project repository on GitHub, you
will be shown a success message (Figure 12.33). Click “Dismiss” to continue
working in Jupyter.
If you visit the remote repository on GitHub, you will see that the changes
now exist there too (Figure 12.34)!
ISTUDY
12.8 Collaboration 389
FIGURE 12.34: The GitHub web interface shows a preview of the commit
message, and the time of the most recently pushed commit for each file.
12.8 Collaboration
12.8.1 Giving collaborators access to your project
As mentioned earlier, GitHub allows you to control who has access to your
project. The default of both public and private projects are that only the
person who created the GitHub repository has permissions to create, edit
and delete files (write access). To give your collaborators write access to the
projects, navigate to the “Settings” tab (Figure 12.35).
ISTUDY
390 12 Collaboration with version control
FIGURE 12.36: The “Manage access” tab on the GitHub web interface.
(Figure 12.37).
Type in the collaborator’s GitHub username or email, and select their name
when it appears (Figure 12.38).
ISTUDY
12.8 Collaboration 391
Finally, click the green “Add to this repository” button (Figure 12.39).
After this, you should see your newly added collaborator listed under the “Man-
age access” tab. They should receive an email invitation to join the GitHub
repository as a collaborator. They need to accept this invitation to enable
write access.
ISTUDY
392 12 Collaboration with version control
FIGURE 12.40: The GitHub interface indicates the name of the last person
to push a commit to the remote repository, a preview of the associated commit
message, the unique commit identifier, and how long ago the commit was
snapshotted.
You can tell Git to “pull” by clicking on the cloud icon with the down arrow
in Jupyter (Figure 12.41).
Once the files are successfully pulled from GitHub, you need to click “Dismiss”
to keep working (Figure 12.42).
ISTUDY
12.8 Collaboration 393
FIGURE 12.42: The prompt after changes have been successfully pulled
from a remote repository.
And then when you open (or refresh) the files whose changes you just pulled,
you should be able to see them (Figure 12.43).
It can be very useful to review the history of the changes to your project. You
can do this directly in Jupyter by clicking “History” in the Git tab (Figure
12.44).
ISTUDY
394 12 Collaboration with version control
FIGURE 12.44: Version control repository history viewed using the Jupyter
Git extension.
It is good practice to pull any changes at the start of every work session
before you start working on your local copy. If you do not do this, and your
collaborators have pushed some changes to the project to GitHub, then you
will be unable to push your changes to GitHub until you pull. This situation
can be recognized by the error message shown in Figure 12.45.
FIGURE 12.45: Error message that indicates that there are changes on the
remote repository that you do not have locally.
Usually, getting out of this situation is not too troublesome. First you need
to pull the changes that exist on GitHub that you do not yet have in the
local repository. Usually when this happens, Git can automatically merge the
changes for you, even if you and your collaborators were working on different
parts of the same file!
ISTUDY
12.8 Collaboration 395
If, however, you and your collaborators made changes to the same line of the
same file, Git will not be able to automatically merge the changes—it will not
know whether to keep your version of the line(s), your collaborators version
of the line(s), or some blend of the two. When this happens, Git will tell you
that you have a merge conflict in certain file(s) (Figure 12.46).
FIGURE 12.46: Error message that indicates you and your collaborators
made changes to the same line of the same file and that Git will not be able
to automatically merge the changes.
FIGURE 12.47: How to open a Jupyter notebook as a plain text file view
in Jupyter.
The beginning of the merge conflict is preceded by <<<<<<< HEAD and the end
of the merge conflict is marked by >>>>>>>. Between these markings, Git also
ISTUDY
396 12 Collaboration with version control
inserts a separator (=======). The version of the change before the separator is
your change, and the version that follows the separator was the change that
existed on GitHub. In Figure 12.48, you can see that in your local repository
there is a line of code that calls scale_color_manual with three color values
(deeppink2, cyan4, and purple1). It looks like your collaborator made an edit to
that line too, except with different colors (to blue3, red3, and black)!
Once you have decided which version of the change (or what combination!) to
keep, you need to use the plain text editor to remove the special marks that
Git added (Figure 12.49).
The file must be saved, added to the staging area, and then committed before
you will be able to push your changes to GitHub.
ISTUDY
12.8 Collaboration 397
ISTUDY
398 12 Collaboration with version control
FIGURE 12.51: The “New issue” button on the GitHub web interface.
Add an issue title (which acts like an email subject line), and then put the
body of the message in the larger text box. Finally, click “Submit new issue”
to post the issue to share with others (Figure 12.52).
FIGURE 12.52: Dialog boxes and submission button for creating new
GitHub issues.
You can reply to an issue that someone opened by adding your written response
to the large text box and clicking comment (Figure 12.53).
ISTUDY
12.8 Collaboration 399
When a conversation is resolved, you can click “Close issue”. The closed issue
can be later viewed by clicking the “Closed” header link in the “Issue” tab
(Figure 12.54).
FIGURE 12.54: The “Closed” issues tab on the GitHub web interface.
ISTUDY
400 12 Collaboration with version control
12.9 Exercises
Practice exercises for the material covered in this chapter can be found in
the accompanying worksheets repository2 in the “Collaboration with version
control” row. You can launch an interactive version of the worksheet in your
browser by clicking the “launch binder” button. You can also preview a non-
interactive version of the worksheet by clicking “view worksheet.” If you instead
decide to download the worksheet and run it on your own machine, make sure
to follow the instructions for computer setup found in Chapter 13. This will
ensure that the automated feedback and guidance that the worksheets provide
will function as intended.
ISTUDY
12.10 Additional resources 401
Happy Git and GitHub for the useR personal access tokens chapter11 are
both excellent additional resources to consult if you need additional help
generating and using personal access tokens.
11
https://happygitwithr.com/https-pat.html
ISTUDY
ISTUDY
13
Setting up your computer
13.1 Overview
In this chapter, you’ll learn how to install all of the software needed to do the
data science covered in this book on your own computer.
403
ISTUDY
404 13 Setting up your computer
13.3.1 Git
As shown in Chapter 12, Git is a very useful tool for version controlling your
projects, as well as sharing your work with others. Here’s how to install Git
on the following operating systems:
Windows: To install Git on Windows, go to https://git-scm.com/download/win
and download the Windows version of Git. Once the download has finished,
run the installer and accept the default configuration for all pages.
MacOS: To install Git on Mac OS, open the terminal (how-to video1 ) and
type the following command:
xcode-select --install
Ubuntu: To install Git on Ubuntu, open the terminal and type the following
commands:
sudo apt update
sudo apt install git
13.3.2 Miniconda
To run Jupyter notebooks on your computer, you will need to install the web-
based platform JupyterLab. But JupyterLab relies on Python, so we need to
install Python first. We can install Python via the miniconda Python package
distribution2 .
Windows: To install miniconda on Windows, download the latest Python
64-bit version from here3 . Once the download has finished, run the installer
and accept the default configuration for all pages. After installation, you can
open the Anaconda Prompt by opening the Start Menu and searching for the
program called “Anaconda Prompt (miniconda3)”. When this opens, you will
see a prompt similar to (base) C:\Users\your_name.
MacOS: To install miniconda on MacOS, you will need to use a different
installation method depending on the type of processor chip your computer
has.
If your Mac computer has an Intel x86 processor chip you can download the
latest Python 64-bit version from here4 . After the download has finished, run
the installer and accept the default configuration for all pages.
1
https://youtu.be/5AJbWEWwnbY
2
https://docs.conda.io/en/latest/miniconda.html
3
https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe
4
https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg
ISTUDY
13.3 Installing software on your own computer 405
If your Mac computer has an Apple M1 processor chip you can download the
latest Python 64-bit version from here5 . After the download has finished, you
need to run the downloaded script in the terminal using a command like:
bash path/to/Miniconda3-latest-MacOSX-arm64.sh
Make sure to replace path/to/ with the path of the folder containing the down-
loaded script. Most computers will save downloaded files to the Downloads folder.
If this is the case for your computer, you can run the script in the terminal
by typing:
bash Downloads/Miniconda3-latest-MacOSX-arm64.sh
The instructions for the installation will then appear. Follow the prompts and
agree to accepting the license, the default installation location, and to running
conda init, which makes conda available from the terminal.
Make sure to replace path/to/ with the path of the folder containing the down-
loaded script. Most often this file will be downloaded to the Downloads folder.
If this is the case for your computer, you can run the script in the terminal
by typing:
bash Downloads/Miniconda3-latest-Linux-x86_64.sh
The instructions for the installation will then appear. Follow the prompts and
agree to accepting the license, the default installation location, and to running
conda init, which makes conda available from the terminal.
13.3.3 JupyterLab
With miniconda set up, we can now install JupyterLab and the Jupyter Git
extension. Type the following into the Anaconda Prompt (Windows) or the
terminal (MacOS and Ubuntu) and press enter:
conda install -c conda-forge -y jupyterlab
conda install -y nodejs
pip install --upgrade jupyterlab-git
To test that your JupyterLab installation is functional, you can type jupyter
5
https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh
6
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
ISTUDY
406 13 Setting up your computer
lab into the Anaconda Prompt (Windows) or terminal (MacOS and Ubuntu)
and press enter. This should open a new tab in your default browser with the
JupyterLab interface. To exit out of JupyterLab you can click File -> Shutdown,
or go to the terminal from which you launched JupyterLab, hold Ctrl, and
press C twice.
To improve the experience of using R in JupyterLab, you should also add an
extension that allows you to set up keyboard shortcuts for inserting text. By
default, this extension creates shortcuts for inserting two of the most common
R operators: <- and |>. Type the following in the Anaconda Prompt (Windows)
or terminal (MacOS and Ubuntu) and press enter:
jupyter labextension install @techrah/text-shortcuts
This command installs the specific R and package versions specified in the
7
environment.yml file found in the worksheets repository . We will always keep
the versions in the environment.yml file updated so that they are compatible
with the exercise worksheets that accompany the book.
You can also install the latest version of R and the R packages used in this book
by typing the commands shown below in the Anaconda Prompt (Windows) or
terminal (MacOS and Ubuntu) and pressing enter. Be careful though: this
may install package versions that are incompatible with the worksheets that
accompany the book; the automated exercise feedback might tell you your
answers are not correct even though they are!
conda install -c conda-forge -y \
r-base \
r-cowplot \
r-ggally \
r-gridextra \
r-irkernel \
7
https://ubc-dsci.github.io/data-science-a-first-intro-worksheets
ISTUDY
13.3 Installing software on your own computer 407
r-kknn \
r-rpostgres \
r-rsqlite \
r-scales \
r-testthat \
r-tidymodels \
r-tidyverse \
r-tinytex \
unixodbc
13.3.5 LaTeX
To be able to render .ipynb files to .pdf you need to install a LaTeX distribution.
These can be quite large, so we will opt to use tinytex, a light-weight cross-
platform, portable, and easy-to-maintain LaTeX distribution based on TeX
Live.
MacOS: To install tinytex we need to make sure that /usr/local/bin is writable.
To do this, type the following in the terminal:
sudo chown -R $(whoami):admin /usr/local/bin
ISTUDY
408 13 Setting up your computer
”parskip”,
”pgf”,
”rsfs”,
”tcolorbox”,
”titling”,
”trimspaces”,
”ucs”,
”ulem”,
”upquote”))
Ubuntu: To append the TinyTex executables to our PATH we need to edit our
.bashrc file. The TinyTex executables are usually installed in ~/bin. Thus, add
the lines below to the bottom of your .bashrc file (which you can open by nano
~/.bashrc and save the file:
Note: If you used nano to open your .bashrc file, follow the keyboard shortcuts
at the bottom of the nano text editor to save and close the file.
ISTUDY
13.5 Downloading the worksheets for this book 409
downloaded. Once you unzip the downloaded file, you can open the folder and
run each worksheet using Jupyter. See Chapter 11 for instructions on how to
use Jupyter.
ISTUDY
ISTUDY
Bibliography
Evelyn Martin Lansdowne Beale, Maurice George Kendall, and David Mann.
The discarding of variables in multivariate analysis. Biometrika, 54(3-4):
357–366, 1967.
Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE
Transactions on Information Theory, 13(1):21–27, 1967.
Murray Cox. Inside Airbnb, n.d. URL http://insideairbnb.com/.
Sameer Deeb. The molecular basis of variation in human color vision. Clinical
Genetics, 67:369–377, 2005.
David Diez, Mine Çetinkaya Rundel, and Christopher Barr. OpenIntro Statis-
tics. OpenIntro, Inc., 2019. URL https://openintro.org/book/os/.
Norman Draper and Harry Smith. Applied Regression Analysis. Wiley, 1966.
M. Eforymson. Stepwise regression—a backward and forward look. In Eastern
Regional Meetings of the Institute of Mathematical Statistics, 1966.
Evelyn Fix and Joseph Hodges. Discriminatory analysis. nonparametric dis-
crimination: consistency properties. Technical report, USAF School of Avi-
ation Medicine, Randolph Field, Texas, 1951.
Kristen Gorman, Tony Williams, and William Fraser. Ecological sexual di-
morphism and environmental variability within a community of Antarctic
penguins (genus Pygoscelis). PLoS ONE, 9(3), 2014.
Garrett Grolemund and Hadley Wickham. Dates and times made easy with
lubridate. Journal of Statistical Software, 40(3):1–25, 2011.
Wolfgang Hardle. Smoothing Techniques with Implementation in S. Springer,
New York, 1991.
Lionel Henry and Hadley Wickham. tidyselect R package, 2021. URL https:
//tidyselect.r-lib.org/.
411
ISTUDY
412 13 Bibliography
Chester Ismay and Albert Kim. Statistical Inference via Data Science: A
ModernDive into R and the Tidyverse. Chapman and Hall/CRC Press,
2020. URL https://moderndive.com/.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An
Introduction to Statistical Learning. Springer, 1st edition, 2013. URL https:
//www.statlearning.com/.
Max Kuhn and David Vaughan. parsnip R package, 2021. URL https://pars
nip.tidymodels.org/.
Max Kuhn and Hadley Wickham. recipes R package, 2021. URL https:
//recipes.tidymodels.org/.
Jeffrey Leek and Roger Peng. What is the question? Science, 347(6228):
1314–1315, 2015.
Thomas Leeper. rio R package, 2021. URL https://cloud.r-project.org/web/pa
ckages/rio/index.html.
Roger D Peng and Elizabeth Matsui. The Art of Data Science: A Guide for
Anyone Who Works with Data. Skybrude Consulting, LLC, 2015. URL
https://bookdown.org/rdpeng/artofdatascience/.
ISTUDY
Bibliography 413
Real Time Statistics Project. Internet live stats: Google search statistics, 2021.
URL https://www.internetlivestats.com/google-search-statistics/.
Vitalie Spinu, Garrett Grolemund, and Hadley Wickham. lubridate R package,
2021. URL https://lubridate.tidyverse.org/.
Stanford Health Care. What is cancer?, 2021. URL https://stanfordhealthca
re.org/medical-conditions/cancer/cancer.html.
William Nick Street, William Wolberg, and Olvi Mangasarian. Nuclear fea-
ture extraction for breast tumor diagnosis. In International Symposium on
Electronic Imaging: Science and Technology, 1993.
Pieter Tans and Ralph Keeling. Trends in atmospheric carbon dioxide, 2020.
URL https://gml.noaa.gov/ccgg/trends/data.html.
Tiffany Timbers. canlang: Canadian Census language data, 2020. URL https:
//ttimbers.github.io/canlang/. R package version 0.0.9.
Truth and Reconciliation Commission of Canada. They Came for the Children:
Canada, Aboriginal Peoples, and the Residential Schools. Public Works &
Government Services Canada, 2012.
Truth and Reconciliation Commission of Canada. Calls to Action. 2015. URL
https://www2.gov.bc.ca/assets/gov/british-columbians-our-governments/indige
nous-people/aboriginal-peoples-documents/calls_to_action_english2.pdf.
ISTUDY
414 13 Bibliography
Hadley Wickham and Garrett Grolemund. R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O’Reilly, 2016. URL https://r4ds.h
ad.co.nz/.
Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt,
and Tracy Teal. Good enough practices in scientific computing. PLoS
Computational Biology, 13(6), 2017.
Kory Wilson. Pulling Together: Foundations Guide. BCcampus, 2018. URL
https://opentextbc.ca/indigenizationfoundations/.
ISTUDY
Index
415
ISTUDY
416 Index
ISTUDY
Index 417
ISTUDY
418 Index
ISTUDY
Index 419
ISTUDY
420 Index
underfitting
classification, 227
regression, 256, 276
unsupervised, 290
URL, 30
reading from, 38
warning, 32, 33
web scraping, 50
permission, 55
Wikipedia, 55
within-cluster
sum-of-squared-distances, see
WSSD
workflow, see tidymodels, 253
working directory, 366
write function
write_csv, 49
WSSD, 295
total, 297, 306
ISTUDY