Data Science With R
Data Science With R
R is a programming language used to analyze data. That’s all it’s good for, and it does this one job very, very
well. Any analysis you can think of can be done with R. Because of these features, R has become a very
popular tool for data science.
So what’s “data science”? Long story short, it’s a buzzword used to describe the relatively new field of
applied data analysis. Our data often has valuable information to tell us. Data science is the “science” of
extracting these worthwhile insights from our data. This tutorial aims to teach the basics of data science:
loading data, performing statistics, and convey this information in a useful manner.
Setup
Before you start, make sure you have both R and RStudio installed and ready to go. RStudio is a
You may also wish to install the tidyverse packages with install.packages("tidyverse") beforehand.
We’ll be using them a lot.
-1-
Basic Syntax
The most basic use of R is as a simple calculator:
5 + 4
## [1] 9
1 - 3
## [1] -2
4 * -2
## [1] -8
5 / 6
## [1] 0.8333333
print('hello world!')
R also gives us access to more complex mathematical funtions. For instance, log() gives us the natural log
of a number, and exp() does the inverse.
log(10)
## [1] 2.302585
exp(log(10))
## [1] 10
But do we know all of this just from the top of our head? Of course not, there’s lots of useful
documentation to leaf through. Importantly, you’ll never stop looking through the docs - it’s a fact of life.
We can look up what a function does by prefixing it with an ? or hovering over it with our cursor in
RStudio and pressing F1.
?log()
log package:base R Documentation
Logarithms and Exponentials
Description:
'log' computes logarithms, by default natural logarithms, 'log10'
computes common (i.e., base 10) logarithms, and 'log2' computes
-2-
binary (i.e., base 2) logarithms. The general form 'log(x, base)'
computes logarithms with base 'base'.
'log1p(x)' computes log(1+x) accurately also for |x| << 1.
'exp' computes the exponential function.
'expm1(x)' computes exp(x) - 1 accurately also for |x| << 1.
Usage:
log(x, base = exp(1))
logb(x, base = exp(1))
log10(x)
log2(x)
Importantly, the bottom of the help pages always contain executable examples. You can always copy and
paste them into your R console and they will work. (A package won’t build properly if the examples don’t
work!) So now that we know where to go for help, we know all we need to know about R, right? ;)
a <- 5
## [1] 5
a * 7
## [1] 35
You may have noticed that we used the <- sign to assign a variable earlier. This is different than most
languages, which use the = sign. We can actually still use the = for assignment in R if we want to.
b = 10
## [1] 10
-3-
Note that using = is frowned upon in R. You’ll hear that phrase a lot in this lesson… “such and such is
frowned upon”, “so and so is better”. Over the course of this lesson, we’re going to actually try to explain
why we do things each way. In this particular case, both methods are the same in all but one rarely-used
edge case (inline variable assignment). As such, you are free to use whichever operator you choose, but by
convention, R programmers will typically use <- (RStudio hotkey: Alt + -). If you’re interested in doing
things according to the “official style”, you can open up Tools -> Global Options -> Code -> Diagnostics and
tell it to check code style.
One important thing to note is when variables are modified. Let’s demonstrate this via example.
weight_kg <- 55
weight_lb
## [1] 121
Ok, everyone should be with me so far. But what about if we modify the value of weight_kg,
does weight_lb change as well?
weight_lb
## [1] 121
It does not. Variables only update when we explicitly assign them with <-.
weight_lb
## [1] 19800
We can contrast with how variable assignment works in Python and other languages: where there is a
distinction between objects and primitives and assigning a new copy of an object just creates a link to the
original set of data. R does not treat any variables differently from others. In R, whenever you assign a
variable under a new name, it creates a copy of the orignal.
Copy-on-modify
Or, that’s almost how it works. R uses a “copy-on-modify” behavior. Essentially whenever you assign a
variable under a new name, it points at the old variable, and does not take up any extra space. However, as
soon as you modify anything in the new variable, R will create a new copy.
-4-
Let’s demonstrate this by example. If you want to follow along for this part, you will need to install
the pryr package (install.packages("pryr")). Also, this will introduce a new function library() - this
is used to load extra bits of functionality that is not included in the base R programming language.
var1 <- 10
Alright, we’ve assigned two variables, var1 and var2. var2 is a copy of var1. Is there a way to check if I’m
telling the truth about the copy-on-modify behavior? Are these the same object?
Fortunately, the pryr package lets us take a closer look at R’s internal workings. The address function can
be used to see the memory address of an R object.
library(pryr)
address(var1)
## [1] "0x537184f4a8"
address(var2)
## [1] "0x537184f4a8"
Without going into too much detail on how memory addresses and allocation works (it’s not important for
writing R code), we can see that both var1 and var2 have the same address: they are stored in the same
spot in your computer.
address(var1)
## [1] "0x537184f4a8"
address(var2)
## [1] "0x53716d8198"
The address has changed. After modifying var2, R realized that it could no longer store both variables in
the same spot and made a new copy to store var2 in.
This copy-on-modify behavior has important implications for how we write R code. Don’t reassign variable
names just because you want a new name for it. Every time you reassign a variable and modify it (even
slightly), you force R to make a new copy, doubling it’s memory use. A smarter way of doing things
(especially if you just need to modify a variable), is to do things “in-place”, or simply overwrite the original
variable with its modified value.
-5-
Good example
var1 <- 10
Bad example
var1 <- 10
-6-
Vectors and indexing
R has a special data structure called a vector. A vector is a 1D set of the same type of object. Most often, a
vector will simply be a sequence of numbers. We can create a sequence of numbers using the : operator.
numbers
## [1] 1 2 3 4 5 6 7 8 9 10
Note that vectors are treated the same way as a single element. Anything that works on a single number
works the same way on a vector. This is called vectorization.
numbers + 5
## [1] 6 7 8 9 10 11 12 13 14 15
2 ^ numbers
sin(numbers)
We can also create a vector with the c() function (c stands for concatenate, in case you are wondering).
concat
## [1] 4 17 -1 55 2
Indexing
Often, we don’t want to get the entire vector. Perhaps we only want a single element, or a set of specific
elements. We do this by indexing (uses the [] brackets).
To get the first element of a vector, we could do the following. In R, array indexes start at 1 - the 1st
element is at index 1. This is different than 0-based languages like C, Python, or Java where the first
element is at index 0.
concat[1]
## [1] 4
-7-
To get other elements, we could do the following:
## [1] 17
## [1] 2
Notice that for the second example, we put a function inside the square brackets. In this case, length() is
used to get the length of a vector, and since the lenght of the vector will be equal to the index of its last
element, this is a nifty way of getting the last element of a vector.
We can actually put anything inside the square brackets. Putting another vector inside the brackets gives
us multiple values, for instance.
concat[1:4]
## [1] 4 17 -1 55
concat[c(3, 5)]
## [1] -1 2
We can use this technique to reassign certain values inside the vector. For instance, we could change the
3rd and 5th values to 76 with the following code:
concat
## [1] 4 17 76 55 76
It’s even possible to index outside the bounds of a vector. Notice that R “fills in the blanks”
with NA values. NA is R’s placeholder for “no data” (since 0 often shows up in real data).
concat
DON’T EVER DO THIS. Though this is actually valid code in R (indexing outside of a vector’s size is an
error in most other languages), it comes at a performance cost.
-8-
Matrices
A matrix is a two-dimensional vector. Let’s create a matrix with the matrix() function. One important
note here: functions often have optional, “extra” arguments that are specified with name=value notation. In
this case, we are creating a matrix with 2 rows and 5 columns.
mat
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
All of the same operations that work on vectors also work on matrices.
mat + 20
## [1,] 21 23 25 27 29
## [2,] 22 24 26 28 30
## [1] 2 5
## [1] 10
However, indexing a matrix is slightly different from indexing in a vector. We now have not only one, but
two dimensions to choose from. When indexing using an object with multiple dimensions, we use a , to
separate them. In R, rows are on the left side of the comma, and columns are on the right (a third
dimension would be after the second comma, and so on…). Note that R is actually trying to help us out
here. When we printed our matrix, the rows have a [#,] next to them, and the columns have a [,#]. This
actually shows us the exact syntax we need to get each element.
So using the matrix output from above, mat[1,] should give us the first row, and mat[,4] should give us
the fourth column. Let’s check this:
mat[1,]
## [1] 1 3 5 7 9
mat[,4]
-9-
## [1] 7 8
We can index both rows and columns at the same time to get a specific element.
## [1] 1
## [1] 2 4 6
"this is a string"
numbers
## [1] 1 4 9 10
numbers
Again, this conversion has a massive performance hit, especially for a large vector or matrix. R needs to
create an entire new vector from scratch, and then copy over and convert every item.
So let’s learn about R’s different data types. There are a few more types that we’re not covering here or
cover later (like factors, for instance). These are simply the most common data types we will encounter.
Numeric
Numeric variables can hold any decimal number, positive or negative. In other languages these are called
floats. Note that this is the default type of number in R.
## [1] 1
-3.5
## [1] -3.5
We can convert another variable to numeric with the as.numeric() function. This only works with values
that can be easily converted.
## [1] 456
## [1] NA
Integers
Integers represent any whole number, positive or negative. They cannot hold decimal numbers.
15L
## [1] 15
- 11 -
We can convert a set of data to integer with the as.integer() function.
as.integer(c("65.3", "4"))
## [1] 65 4
Characters (strings)
As mentioned earlier, strings are sets of text. We can turn something into a string with
the as.character() function.
as.character(TRUE)
## [1] "TRUE"
As usual, we can turn something into a logical/boolean value with the as.logical() function. One
important thing to note is that this very nicely demonstrates what happens when we turn one data type
into another. If one data type cannot hold extra information from another, that information is lost during
conversion.
fifty
## [1] TRUE
as.numeric(fifty)
## [1] 1
class(14)
## [1] "numeric"
class(1L)
## [1] "integer"
- 12 -
class(TRUE)
## [1] "logical"
class("text")
## [1] "character"
class(NA)
## [1] "logical"
Note that NA’s can be of any type, and are often used as placeholders for missing data.
- 13 -
Dataframes
Vectors and matrices are super cool. However they don’t address an important issue: holding multiple
types of data and working with them at the same time. Dataframes are another special data structure that
let’s you handle large amounts and different types of data together. Because of this, they are generally the
tool-of-choice for doing analyses in R.
We are going to focus on using dataframes using the dplyr package. dplyr comes as part of
the tidyverse package bundle, you can install it with install.packages("tidyverse"). It can take
awhile to install this on Linux, so perhaps start the command in another window while we go through the
non-dplyr parts.
A small example
In a text editor, create the following example CSV file. We’ll call it cats.csv.
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
Once we’ve saved it in the same directory we’re working in, we can load it with read.csv().
cats
## 1 calico 2.1 1
## 2 black 5.0 0
## 3 tabby 3.2 1
Whenever we import a dataset with multiple types of values, R will autodetect this and make the output a
dataframe. Let’s verify this for ourselves:
class(cats)
## [1] "data.frame"
So, we’ve got a dataframe with multiple types of values. How do we work with it? Fortunately, everything
we know about vectors also applies to dataframes.
- 14 -
Each column of a dataframe can be used as a vector. We use the $ operator to specify which column we
want.
cats$weight + 34
class(cats$weight)
## [1] "numeric"
cats$coat
We can also reassign columns as if they were variables. The cats$likes_string likely represents a set of
boolean value, lets update that column to reflect this fact.
class(cats$likes_string) # before
## [1] "integer"
class(cats$likes_string)
## [1] "logical"
Notice how it won’t let us do that. The reason is that dataframes must have the same number of elements
in every column. If each column only has 3 rows, we can’t add another column with 4 rows. Let’s try that
again with the proper number of elements.
cats
- 15 -
## 3 tabby 3.2 TRUE 4
Note that we don’t have to call class() on every single column to figure out what they are. There are a
number of useful summary functions to get information about our dataframe.
str() reports on the structure of your dataframe. It is an extremely useful function - use it on everything if
you’ve loaded a dataset for the first time.
str(cats)
## $ age : num 1 6 4
As with matrices, we can use dim() to know how many rows and columns we’re working with.
dim(cats)
## [1] 3 4
## [1] 3
## [1] 4
Factors
When we ran str(cats), you might have noticed something weird. cats$coat is listed as a “factor”. A
factor is a special type of data that’s almost a string.
cats$coat
But it’s not a string! The output of str(cats) gives us a clue to what’s actually happening behind-the-
scenes.
str(cats)
## $ age : num 1 6 4
str() reports that the first values are 2, 1, 3 (and not text). Let’s use as.numeric() to reveal its true form!
as.numeric(cats$coat)
## [1] 2 1 3
cats$coat
A factor has two components, its levels and its values. Levels represent all possible values for a column. In
this case, there’s only 3 possiblities: black, calico and tabby.
The actual values are 2, 1, and 3. Each value matches up to a specific level. So in our example, the first
value is 2, which corresponds to the second level, calico. The second value is 1, which matches up with
the first level, black.
Factors in R are a method of storing text information as one of several possible “levels”. R converts text to
factors automatically when we import data, like from a CSV file. We’ve got several options here:
class(cats$coat)
## [1] "character"
Tell R to simply not convert things to factors when we import it (as.is=TRUE is the R equivalent of “don’t
touch my stuff!”):
- 17 -
new_cats <- read.csv('cats.csv', as.is=TRUE)
class(new_cats$coat)
## [1] "character"
Use the read_csv() function from the readr package. readr is part of the tidyverse and has a number of
ways of reading/writing data with more sensible defaults.
library(tidyverse)
## cols(
## coat = col_character(),
## weight = col_double(),
## likes_string = col_integer()
## )
class(even_newer_cats$coat)
## [1] "character"
- 18 -
Data analysis with dplyr
About the rest of this seminar
There are a million different ways to do things in R. This isn’t Python, where solutions on StackOverflow
get ranked on how “Pythonic” they are. If there’s something you like about another workflow in R, there’s
nothing stopping you from using it!
• “Base R” - “Base R” means using only functions and stuff built into your base R installation. No
external packages or fancy stuff. The focus here is on stability from version to version - your
code will never break from an update, but performance and usability aren’t always as great.
• data.table - data.table is a dataframe manipulation package known to have very good
performance.
• “The tidyverse” - The “tidyverse” is a collection of packages that overhauls just about
everything in R to use a consistent API. Has comparable performance with data.table.
For much of the rest of this tutorial, we’ll focus on doing things the “tidyverse” way (with a few
exceptions). The biggest reasons is that everything follows a consistent API - everything in the tidyverse
works well together. You can often guess how to use a new function because you’ve used others like it. It’s
also got pretty great performance. When you use stuff from the tidyverse, you can be reasonably confident
that someone has already taken a look at optimizing things to speed things along.
Logical indexing
So far, we’ve covered how to extract certain pieces of data via indexing. But what we’ve shown so far only
works if we know the exact index of the data we want (vector[42], for example). There is a neat trick to
extra certain pieces of data in R known as “logical indexing”.
1 == 1
## [1] TRUE
!TRUE
## [1] FALSE
## [1] FALSE
We can also make comparisons with the greater than > and less than < symbols. Pairing these with an
equals sign means “greater than or equal to” (>=) or “less than or equal to” (<=).
4 < 5
## [1] TRUE
5 <= 5
## [1] TRUE
9 > 999
## [1] FALSE
## [1] TRUE
The last example worked because TRUE and FALSE are equal to 1 and 0, respectively.
TRUE == 1
## [1] TRUE
FALSE == 1
## [1] FALSE
"a" == "a"
## [1] TRUE
"a" != "b"
## [1] TRUE
This trick also works with vectors, returning TRUE or FALSE for every element in the vector.
example >= 4
- 20 -
another_example == "banana"
This trick is extremely useful for getting specific elements. Watch what happens when we index a vector
using a set of boolean values. Using our example from above:
example
## [1] 1 2 3 4 5 6 7
greater_than_3
example[greater_than_3]
## [1] 4 5 6 7
This can be turned into a one-liner by putting the boolean expression inside the square brackets.
example[example > 3]
## [1] 4 5 6 7
We can also get the elements which were not greater than 3 by adding an ! in front.
example[!example > 3]
## [1] 1 2 3
Using this info, make the following return a number as a result instead of NA.
mean(ugly_data)
## [1] NA
- 21 -
Retrieving rows from dataframes
Let’s try this out on a bigger dataset. nycflights13 is an example dataset containing all outbound flights
from NYC in 2013. You can get this dataset with install.packages("nycflights13").
Let’s take a look at the dataset and see what we’ve got.
library(nycflights13)
## # A tibble: 6 x 19
str(flights)
- 22 -
dim(flights)
## [1] 336776 19
class(flights)
To force a tbl_df to print all columns, you can use print(some_tbl_df, width=Inf)
If we ever get annoyed with a tbl_df, we can turn it back into a dataframe with as.data.frame().
class(as.data.frame(flights))
## [1] "data.frame"
The flights table clocks in at several hundred thousand rows. That’s a fair sized chunk of data.
Nevertheless, our tricks from before work just the same.
Using the same technique from before, let’s retrieve all of the flights that went to Los Angeles (LAX).
flights[rows_with_yvr, ]
## # A tibble: 16,174 x 19
- 23 -
# and the same, but in one line
unique(result$dest)
## [1] "LAX"
nrow(result)
## [1] 16174
Breaking things apart, we look for all instances where the column dest was equal to “LAX”. We end up
with a vector of whether or not “LAX” was found in each row. We can then use the square brackets to
extract every row where the vector is true. Note the addition of a comma in our square
brackets. flights has 2 dimensions, so our indexing needs to as well!
flights[flights$dest]
Error: Length of logical index vector must be 1 or 19 (the number of rows), not 336776
One other issue - what happens if we want to grab the flights to either LAX or SEA (Seattle). Let’s try the
following:
unique(result$dest)
nrow(result)
## [1] 10060
Though in both cases we got results corresponding to the cities we wanted, it looks like somethig went
wrong. Before, we got 16174 results for just “LAX”. Now we only get 10060, and we even added an extra
city worth of flights! So what’s happening here?
When R compares two vectors of different length, it “recycles” the shorter vector until it matches the
length of the longer one!
- 24 -
short <- c(1, 2)
long == short
## object length
long == short_recycled
This is not what we want. We want to know if elements in the long vector were found “in” the shorter
vector, not whether or not the two are equal at every point. Fortunately, there is a special %in% operator
that does just that.
## [1] 1 1 1 2 2 2
If we take the %in% operator and apply it to our issue, we get the correct number of rows.
nrow(res)
## [1] 20097
# our results contain the same number of flights bound for LAX
nrow(res[flights$dest == "LAX", ])
## [1] 16174
- 25 -
Let’s try things the “tidyverse” way using dplyr (dplyr is a package that comes as part of
the tidyverse package bundle).
To filter out a set of specific rows that match a condition, we use the filter() function. The syntax of this
function is a bit unusual:
library(tidyverse)
nrow(results)
## [1] 16174
Notice how we just used dest all by itself. filter() is smart enough to figure out that dest is a column
name in the flights dataframe.
We can also filter multiple things at once using the & (AND) and | (OR) operators. & checks if both
conditions are true, | checks if just one condition is true:
## [1] TRUE
## [1] FALSE
TRUE | FALSE
## [1] TRUE
Using this in an example with filter() to fetch all the flights to LAX in February:
- 26 -
filter(flights, dest == "LAX" & month == 2)
## # A tibble: 1,030 x 19
Repeating an earlier example, we can retrieve the number of flights that went to LAX with:
# earlier example:
## [1] 16174
- 27 -
Our analysis now flows from left to right, instead of inside out. Makes things quite a bit more readable.
Many people also put each step on a new line. That way if you want to exclude a step, you can just
comment it out.
flights %>%
nrow()
## [1] 16174
Controlling output
dplyr also has its own function for selecting columns: select(). To grab the certain columns from a
dataframe, we supply their names to select() as arguments.
## # A tibble: 336,776 x 3
We can also sort columns using arrange(). arrange() sorts a dataset by whatever column names you
specify.
## # A tibble: 336,776 x 19
- 28 -
## 5 2013 1 5 458 500 -2 640
## 6 2013 1 6 458 500 -2 718
## 7 2013 1 7 454 500 -6 637
## 8 2013 1 8 454 500 -6 625
## 9 2013 1 9 457 500 -3 647
## 10 2013 1 10 450 500 -10 634
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
To sort in descending order, we can add the desc() function into the mix.
## # A tibble: 336,776 x 19
Data analysis
So far we’ve learned how to to rearrange and select parts of our data. What about actually analyzing it.
The group_by() and summarize() functions, allow us to group by a certain column (say, city or airline),
and then perform an operation on every group.
A simple example might be grouping by month and then summarizing by the number of flights (rows) in
each group.
flights %>%
group_by(month) %>%
- 29 -
summarize(length(month)) # number of records in a group
## # A tibble: 12 x 2
## month `length(month)`
## <int> <int>
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135
We can also perform multiple “summarizations” at once and name our columns something informative.
flights %>%
group_by(month) %>%
summarize(num_flights=length(month),
avg_flight_time=mean(air_time, na.rm=TRUE))
## # A tibble: 12 x 3
We can also simply add on a column to a dataset with the mutate() function. This is the equivalent
of cats$age <- c(1, 3, 4) like we did earlier.
- 30 -
colnames(flights)
colnames(flights)
To do this, group our data by carrier, get the average arrival delay for each group, then sort in descending
order so that the worst offenders are at the top.
head(airports)
## # A tibble: 6 x 8
- 31 -
## 5 09J Jekyll Island Airport 31.07447 -81.42778 11 -5
## 6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -5
## # ... with 2 more variables: dst <chr>, tzone <chr>
In order for this information to be useful to us, we need to match it up and “join” it to our flights table.
This is a pretty complex operation in base R, but dplyr makes it relatively easy.
There are a lot of different types of joins that put together data in different ways. In this case, we’re going
to do what’s called a “left join”: one table is on the left side, and we’ll keep all of its data. However, on the
right side (the table we are joining), we’ll only match up and add each entry if there is a corresponding
entry on the left side.
colnames(flights)
colnames(airports)
# join syntax:
# the "by" argument controls which columns in each table are matched up
Let’s check our work. SEA should show up as Seattle-Tacoma International Airport. Note: we can use . as a
placeholder to represent the entire object passed to the summarize function (instead of using just a column
name, for instance).
joined %>%
- 32 -
select(name) %>%
head(n=1)
## # A tibble: 1 x 1
## name
## <chr>
Writing functions
Being able to group by and summarize data is great. But so far all we know how to do is use canned
functions - ones that come with base R or one of the packages we’ve covered. We’ll need to write our own
functions eventually.
Functions in R are defined almost the same as variables. The general syntax looks like this:
# do stuff
return(result)
- 33 -
adder <- function(num1, num2) {
return(result)
adder(5, 6)
## [1] 11
We also have the ability to specify optional arguments. Optional arguments are just ones where we’ve
given it a default. In this case, we’ll make our adder function just add 10 if a second number is not
specified. Notice that we’ve also eliminated saving the results variable and do everything in one line.
return(num1 + num2)
adder(5)
## [1] 15
Conditional expressions
Sometimes we need to have functions do things differently if a certain condition is met. For this, we use
if/else statements. if executes a block of code if some condition was met.
number <- 5
if (number > 4) {
}
- 34 -
## [1] "number was greater than 4!"
number <- 3
if (number > 4) {
} else {
number <- 3
if (number > 4) {
} else if (number == 4) {
} else {
Though we could use a for loop like this, there is a more efficient way of doing things:
for (i in 1:10) {
print(i)
- 35 -
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
purrr provides a set of functions to provide map/reduce-style functionality. “Map” means to apply a
function to every piece of a dataset. “Reduce” means to calculate some kind of summary statistic on a large
group of data.
Let’s demonstrate with several examples using map(). The function requires two arguments, something to
iterate over, and a function to apply to each piece.
library(tidyverse)
- 36 -
mapped
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 2.645751
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 3
##
- 37 -
## [[10]]
## [1] 3.162278
Notice how we get back a weird datastructure as a result. map returns a list by default. Lists are a special
datastructure that can contain any type or size of element.
One note on indexing lists: to extract individual elements we need to use two square brackets ( [[]])
instead of one ([]). Using single brackets just returns a one-element list.
example_list[1]
## [[1]]
## [1] 1
example_list[[1]]
## [1] 1
The advantage of using map() is that it can operate on lists as if it were a vector, which is normally hard to
do. In this case, we’ll retrieve what type of data is contained in each element. Just as with other tidyverse
functions, we can still use the pipe as well!
## [[1]]
## [1] "numeric"
##
## [[2]]
## [1] "character"
##
## [[3]]
## [1] "numeric"
##
- 38 -
## [[4]]
## [1] "logical"
However, we usually don’t want a list back as output. There are a number of extra map functions that let us
specify what type of output we want.
Note that purrr’s map functions will complain if you ask to get output where type coercion would result in
loss of data. This is a useful feature called type-safety.
Execution halted
That error prevents this kind of stuff from happening (without you knowing about it!).
as.integer(sqrt(7))
## [1] 2
Here is an example. In this case we want to get the standard error of the mean (SEM), but it is not already
defined in R. We’ll use an anonymous function to define our SEM function in-line.
- 39 -
return(sd(var) / sqrt(length(var)))
})
## [[1]]
## [1] 0.6454972
##
## [[2]]
## [1] 0.5773503
##
## [[3]]
## [1] 1
purrr also provides a shortcut to define an anonymous function. We can use ~ to replace
the function keyword, and .x to replace the variable name. Using this shortcut might look like this:
## [[1]]
## [1] 0.6454972
##
## [[2]]
## [1] 0.5773503
##
## [[3]]
## [1] 1
class(starwars$films)
## [1] "list"
Use map_int() and an anonymous function to determin how many films each character appeared in (from
the starwars$films column).
- 40 -
Operating on matrices
When it comes to working with rows/columns of matrices, the base R function apply() function is still
superior to everything else.
apply(mat, 1, sum)
apply(mat, 2, sum)
- 41 -
Pretty plots with ggplot2
Though this is technically a course on high-performance computing. I would be doing newcomers a
disservice if we did not at least quickly cover how plotting works in R. We’ll also be profiling
some ggplot2 code in the next section as an example.
ggplot2 is a plotting framework that is (relatively) easy to use, powerful, AND it looks good.
library(ggplot2)
str(data)
It’s sleep data of some kind. Anyhow, let’s start. ggplot2 revolves around a certain kind of variable: the
ggplot2 object. What is a ggplot2 object? Basically it is your data + information on how to interpret it + the
actual geometry it uses to plot it.
You can add as much data in the inital function call as you want. All of these work, but the final version is
the only “complete” object that fully specifies the data used for the plot.
- 42 -
To store an object (to add to it later/plot it on demand), just give it a reference. Simply typing the reference
will display the plot (if you’ve provided enough information to make it.)
ref
As you can see, we haven’t specified everything we need yet. There are 3 components to making a plot
with a ggplot object: your data, the aesthetic mappings of your data, and the geometry. If you are missing
one, you won’t get a functional plot.
Your data should be a dataframe with everything you want to plot. Note that it is possible to put data from
multiple sources (ie. different dataframes) in the same plot, but it’s easier if everything is in the same 2-
dimensional dataframe.
The aesthetic mappings tell ggplot2 how to interpret your data. Which values in your dataframe are the y-
values, x-values, what should be used for colors, etc.
The geometry is the actual stuff that goes on the plot. You can specify any geometry as long as you have
supplied the values it needs. If you’ve specified the required aesthetic mappings (which data corresponds to
x, y, etc.), all you need to do is tell ggplot2 to create a certain geometry- for instance a scatterplot.
Just add the geometry you want to your object. In this case, we are making a scatterplot.
- 43 -
ref
All you need to do to add more information to your plot/change things is add on more elements. Lets add a
logarithmic scale on the x axis.
ref
- 44 -
Lets add a smoothed mean.
ref + geom_smooth()
- 45 -
You can also specify aesthetics inside the call to create geomtery.
Why didn’t that work? This is because when we specfy aesthetics inside a call to geomtery it only applies
for that layer (only geom_point got the x and y values). The only information that gets passed to all
geometery calls is aethetics specified in the initial creation of the ggplot object.
ggplot(data) + scale_x_log10() +
It’s important to note that geometry will automatically use any aesthetic mappings that it understands, and
ignore ones it doesn’t. So if you specify as much stuff as you can in the inital call that can be used, it’ll save
you work.
Like this:
- 46 -
ggplot(data, aes(x = bodywt, y = sleep_total)) + scale_x_log10() + geom_point() + geom_smooth()
Hint: conservation status is data$conservation and time awake is data$awake. To make a scatterplot,
use geom_point().
Let’s follow up with a few very common plot/geometry types and mappings you might be interested in:
These x and y mappings (and the log scale on the x axis will be used for all later plots).
plot + geom_point()
- 47 -
First lets add color based on what things eat. Note that it automatically adds a legend.
We used a factor there, but we can also use a continuous variable for color as well.
- 48 -
plot + geom_point(aes(color = log(brainwt)))
- 49 -
Set the limits of a scale
- 50 -
Changing size and alpha works the same way:
- 51 -
If we want to simply change a plot value like marker shape or size without mapping it to data, just specify
it outside the call to aesthetics. plot + geom_point(aes(shape = vore), size = 6, color = “orange”)
- 52 -
Exercise - Another plot example
How would I make a scatterplot of conservation status (data$conservation) vs time awake (data$awake),
with the color mapped to vore (data$vore) and the ize mapped to the log of brain weight
(log(data$brainwt)). Bonus points if you add axis labels and a title.
Boxplot
Note that stats are automatically performed.
- 53 -
Line plot with different groups
ggplot(data, aes(x = bodywt, y = brainwt, group = vore, color = vore)) +
- 54 -
1D density
ggplot(data, aes(x = sleep_total, fill = vore)) + geom_density(alpha = 0.5)
Violin plot
ggplot(data, aes(x = vore, y = sleep_total)) + geom_violin()
- 55 -
Bar plot
ggplot(data, aes(x = vore)) + geom_bar()
Note that it automatically is binning the number of values in “vore”. To get a bar plot to simply plot the
values you feed it, use geom_bar(stat = "identity").
Hint: you will need the following aesthetics: x, y, and fill. The geometry you want to use
is geom_boxplot()
- 56 -
Reports with R Markdown
Making plots and being able to write output files is important. However, the most important part of any
analysis is communicating your findings effectively! The rmarkdown package is one of the best and easiest
ways of doing this. It let’s you create a fancy looking report without ever leaving RStudio. You can
interleave your code and plots together with explanation text and formatting you might normally do in a
tool like Microsoft Word. You can even make websites with it (like this one!)!
Let’s get started with R Markdown and make ourselves a sweet report.
You should see a file open with a lot of text and code already there. This is the default R Markdown
document and comes with a lot of pointers about how to do things. For now, save the file as example.Rmd,
and hit the “Knit” button.
If everything went well, you should see a nicely formatted webpage/ report pop up in RStudio’s built-in
browser. Congratulations, you’ve made your first report!
Basic formatting
Now that we know everything’s working, let’s start from scratch and learn how R Markdown works step-
by-step. The first thing we should do is delete everything except for the following:
---
title: "test"
output: html_document
---
The stuff between the ---s is called YAML - it’s a markup format that describes what kind of document to
make, who wrote it, and other technical bits that’s not actual content in your report. If you want, you can
modify it, but let’s leave it be for now.
At the bottom of the document, let’s start typing some text. Everything that follows is example markdown
output and syntax:
- 57 -
Normal characters result in normal text.
# A header
## A smaller header
----------------------
• * Item one
• * Item two
• * Item three
o indenting four spaces makes a nested list
1. first
2. second
3. third
`some code`
```
a block of text
- 58 -
```

5 + 6
## [1] 11
If you want to display data, you have a couple options. The first, is just to print out data the way you
normally would.
library(gapminder)
head(gapminder)
However, there is a special function knitr::kable() that inserts a nice-looking table into your report.
Depending on what your output format is, it sometimes is even interactive!
knitr::kable(head(gapminder))
- 59 -
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
You can put code inline in text by just typing “r some code” in backtics.
library(ggplot2)
geom_boxplot()
- 60 -
- 61 -