0% found this document useful (0 votes)
35 views61 pages

Data Science With R

Uploaded by

Deepanshu Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views61 pages

Data Science With R

Uploaded by

Deepanshu Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Data Science with R

R is a programming language used to analyze data. That’s all it’s good for, and it does this one job very, very
well. Any analysis you can think of can be done with R. Because of these features, R has become a very
popular tool for data science.

So what’s “data science”? Long story short, it’s a buzzword used to describe the relatively new field of
applied data analysis. Our data often has valuable information to tell us. Data science is the “science” of
extracting these worthwhile insights from our data. This tutorial aims to teach the basics of data science:
loading data, performing statistics, and convey this information in a useful manner.

At the end of this webinar, you’ll know how to:

• Write and run R code.


• Analyze data with dplyr and purrr.
• Make sweet plots with ggplot2.
• Write reports using R Markdown.

Setup
Before you start, make sure you have both R and RStudio installed and ready to go. RStudio is a

You may also wish to install the tidyverse packages with install.packages("tidyverse") beforehand.
We’ll be using them a lot.

-1-
Basic Syntax
The most basic use of R is as a simple calculator:

5 + 4

## [1] 9

1 - 3

## [1] -2

4 * -2

## [1] -8

5 / 6

## [1] 0.8333333

A function in R follows the syntax function_name(argument1, argument2). Functions perform


operations on their arguments and return a result. The most basic function is the print() statement.

print('hello world!')

## [1] "hello world!"

R also gives us access to more complex mathematical funtions. For instance, log() gives us the natural log
of a number, and exp() does the inverse.

log(10)

## [1] 2.302585

exp(log(10))

## [1] 10

But do we know all of this just from the top of our head? Of course not, there’s lots of useful
documentation to leaf through. Importantly, you’ll never stop looking through the docs - it’s a fact of life.
We can look up what a function does by prefixing it with an ? or hovering over it with our cursor in
RStudio and pressing F1.

?log()
log package:base R Documentation
Logarithms and Exponentials
Description:
'log' computes logarithms, by default natural logarithms, 'log10'
computes common (i.e., base 10) logarithms, and 'log2' computes

-2-
binary (i.e., base 2) logarithms. The general form 'log(x, base)'
computes logarithms with base 'base'.
'log1p(x)' computes log(1+x) accurately also for |x| << 1.
'exp' computes the exponential function.
'expm1(x)' computes exp(x) - 1 accurately also for |x| << 1.

Usage:
log(x, base = exp(1))
logb(x, base = exp(1))
log10(x)
log2(x)

# further output omitted for brevity

Importantly, the bottom of the help pages always contain executable examples. You can always copy and
paste them into your R console and they will work. (A package won’t build properly if the examples don’t
work!) So now that we know where to go for help, we know all we need to know about R, right? ;)

Variables and assignment


We’ll probably want to save our answers at some point. We do this by assigning a variable. A variable is a
name for a saved piece of data - let’s show some examples:

Assign the value 5 to the variable “a”:

a <- 5

## [1] 5

Note that a and it’s value can be used interchangeably:

a * 7

## [1] 35

You may have noticed that we used the <- sign to assign a variable earlier. This is different than most
languages, which use the = sign. We can actually still use the = for assignment in R if we want to.

b = 10

## [1] 10

-3-
Note that using = is frowned upon in R. You’ll hear that phrase a lot in this lesson… “such and such is
frowned upon”, “so and so is better”. Over the course of this lesson, we’re going to actually try to explain
why we do things each way. In this particular case, both methods are the same in all but one rarely-used
edge case (inline variable assignment). As such, you are free to use whichever operator you choose, but by
convention, R programmers will typically use <- (RStudio hotkey: Alt + -). If you’re interested in doing
things according to the “official style”, you can open up Tools -> Global Options -> Code -> Diagnostics and
tell it to check code style.

One important thing to note is when variables are modified. Let’s demonstrate this via example.

weight_kg <- 55

weight_lb <- weight_kg * 2.2

weight_lb

## [1] 121

Ok, everyone should be with me so far. But what about if we modify the value of weight_kg,
does weight_lb change as well?

weight_kg <- 9000

weight_lb

## [1] 121

It does not. Variables only update when we explicitly assign them with <-.

weight_lb <- weight_kg * 2.2

weight_lb

## [1] 19800

We can contrast with how variable assignment works in Python and other languages: where there is a
distinction between objects and primitives and assigning a new copy of an object just creates a link to the
original set of data. R does not treat any variables differently from others. In R, whenever you assign a
variable under a new name, it creates a copy of the orignal.

Copy-on-modify
Or, that’s almost how it works. R uses a “copy-on-modify” behavior. Essentially whenever you assign a
variable under a new name, it points at the old variable, and does not take up any extra space. However, as
soon as you modify anything in the new variable, R will create a new copy.

-4-
Let’s demonstrate this by example. If you want to follow along for this part, you will need to install
the pryr package (install.packages("pryr")). Also, this will introduce a new function library() - this
is used to load extra bits of functionality that is not included in the base R programming language.

var1 <- 10

var2 <- var1

Alright, we’ve assigned two variables, var1 and var2. var2 is a copy of var1. Is there a way to check if I’m
telling the truth about the copy-on-modify behavior? Are these the same object?

Fortunately, the pryr package lets us take a closer look at R’s internal workings. The address function can
be used to see the memory address of an R object.

library(pryr)

address(var1)

## [1] "0x537184f4a8"

address(var2)

## [1] "0x537184f4a8"

Without going into too much detail on how memory addresses and allocation works (it’s not important for
writing R code), we can see that both var1 and var2 have the same address: they are stored in the same
spot in your computer.

What happens if we modify var2?

var2 <- var1 + 1

address(var1)

## [1] "0x537184f4a8"

address(var2)

## [1] "0x53716d8198"

The address has changed. After modifying var2, R realized that it could no longer store both variables in
the same spot and made a new copy to store var2 in.

This copy-on-modify behavior has important implications for how we write R code. Don’t reassign variable
names just because you want a new name for it. Every time you reassign a variable and modify it (even
slightly), you force R to make a new copy, doubling it’s memory use. A smarter way of doing things
(especially if you just need to modify a variable), is to do things “in-place”, or simply overwrite the original
variable with its modified value.

-5-
Good example

var1 <- 10

var1 <- var1 + 1

Bad example

var1 <- 10

var1_modified <- var1 + 1

-6-
Vectors and indexing
R has a special data structure called a vector. A vector is a 1D set of the same type of object. Most often, a
vector will simply be a sequence of numbers. We can create a sequence of numbers using the : operator.

numbers <- 1:10

numbers

## [1] 1 2 3 4 5 6 7 8 9 10

Note that vectors are treated the same way as a single element. Anything that works on a single number
works the same way on a vector. This is called vectorization.

numbers + 5

## [1] 6 7 8 9 10 11 12 13 14 15

2 ^ numbers

## [1] 2 4 8 16 32 64 128 256 512 1024

sin(numbers)

## [1] 0.8414710 0.9092974 0.1411200 -0.7568025 -0.9589243 -0.2794155

## [7] 0.6569866 0.9893582 0.4121185 -0.5440211

We can also create a vector with the c() function (c stands for concatenate, in case you are wondering).

concat <- c(4, 17, -1, 55, 2)

concat

## [1] 4 17 -1 55 2

Indexing
Often, we don’t want to get the entire vector. Perhaps we only want a single element, or a set of specific
elements. We do this by indexing (uses the [] brackets).

To get the first element of a vector, we could do the following. In R, array indexes start at 1 - the 1st
element is at index 1. This is different than 0-based languages like C, Python, or Java where the first
element is at index 0.

concat[1]

## [1] 4

-7-
To get other elements, we could do the following:

concat[2] # second element

## [1] 17

concat[length(concat)] # last element

## [1] 2

Notice that for the second example, we put a function inside the square brackets. In this case, length() is
used to get the length of a vector, and since the lenght of the vector will be equal to the index of its last
element, this is a nifty way of getting the last element of a vector.

We can actually put anything inside the square brackets. Putting another vector inside the brackets gives
us multiple values, for instance.

concat[1:4]

## [1] 4 17 -1 55

concat[c(3, 5)]

## [1] -1 2

We can use this technique to reassign certain values inside the vector. For instance, we could change the
3rd and 5th values to 76 with the following code:

concat[c(3, 5)] <- 76

concat

## [1] 4 17 76 55 76

It’s even possible to index outside the bounds of a vector. Notice that R “fills in the blanks”
with NA values. NA is R’s placeholder for “no data” (since 0 often shows up in real data).

concat[10] <- 4.3

concat

## [1] 4.0 17.0 76.0 55.0 76.0 NA NA NA NA 4.3

DON’T EVER DO THIS. Though this is actually valid code in R (indexing outside of a vector’s size is an
error in most other languages), it comes at a performance cost.

-8-
Matrices
A matrix is a two-dimensional vector. Let’s create a matrix with the matrix() function. One important
note here: functions often have optional, “extra” arguments that are specified with name=value notation. In
this case, we are creating a matrix with 2 rows and 5 columns.

mat <- matrix(1:10, nrow=2, ncol=5)

mat

## [,1] [,2] [,3] [,4] [,5]

## [1,] 1 3 5 7 9

## [2,] 2 4 6 8 10

All of the same operations that work on vectors also work on matrices.

mat + 20

## [,1] [,2] [,3] [,4] [,5]

## [1,] 21 23 25 27 29

## [2,] 22 24 26 28 30

dim(mat) # grab dimensions of a matrix

## [1] 2 5

length(mat) # number of elements in a matrix

## [1] 10

However, indexing a matrix is slightly different from indexing in a vector. We now have not only one, but
two dimensions to choose from. When indexing using an object with multiple dimensions, we use a , to
separate them. In R, rows are on the left side of the comma, and columns are on the right (a third
dimension would be after the second comma, and so on…). Note that R is actually trying to help us out
here. When we printed our matrix, the rows have a [#,] next to them, and the columns have a [,#]. This
actually shows us the exact syntax we need to get each element.

So using the matrix output from above, mat[1,] should give us the first row, and mat[,4] should give us
the fourth column. Let’s check this:

mat[1,]

## [1] 1 3 5 7 9

mat[,4]

-9-
## [1] 7 8

We can index both rows and columns at the same time to get a specific element.

mat[1, 1] # grab first row, first column

## [1] 1

mat[2, 1:3] # elements 1-3 of the second row

## [1] 2 4 6

Exercise - Grabbing unrelated elements


Try getting the columns 1, 4, and 5 of the first row in one command.

Exercise - Reading documentation


Create an 8x5 matrix using the numbers 1:40. See if you can get it to fill by row instead of by column.

Hint: you should check the documentation for matrix().

Different types of data


In most programming languages, text is called a string. To create text in R, we simply surround it with
double (") or single (') quotes. We’ve actually seen an example of a string already ( print('hello
world!')).

"this is a string"

## [1] "this is a string"

paste('we can combine strings', 'with the paste() function')

## [1] "we can combine strings with the paste() function"

What happens when we add a string to a vector of numbers?

numbers <- c(1, 4, 9, 10)

numbers

## [1] 1 4 9 10

numbers[3] <- "testing..."

numbers

## [1] "1" "4" "testing..." "10"


- 10 -
Our entire vector of numbers was turned into strings! This is an important property of vectors and
matrices: they can only hold one type of data! If we try putting a different type into a vector, R will
convert the entire vector to the new datatype.

Again, this conversion has a massive performance hit, especially for a large vector or matrix. R needs to
create an entire new vector from scratch, and then copy over and convert every item.

So let’s learn about R’s different data types. There are a few more types that we’re not covering here or
cover later (like factors, for instance). These are simply the most common data types we will encounter.

Numeric
Numeric variables can hold any decimal number, positive or negative. In other languages these are called
floats. Note that this is the default type of number in R.

To create a numeric value, all we have to do is type it normally.

## [1] 1

-3.5

## [1] -3.5

We can convert another variable to numeric with the as.numeric() function. This only works with values
that can be easily converted.

as.numeric("456") # this works

## [1] 456

as.numeric("seven") # this does not

## Warning: NAs introduced by coercion

## [1] NA

Integers
Integers represent any whole number, positive or negative. They cannot hold decimal numbers.

To create a number explicitly as an integer, add an L after it.

15L

## [1] 15

- 11 -
We can convert a set of data to integer with the as.integer() function.

as.integer(c("65.3", "4"))

## [1] 65 4

Characters (strings)
As mentioned earlier, strings are sets of text. We can turn something into a string with
the as.character() function.

as.character(TRUE)

## [1] "TRUE"

Logical (boolean) values


There are just two logical values: TRUE and FALSE. These are used, quite literally, to represent whether or
not a statement is true or false.

As usual, we can turn something into a logical/boolean value with the as.logical() function. One
important thing to note is that this very nicely demonstrates what happens when we turn one data type
into another. If one data type cannot hold extra information from another, that information is lost during
conversion.

fifty <- as.logical(50)

fifty

## [1] TRUE

as.numeric(fifty)

## [1] 1

Determining data types


If something isn’t working the way it’s supposed to, a great technique is to check what type of data it is
with the class() function:

class(14)

## [1] "numeric"

class(1L)

## [1] "integer"

- 12 -
class(TRUE)

## [1] "logical"

class("text")

## [1] "character"

class(NA)

## [1] "logical"

Note that NA’s can be of any type, and are often used as placeholders for missing data.

Exercise - Converting data types


See if you can make the following expressions work:

5 + "10" - Get 15 as a result.

3 + 3.7 - Get 6 as a result.

1 + 1 - Your answer should equal “11”.

"4" + 10 - Get 5 as a result.

- 13 -
Dataframes
Vectors and matrices are super cool. However they don’t address an important issue: holding multiple
types of data and working with them at the same time. Dataframes are another special data structure that
let’s you handle large amounts and different types of data together. Because of this, they are generally the
tool-of-choice for doing analyses in R.

We are going to focus on using dataframes using the dplyr package. dplyr comes as part of
the tidyverse package bundle, you can install it with install.packages("tidyverse"). It can take
awhile to install this on Linux, so perhaps start the command in another window while we go through the
non-dplyr parts.

A small example
In a text editor, create the following example CSV file. We’ll call it cats.csv.

coat,weight,likes_string

calico,2.1,1

black,5.0,0

tabby,3.2,1

Once we’ve saved it in the same directory we’re working in, we can load it with read.csv().

cats <- read.csv('cats.csv')

cats

## coat weight likes_string

## 1 calico 2.1 1

## 2 black 5.0 0

## 3 tabby 3.2 1

Whenever we import a dataset with multiple types of values, R will autodetect this and make the output a
dataframe. Let’s verify this for ourselves:

class(cats)

## [1] "data.frame"

So, we’ve got a dataframe with multiple types of values. How do we work with it? Fortunately, everything
we know about vectors also applies to dataframes.

- 14 -
Each column of a dataframe can be used as a vector. We use the $ operator to specify which column we
want.

cats$weight + 34

## [1] 36.1 39.0 37.2

class(cats$weight)

## [1] "numeric"

cats$coat

## [1] calico black tabby

## Levels: black calico tabby

We can also reassign columns as if they were variables. The cats$likes_string likely represents a set of
boolean value, lets update that column to reflect this fact.

class(cats$likes_string) # before

## [1] "integer"

cats$likes_string <- as.logical(cats$likes_string)

class(cats$likes_string)

## [1] "logical"

We can even add a column if we want!

cats$age <- c(1, 6, 4, 2.5)

Error in `$<-.data.frame`(`*tmp*`, age, value = c(1, 6, 4, 2.5)) :

replacement has 4 rows, data has 3

Notice how it won’t let us do that. The reason is that dataframes must have the same number of elements
in every column. If each column only has 3 rows, we can’t add another column with 4 rows. Let’s try that
again with the proper number of elements.

cats$age <- c(1, 6, 4)

cats

## coat weight likes_string age

## 1 calico 2.1 TRUE 1

## 2 black 5.0 FALSE 6

- 15 -
## 3 tabby 3.2 TRUE 4

Note that we don’t have to call class() on every single column to figure out what they are. There are a
number of useful summary functions to get information about our dataframe.

str() reports on the structure of your dataframe. It is an extremely useful function - use it on everything if
you’ve loaded a dataset for the first time.

str(cats)

## 'data.frame': 3 obs. of 4 variables:

## $ coat : Factor w/ 3 levels "black","calico",..: 2 1 3

## $ weight : num 2.1 5 3.2

## $ likes_string: logi TRUE FALSE TRUE

## $ age : num 1 6 4

As with matrices, we can use dim() to know how many rows and columns we’re working with.

dim(cats)

## [1] 3 4

nrow(cats) # number of rows only

## [1] 3

ncol(cats) # number of columns only

## [1] 4

Factors
When we ran str(cats), you might have noticed something weird. cats$coat is listed as a “factor”. A
factor is a special type of data that’s almost a string.

It prints like a string (sort of):

cats$coat

## [1] calico black tabby

## Levels: black calico tabby

It can be used like a string:

paste("The cat is", cats$coat)


- 16 -
## [1] "The cat is calico" "The cat is black" "The cat is tabby"

But it’s not a string! The output of str(cats) gives us a clue to what’s actually happening behind-the-
scenes.

str(cats)

## 'data.frame': 3 obs. of 4 variables:

## $ coat : Factor w/ 3 levels "black","calico",..: 2 1 3

## $ weight : num 2.1 5 3.2

## $ likes_string: logi TRUE FALSE TRUE

## $ age : num 1 6 4

str() reports that the first values are 2, 1, 3 (and not text). Let’s use as.numeric() to reveal its true form!

as.numeric(cats$coat)

## [1] 2 1 3

cats$coat

## [1] calico black tabby

## Levels: black calico tabby

A factor has two components, its levels and its values. Levels represent all possible values for a column. In
this case, there’s only 3 possiblities: black, calico and tabby.

The actual values are 2, 1, and 3. Each value matches up to a specific level. So in our example, the first
value is 2, which corresponds to the second level, calico. The second value is 1, which matches up with
the first level, black.

Factors in R are a method of storing text information as one of several possible “levels”. R converts text to
factors automatically when we import data, like from a CSV file. We’ve got several options here:

Convert the factor to a character vector ourselves:

cats$coat <- as.character(cats$coat)

class(cats$coat)

## [1] "character"

Tell R to simply not convert things to factors when we import it (as.is=TRUE is the R equivalent of “don’t
touch my stuff!”):

- 17 -
new_cats <- read.csv('cats.csv', as.is=TRUE)

class(new_cats$coat)

## [1] "character"

Use the read_csv() function from the readr package. readr is part of the tidyverse and has a number of
ways of reading/writing data with more sensible defaults.

library(tidyverse)

## Loading tidyverse: ggplot2

## Loading tidyverse: tibble

## Loading tidyverse: tidyr

## Loading tidyverse: readr

## Loading tidyverse: purrr

## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats

## lag(): dplyr, stats

even_newer_cats <- read_csv('cats.csv')

## Parsed with column specification:

## cols(

## coat = col_character(),

## weight = col_double(),

## likes_string = col_integer()

## )

class(even_newer_cats$coat)

## [1] "character"

- 18 -
Data analysis with dplyr
About the rest of this seminar
There are a million different ways to do things in R. This isn’t Python, where solutions on StackOverflow
get ranked on how “Pythonic” they are. If there’s something you like about another workflow in R, there’s
nothing stopping you from using it!

In this case, there are three main camps on analyzing dataframes in R:

• “Base R” - “Base R” means using only functions and stuff built into your base R installation. No
external packages or fancy stuff. The focus here is on stability from version to version - your
code will never break from an update, but performance and usability aren’t always as great.
• data.table - data.table is a dataframe manipulation package known to have very good
performance.
• “The tidyverse” - The “tidyverse” is a collection of packages that overhauls just about
everything in R to use a consistent API. Has comparable performance with data.table.

For much of the rest of this tutorial, we’ll focus on doing things the “tidyverse” way (with a few
exceptions). The biggest reasons is that everything follows a consistent API - everything in the tidyverse
works well together. You can often guess how to use a new function because you’ve used others like it. It’s
also got pretty great performance. When you use stuff from the tidyverse, you can be reasonably confident
that someone has already taken a look at optimizing things to speed things along.

Logical indexing
So far, we’ve covered how to extract certain pieces of data via indexing. But what we’ve shown so far only
works if we know the exact index of the data we want (vector[42], for example). There is a neat trick to
extra certain pieces of data in R known as “logical indexing”.

Before we start, we need to know a little about comparing things.

== is the equality operator in R.

1 == 1

## [1] TRUE

! means “not”. Not TRUE is FALSE.

!TRUE

## [1] FALSE

Likewise we can check if something is not equal to something else with !=


- 19 -
TRUE != TRUE

## [1] FALSE

We can also make comparisons with the greater than > and less than < symbols. Pairing these with an
equals sign means “greater than or equal to” (>=) or “less than or equal to” (<=).

4 < 5

## [1] TRUE

5 <= 5

## [1] TRUE

9 > 999

## [1] FALSE

TRUE >= FALSE

## [1] TRUE

The last example worked because TRUE and FALSE are equal to 1 and 0, respectively.

TRUE == 1

## [1] TRUE

FALSE == 1

## [1] FALSE

We can even compare strings:

"a" == "a"

## [1] TRUE

"a" != "b"

## [1] TRUE

This trick also works with vectors, returning TRUE or FALSE for every element in the vector.

example <- 1:7

example >= 4

## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE

another_example <- c("apple", "banana", "banana")

- 20 -
another_example == "banana"

## [1] FALSE TRUE TRUE

This trick is extremely useful for getting specific elements. Watch what happens when we index a vector
using a set of boolean values. Using our example from above:

example

## [1] 1 2 3 4 5 6 7

greater_than_3 <- example > 3

greater_than_3

## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE

example[greater_than_3]

## [1] 4 5 6 7

This can be turned into a one-liner by putting the boolean expression inside the square brackets.

example[example > 3]

## [1] 4 5 6 7

We can also get the elements which were not greater than 3 by adding an ! in front.

example[!example > 3]

## [1] 1 2 3

Exercise - Removing NAs from a dataset


Logical indexing is also a pretty neat trick for removing NAs from a vector. Many functions will refuse to
work on data with NAs present. The is.na() function returns TRUE or FALSE depending on if a value is NA.

Using this info, make the following return a number as a result instead of NA.

ugly_data <- c(1, NA, 5, 7, NA, NA)

mean(ugly_data)

## [1] NA

Exercise - The `na.rm` argument


Many functions have an na.rm argument used to ignore NA values. Does this work for mean() in the
previous example?

- 21 -
Retrieving rows from dataframes
Let’s try this out on a bigger dataset. nycflights13 is an example dataset containing all outbound flights
from NYC in 2013. You can get this dataset with install.packages("nycflights13").

Let’s take a look at the dataset and see what we’ve got.

library(nycflights13)

head(flights) # shows the top few rows of a dataset

## # A tibble: 6 x 19

## year month day dep_time sched_dep_time dep_delay arr_time


## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>

str(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:


## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

- 22 -
dim(flights)

## [1] 336776 19

A note about tbl_dfs


flights is an example of a “tibble” or tbl_df. tbl_dfs are identical to dataframes for most purposes, but
they print out differently (notice how we didnt’t get all of the columns!).

class(flights)

## [1] "tbl_df" "tbl" "data.frame"

To force a tbl_df to print all columns, you can use print(some_tbl_df, width=Inf)

If we ever get annoyed with a tbl_df, we can turn it back into a dataframe with as.data.frame().

class(as.data.frame(flights))

## [1] "data.frame"

The flights table clocks in at several hundred thousand rows. That’s a fair sized chunk of data.
Nevertheless, our tricks from before work just the same.

Using the same technique from before, let’s retrieve all of the flights that went to Los Angeles (LAX).

rows_with_yvr <- flights$dest == "LAX"

flights[rows_with_yvr, ]

## # A tibble: 16,174 x 19

## year month day dep_time sched_dep_time dep_delay arr_time


## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 924
## 2 2013 1 1 628 630 -2 1016
## 3 2013 1 1 658 700 -2 1027
## 4 2013 1 1 702 700 2 1058
## 5 2013 1 1 743 730 13 1107
## 6 2013 1 1 828 823 5 1150
## 7 2013 1 1 829 830 -1 1152
## 8 2013 1 1 856 900 -4 1226
## 9 2013 1 1 859 900 -1 1223
## 10 2013 1 1 921 900 21 1237
## # ... with 16,164 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

- 23 -
# and the same, but in one line

result <- flights[flights$dest == "LAX", ]

# checking our work... we should only see "LAX" here

unique(result$dest)

## [1] "LAX"

# how many results did we get

nrow(result)

## [1] 16174

Breaking things apart, we look for all instances where the column dest was equal to “LAX”. We end up
with a vector of whether or not “LAX” was found in each row. We can then use the square brackets to
extract every row where the vector is true. Note the addition of a comma in our square
brackets. flights has 2 dimensions, so our indexing needs to as well!

If we don’t add the comma, R gets upset:

flights[flights$dest]

Error: Length of logical index vector must be 1 or 19 (the number of rows), not 336776

One other issue - what happens if we want to grab the flights to either LAX or SEA (Seattle). Let’s try the
following:

result <- flights[flights$dest == c("LAX", "SEA"), ]

unique(result$dest)

## [1] "LAX" "SEA"

nrow(result)

## [1] 10060

Though in both cases we got results corresponding to the cities we wanted, it looks like somethig went
wrong. Before, we got 16174 results for just “LAX”. Now we only get 10060, and we even added an extra
city worth of flights! So what’s happening here?

When R compares two vectors of different length, it “recycles” the shorter vector until it matches the
length of the longer one!

Using a smaller example, this is what just happened:

long <- c(1, 1, 1, 2, 2, 2, 3)

- 24 -
short <- c(1, 2)

long == short

## Warning in long == short: longer object length is not a multiple of shorter

## object length

## [1] TRUE FALSE TRUE TRUE FALSE TRUE FALSE

# what R is really doing behind the scenes

short_recycled <- c(1, 2, 1, 2, 1, 2, 1)

long == short_recycled

## [1] TRUE FALSE TRUE TRUE FALSE TRUE FALSE

This is not what we want. We want to know if elements in the long vector were found “in” the shorter
vector, not whether or not the two are equal at every point. Fortunately, there is a special %in% operator
that does just that.

long %in% short

## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE

# and using that to subset values

long[long %in% short]

## [1] 1 1 1 2 2 2

If we take the %in% operator and apply it to our issue, we get the correct number of rows.

res <- flights[flights$dest %in% c("SEA", "LAX"), ]

nrow(res)

## [1] 20097

# our results contain the same number of flights bound for LAX

nrow(res[flights$dest == "LAX", ])

## [1] 16174

Filtering rows with dplyr


Up to this point, we’ve done everything using base R. Our code has a lot of crazy symbols in it, and isn’t
that readable for the average person. It’s also not that fun to type out.

- 25 -
Let’s try things the “tidyverse” way using dplyr (dplyr is a package that comes as part of
the tidyverse package bundle).

To filter out a set of specific rows that match a condition, we use the filter() function. The syntax of this
function is a bit unusual:

library(tidyverse)

## Loading tidyverse: ggplot2

## Loading tidyverse: tibble

## Loading tidyverse: tidyr

## Loading tidyverse: readr

## Loading tidyverse: purrr

## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats

## lag(): dplyr, stats

results <- filter(flights, dest == "LAX")

nrow(results)

## [1] 16174

Notice how we just used dest all by itself. filter() is smart enough to figure out that dest is a column
name in the flights dataframe.

We can also filter multiple things at once using the & (AND) and | (OR) operators. & checks if both
conditions are true, | checks if just one condition is true:

TRUE & TRUE

## [1] TRUE

TRUE & FALSE

## [1] FALSE

TRUE | FALSE

## [1] TRUE

Using this in an example with filter() to fetch all the flights to LAX in February:

- 26 -
filter(flights, dest == "LAX" & month == 2)

## # A tibble: 1,030 x 19

## year month day dep_time sched_dep_time dep_delay arr_time


## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 2 1 554 601 -7 920
## 2 2013 2 1 654 700 -6 1032
## 3 2013 2 1 657 705 -8 1027
## 4 2013 2 1 658 700 -2 1018
## 5 2013 2 1 722 705 17 1040
## 6 2013 2 1 807 730 37 1134
## 7 2013 2 1 826 830 -4 1206
## 8 2013 2 1 857 900 -3 1225
## 9 2013 2 1 859 900 -1 1251
## 10 2013 2 1 901 905 -4 1230
## # ... with 1,020 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

Exercise - Filtering data


Let’s do several more examples to make sure you’re super comfortable with filtering data:

• How many flights left before 6 AM?


• How many flights went to Toronto (YYZ)? Is there anything weird about this dataset?
• What is a typical flight time (air time) when traveling from New York to Chicago O’Hare
(ORD)?

Using the “pipe”


The tidyverse heavily encourages the use of a special pipe (%>% operator). The pipe sends the output of the
last command to the first argument of the next (probably will be a familiar concept for users of bash, the
Linux shell). This is a great tool for making our analyses more readable (read: good).

Repeating an earlier example, we can retrieve the number of flights that went to LAX with:

# earlier example:

# nrow(filter(flights, dest == "LAX"))

flights %>% filter(dest == "LAX") %>% nrow

## [1] 16174

- 27 -
Our analysis now flows from left to right, instead of inside out. Makes things quite a bit more readable.
Many people also put each step on a new line. That way if you want to exclude a step, you can just
comment it out.

flights %>%

filter(dest == "LAX") %>%

nrow()

## [1] 16174

Controlling output
dplyr also has its own function for selecting columns: select(). To grab the certain columns from a
dataframe, we supply their names to select() as arguments.

flights %>% select(flight, dest, air_time)

## # A tibble: 336,776 x 3

## flight dest air_time


## <int> <chr> <dbl>
## 1 1545 IAH 227
## 2 1714 IAH 227
## 3 1141 MIA 160
## 4 725 BQN 183
## 5 461 ATL 116
## 6 1696 ORD 150
## 7 507 FLL 158
## 8 5708 IAD 53
## 9 79 MCO 140
## 10 301 ORD 138
## # ... with 336,766 more rows

We can also sort columns using arrange(). arrange() sorts a dataset by whatever column names you
specify.

flights %>% arrange(sched_dep_time)

## # A tibble: 336,776 x 19

## year month day dep_time sched_dep_time dep_delay arr_time


## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 2 458 500 -2 703
## 3 2013 1 3 458 500 -2 650
## 4 2013 1 4 456 500 -4 631

- 28 -
## 5 2013 1 5 458 500 -2 640
## 6 2013 1 6 458 500 -2 718
## 7 2013 1 7 454 500 -6 637
## 8 2013 1 8 454 500 -6 625
## 9 2013 1 9 457 500 -3 647
## 10 2013 1 10 450 500 -10 634
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

To sort in descending order, we can add the desc() function into the mix.

flights %>% arrange(desc(sched_dep_time))

## # A tibble: 336,776 x 19

## year month day dep_time sched_dep_time dep_delay arr_time


## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 2353 2359 -6 425
## 2 2013 1 1 2353 2359 -6 418
## 3 2013 1 1 2356 2359 -3 425
## 4 2013 1 2 42 2359 43 518
## 5 2013 1 2 2351 2359 -8 427
## 6 2013 1 2 2354 2359 -5 413
## 7 2013 1 3 32 2359 33 504
## 8 2013 1 3 235 2359 156 700
## 9 2013 1 3 2349 2359 -10 434
## 10 2013 1 4 25 2359 26 505
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

Data analysis
So far we’ve learned how to to rearrange and select parts of our data. What about actually analyzing it.
The group_by() and summarize() functions, allow us to group by a certain column (say, city or airline),
and then perform an operation on every group.

A simple example might be grouping by month and then summarizing by the number of flights (rows) in
each group.

flights %>%

group_by(month) %>%

- 29 -
summarize(length(month)) # number of records in a group

## # A tibble: 12 x 2

## month `length(month)`
## <int> <int>
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135

We can also perform multiple “summarizations” at once and name our columns something informative.

flights %>%

group_by(month) %>%

summarize(num_flights=length(month),

avg_flight_time=mean(air_time, na.rm=TRUE))

## # A tibble: 12 x 3

## month num_flights avg_flight_time


## <int> <int> <dbl>
## 1 1 27004 154.1874
## 2 2 24951 151.3464
## 3 3 28834 149.0770
## 4 4 28330 153.1011
## 5 5 28796 145.7275
## 6 6 28243 150.3252
## 7 7 29425 146.7283
## 8 8 29327 148.1604
## 9 9 27574 143.4712
## 10 10 28889 148.8861
## 11 11 27268 155.4686
## 12 12 28135 162.5914

We can also simply add on a column to a dataset with the mutate() function. This is the equivalent
of cats$age <- c(1, 3, 4) like we did earlier.

- 30 -
colnames(flights)

## [1] "year" "month" "day" "dep_time"


## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"

new_flights <- flights %>%

mutate(plane_speed = distance / air_time)

colnames(flights)

## [1] "year" "month" "day" "dep_time"


## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"

Exercise - Finding the worst airline


Which airline has the worst record in terms of delays?

To do this, group our data by carrier, get the average arrival delay for each group, then sort in descending
order so that the worst offenders are at the top.

Excercise - Picking an analysis method


Get the maximum arrival delay in the dataset. You’ll want to use the max() function. Did you need to
use dplyr?

Putting dataframes together


In terms of some data, the flights table is actually incomplete! What if we wanted to match up the
destination airport acronyms to their details (like airports’ full names)? This data is actually in another
table: airports.

head(airports)

## # A tibble: 6 x 8

## faa name lat lon alt tz


## <chr> <chr> <dbl> <dbl> <int> <dbl>
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -6
## 3 06C Schaumburg Regional 41.98934 -88.10124 801 -6
## 4 06N Randall Airport 41.43191 -74.39156 523 -5

- 31 -
## 5 09J Jekyll Island Airport 31.07447 -81.42778 11 -5
## 6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -5
## # ... with 2 more variables: dst <chr>, tzone <chr>

In order for this information to be useful to us, we need to match it up and “join” it to our flights table.
This is a pretty complex operation in base R, but dplyr makes it relatively easy.

There are a lot of different types of joins that put together data in different ways. In this case, we’re going
to do what’s called a “left join”: one table is on the left side, and we’ll keep all of its data. However, on the
right side (the table we are joining), we’ll only match up and add each entry if there is a corresponding
entry on the left side.

colnames(flights)

## [1] "year" "month" "day" "dep_time"


## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"

colnames(airports)

## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"

# join syntax:

# left_join(left_table, right_table, by=c("left_colname" = "right_colname"))

# the "by" argument controls which columns in each table are matched up

joined <- left_join(flights, airports, by=c("dest" = "faa"))

colnames(joined) # joined now contain columns from both

## [1] "year" "month" "day" "dep_time"


## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour" "name"
## [21] "lat" "lon" "alt" "tz"
## [25] "dst" "tzone"

Let’s check our work. SEA should show up as Seattle-Tacoma International Airport. Note: we can use . as a
placeholder to represent the entire object passed to the summarize function (instead of using just a column
name, for instance).

joined %>%

filter(dest == "SEA") %>%

- 32 -
select(name) %>%

head(n=1)

## # A tibble: 1 x 1

## name

## <chr>

## 1 Seattle Tacoma Intl

Looks like our join worked!


Exercise - Worst airline, part II
Find the name of the airline with the biggest arrival delays. You will need to join the airlines table to
the flights table. A suggested workflow is shown below (feel free to reuse code from earlier).

• Calculate the average arrival delays by airline.


• Sort the result by average delay in descending order.
• Find which columns match up between the airlines and flights tables. Remember, you can
use print(table_name, width=Inf) to show all columns!
• Join the airlines table to the flights table based upon their common column.
• The top value is your answer.

Exercise - Writing output


Write your results from the last problem to a file. Use the write_csv() to write the table to a csv file. You
can use ?write_csv() to look up how to use this function.

Writing functions
Being able to group by and summarize data is great. But so far all we know how to do is use canned
functions - ones that come with base R or one of the packages we’ve covered. We’ll need to write our own
functions eventually.

Functions in R are defined almost the same as variables. The general syntax looks like this:

name <- function(reqd_arg, optional_arg=42) {

# do stuff

return(result)

Let’s create a function that adds two numbers together as an example.

- 33 -
adder <- function(num1, num2) {

result <- num1 + num2

return(result)

We can now use our function just like any other.

adder(5, 6)

## [1] 11

We also have the ability to specify optional arguments. Optional arguments are just ones where we’ve
given it a default. In this case, we’ll make our adder function just add 10 if a second number is not
specified. Notice that we’ve also eliminated saving the results variable and do everything in one line.

adder <- function(num1, num2=10) {

return(num1 + num2)

adder(5)

## [1] 15

Exercise - Writing our own functions


Write a function that converts feet to meters. 1 foot equals 0.3048 meter.

Exercise - Applying our own functions


The airports table from nycflights13 is using feet for altitude instead of meters. Add a
column alt_meters to correct this mistake. You’ll need to use your function from the last example.

Conditional expressions
Sometimes we need to have functions do things differently if a certain condition is met. For this, we use
if/else statements. if executes a block of code if some condition was met.

number <- 5

if (number > 4) {

print('number was greater than 4!')

}
- 34 -
## [1] "number was greater than 4!"

else statements are executed if a statement is not met.

number <- 3

if (number > 4) {

print('number was greater than 4!')

} else {

print('number was not greater than 4')

## [1] "number was not greater than 4"

We can add an else if statemet to check a second condition.

number <- 3

if (number > 4) {

print('number was greater than 4!')

} else if (number == 4) {

print('number was equal to 4')

} else {

print('number was not greater than 4')

## [1] "number was not greater than 4"

These else/if statements are useful when writing functions.

Running functions on stuff besides dataframes


The dplyr package is very cool, but what if we want to perform analyses on stuff besides dataframes (like
vectors and matrices!)? We’ll need to use the purrr package (also bundled with tidyverse).

Though we could use a for loop like this, there is a more efficient way of doing things:

for (i in 1:10) {

print(i)

- 35 -
}

## [1] 1

## [1] 2

## [1] 3

## [1] 4

## [1] 5

## [1] 6

## [1] 7

## [1] 8

## [1] 9

## [1] 10

purrr provides a set of functions to provide map/reduce-style functionality. “Map” means to apply a
function to every piece of a dataset. “Reduce” means to calculate some kind of summary statistic on a large
group of data.

Let’s demonstrate with several examples using map(). The function requires two arguments, something to
iterate over, and a function to apply to each piece.

library(tidyverse)

## Loading tidyverse: ggplot2

## Loading tidyverse: tibble

## Loading tidyverse: tidyr

## Loading tidyverse: readr

## Loading tidyverse: purrr

## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats

## lag(): dplyr, stats

# iterate over 1:10, apply sqrt() to each number

mapped <- map(1:10, sqrt)

- 36 -
mapped

## [[1]]

## [1] 1

##

## [[2]]

## [1] 1.414214

##

## [[3]]

## [1] 1.732051

##

## [[4]]

## [1] 2

##

## [[5]]

## [1] 2.236068

##

## [[6]]

## [1] 2.44949

##

## [[7]]

## [1] 2.645751

##

## [[8]]

## [1] 2.828427

##

## [[9]]

## [1] 3

##
- 37 -
## [[10]]

## [1] 3.162278

Notice how we get back a weird datastructure as a result. map returns a list by default. Lists are a special
datastructure that can contain any type or size of element.

example_list <- list(1, "a", c(5, 9 , 10), TRUE)

One note on indexing lists: to extract individual elements we need to use two square brackets ( [[]])
instead of one ([]). Using single brackets just returns a one-element list.

example_list[1]

## [[1]]

## [1] 1

example_list[[1]]

## [1] 1

Exercise - Retrieving nested values


Retrieve the 10 in example_list via indexing. You’ll need to use multiple sets of square brackets (and
index twice)

The advantage of using map() is that it can operate on lists as if it were a vector, which is normally hard to
do. In this case, we’ll retrieve what type of data is contained in each element. Just as with other tidyverse
functions, we can still use the pipe as well!

example_list %>% map(class)

## [[1]]

## [1] "numeric"

##

## [[2]]

## [1] "character"

##

## [[3]]

## [1] "numeric"

##

- 38 -
## [[4]]

## [1] "logical"

However, we usually don’t want a list back as output. There are a number of extra map functions that let us
specify what type of output we want.

# return a numerical vector

1:10 %>% map_dbl(sqrt)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751

## [8] 2.828427 3.000000 3.162278

# return a character vector

1:10 %>% map_chr(sqrt)

## [1] "1.000000" "1.414214" "1.732051" "2.000000" "2.236068" "2.449490"

## [7] "2.645751" "2.828427" "3.000000" "3.162278"

Note that purrr’s map functions will complain if you ask to get output where type coercion would result in
loss of data. This is a useful feature called type-safety.

1:10 %>% map_int(sqrt)

Error: Can't coerce element 1 from a double to a integer

Execution halted

That error prevents this kind of stuff from happening (without you knowing about it!).

as.integer(sqrt(7))

## [1] 2

Anonymous (lambda) functions


Sometimes it can be a bit of a pain to define a function only to use it once. In these scenarios, we can use
what’s called an anonymous function - we never give the function a name, using it immediately.

Here is an example. In this case we want to get the standard error of the mean (SEM), but it is not already
defined in R. We’ll use an anonymous function to define our SEM function in-line.

data <- list(1:4, 10:8, 50:60)

data %>% map(function(var) {

- 39 -
return(sd(var) / sqrt(length(var)))

})

## [[1]]

## [1] 0.6454972

##

## [[2]]

## [1] 0.5773503

##

## [[3]]

## [1] 1

purrr also provides a shortcut to define an anonymous function. We can use ~ to replace
the function keyword, and .x to replace the variable name. Using this shortcut might look like this:

data %>% map(~sd(.x) / sqrt(length(.x)))

## [[1]]

## [1] 0.6454972

##

## [[2]]

## [1] 0.5773503

##

## [[3]]

## [1] 1

Exercise - Iterating through rows of a dataframe


The starwars dataset has a list of Star Wars characters. However, some of the columns are a little funny -
they are lists!

class(starwars$films)

## [1] "list"

Use map_int() and an anonymous function to determin how many films each character appeared in (from
the starwars$films column).
- 40 -
Operating on matrices
When it comes to working with rows/columns of matrices, the base R function apply() function is still
superior to everything else.

mat <- matrix(1:50, nrow=5, ncol=10)

# operate on rows with margin=1

apply(mat, 1, sum)

## [1] 235 245 255 265 275

# operate on columns with margin=2

apply(mat, 2, sum)

## [1] 15 40 65 90 115 140 165 190 215 240

- 41 -
Pretty plots with ggplot2
Though this is technically a course on high-performance computing. I would be doing newcomers a
disservice if we did not at least quickly cover how plotting works in R. We’ll also be profiling
some ggplot2 code in the next section as an example.

ggplot2 is a plotting framework that is (relatively) easy to use, powerful, AND it looks good.

library(ggplot2)

# Load the example data

data <- msleep

str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame': 83 obs. of 11 variables:


## $ name : chr "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew"
...
## $ genus : chr "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
## $ vore : chr "carni" "omni" "herbi" "omni" ...
## $ order : chr "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
## $ conservation: chr "lc" NA "nt" "lc" ...
## $ sleep_total : num 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
## $ sleep_rem : num NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
## $ sleep_cycle : num NA NA NA 0.133 0.667 ...
## $ awake : num 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
## $ brainwt : num NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
## $ bodywt : num 50 0.48 1.35 0.019 600 ...

It’s sleep data of some kind. Anyhow, let’s start. ggplot2 revolves around a certain kind of variable: the
ggplot2 object. What is a ggplot2 object? Basically it is your data + information on how to interpret it + the
actual geometry it uses to plot it.

How to create ggplot2 objects:

You can add as much data in the inital function call as you want. All of these work, but the final version is
the only “complete” object that fully specifies the data used for the plot.

ref <- ggplot()

ref <- ggplot(data)

ref <- ggplot(data, aes(x = bodywt, y = sleep_total))

# This is the same thing as the line above:

ref <- ggplot(data, aes(x = data$bodywt, y = data$sleep_total))

- 42 -
To store an object (to add to it later/plot it on demand), just give it a reference. Simply typing the reference
will display the plot (if you’ve provided enough information to make it.)

ref

As you can see, we haven’t specified everything we need yet. There are 3 components to making a plot
with a ggplot object: your data, the aesthetic mappings of your data, and the geometry. If you are missing
one, you won’t get a functional plot.

Your data should be a dataframe with everything you want to plot. Note that it is possible to put data from
multiple sources (ie. different dataframes) in the same plot, but it’s easier if everything is in the same 2-
dimensional dataframe.

ref <- ggplot(data)

The aesthetic mappings tell ggplot2 how to interpret your data. Which values in your dataframe are the y-
values, x-values, what should be used for colors, etc.

ref <- ggplot(data, aes(x = bodywt, y = sleep_total))

The geometry is the actual stuff that goes on the plot. You can specify any geometry as long as you have
supplied the values it needs. If you’ve specified the required aesthetic mappings (which data corresponds to
x, y, etc.), all you need to do is tell ggplot2 to create a certain geometry- for instance a scatterplot.

Just add the geometry you want to your object. In this case, we are making a scatterplot.

ref <- ggplot(data, aes(x = bodywt, y = sleep_total)) + geom_point()

- 43 -
ref

All you need to do to add more information to your plot/change things is add on more elements. Lets add a
logarithmic scale on the x axis.

ref <- ggplot(data, aes(x = bodywt, y = sleep_total)) + geom_point() + scale_x_log10()

ref

- 44 -
Lets add a smoothed mean.

ref + geom_smooth()

## `geom_smooth()` using method = 'loess'

- 45 -
You can also specify aesthetics inside the call to create geomtery.

ggplot(data) + geom_point(aes(x = bodywt, y = sleep_total)) + scale_x_log10() + geom_smooth()

Error: stat_smooth requires the following missing aesthetics: x, y

Why didn’t that work? This is because when we specfy aesthetics inside a call to geomtery it only applies
for that layer (only geom_point got the x and y values). The only information that gets passed to all
geometery calls is aethetics specified in the initial creation of the ggplot object.

So if we wanted that to work, we’d have to do this:

ggplot(data) + scale_x_log10() +

geom_point(aes(x = bodywt, y = sleep_total)) +

geom_smooth(aes(x = bodywt, y = sleep_total))

## `geom_smooth()` using method = 'loess'

It’s important to note that geometry will automatically use any aesthetic mappings that it understands, and
ignore ones it doesn’t. So if you specify as much stuff as you can in the inital call that can be used, it’ll save
you work.

Like this:

- 46 -
ggplot(data, aes(x = bodywt, y = sleep_total)) + scale_x_log10() + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess'

Exercise - Creating a plot of our own


Make a scatterplot of conservation status vs. time spent awake

Hint: conservation status is data$conservation and time awake is data$awake. To make a scatterplot,
use geom_point().

Let’s follow up with a few very common plot/geometry types and mappings you might be interested in:

These x and y mappings (and the log scale on the x axis will be used for all later plots).

plot <- ggplot(data, aes(x = bodywt, y = sleep_total)) + scale_x_log10()

plot + geom_point()

- 47 -
First lets add color based on what things eat. Note that it automatically adds a legend.

plot + geom_point(aes(color = vore))

We used a factor there, but we can also use a continuous variable for color as well.

- 48 -
plot + geom_point(aes(color = log(brainwt)))

We can change the legend to change the colors in this case.

plot + geom_point(aes(color = log(brainwt))) + scale_color_gradient2()

- 49 -
Set the limits of a scale

plot + geom_point() + scale_y_continuous(limits = c(5, 15))

## Warning: Removed 23 rows containing missing values (geom_point).

- 50 -
Changing size and alpha works the same way:

plot + geom_point(aes(size = sleep_rem, alpha = sleep_rem)) +

xlab("this is our x axis") + ylab("this is our y axis") + ggtitle("title") + scale_alpha("our


legend")

## Warning: Removed 22 rows containing missing values (geom_point).

- 51 -
If we want to simply change a plot value like marker shape or size without mapping it to data, just specify
it outside the call to aesthetics. plot + geom_point(aes(shape = vore), size = 6, color = “orange”)

Let’s facet our data by a factor:

plot + geom_point() + facet_wrap(~vore)

- 52 -
Exercise - Another plot example
How would I make a scatterplot of conservation status (data$conservation) vs time awake (data$awake),
with the color mapped to vore (data$vore) and the ize mapped to the log of brain weight
(log(data$brainwt)). Bonus points if you add axis labels and a title.

Other types of plots


All other types of plots work identically to the scatterplot - let’s see a few examples…

Boxplot
Note that stats are automatically performed.

ggplot(data, aes(x = vore, y = sleep_total, fill = vore)) + geom_boxplot()

- 53 -
Line plot with different groups
ggplot(data, aes(x = bodywt, y = brainwt, group = vore, color = vore)) +

geom_line() + scale_x_log10() + scale_y_log10()

## Warning: Removed 8 rows containing missing values (geom_path).

- 54 -
1D density
ggplot(data, aes(x = sleep_total, fill = vore)) + geom_density(alpha = 0.5)

Violin plot
ggplot(data, aes(x = vore, y = sleep_total)) + geom_violin()

- 55 -
Bar plot
ggplot(data, aes(x = vore)) + geom_bar()

Note that it automatically is binning the number of values in “vore”. To get a bar plot to simply plot the
values you feed it, use geom_bar(stat = "identity").

Exercise - making other types of plots


Make a box plot of the amount of sleep (data$sleep_total) per conservation status (data$conservation).
Fill in the box plot colors by conservation status (data$conservation).

Hint: you will need the following aesthetics: x, y, and fill. The geometry you want to use
is geom_boxplot()

- 56 -
Reports with R Markdown
Making plots and being able to write output files is important. However, the most important part of any
analysis is communicating your findings effectively! The rmarkdown package is one of the best and easiest
ways of doing this. It let’s you create a fancy looking report without ever leaving RStudio. You can
interleave your code and plots together with explanation text and formatting you might normally do in a
tool like Microsoft Word. You can even make websites with it (like this one!)!

Let’s get started with R Markdown and make ourselves a sweet report.

Creating an R Markdown notebook


In RStudio, go to File -> New File -> R Markdown. It will ask you what you want to call the file - leave the
output as HTML for now.

You should see a file open with a lot of text and code already there. This is the default R Markdown
document and comes with a lot of pointers about how to do things. For now, save the file as example.Rmd,
and hit the “Knit” button.

If everything went well, you should see a nicely formatted webpage/ report pop up in RStudio’s built-in
browser. Congratulations, you’ve made your first report!

Basic formatting
Now that we know everything’s working, let’s start from scratch and learn how R Markdown works step-
by-step. The first thing we should do is delete everything except for the following:

---

title: "test"

author: "Jeff Stafford"

date: "August 20, 2017"

output: html_document

---

The stuff between the ---s is called YAML - it’s a markup format that describes what kind of document to
make, who wrote it, and other technical bits that’s not actual content in your report. If you want, you can
modify it, but let’s leave it be for now.

At the bottom of the document, let’s start typing some text. Everything that follows is example markdown
output and syntax:
- 57 -
Normal characters result in normal text.

*this makes italics*

**big bad bold text**

***italic AND bold***

# A header

## A smaller header

### An even smaller header


Horizontal lines can be made with lots of dashes. Like this:

----------------------

To start a list, we can use asterisks:

• * Item one
• * Item two
• * Item three
o indenting four spaces makes a nested list

We can make a numbered list by just typing 1. 2. 3.

1. first
2. second
3. third

To make a block of code-y looking text, we use backticks:

`some code`

```

Three backticks makes

a block of text

look the same way.

- 58 -
```

A hyperlink looks like [link text](www.some-website.ca)

Images are just a link with an ! in front:

![image text](www.some-image-site/image.png)

Some notes on formatting


Text on subsequent lines gets treated as the same line. To have text appear in a separate line/paragraph, you
need a blank line in between.

Including code, data, and plots


To include code in a report, we just change the “code” styling slightly. Add a {r} after the triple backticks
to have that block of text get treated as R code. The top of the code header should look like this: ```{r}

In practice, it looks something like this:

5 + 6

## [1] 11

print("The code output gets put directly below")

## [1] "The code output gets put directly below"

If you want to display data, you have a couple options. The first, is just to print out data the way you
normally would.

library(gapminder)

head(gapminder)

## country continent year lifeExp pop gdpPercap


## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134

However, there is a special function knitr::kable() that inserts a nice-looking table into your report.
Depending on what your output format is, it sometimes is even interactive!

knitr::kable(head(gapminder))

- 59 -
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

You can put code inline in text by just typing “r some code” in backtics.

Plots are done the same way as code:

library(ggplot2)

ggplot(gapminder, aes(x=year, y=lifeExp, group=year)) +

geom_boxplot()

Exercise - Writing your own report


Using any dataset you want, find something interesting and write a report on it in RMarkdown
(nycflights13 is a good starting point).

Voila ! Now go ahead and subscribe to my channel

- 60 -
- 61 -

You might also like