02 Tidyverse
02 Tidyverse
Tyson S. Barrett
Summer 2017
Utah State University
1
Introduction
Tidy Methods
A Walk-Through
Conclusions
2
Introduction
3
The Newest and Brightest
4
The Newest and Brightest
Tidyverse
• In order to manipulate your data in the cleanest, most
up-to-date manner, we are going to be using the “tidyverse”
group of methods.
• The tidyverse1 is a group of packages2 that provide a simple
syntax that can do many basic (and complex) data
manipulating.
• The group of packages can be downloaded via:
install.packages("tidyverse")
library(tidyverse)
5
Tidyverse
library(awesome)
library(amazing)
6
Conflicts
awesome::make_really_great(arg)
That’s a bit of an aside, but know that you can always get at a
function even if it is “masked” from your current session.
7
Tidy Methods
8
The Tidy Data Way
9
Methods for Tidying
1. Piping
2. Selecting and Filtering
3. Grouping and Summarizing
4. Reshaping
5. Joining (merging)
To help illustrate each aspect, we are going to use real data from
the National Health and Nutrition Examiniation Survey (NHANES).
I’ve provided this data at
https://tysonstanley.github.io/assets/Data/NHANES.zip. I’ve
cleaned it up somewhat already.
10
A Walk-Through
11
Example: NHANES
Import
First, we will set our working directory with setwd. This tells R
where to look for files, including your data files.
setwd("~/Dropbox/GitHub/blog_rstats/assets/Data/")
library(foreign)
dem_df <- read.xport("NHANES_demographics_11.xpt")
med_df <- read.xport("NHANES_MedHeath_11.xpt")
men_df <- read.xport("NHANES_MentHealth_11.xpt")
act_df <- read.xport("NHANES_PhysActivity_11.xpt")
12
Example: NHANES
13
Example: NHANES
This takes the names of the data frame (on the right hand side),
changes them to lower case and then reassigns them to the names
of the data frame.3
3
Note that these are not particularly helpful names, but they are the names
provided in the original data source. If you have questions about the data, visit
http://wwwn.cdc.gov/Nchs/Nhanes/Search/Nhanes11_12.aspx.
14
Example: NHANES
15
Example: NHANES
Piping
16
Example: NHANES
Piping
%>% is the pipe “operator”. It takes what is on the left hand side
and puts it in the right hand side’s function.
So the above code takes the data frame df and puts it into the
summary function. This does the same thing as summary(dem_df).
In this simple case, it doesn’t really make the code more readable,
but in more complex situations it can really help.
17
Example: NHANES
18
Example: NHANES
1. Selecting Variables
2. Filtering Rows
The following slides show the base R way and the tidyverse way of
subsetting.
19
Example: NHANES
Selecting Variables
Here both do the same thing. The first, using [, is the “base R”
way of selecting variables. The second, using the pipe, is the
tidyverse way. Both work great so the choice is yours.
20
Example: NHANES
Filtering Rows
df[df$var1 == 1, ]
df %>%
filter(var1 == 1)
Again, both do the same thing. The first, using [, is the “base R”
way of filtering rows so that you only keep the ones where “var1” in
df is equal to 1. Again, the second is the tidyverse way. Whichever
you like you should use.
21
Example: NHANES
1. Data
2. Group by
3. Summarize
22
Example: NHANES
# A tibble: 4 × 2
citizen N
<fctr> <int>
1 1 8685
2 2 1040
3 7 26
4 NA 5
23
Example: NHANES
• The first column is the grouping variable and the second is the
N (number of individuals) by group.
• We can quickly see that there are four levels, currently, to the
citizen variable.
• After some reading of the documentation we see that 1 =
Citizen and 2 = Not a Citizen.
• A value of 7 it turns out is a placeholder value for missing.
• And finally we have an NA category.
• It’s unlikely that we want those to be included in any analyses,
unless we are particularly interested in the missingness on this
variable.
• So let’s do some simple cleaning to get this where we want it.
To do this, we will use the furniture package. 24
Example: NHANES
install.packages("furniture")
library(furniture)
## Changes all 7's to NA's
dem_df$citizen <- washer(dem_df$citizen, 7)
## Changes all 2's to 0's
dem_df$citizen <- washer(dem_df$citizen, 2, value=0)
25
Example: NHANES
# A tibble: 3 × 2
citizen N
<chr> <int>
1 0 1040
2 1 8685
3 <NA> 31
# A tibble: 3 × 3
citizen N Age
<chr> <int> <dbl>
1 0 1040 37.31635
2 1 8685 30.66252
27
3 <NA> 31 40.35484
Example: NHANES
28
Example: NHANES
29
Example: NHANES
Reshaping
This is a big part of working with data. Unfortunately, it is also a
difficult topic to understand without much practice at it. In general,
two data formats exist:
1. Wide form
2. Long form
30
Example: NHANES
Wide Form
Wide form generally has one unit (i.e. individual) per row. This
generally looks like:
ID Var_Time1 Var_Time2
1 1 1.138688557 0.67206981
2 2 -0.926541315 0.30853689
3 3 -0.007108554 0.55613005
4 4 0.533288410 0.23545637
5 5 -0.909166260 0.01326606
6 6 1.396866039 0.73015902
7 7 1.748336183 0.66249056
8 8 0.100194424 0.36643398
9 9 0.511294922 0.08342045
31
10 10 -0.585448865 0.56180077
Example: NHANES
Long Form
In contrast, long format has the lowest nested unit as a single row.
This means that a single ID can span multiple rows, usually with a
unique time point for each row as so:
ID Time Var
1 1 1 0.4722128
2 1 2 0.1303989
3 1 3 0.7835221
4 1 4 0.4007190
5 2 1 0.1882725
6 2 2 0.8000024
7 3 1 0.7557883
8 3 2 0.1840514
32
9 3 3 0.9533038
Quick Sidetrack from NHANES: Reshaping
Wide to Long
With a fake data set, we’ll go from wide to long. . .
33
Quick Sidetrack from NHANES: Reshaping
Long to Wide
Now we will go from long to wide using spread() from the same
package.
Here, we provided the column name (Time) that had the value
labels and (Var) that contained the values themselves.
With a little bit of code we can move data around without any
copy-pasting that is so error-prone.
34
Example: NHANES
Joining (merging)
The final topic in the chapter is joining data sets.
We currently have 4 data sets that have mostly the same people in
them but with different variables. One tells us about the
demographics; another gives us information on mental health. We
may have questions that ask whether a demographic characteristics
is related to a mental health factor. This means we need to merge,
or join, our data sets.4
4
Note that this is different than adding new rows but not new variables.
Merging requires that we have at least some overlap of individuals in both data
sets.
35
Example: NHANES
Joining (merging)
When we merge a data set, we combine them based on some ID
variable(s). Here, this is simple since each individual is given a
unique identifier in the variable seqn. Within the dplyr package
there are four main joining functions: inner_join, left_join,
right_join and full_join. Each join combines the data in
slightly different ways.
36
Example: NHANES
Joining (merging)
Let’s first load dplyr:
library(dplyr)
37
Example: NHANES
Joining (merging)
Inner Join
Here, only those individuals that are in both data sets that you are
combining will remain. So if person “A” is in data set 1 and not in
data set 2 then he/she will not be included.
38
Example: NHANES
Joining (merging)
Left or Right Join
This is similar to inner join but now if the individual is in data set 1
then left_join will keep them even if they aren’t in data set 2.
right_join means if they are in data set 2 then they will be kept
whether or not they are in data set 1.
39
Example: NHANES
Joining (merging)
Full Join
This one simply keeps all individuals that are in either data set 1 or
data set 2.
Each of the left, right and full joins will have missing values placed
in the variables where that individual wasn’t found. For example, if
person “A” was not in df2, then in a full join they would have
missing values in the df1 variables.
40
Example: NHANES
For our NHANES example, we will use full_join to get all the
data sets together. Note that in the code below we do all the
joining in the same overall step.
So now df is the the joined data set of all four. We started with
dem_df joined it with med_df by seqn then joined that joined data
set with men_df by seqn, and so on.
41
Conclusions
42
In This Chapter:
For analyses in the later chapters, we will use this new df object
that we concluded with containing NHANES data.
Also, you’ll see that many of these methods apply to more than just
manipulating data. As you learn one method, you’ll begin to see
how easily you can use it in other situations.
43
44