Lecture 9: Data Wrangling With Dplyr: Kevin Lee
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
Kevin Lee
Department of Statistics
Western Michigan University
Happy families are all alike; every unhappy family is unhappy in its
own way.
– Leo Tolstoy
Tidy datasets are all alike, but every messy dataset is messy in its
own way.
– Hadley Wickham
Five main dplyr functions that allow you to solve the majority of your data-
manipulation challenges:
filter(), pick observations by their values
arrange(), reorder the rows
select(), pick variables by their names
mutate(), create new variables with functions of existing variables
summarize(), collapse many values down to a single summary
To use filtering effectively, you have to know how to select the observations
that you want using the comparison operators and logical operators in R.
Comparison operators in R:
< # less than
> # greater than
== # equal to
<= # less than or equal to
>= # greater than or equal to
!= # not equal to
Logical operators in R:
& # logical “and”
| # logical “or”
! # logical “not”
Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 6 / 12
arrange()
If you provide more than one column name, each additional column will be
used to break ties in the values of preceding columns:
Use desc() to reorder by a column in descending order.
Missing values are always sorted at the end.
Below are some helper functions you can use within select():
starts_with("abc") matches names that begin with "abc"
ends_with("xyz") matches names that contain "xyz".
num_range("x", 1:3) matches x1 , x2 , and x3 .
mutate() allows you to add new columns that are functions of existing
columns.
Below are some summary functions you can use within summarize():
Measures of location: mean(), median()
Measures of variation: var(), sd(), IQR()
Measures of rank: min(), max(), quantile()