0% found this document useful (0 votes)
25 views

Lecture 9: Data Wrangling With Dplyr: Kevin Lee

This document summarizes a lecture on data wrangling using the dplyr package in R. It introduces the concept of tidy data and describes five main functions in dplyr - filter(), arrange(), select(), mutate(), and summarize() - to manipulate and transform data frames. It also discusses working with relational data through inner, full, left and right joins. The overall purpose is to provide an overview of how to use dplyr to solve common data manipulation challenges.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lecture 9: Data Wrangling With Dplyr: Kevin Lee

This document summarizes a lecture on data wrangling using the dplyr package in R. It introduces the concept of tidy data and describes five main functions in dplyr - filter(), arrange(), select(), mutate(), and summarize() - to manipulate and transform data frames. It also discusses working with relational data through inner, full, left and right joins. The overall purpose is to provide an overview of how to use dplyr to solve common data manipulation challenges.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 9: Data Wrangling with dplyr

Kevin Lee

Department of Statistics
Western Michigan University

September 30, 2019

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 1 / 12


Tidy Data

Happy families are all alike; every unhappy family is unhappy in its
own way.
– Leo Tolstoy

Tidy datasets are all alike, but every messy dataset is messy in its
own way.
– Hadley Wickham

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 2 / 12


Tidy Data

Tidying your data means storing it in a consistent form that matches


the semantics of the dataset.

There are three interrelated rules which make a dataset tidy:


1 Each variable must have its own column,
2 Each observation must have its own row.
3 Each value must have its own cell.

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 3 / 12


Data Transformation with dplyr

Five main dplyr functions that allow you to solve the majority of your data-
manipulation challenges:
filter(), pick observations by their values
arrange(), reorder the rows
select(), pick variables by their names
mutate(), create new variables with functions of existing variables
summarize(), collapse many values down to a single summary

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 4 / 12


Data Transformation with dplyr

All functions work similarly:


1 The first argument is a data frame.
2 The subsequent arguments describe what to do with the data frame,
using the variable names.
3 The result is a new data frame.

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 5 / 12


filter()
filter() allows you to subset observations based on their values

filter(data frame, condition)

To use filtering effectively, you have to know how to select the observations
that you want using the comparison operators and logical operators in R.
Comparison operators in R:
< # less than
> # greater than
== # equal to
<= # less than or equal to
>= # greater than or equal to
!= # not equal to
Logical operators in R:
& # logical “and”
| # logical “or”
! # logical “not”
Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 6 / 12
arrange()

arrange() allows you to change the order of the observations.

arrange(data frame, column name)

If you provide more than one column name, each additional column will be
used to break ties in the values of preceding columns:
Use desc() to reorder by a column in descending order.
Missing values are always sorted at the end.

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 7 / 12


select()

select() allows you to zoom in on a useful subset using operations based


on the names of the variables.

select(data frame, column name)

Below are some helper functions you can use within select():
starts_with("abc") matches names that begin with "abc"
ends_with("xyz") matches names that contain "xyz".
num_range("x", 1:3) matches x1 , x2 , and x3 .

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 8 / 12


mutate()

mutate() allows you to add new columns that are functions of existing
columns.

mutate(data frame, new column = f(column name))

If you only want to keep the new variables, use transmute().

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 9 / 12


summarize()

summarize() collapse a data frame to a single row.

summarize(data frame, R function(column name))

Below are some summary functions you can use within summarize():
Measures of location: mean(), median()
Measures of variation: var(), sd(), IQR()
Measures of rank: min(), max(), quantile()

summarize() becomes really useful when we use with group_by().


group_by() is used to group data by one or more variables.
group_by(data frame, column name)

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 10 / 12


Relational Data with dplyr

It is s rare that a data analysis involves only a single table of data.


Typically you have many tables of data, and you must combine them
to answer the questions that you are interested in.

Multiple tables of data are called relational data because it is the


relations, not just the individual datasets, that are important.

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 11 / 12


Relational Data with dplyr

inner_join(x, y), keeps only common observations in x and y.


full_join(x, y), keeps all observations in x and y.
left_join(x, y), keeps all observations in x.
right_join(x, y), keeps all observations in y.

Kevin Lee (WMU) Lecture 9 (9/30/2019) September 30, 2019 12 / 12

You might also like