Tutorial-Introduction To Dplyr

The document discusses the dplyr package in R which provides functions for data manipulation. It introduces common verbs like filter(), arrange(), select(), distinct(), mutate(), and summarise(). It explains how these can be used to manipulate and transform data frames. It also discusses using group_by() to perform grouped operations, applying verbs by groups within the data.

Uploaded by

Trisna Yulia Junita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views54 pages

Tutorial-Introduction To Dplyr

Uploaded by

Trisna Yulia Junita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Courtesy

This powerpoint file summarize the

Introduction to dplyr from
https://cran.rstudio.com/web/packages/dplyr
/vignettes/introduction.html where shows
step by step learning on dplyr package which
is very useful for manipulating data
By compiling the mentioned webpage to
powerpoint, the material is easier to introduce
in the lecture
When working with data you must
Figure out what you want to do
Describe those tasks in the form of a
computer program
Execute the program
The dplyr package makes these steps
fast and easy
By constraining your options, it simplifies how
you can think about common data
manipulation tasks
It provides simple verbs, functions that
correspond to the most common data
manipulation tasks, to help you translate
those thoughts into code.
It uses efficient data storage backends, so you
spend less time waiting for the computer.
dplyrs basic set of tools
Databases
Besides in-memory data frames, dplyr also connects to
out-of-memory, remote databases. And by translating your
R code into the appropriate SQL, it allows you to work with
both types of data using the same set of tools.
benchmark-baseball
see how dplyr compares to other tools for data
manipulation on a realistic use case.
window-functions
a window function is a variation on an aggregation
function. Where an aggregate function uses n inputs to
produce 1 output, a window function uses n inputs to
produce n outputs.
Data: nycflights13
To explore the basic data manipulation verbs
of dplyr, well start with the built in
nycflights13 data frame. This dataset contains
all 336776 flights that departed from New
York City in 2013. The data comes from the US
Bureau of Transportation Statistics, and is
documented in ?nycflights13
Prepare your nycflights13 data
install.packages("nycflights13")
library(nycflights13)
dim(flights)
head(flights)
Large data: tbl_df
dplyr can work with data frames as is, but if
youre dealing with large data, its worthwhile
to convert them to a tbl_df: this is a wrapper
around a data frame that wont accidentally
print a lot of data to the screen.
SINGLE TABLE VERBS
Single table verbs
Dplyr aims to provide a function for each basic
verb of data manipulation:
filter() (and slice())
arrange()
select() (and rename())
distinct()
mutate() (and transmute())
summarise()
sample_n() and sample_frac()
If youve used plyr before, many of these will be
familiar.
Filter rows with filter()
filter() allows you to select a subset of rows in
a data frame. The first argument is the name
of the data frame. The second and subsequent
arguments are the expressions that filter the
data frame:
For example, we can select all flights on
January 1st with:
Filter rows with filter()
filter(flights, month == 1, day == 1)

This is equivalent to the more verbose code in base R:

flights[flights$month == 1 & flights$day == 1, ]
filter() vs. subset()
filter() works similarly to subset() except that
you can give it any number of filtering
conditions, which are joined together with &
(not && which is easy to do accidentally!). You
can also use other boolean operators:
filter(flights, month == 1 | month == 2)
slices()
To select rows by position, use slice():
slice(flights, 1:10)
Arrange rows with arrange()
arrange() works similarly to filter() except that
instead of filtering or selecting rows, it
reorders them
It takes a data frame, and a set of column
names (or more complicated expressions) to
order by
If you provide more than one column name,
each additional column will be used to break
ties in the values of preceding columns:
arrange()
arrange(flights, year, month, day)
arrange() and desc()
Use desc() to order a column in descending
order:
arrange(flights, desc(arr_delay))
The previous code is equivalent to:
flights[order(flights$year, flights$month, flights$day), ]
flights[order(desc(flights$arr_delay)), ]
Select columns with select()
Often you work with large datasets with many
columns but only a few are actually of interest
to you
select() allows you to rapidly zoom in on a
useful subset using operations that usually
only work on numeric variable positions:
select()
# Select columns by name
select(flights, year, month, day)
select()
# Select all columns between year and day
(inclusive)
select(flights, year:day)
select()
# Select all columns except those from year to
day (inclusive)
select(flights, -(year:day))
There are a number of helper functions you
can use within select(), like starts_with(),
ends_with(), matches() and contains()
These let you quickly match larger blocks of
variables that meet some criterion.
See ?select for more details.
rename column in select()
You can rename variables with rename() by
using named arguments:
rename(flights, tail_num = tailnum)
Extract distinct (unique) rows
A common use of select() is to find the values
of a set of variables. This is particularly useful
in conjunction with the distinct() verb which
only returns the unique values in a table.
distinct(select(flights, tailnum))
distinct(select(flights, origin, dest))

This is very similar to base::unique() but should be

much faster.
Add new columns with mutate()
Besides selecting sets of existing columns, its
often useful to add new columns that are
functions of existing columns. This is the job of
mutate():
mutate()
mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
mutate allows you to refer to columns that
youve just created:
mutate(flights, gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60) )
transmute()
If you only want to keep the new variables,
use transmute():
transmute(flights, gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60) )
Summarise values with summarise()
The last verb is summarise(). It collapses a
data frame to a single row
summarise(flights, delay = mean(dep_delay,
na.rm = TRUE))
Randomly sample rows with
sample_n() and sample_frac()
You can use sample_n() and sample_frac() to
take a random sample of rows
use sample_n() for a fixed number and
sample_frac() for a fixed fraction.
sample_n(flights, 10)
sample_frac(flights, 0.01)

Use replace = TRUE to perform a bootstrap sample. If needed,

you can weight the sample with the weight argument.
Commonalities
You may have noticed that the syntax and
function of all these verbs are very similar:
The first argument is a data frame.
The subsequent arguments describe what to do with
the data frame. Notice that you can refer to columns
in the data frame directly without using $.
The result is a new data frame
Together these properties make it easy to chain
together multiple simple steps to achieve a
complex result.
At the most basic level, you can only alter a tidy data
frame in five useful ways: you can
reorder the rows (arrange())
pick observations and variables of interest (filter() and
select())
add new variables that are functions of existing variables
(mutate())
collapse many values to a summary (summarise())
The remainder of the language comes from applying
the five functions to different types of data. For
example, how these functions work with grouped data.
GROUPED OPERATIONS
These verbs are useful on their own, but they become
really powerful when you apply them to groups of
observations within a dataset.
In dplyr, you do this by with the group_by() function
It breaks down a dataset into specified groups of rows
When you then apply the verbs above on the resulting
object theyll be automatically applied by group.
Most importantly, all this is achieved by using the same
exact syntax youd use with an ungrouped object.
In the following example, we split the
complete dataset into individual planes and
then summarise each plane by counting the
number of flights (count = n()) and computing
the average distance (dist = mean(Distance,
na.rm = TRUE)) and arrival delay (delay =
mean(ArrDelay, na.rm = TRUE)). We then use
ggplot2 to display the output.
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
# Interestingly, the average delay is only slightly
related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) + geom_point(aes(size =
count), alpha = 1/2) + geom_smooth() +
scale_size_area()
You use summarise() with aggregate functions,
which take a vector of values and return a single
number. There are many useful examples of such
functions in base R like min(), max(), mean(),
sum(), sd(), median(), and IQR(). dplyr provides a
handful of others:
n(): the number of observations in the current group
n_distinct(x):the number of unique values in x.
first(x), last(x) and nth(x, n) - these work similarly to
x[1], x[length(x)], and x[n] but give you more control
over the result if the value is missing.
For example, we could use these to find the
number of planes and the number of flights
that go to each possible destination:
destinations <- group_by(flights, dest)
summarise(destinations,
planes = n_distinct(tailnum),
flights = n()
)
You can save the result in a new variable

Check type of t object

Check all of data by change t object to
data.frame
When you group by multiple variables, each
summary peels off one level of the grouping.
That makes it easy to progressively roll-up a
dataset:
Chaining
The dplyr API is functional in the sense that
function calls dont have side-effects
You must always save their results
This doesnt lead to particularly elegant code,
especially if you want to do many operations
at once
You either have to do it step-by-step:
Or if you dont want to save the intermediate
results, you need to wrap the function calls
inside each other:
This is difficult to read because the order of
the operations is from inside to out
Thus, the arguments are a long way away from
the function
To get around this problem, dplyr provides the
%>% operator
x %>% f(y) turns into f(x, y) so you can use it to
rewrite multiple operations that you can read
left-to-right, top-to-bottom:
New Expression by %>%

CRC.Data.Science
No ratings yet
CRC.Data.Science
443 pages
R Language PDF
100% (1)
R Language PDF
619 pages
Audi Code Errors
100% (1)
Audi Code Errors
419 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
Subsetting Data in R
No ratings yet
Subsetting Data in R
44 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
Tidyverse Pres
No ratings yet
Tidyverse Pres
20 pages
Module IV
No ratings yet
Module IV
43 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
ProgrammingForDS16_Rdatamanipulation
No ratings yet
ProgrammingForDS16_Rdatamanipulation
20 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
DS-R Block 3-1 All
No ratings yet
DS-R Block 3-1 All
43 pages
Module 7_(Data Analysis with R Programming)
No ratings yet
Module 7_(Data Analysis with R Programming)
18 pages
MBA Sem 1 Unit 3 Fundamentals of R (1)
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R (1)
41 pages
DSF 11-12
No ratings yet
DSF 11-12
21 pages
Instant-R
No ratings yet
Instant-R
213 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
WST 212: Introduction To Data Science
No ratings yet
WST 212: Introduction To Data Science
67 pages
LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS
No ratings yet
LEARNING R PROGRAMMING FOR DATA SCIENCE ENTHUSIASTS
8 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
14 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
R For Health Data Science
100% (1)
R For Health Data Science
365 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
Rbook PDF
No ratings yet
Rbook PDF
360 pages
r file code
No ratings yet
r file code
16 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
R For Health Data Science Ewen Harrison Riinu Pius download
No ratings yet
R For Health Data Science Ewen Harrison Riinu Pius download
78 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
Data - Table Tutorial (With 50 Examples) PDF
No ratings yet
Data - Table Tutorial (With 50 Examples) PDF
13 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Modern Statistics With R
100% (3)
Modern Statistics With R
580 pages
BMR Assignment: Tidyr
No ratings yet
BMR Assignment: Tidyr
3 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
9 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
DSCI 100 Cheat Sheet
No ratings yet
DSCI 100 Cheat Sheet
3 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
MIT14 381F13 EcnomtrisInR PDF
No ratings yet
MIT14 381F13 EcnomtrisInR PDF
70 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
(Daniel I. A. Cohen) Introduction To Computer Theo (BookSee - Org) 2
No ratings yet
(Daniel I. A. Cohen) Introduction To Computer Theo (BookSee - Org) 2
649 pages
Checkums (30 Wallet - Dat)
50% (2)
Checkums (30 Wallet - Dat)
2 pages
Peptide
No ratings yet
Peptide
5 pages
Web - DOW Industrial Reaction Engineering Course Flyer PDF
No ratings yet
Web - DOW Industrial Reaction Engineering Course Flyer PDF
1 page
Welding Procedure Specifications (Wps - 6G) : 2mm (Max)
100% (1)
Welding Procedure Specifications (Wps - 6G) : 2mm (Max)
2 pages
AWS D1.1 - 1M - 2015 Notes
50% (4)
AWS D1.1 - 1M - 2015 Notes
44 pages
Felson Brochure With Price PDF
100% (1)
Felson Brochure With Price PDF
103 pages
Helical Gears Project
100% (2)
Helical Gears Project
42 pages
STATA - Subject Table of Contents
No ratings yet
STATA - Subject Table of Contents
15 pages
Eschmann RX-500 Operation Table - Service Manual PDF
100% (1)
Eschmann RX-500 Operation Table - Service Manual PDF
40 pages
Workbook Quantec Betaversion en
No ratings yet
Workbook Quantec Betaversion en
41 pages
GOST 21014-88 Rolled products of ferrous metals. Surface defects. Terms and definitions
No ratings yet
GOST 21014-88 Rolled products of ferrous metals. Surface defects. Terms and definitions
50 pages
Indirect Retainers
No ratings yet
Indirect Retainers
18 pages
Fujitsu Siemens Amilo La1703 Inventec E25 Rev A02 SCH PDF
No ratings yet
Fujitsu Siemens Amilo La1703 Inventec E25 Rev A02 SCH PDF
30 pages
A Neoterra Wonder Tour
No ratings yet
A Neoterra Wonder Tour
19 pages
Logical & Decision Tree in Editable Powerpoint
100% (5)
Logical & Decision Tree in Editable Powerpoint
22 pages
Foundations of Auditing Information Systems
50% (2)
Foundations of Auditing Information Systems
10 pages
WN SA-105 B16 5 TT Rev0
No ratings yet
WN SA-105 B16 5 TT Rev0
2 pages
(Wiring Harness Dash - KC6C1)
No ratings yet
(Wiring Harness Dash - KC6C1)
1 page
Fabrication and Properties of Redispersible Polymer Powder-Modified Systems
No ratings yet
Fabrication and Properties of Redispersible Polymer Powder-Modified Systems
18 pages
Manual AMS 210
No ratings yet
Manual AMS 210
4 pages
MEC4417 - Tutorial 5
No ratings yet
MEC4417 - Tutorial 5
14 pages
Tuesday 13999 Baird
No ratings yet
Tuesday 13999 Baird
16 pages
Chapter 3 Transformer Connections, Operation, and Specialty Transformers
No ratings yet
Chapter 3 Transformer Connections, Operation, and Specialty Transformers
38 pages
An Outline of A Course On Operating System Principles: Per Brinch Hansen (1971)
No ratings yet
An Outline of A Course On Operating System Principles: Per Brinch Hansen (1971)
10 pages
7 Micrometry
No ratings yet
7 Micrometry
13 pages
Schlumberger Eclipse CCS Datasheet
No ratings yet
Schlumberger Eclipse CCS Datasheet
2 pages
CS6A-PE E-Module en
No ratings yet
CS6A-PE E-Module en
2 pages
E26 It4 Maintenance Chart
No ratings yet
E26 It4 Maintenance Chart
2 pages
Masterpact NT and NW: "Plug and Play" Retrofit Solution
No ratings yet
Masterpact NT and NW: "Plug and Play" Retrofit Solution
10 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)

Tutorial-Introduction To Dplyr

Uploaded by

Tutorial-Introduction To Dplyr

Uploaded by

Courtesy

This powerpoint file summarize the

This is equivalent to the more verbose code in base R:

This is very similar to base::unique() but should be

Use replace = TRUE to perform a bootstrap sample. If needed,

Check type of t object

You might also like