DSF 11-12

The document provides an overview of data wrangling in R, focusing on the dplyr package, which offers a grammar for data manipulation through functions like select, filter, mutate, arrange, and summarize. It explains how to use these functions to manipulate data frames effectively, emphasizing the importance of tidy data. Additionally, it covers the use of the pipe operator for chaining operations to improve code readability.

Uploaded by

bidiy85138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views21 pages

DSF 11-12

Uploaded by

bidiy85138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

DATA SCIENCE

FUNDAMENTAL
S
DSC293
Lecture 11-12
Dr. Hufsa Mohsin
A GRAMMAR FOR DATA
WRANGLING
 The data frame is a key data structure in statistics and in R.
 The basic structure of a data frame is that there is one observation per row
and each column represents a variable, a measure, feature, or
characteristic of that observation.
THE DPLYR PACKAGE
 The dplyr package was developed by Hadley Wickham of RStudio and is an
optimized and distilled version of his plyr package.
 One important contribution of the dplyr package is that it provides a
“grammar” (in particular, verbs) for data manipulation and for operating on
data frames.
 With this grammar, you can sensibly communicate what it is that you are
doing to a data frame that other people can understand (assuming they
also know the grammar).
DPLYR GRAMMAR
 select: return a subset of the columns of a data frame, using a flexible
notation
 filter: extract a subset of rows from a data frame based on logical
conditions
 arrange: reorder rows of a data frame • rename: rename variables in a
data frame
 mutate: add new variables/columns or transform existing variables
 summarize: generate summary statistics of different variables in the data
frame, possibly within strata
 %>%: the “pipe” operator is used to connect multiple verb actions
together into a pipeline
COMMON DPLYR FUNCTION
PROPERTIES
 1. The first argument is a data frame.
 2. The subsequent arguments describe what to do with the data frame
specified in the first argument, and you can refer to columns in the data
frame directly without using the $ operator (just use the column names).
 3. The return result of a function is a new data frame
 4. Data frames must be properly formatted and annotated for this, to all be
useful. In particular, the data must be tidy.
 In short, there should be one observation per row, and each column should
represent a feature or characteristic of that observation
 install.packages("dplyr")
 After installing the package it is important that you load it into your R session
with the library() function. > library(dplyr)
SELECT()
 The select() function can be used to select columns of a data frame that
you want to focus on. Often you’ll have a large data frame containing “all”
of the data, but any given analysis might only use a subset of variables or
observations.
SELECT()…
 The select() function allows you to get the few columns you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways
to do this. We could for example use numerical indices. But we can also
use the names directly.
SELECT()…
 To retrieve only the names and party
affiliations of these presidents, we would
use select(). The first argument to
the function is the data frame, followed by
an arbitrarily long list of column names,
separated by commas.

FILTER()
 At left, a data frame that contains matching entries in a certain column for
only a subset of the rows. At right, the resulting data frame after filtering.
FILTER()…
 The first argument to filter() is a data frame, and
subsequent arguments are logical conditions
that are evaluated on any involved columns.
 If we want to retrieve only those rows that
pertain to Republican presidents, we need to
specify that the value of the party is republican.
 Note that the == is a test for equality. If we
were to use only a single equal sign here, we
would be asserting that the value of party was
republican.
 This would result in an error. The quotation
marks around republican are necessary here,
since republican is a literal value, and not a
variable name.
COMBINING FILTER() AND
SELECT()
Find which Democratic presidents
served since Watergate.

The filter() operation is nested

inside the select(). each of the
five verbs takes and returns a
data frame, which makes this
type of nesting possible.
PIPELINE
 Pipe-forwarding is an alternative to nesting that yields code that can be
easily read from top to bottom. With the pipe, we can write the same
expression as above in this more readable syntax.

is equivalent to
MUTATE()
While we have the raw data on when
each of these presidents took and
relinquished office, we don’t actually
have a numeric variable giving the
length of each president’s term.

Of course, we can derive this information

from the dates given, and add the result
as a new column to our data frame.

This date arithmetic is made easier

through the use of
the lubridate package, which we use to
compute the number of years
( dyears()) that elapsed since during the
interval() from the start until the end of
each president term
MUTATE()…
 Mutate() function can also be used
to modify the data in an existing
column.
 Suppose that we wanted to add to
our data frame a variable containing
the year in which each president
was elected.
 Our first (naïve) attempt might
assume that every president was
elected in the year before he took
office.
 Mutate() returns a data frame, so
if we want to modify our existing
data frame, we need to overwrite
it with the results.
RENAME()
 it is considered bad practice to use
“.” in the name of functions, data
frames, and variables in R.
 Also this could conflict with R’s use
of generic functions (i.e., R’s
mechanism for method
overloading).
 Thus, we should change the name
of the column by rename()
functions
ARRANGE()
 The function sort() will sort a vector but
not a data frame. The function that will
sort a data frame is called arrange().
 In order to apply arrange() on a data
frame, you have to specify the data
frame, and the column by which you
want it to be sorted.
 You also have to specify the direction in
which you want it to be sorted.
Specifying multiple sort conditions will
help break ties.
 To sort our presidential data frame data
frame by the length of each president’s
term, we specify that we want the
column term_length in descending order.
SUMMARIZE()
 which is nearly always used in
conjunction with group_by().
 The previous four verbs provided us with
means to manipulate a data frame in
powerful and flexible ways.
 But the extent of the analysis we can
perform with these four verbs alone is
limited.
 On the other hand summarize() with
group_by() enables us to make
comparisons.
SUMMARIZE()…
 When used alone summarize() collapses a
data frame into a single row. Critically, we
have to specify how we want to reduce an
entire column of data into a single value.
 The first argument is a data frame,
followed by a list of variables that will
appear in the output.
 Note that every variable in the output is
defined by operations performed
on vectors—not on individual values.
 This is essential, since if the specification
of an output variable is not an operation on
a vector, there is no way for R to know how
to collapse each column.
 In this example, the function n() simply
counts the number of rows.
SUMMARIZE()…
 whether Democratic or
Republican presidents served a
longer average term during this
time period.
 To figure this out, we can just
execute again, but this time,
instead of the first argument
being the data frame we will
specify that the rows of the data
frame should be grouped by the
values of the party

R Language PDF
100% (1)
R Language PDF
619 pages
Carraro 20.19
100% (1)
Carraro 20.19
10 pages
Caterpillar 3516b Marine Engine Operation Maintenance Manual 4bw
No ratings yet
Caterpillar 3516b Marine Engine Operation Maintenance Manual 4bw
32 pages
Circuit Breaker Testing
0% (1)
Circuit Breaker Testing
13 pages
Assignment 3 - Test Plan
No ratings yet
Assignment 3 - Test Plan
57 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
Seismic Zones Factor Zone 4 Normal Occupancies 8: I Occupancy Requirements Table 2.2D
No ratings yet
Seismic Zones Factor Zone 4 Normal Occupancies 8: I Occupancy Requirements Table 2.2D
5 pages
Rcourse3 PDF
No ratings yet
Rcourse3 PDF
35 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
LS LT Reference Guide Summit Racing
No ratings yet
LS LT Reference Guide Summit Racing
2 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
R-Cheat Sheet
100% (1)
R-Cheat Sheet
4 pages
(R) Internal-2 Q & A
No ratings yet
(R) Internal-2 Q & A
65 pages
Subsetting Data in R
No ratings yet
Subsetting Data in R
44 pages
DS-R Block 3-1 All
No ratings yet
DS-R Block 3-1 All
43 pages
Module IV
No ratings yet
Module IV
43 pages
Unit 1.3
No ratings yet
Unit 1.3
36 pages
Data
No ratings yet
Data
40 pages
4 1 Data Manipulation Core Dplyr Functions Performing Sequential
No ratings yet
4 1 Data Manipulation Core Dplyr Functions Performing Sequential
33 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
What Is Dplyr
No ratings yet
What Is Dplyr
23 pages
DSF 9-10
No ratings yet
DSF 9-10
25 pages
Lec 10
No ratings yet
Lec 10
51 pages
Dar Lecture 7
No ratings yet
Dar Lecture 7
24 pages
Broomspatial
No ratings yet
Broomspatial
31 pages
Lab Week2-3
No ratings yet
Lab Week2-3
26 pages
Trinitronkv28fx66b 1
No ratings yet
Trinitronkv28fx66b 1
78 pages
Daur Unit 2
No ratings yet
Daur Unit 2
28 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
Code 188 - Punto Classic
No ratings yet
Code 188 - Punto Classic
5 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
No ratings yet
Lecture 9: Data Wrangling With Dplyr: Kevin Lee
12 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Course File OS Session 2022-23
No ratings yet
Course File OS Session 2022-23
34 pages
Ca 1
No ratings yet
Ca 1
25 pages
Commercial Proposal-GoodWin Pontoon and Slurry Pump Installation
No ratings yet
Commercial Proposal-GoodWin Pontoon and Slurry Pump Installation
4 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
Aptitude Training Registered Students
No ratings yet
Aptitude Training Registered Students
24 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Kids C ("Jack", "Jill") : 5.1 Creating Data Frames
No ratings yet
Kids C ("Jack", "Jill") : 5.1 Creating Data Frames
11 pages
First Course On R
No ratings yet
First Course On R
26 pages
Machine Learning - Unit IV Notes
No ratings yet
Machine Learning - Unit IV Notes
18 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
R Prog
No ratings yet
R Prog
27 pages
Data Analytics-34-41
No ratings yet
Data Analytics-34-41
8 pages
BigData - BCom Unit 4
No ratings yet
BigData - BCom Unit 4
9 pages
CIPM FSG November - 2018 - v1
No ratings yet
CIPM FSG November - 2018 - v1
11 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
Din en 13215.2000-07 - 2637602
No ratings yet
Din en 13215.2000-07 - 2637602
20 pages
R
No ratings yet
R
13 pages
Practical 1 - Data Frame Manipulation - 072502
No ratings yet
Practical 1 - Data Frame Manipulation - 072502
16 pages
Magic Quadrant For Managed IoT Connectivity Services 2024
No ratings yet
Magic Quadrant For Managed IoT Connectivity Services 2024
39 pages
MTech R Notes
No ratings yet
MTech R Notes
14 pages
R Data Frame - Javatpoint
No ratings yet
R Data Frame - Javatpoint
14 pages
R Cheat Sheet (Updated)
No ratings yet
R Cheat Sheet (Updated)
13 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Base R
No ratings yet
Base R
9 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
R Study Material I
No ratings yet
R Study Material I
8 pages
R Lectures Chapter 4
No ratings yet
R Lectures Chapter 4
3 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
Minchenkov 2022
No ratings yet
Minchenkov 2022
6 pages
DSCI 100 Cheat Sheet
No ratings yet
DSCI 100 Cheat Sheet
3 pages
Lab 6 A
No ratings yet
Lab 6 A
3 pages
Statistics With R Week 3
No ratings yet
Statistics With R Week 3
3 pages
Experiment No.-04: Heat Balance On 4 Stroke Single Cylinder Diesel Engine AIM
No ratings yet
Experiment No.-04: Heat Balance On 4 Stroke Single Cylinder Diesel Engine AIM
8 pages
Single Linked List
No ratings yet
Single Linked List
14 pages
Comp1 Midterm Rev Ae
No ratings yet
Comp1 Midterm Rev Ae
8 pages
Color Standard
No ratings yet
Color Standard
1 page
R Command Cheatsheet2551545
No ratings yet
R Command Cheatsheet2551545
2 pages
Bran Chembah
No ratings yet
Bran Chembah
4 pages
Tutorial Bootstrap Part 3 - Cara Menginstall Bootstrap 5
No ratings yet
Tutorial Bootstrap Part 3 - Cara Menginstall Bootstrap 5
6 pages
SG49K5J: Multi-MPPT String Inverter For Japan System
No ratings yet
SG49K5J: Multi-MPPT String Inverter For Japan System
1 page
Bab 2 Bahasa Inggris Frangklif Rafel Pinontoan (20024017)
No ratings yet
Bab 2 Bahasa Inggris Frangklif Rafel Pinontoan (20024017)
3 pages
Distributed Generation and Microturbines
No ratings yet
Distributed Generation and Microturbines
5 pages
Помощь метамаск
No ratings yet
Помощь метамаск
4 pages
WMS - As of 23-1-23 (JKV) - 1
No ratings yet
WMS - As of 23-1-23 (JKV) - 1
2 pages
Boschtrainingsolutionsleafleta 4 Cropped
No ratings yet
Boschtrainingsolutionsleafleta 4 Cropped
2 pages
Sugar Rush Project Fudge Wreck-It Ralph Fanon Wiki Fandom
No ratings yet
Sugar Rush Project Fudge Wreck-It Ralph Fanon Wiki Fandom
1 page
Oracle SQL and PL/SQL
From Everand
Oracle SQL and PL/SQL
Niraj Gupta
4.5/5 (8)
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Statistical Analysis with R For Dummies
From Everand
Statistical Analysis with R For Dummies
Joseph Schmuller
5/5 (1)
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Advanced SAS Interview Questions You'll Most Likely Be Asked
From Everand
Advanced SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

DSF 11-12

Uploaded by

DSF 11-12

Uploaded by

DATA SCIENCE

The filter() operation is nested

Of course, we can derive this information

This date arithmetic is made easier

You might also like