0% found this document useful (0 votes)
12 views21 pages

DSF 11-12

The document provides an overview of data wrangling in R, focusing on the dplyr package, which offers a grammar for data manipulation through functions like select, filter, mutate, arrange, and summarize. It explains how to use these functions to manipulate data frames effectively, emphasizing the importance of tidy data. Additionally, it covers the use of the pipe operator for chaining operations to improve code readability.

Uploaded by

bidiy85138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

DSF 11-12

The document provides an overview of data wrangling in R, focusing on the dplyr package, which offers a grammar for data manipulation through functions like select, filter, mutate, arrange, and summarize. It explains how to use these functions to manipulate data frames effectively, emphasizing the importance of tidy data. Additionally, it covers the use of the pipe operator for chaining operations to improve code readability.

Uploaded by

bidiy85138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

DATA SCIENCE

FUNDAMENTAL
S
DSC293
Lecture 11-12
Dr. Hufsa Mohsin
A GRAMMAR FOR DATA
WRANGLING
 The data frame is a key data structure in statistics and in R.
 The basic structure of a data frame is that there is one observation per row
and each column represents a variable, a measure, feature, or
characteristic of that observation.
THE DPLYR PACKAGE
 The dplyr package was developed by Hadley Wickham of RStudio and is an
optimized and distilled version of his plyr package.
 One important contribution of the dplyr package is that it provides a
“grammar” (in particular, verbs) for data manipulation and for operating on
data frames.
 With this grammar, you can sensibly communicate what it is that you are
doing to a data frame that other people can understand (assuming they
also know the grammar).
DPLYR GRAMMAR
 select: return a subset of the columns of a data frame, using a flexible
notation
 filter: extract a subset of rows from a data frame based on logical
conditions
 arrange: reorder rows of a data frame • rename: rename variables in a
data frame
 mutate: add new variables/columns or transform existing variables
 summarize: generate summary statistics of different variables in the data
frame, possibly within strata
 %>%: the “pipe” operator is used to connect multiple verb actions
together into a pipeline
COMMON DPLYR FUNCTION
PROPERTIES
 1. The first argument is a data frame.
 2. The subsequent arguments describe what to do with the data frame
specified in the first argument, and you can refer to columns in the data
frame directly without using the $ operator (just use the column names).
 3. The return result of a function is a new data frame
 4. Data frames must be properly formatted and annotated for this, to all be
useful. In particular, the data must be tidy.
 In short, there should be one observation per row, and each column should
represent a feature or characteristic of that observation
 install.packages("dplyr")
 After installing the package it is important that you load it into your R session
with the library() function. > library(dplyr)
SELECT()
 The select() function can be used to select columns of a data frame that
you want to focus on. Often you’ll have a large data frame containing “all”
of the data, but any given analysis might only use a subset of variables or
observations.
SELECT()…
 The select() function allows you to get the few columns you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways
to do this. We could for example use numerical indices. But we can also
use the names directly.
SELECT()…
 To retrieve only the names and party
affiliations of these presidents, we would
use select(). The first argument to
the function is the data frame, followed by
an arbitrarily long list of column names,
separated by commas.

FILTER()
 At left, a data frame that contains matching entries in a certain column for
only a subset of the rows. At right, the resulting data frame after filtering.
FILTER()…
 The first argument to filter() is a data frame, and
subsequent arguments are logical conditions
that are evaluated on any involved columns.
 If we want to retrieve only those rows that
pertain to Republican presidents, we need to
specify that the value of the party is republican.
 Note that the == is a test for equality. If we
were to use only a single equal sign here, we
would be asserting that the value of party was
republican.
 This would result in an error. The quotation
marks around republican are necessary here,
since republican is a literal value, and not a
variable name.
COMBINING FILTER() AND
SELECT()
Find which Democratic presidents
served since Watergate.

The filter() operation is nested


inside the select(). each of the
five verbs takes and returns a
data frame, which makes this
type of nesting possible.
PIPELINE
 Pipe-forwarding is an alternative to nesting that yields code that can be
easily read from top to bottom. With the pipe, we can write the same
expression as above in this more readable syntax.

is equivalent to
MUTATE()
While we have the raw data on when
each of these presidents took and
relinquished office, we don’t actually
have a numeric variable giving the
length of each president’s term.

Of course, we can derive this information


from the dates given, and add the result
as a new column to our data frame.

This date arithmetic is made easier


through the use of
the lubridate package, which we use to
compute the number of years
( dyears()) that elapsed since during the
interval() from the start until the end of
each president term
MUTATE()…
 Mutate() function can also be used
to modify the data in an existing
column.
 Suppose that we wanted to add to
our data frame a variable containing
the year in which each president
was elected.
 Our first (naïve) attempt might
assume that every president was
elected in the year before he took
office.
 Mutate() returns a data frame, so
if we want to modify our existing
data frame, we need to overwrite
it with the results.
RENAME()
 it is considered bad practice to use
“.” in the name of functions, data
frames, and variables in R.
 Also this could conflict with R’s use
of generic functions (i.e., R’s
mechanism for method
overloading).
 Thus, we should change the name
of the column by rename()
functions
ARRANGE()
 The function sort() will sort a vector but
not a data frame. The function that will
sort a data frame is called arrange().
 In order to apply arrange() on a data
frame, you have to specify the data
frame, and the column by which you
want it to be sorted.
 You also have to specify the direction in
which you want it to be sorted.
Specifying multiple sort conditions will
help break ties.
 To sort our presidential data frame data
frame by the length of each president’s
term, we specify that we want the
column term_length in descending order.
SUMMARIZE()
 which is nearly always used in
conjunction with group_by().
 The previous four verbs provided us with
means to manipulate a data frame in
powerful and flexible ways.
 But the extent of the analysis we can
perform with these four verbs alone is
limited.
 On the other hand summarize() with
group_by() enables us to make
comparisons.
SUMMARIZE()…
 When used alone summarize() collapses a
data frame into a single row. Critically, we
have to specify how we want to reduce an
entire column of data into a single value.
 The first argument is a data frame,
followed by a list of variables that will
appear in the output.
 Note that every variable in the output is
defined by operations performed
on vectors—not on individual values.
 This is essential, since if the specification
of an output variable is not an operation on
a vector, there is no way for R to know how
to collapse each column.
 In this example, the function n() simply
counts the number of rows.
SUMMARIZE()…
 whether Democratic or
Republican presidents served a
longer average term during this
time period.
 To figure this out, we can just
execute again, but this time,
instead of the first argument
being the data frame we will
specify that the rows of the data
frame should be grouped by the
values of the party

You might also like