DSF 11-12
DSF 11-12
FUNDAMENTAL
S
DSC293
Lecture 11-12
Dr. Hufsa Mohsin
A GRAMMAR FOR DATA
WRANGLING
The data frame is a key data structure in statistics and in R.
The basic structure of a data frame is that there is one observation per row
and each column represents a variable, a measure, feature, or
characteristic of that observation.
THE DPLYR PACKAGE
The dplyr package was developed by Hadley Wickham of RStudio and is an
optimized and distilled version of his plyr package.
One important contribution of the dplyr package is that it provides a
“grammar” (in particular, verbs) for data manipulation and for operating on
data frames.
With this grammar, you can sensibly communicate what it is that you are
doing to a data frame that other people can understand (assuming they
also know the grammar).
DPLYR GRAMMAR
select: return a subset of the columns of a data frame, using a flexible
notation
filter: extract a subset of rows from a data frame based on logical
conditions
arrange: reorder rows of a data frame • rename: rename variables in a
data frame
mutate: add new variables/columns or transform existing variables
summarize: generate summary statistics of different variables in the data
frame, possibly within strata
%>%: the “pipe” operator is used to connect multiple verb actions
together into a pipeline
COMMON DPLYR FUNCTION
PROPERTIES
1. The first argument is a data frame.
2. The subsequent arguments describe what to do with the data frame
specified in the first argument, and you can refer to columns in the data
frame directly without using the $ operator (just use the column names).
3. The return result of a function is a new data frame
4. Data frames must be properly formatted and annotated for this, to all be
useful. In particular, the data must be tidy.
In short, there should be one observation per row, and each column should
represent a feature or characteristic of that observation
install.packages("dplyr")
After installing the package it is important that you load it into your R session
with the library() function. > library(dplyr)
SELECT()
The select() function can be used to select columns of a data frame that
you want to focus on. Often you’ll have a large data frame containing “all”
of the data, but any given analysis might only use a subset of variables or
observations.
SELECT()…
The select() function allows you to get the few columns you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways
to do this. We could for example use numerical indices. But we can also
use the names directly.
SELECT()…
To retrieve only the names and party
affiliations of these presidents, we would
use select(). The first argument to
the function is the data frame, followed by
an arbitrarily long list of column names,
separated by commas.
FILTER()
At left, a data frame that contains matching entries in a certain column for
only a subset of the rows. At right, the resulting data frame after filtering.
FILTER()…
The first argument to filter() is a data frame, and
subsequent arguments are logical conditions
that are evaluated on any involved columns.
If we want to retrieve only those rows that
pertain to Republican presidents, we need to
specify that the value of the party is republican.
Note that the == is a test for equality. If we
were to use only a single equal sign here, we
would be asserting that the value of party was
republican.
This would result in an error. The quotation
marks around republican are necessary here,
since republican is a literal value, and not a
variable name.
COMBINING FILTER() AND
SELECT()
Find which Democratic presidents
served since Watergate.
is equivalent to
MUTATE()
While we have the raw data on when
each of these presidents took and
relinquished office, we don’t actually
have a numeric variable giving the
length of each president’s term.