What Is Dplyr
What Is Dplyr
select( ) Function
It is used to select only desired variables.
select() syntax : select(data , ....)
data : Data Frame
.... : Variables by name or by function
Example 6 : Selecting Variables (or Columns)
Suppose you are asked to select only a few variables. The code below
selects variables "Index", columns from "State" to "Y2008".
mydata2 = select(mydata, Index, State:Y2008)
Example 7 : Dropping Variables
The minus sign before a variable tells R to drop the variable.
mydata = select(mydata, -Index, -State)
The above code can also be written like :
mydata = select(mydata, -c(Index,State))
Example 8 : Selecting or Dropping Variables starts
with 'Y'
The starts_with() function is used to select variables starts with an
alphabet.
mydata3 = select(mydata, starts_with("Y"))
Adding a negative sign before starts_with() implies dropping the
variables starts with 'Y'
mydata33 = select(mydata, -starts_with("Y"))
The following functions helps you to select variables
based on their names.
Helpers Description
starts_with
Starts with a prefix
()
ends_with(
Ends with a prefix
)
contains() Contains a literal string
matches() Matches a regular expression
num_rang Numerical range like x01,
e() x02, x03.
one_of() Variables in character vector.
everything
All variables.
()
rename( ) Function
It is used to change variable name.
rename() syntax : rename(data , new_name = old_name)
data : Data Frame
new_name : New variable name you want to keep
old_name : Existing Variable Name
Example 11 : Rename Variables
The rename function can be used to rename variables.
Output
filter( ) Function
It is used to subset data with matching logical conditions.
filter() syntax : filter(data , ....)
data : Data Frame
.... : Logical Condition
Example 12 : Filter Rows
Suppose you need to subset data. You want to filter rows and retain
only those values in which Index is equal to A.
mydata7 = filter(mydata, Index == "A")
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008
Y2009
1 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1945229 1944173
2 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
1551826 1436541
3 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
1752886 1554330
4 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
1188104 1628980
Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
1 1237582 1440756 1186741 1852841 1558906 1916661
2 1629616 1230866 1512804 1985302 1580394 1979143
3 1300521 1130709 1907284 1363279 1525866 1647724
4 1669295 1928238 1216675 1591896 1360959 1329341
Example 13 : Multiple Selection Criteria
The %in% operator can be used to select multiple items. In the
following program, we are telling R to select rows against 'A' and 'C' in
column 'Index'.
mydata7 = filter(mydata6, Index %in% c("A", "C"))
Example 14 : 'AND' Condition in Selection Criteria
Suppose you need to apply 'AND' condition. In this case, we are
picking data for 'A' and 'C' in the column 'Index' and income greater
than 1.3 million in Year 2002.
mydata8 = filter(mydata6, Index %in% c("A", "C") & Y2002 >= 1300000 )
Example 15 : 'OR' Condition in Selection Criteria
The 'I' denotes OR in the logical condition. It means any of the two
conditions.
mydata9 = filter(mydata6, Index %in% c("A", "C") | Y2002 >= 1300000)
Example 16 : NOT Condition
The "!" sign is used to reverse the logical condition.
mydata10 = filter(mydata6, !Index %in% c("A", "C"))
Example 17 : CONTAINS Condition
The grepl function is used to search for pattern matching. In the
following code, we are looking for records wherein
column state contains 'Ar' in their name.
mydata10 = filter(mydata6, grepl("Ar", State))
summarise( ) Function
It is used to summarize data.
summarise() syntax : summarise(data , ....)
data : Data Frame
..... : Summary Functions such as mean, median etc
Example 18 : Summarize selected variables
In the example below, we are calculating mean and median for the
variable Y2015.
summarise(mydata, Y2015_mean = mean(Y2015),
Y2015_med=median(Y2015))
Output
Output
arrange() function
Use : Sort data
Syntax
arrange(data_frame, variable(s)_to_sort)
or
data_frame %>% arrange(variable(s)_to_sort)
To sort a variable in descending order, use desc(x).
Example 23 : Sort Data by Multiple Variables
The default sorting order of arrange() function is ascending. In this
example, we are sorting data by multiple variables.
arrange(mydata, Index, Y2011)
Suppose you need to sort one variable by descending order and other
variable by ascending oder.
arrange(mydata, desc(Index), Y2011)
group_by() function
Use : Group data by categorical variable
Syntax :
group_by(data, variables)
or
data %>% group_by(variables)
Example 24 : Summarise Data by Categorical Variable
We are calculating count and mean of variables Y2011 and Y2012 by
variable Index.
t = summarise_at(group_by(mydata, Index), vars(Y2011, Y2012), funs(n(),
mean(., na.rm = TRUE)))
The above code can also be written like
t = mydata %>% group_by(Index) %>%
summarise_at(vars(Y2011:Y2015), funs(n(), mean(., na.rm = TRUE)))
Index Y2011_n Y2012_n Y2013_n Y2014_n Y2015_n Y2011_mean
Y2012_mean
A 4 4 4 4 4 1432642 1455876
C 3 3 3 3 3 1750357 1547326
D 2 2 2 2 2 1336059 1981868
F 1 1 1 1 1 1497051 1131928
G 1 1 1 1 1 1851245 1850111
H 1 1 1 1 1 1902816 1695126
I 4 4 4 4 4 1690171 1687056
K 2 2 2 2 2 1489353 1899773
L 1 1 1 1 1 1210385 1234234
M 8 8 8 8 8 1582714 1586091
N 8 8 8 8 8 1448351 1470316
O 3 3 3 3 3 1882111 1602463
P 1 1 1 1 1 1483292 1290329
R 1 1 1 1 1 1781016 1909119
S 2 2 2 2 2 1381724 1671744
T 2 2 2 2 2 1724080 1865787
U 1 1 1 1 1 1288285 1108281
V 2 2 2 2 2 1482143 1488651
W 4 4 4 4 4 1711341 1660192
Since dplyr >= 1.0.0 version you may get the following warnings.
#`summarise()` ungrouping output (override with `.groups` argument)
#`summarise()` regrouping output by xxx (override with `.groups`
argument)
To suppress this warning you can use the following command.
options(dplyr.summarise.inform=F)
do() function
Use : Compute within groups
Syntax :
do(data_frame, expressions_to_apply_to_each_group)
Note : The dot (.) is required to refer to a data frame.
Example 25 : Filter Data within a Categorical Variable
Suppose you need to pull top 2 rows from 'A', 'C' and 'I' categories of
variable Index.
t = mydata %>% filter(Index %in% c("A", "C","I")) %>% group_by(Index) %>
%
do(head( . , 2))
Output
mutate() function
Use :Creates new variables
Syntax :
mutate(data_frame, expression(s) )
or
data_frame %>% mutate(expression(s))
Example 28 : Create a new variable
The following code calculates division of Y2015 by Y2014 and name it
"change".
mydata1 = mutate(mydata, change=Y2015/Y2014)
Example 29 : Multiply all the variables by 1000
It creates new variables and name them with suffix "_new".
mydata11 = mutate_all(mydata, funs("new" = .* 1000))
Output
The output shown in the image above is truncated due to high number
of variables.
Warning messages:
1: In Ops.factor(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, :
‘*’ not meaningful for factors
2: In Ops.factor(1:51, 1000) : ‘*’ not meaningful for factors
Output
By default, min_rank() assigns 1 to the smallest value and high
number to the largest value. In case, you need to assign rank 1 to the
largest value of a variable, use min_rank(desc(.))
mydata13 = mutate_at(mydata, vars(Y2008:Y2010),
funs(Rank=min_rank(desc(.))))
Example 31 : Select State that generated highest
income among the variable 'Index'
out = mydata %>% group_by(Index) %>% filter(min_rank(desc(Y2015)) ==
1) %>%
select(Index, State, Y2015)
Index State Y2015
1 A Alaska 1979143
2 C Connecticut 1718072
3 D Delaware 1627508
4 F Florida 1170389
5 G Georgia 1725470
6 H Hawaii 1150882
7 I Idaho 1757171
8 K Kentucky 1913350
9 L Louisiana 1403857
10 M Missouri 1996005
11 N New Hampshire 1963313
12 O Oregon 1893515
13 P Pennsylvania 1668232
14 R Rhode Island 1611730
15 S South Dakota 1136443
16 T Texas 1705322
17 U Utah 1729273
18 V Virginia 1850394
19 W Wyoming 1853858
Example 32 : Cumulative Income of 'Index' variable
The cumsum function calculates cumulative sum of a variable.
With mutate function, we insert a new variable called 'Total' which
contains values of cumulative income of variable Index.
out2 = mydata %>% group_by(Index) %>% mutate(Total=cumsum(Y2015))
%>%
select(Index, Y2015, Total)
join() function
Use : Join two datasets
Syntax :
inner_join(x, y, by = )
left_join(x, y, by = )
right_join(x, y, by = )
full_join(x, y, by = )
semi_join(x, y, by = )
anti_join(x, y, by = )
x, y - datasets (or tables) to merge / join
by - common variable (primary key) to join by.
Example 33 : Common rows in both the tables
df1 = data.frame(ID = c(1, 2, 3, 4, 5),
w = c('a', 'b', 'c', 'd', 'e'),
x = c(1, 1, 0, 0, 1),
y=rnorm(5),
z=letters[1:5])
df2 = data.frame(ID = c(1, 7, 3, 6, 8),
a = c('z', 'b', 'k', 'd', 'l'),
b = c(1, 2, 3, 0, 4),
c =rnorm(5),
d =letters[2:6])
INNER JOIN returns rows when there is a match in both tables. In this
example, we are merging df1 and df2 with ID as common variable
(primary key).
df3 = inner_join(df1, df2, by = "ID")
Output : INNER JOIN
If the primary key does not have same name in both the tables, try the
following way:
inner_join(df1, df2, by = c("ID"="ID1"))
Example 34 : Applying LEFT JOIN
LEFT JOIN : It returns all rows from the left table, even if there are no
matches in the right table.
left_join(df1, df2, by = "ID")
Output
Nested IF ELSE
Multiple IF ELSE statement can be written using if_else() function.
See the example below -
mydf =data.frame(x = c(1:5,NA))
mydf %>% mutate(newvar= if_else(is.na(x),"I am missing",
if_else(x==1,"I am one",
if_else(x==2,"I am two",
if_else(x==3,"I am three","Others")))))
Output
x flag
1 1 I am one
2 2 I am two
3 3 I am three
4 4 Others
5 5 Others
6 NA I am missing
SQL-Style CASE WHEN Statement
We can use case_when() function to write nested if-else queries. In
case_when(), you can use variables directly within case_when()
wrapper. TRUE refers to ELSE statement.
mydf %>% mutate(flag = case_when(is.na(x) ~ "I am missing",
x == 1 ~ "I am one",
x == 2 ~ "I am two",
x == 3 ~ "I am three",
TRUE ~ "Others"))
Important Point
Make sure you set is.na() condition at the beginning in nested ifelse.
Otherwise, it would not be executed.
Example 39 : Apply ROW WISE Operation
Suppose you want to find maximum value in each row of variables
2012, 2013, 2014, 2015. The rowwise() function allows you to apply
functions to rows.
df = mydata %>%
rowwise() %>% mutate(Max= max(Y2012,Y2013,Y2014,Y2015)) %>%
select(Y2012:Y2015,Max)
Example 40 : Combine Data Frames
Suppose you are asked to combine two data frames. Let's first create
two sample datasets.
df1=data.frame(ID = 1:6, x=letters[1:6])
df2=data.frame(ID = 7:12, x=letters[7:12])
Input Datasets
The bind_rows() function combine two datasets with rows. So
combined dataset would contain 12 rows (6+6) and 2 columns.
xy = bind_rows(df1,df2)
It is equivalent to base R function rbind.
xy = rbind(df1,df2)
The bind_cols() function combine two datasets with columns. So
combined dataset would contain 4 columns and 6 rows.
xy = bind_cols(x,y)
or
xy = cbind(x,y)
The output is shown below-
cbind Output
Output : R-Squared
Values
Output
Zero rows
filter_df <- function(df, colname, val){
filter(df, !!sym(colname) == val)
}
filter_df(iris,"Species", "setosa")
Output
50 rows
enquo() is used to quote its argument. Here we are asking user to
define variable name without quotes.
filter_df <- function(df, colname, val){
colname = enquo(colname)
filter(df, !!colname == val)
}
filter_df(iris, Species, "setosa")
Example 49 : How to use SQL rank() over(partition by)
In SQL, rank() over(partition by) is used to compute rank by a
grouping variable. In dplyr, it can be achieved very easily with a single
line of code. See the example below. Here we are calculating rank of
variable Y2015 by variable Index.
t = mydata %>% select(Index, Y2015) %>%
group_by(Index) %>%
mutate(rank = min_rank(desc(Y2015)))%>%
arrange(Index, rank)
In dplyr, there are many functions to compute rank other
than min_rank( ). These
are dense_rank( ), row_number( ), percent_rank().
across() function
across( ) function was added starting dplyr version 1.0. It helps
analyst to perform same operation on multiple columns. Let's take a
sample data.frame mtcars and calculate mean on variables from 'mpg'
through 'qsec' by 'carb'.
Alternative to summarise_at function
mtcars %>%
group_by(carb) %>%
summarise(across(mpg:qsec, mean))
Alternative to summarise_if function
The code below calculates average on numeric variables. It identifies
numeric variables using where() function.
mtcars %>%
group_by(carb) %>%
summarise(across(where(is.numeric), mean))
Multiple across() function
Here we are using two summary statistics - mean and no. of distinct
values in two different set of variables.
mtcars %>%
group_by(carb) %>%
summarise(across(mpg:qsec, mean), across(vs:gear, n_distinct))
across() can also be applied with mutate function
mtcars %>%
group_by(carb) %>%
mutate(across(where(is.numeric), mean))