Manipulating Data in R
Manipulating Data in R
John Muschelli
January 7, 2016
Overview
I https://www.rstudio.com/wp-content/uploads/2015/
02/data-wrangling-cheatsheet.pdf
Load the packages/libraries
library(dplyr)
filter, lag
library(tidyr)
Data used: Charm City Circulator
http://www.aejaffe.com/winterR_2016/data/Charm_City_
Circulator_Ridership.csv
Let’s read in the Charm City Circulator data:
ex_data = read.csv("http://www.aejaffe.com/winterR_2016/dat
head(ex_data, 2)
[1] 0
head(ex_data$date)
class(ex_data$date)
library(stringr)
cn = colnames(ex_data)
cn = cn %>%
str_replace("Board", ".Board") %>%
str_replace("Alight", ".Alight") %>%
str_replace("Average", ".Average")
colnames(ex_data) = cn
Removing the daily ridership
ex_data$daily = NULL
Reshaping data from wide (fat) to long (tall)
See http://www.cookbook-r.com/Manipulating_data/
Converting_data_between_wide_and_long_format/
Reshaping data from wide (fat) to long (tall): base R
table(long$line)
Now we can filter only the good rows and delete the good column.
id Age
1 1 55.00000
2 2 55.55556
id visit Outcome
1 1 1 10.00000
2 2 2 11.73913
Merging
dim(merged.data)
[1] 24 4
Merging
dim(all.data)
[1] 26 4
Joining in dplyr
dim(lj)
[1] 26 4
tail(lj)
dim(rj)
[1] 24 4
tail(rj)
dim(fj)
[1] 26 4
tail(fj)
args(tapply)
gb = group_by(wide, line)
summarize(gb, mean_avg = mean(Average))
line mean_avg
(chr) (dbl)
1 banner 827.2685
2 green 1957.7814
3 orange 3033.1611
4 purple 4016.9345
Perform Operations By Groups: dplyr with piping
Using piping, this is:
wide %>%
group_by(line) %>%
summarise(mean_avg = mean(Average))
line mean_avg
(chr) (dbl)
1 banner 827.2685
2 green 1957.7814
3 orange 3033.1611
4 purple 4016.9345
Perform Operations By Multiple Groups: dplyr
This can easily be extended using group_by with multiple groups.
Let’s define the year of riding:
library(ggplot2)
ggplot(aes(x = date, y = Average,
colour = line), data = wide) + geom_line()
8000
6000
line
banner
Average
green
4000
orange
purple
2000
Perform Operations By Multiple Groups: dplyr
Let’s create the middle of the month (the 15th for example), and
name it mon.
ggplot(aes(x = mid_month,
y = mean_avg,
colour = line), data = mon) + geom_line()
5000
4000
line
mean_avg
banner
green
3000
orange
purple
2000
Bonus! Points with a smoother!
ggplot(aes(x = date, y = Average, colour = line),
data = wide) + geom_smooth(se = FALSE) +
geom_point(size = .5)
8000
6000
line
banner
Average
green
4000
orange
purple
2000