Study Guide Data Manipulation With R
Study Guide Data Manipulation With R
003 Software Tools — Data Science Afshine Amidi & Shervine Amidi
Study Guide: Data Manipulation with R r Exploring the data – The table below summarizes the main functions used to get a complete
overview of the data:
Category Action Command r Data types – The table below sums up the main data types that can be contained in columns:
Data preprocessing
r Chaining – The symbol %>%, also called "pipe", enables to have chained operations and
provides better legibility. Here are its different interpretations: r Filtering – We can filter rows according to some conditions as follows:
• f(arg_1, arg_2, ..., arg_n) is equivalent to arg_1 %>% f(arg_2, arg_3, ..., arg_n), R
and also to:
df %>%
– arg_1 %>% f(., arg_2, ..., arg_n) ..filter(some_col some_operation some_value_or_list_or_col)
– arg_2 %>% f(arg_1, ., arg_3, ..., arg_n)
– arg_n %>% f(arg_1, ..., arg_n-1,...) where some_operation is one of the following:
Remark: we can filter columns with the select_if command. where format is a string describing the structure of the field and using the commands summarized
in the table below:
r Changing columns – The table below summarizes the main column operations:
Action Command
Category Command Description Example
Add new columns
df %>% mutate(new_col = operation(other_cols)) Year ’%Y’ / ’%y’ With / without century 2020 / 20
on top of old ones
Add new columns Month ’%B’ / ’%b’ / ’%m’ Full / abbreviated / numerical August / Aug / 8
df %>% transmute(new_col = operation(other_cols))
and discard old ones ’%A’ / ’%a’ Full / abbreviated Sunday / Sun
Modify several columns Weekday
df %>% mutate_at(vars, funs) ’%u’ / ’%w’ Number (1-7) / Number (0-6) 7/0
in-place
Day ’%d’ / ’%j’ Of the month / of the year 09 / 222
Modify all columns
df %>% mutate_all(funs)
in-place Time ’%H’ / ’%M’ Hour / minute 09 / 40
Modify columns fitting Timezone ’%Z’ / ’%z’ String / Number of hours from UTC EST / -0400
df %>% mutate_if(condition, funs)
a specific condition
Unite columns df %>% unite(new_merged_col, old_cols_list)
Separate columns df %>% separate(col_to_separate, new_cols_list) Remark: data frames only accept datetime in POSIXct format.
r Date properties – In order to extract a date-related property from a datetime object, the
following command is used:
r Conditional column – A column can take different values with respect to a particular set
of conditions with the case_when() command as follows:
R
R
format(datetime_object, format)
case_when(condition_1 ∼ value_1,..# If condition_1 then value_1
..........condition_2 ∼ value_2,..# If condition_2 then value_2
...................
..........TRUE ∼ value_n).........# Otherwise, value_n where format follows the same convention as in the table above.
r Datetime conversion – Fields containing datetime values can be stored in two different
POSIXt data types:
Case Fields are equal Different field names
gather(
Right join all.y = TRUE df, key = ’key’
Wide to long value = ’value’,
c(key_1, ..., key_n)
)
Full join all = TRUE
r Row operations – The following actions are used to make operations on rows of the data
frame:
Illustration
Remark: if the by parameter is not specified, the merge will be a cross join. Action Command
Before After
r Concatenation – The table below summarizes the different ways data frames can be con-
catenated:
Sort with
respect df %>%
to columns arrange(col_1, ..., col_n)
Dropping
df %>% unique()
Rows rbind(df_1, ..., df_n) duplicates
Drop rows
with at df %>% na.omit()
least a
Columns cbind(df_1, ..., df_n) null value
Remark: by default, the arrange command sorts in ascending order. If we want to sort it in
descending order, the - command needs to be used before a column.
Aggregations
r Common transformations – The common data frame transformations are summarized in
the table below: r Grouping data – Aggregate metrics are computed across groups as follows:
r Row numbering – The table below summarizes the main commands that rank each row
across specified groups, ordered by a specific field:
Window functions
r Definition – A window function computes a metric over groups and has the following struc-
ture:
R
df %>%........................................# Ungrouped data frame
..group_by(col_1, ..., col_n) %>%.............# Group by some columns
..mutate(win_metric = window_function(col))...# Window function
Remark: applying a window function will not change the initial number of rows of the data
frame.