0% found this document useful (0 votes)
29 views4 pages

Study Guide Data Manipulation With R

The document provides an overview of useful R functions and commands for data manipulation and preprocessing. It summarizes functions for exploring, filtering, and changing data types and columns. Pipes enable chained operations like applying multiple transformations sequentially to a dataframe.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views4 pages

Study Guide Data Manipulation With R

The document provides an overview of useful R functions and commands for data manipulation and preprocessing. It summarizes functions for exploring, filtering, and changing data types and columns. Pipes enable chained operations like applying multiple transformations sequentially to a dataframe.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

15.

003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

Study Guide: Data Manipulation with R r Exploring the data – The table below summarizes the main functions used to get a complete
overview of the data:

Category Action Command


Afshine Amidi and Shervine Amidi Select columns of interest df %>% select(col_list)
Remove unwanted columns df %>% select(-col_list)
August 21, 2020 Look at data
Look at n first rows / last rows df %>% head(n) / df %>% tail(n)
Summary statistics of columns df %>% summary()
Main concepts Data types of columns df %>% str()
Data types
r File management – The table below summarizes the useful commands to make sure the Number of rows / columns df %>% NROW() / df %>% NCOL()
working directory is correctly set:

Category Action Command r Data types – The table below sums up the main data types that can be contained in columns:

Change directory to another path setwd(path)


Data type Description Example
Paths Get current working directory getwd()
character String-related data ’teddy bear’
Join paths file.path(path_1, ..., path_n)
String-related data that can be
List files and folders in factor
put in bucket, or ordered
’high’
list.files(path, include.dirs = TRUE)
a given directory
numeric Numerical data 24.0
file_test(’-f’, path)
Files Check if path is a file / folder int Numeric data that are integer 24
file_test(’-d’, path)
Date Dates ’2020-01-01’
read.csv(path_to_csv_file)
Read / write csv file POSIXct Timestamps ’2020-01-01 00:01:00’
write.csv(df, path_to_csv_file)

Data preprocessing
r Chaining – The symbol %>%, also called "pipe", enables to have chained operations and
provides better legibility. Here are its different interpretations: r Filtering – We can filter rows according to some conditions as follows:
• f(arg_1, arg_2, ..., arg_n) is equivalent to arg_1 %>% f(arg_2, arg_3, ..., arg_n), R
and also to:
df %>%
– arg_1 %>% f(., arg_2, ..., arg_n) ..filter(some_col some_operation some_value_or_list_or_col)
– arg_2 %>% f(arg_1, ., arg_3, ..., arg_n)
– arg_n %>% f(arg_1, ..., arg_n-1,...) where some_operation is one of the following:

• A common use of pipe is when a dataframe df gets first modified by some_operation_1,


then some_operation_2, until some_operation_n in a sequential way. It is done as follows: Category Operation Command
Equality / non-equality == / !=
R
Basic Inequalities <, <=, >=, >
# df gets some_operation_1, then some_operation_2, ...,
# then some_operation_n And / or &/|
df %>%
..some_operation_1 %>% Check for missing value is.na()
..some_operation_2 %>%
...................%>% Advanced Belonging %in% (val_1, ..., val_n)
..some_operation_n
Pattern matching %like% ’val’

Massachusetts Institute of Technology 1 https://www.mit.edu/~amidi


15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

Remark: we can filter columns with the select_if command. where format is a string describing the structure of the field and using the commands summarized
in the table below:
r Changing columns – The table below summarizes the main column operations:

Action Command
Category Command Description Example
Add new columns
df %>% mutate(new_col = operation(other_cols)) Year ’%Y’ / ’%y’ With / without century 2020 / 20
on top of old ones
Add new columns Month ’%B’ / ’%b’ / ’%m’ Full / abbreviated / numerical August / Aug / 8
df %>% transmute(new_col = operation(other_cols))
and discard old ones ’%A’ / ’%a’ Full / abbreviated Sunday / Sun
Modify several columns Weekday
df %>% mutate_at(vars, funs) ’%u’ / ’%w’ Number (1-7) / Number (0-6) 7/0
in-place
Day ’%d’ / ’%j’ Of the month / of the year 09 / 222
Modify all columns
df %>% mutate_all(funs)
in-place Time ’%H’ / ’%M’ Hour / minute 09 / 40
Modify columns fitting Timezone ’%Z’ / ’%z’ String / Number of hours from UTC EST / -0400
df %>% mutate_if(condition, funs)
a specific condition
Unite columns df %>% unite(new_merged_col, old_cols_list)
Separate columns df %>% separate(col_to_separate, new_cols_list) Remark: data frames only accept datetime in POSIXct format.

r Date properties – In order to extract a date-related property from a datetime object, the
following command is used:
r Conditional column – A column can take different values with respect to a particular set
of conditions with the case_when() command as follows:
R
R
format(datetime_object, format)
case_when(condition_1 ∼ value_1,..# If condition_1 then value_1
..........condition_2 ∼ value_2,..# If condition_2 then value_2
...................
..........TRUE ∼ value_n).........# Otherwise, value_n where format follows the same convention as in the table above.

Remark: the ifelse(condition_if_true, value_true, value_other) can be used and is easier to


manipulate if there is only one condition.
r Mathematical operations – The table below sums up the main mathematical operations Data frame transformation
that can be performed on columns:
r Merging data frames – We can merge two data frames by a given field as follows:
Operation Command
√ R
x sqrt(x)
merge(df_1, df_2, join_field, join_type)
bxc floor(x)
dxe ceiling(x)
where join_field indicates fields where the join needs to happen:

r Datetime conversion – Fields containing datetime values can be stored in two different
POSIXt data types:
Case Fields are equal Different field names

Action Command Command by = ’field’ by.x = ’field_1’, by.y = ’field_2’


Converts to datetime with seconds since origin as.POSIXct(col, format)
Converts to datetime with attributes (e.g. time zone) as.POSIXlt(col, format) and where join_type indicates the join type, and is one of the following:

Massachusetts Institute of Technology 2 https://www.mit.edu/~amidi


15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

Join type Option Illustration Illustration


Type Command
Before After
Inner join default
spread(
Long to wide df, key = ’key’,
value = ’value’
Left join all.x = TRUE )

gather(
Right join all.y = TRUE df, key = ’key’
Wide to long value = ’value’,
c(key_1, ..., key_n)
)
Full join all = TRUE

r Row operations – The following actions are used to make operations on rows of the data
frame:

Illustration
Remark: if the by parameter is not specified, the merge will be a cross join. Action Command
Before After
r Concatenation – The table below summarizes the different ways data frames can be con-
catenated:
Sort with
respect df %>%
to columns arrange(col_1, ..., col_n)

Type Command Illustration

Dropping
df %>% unique()
Rows rbind(df_1, ..., df_n) duplicates

Drop rows
with at df %>% na.omit()
least a
Columns cbind(df_1, ..., df_n) null value

Remark: by default, the arrange command sorts in ascending order. If we want to sort it in
descending order, the - command needs to be used before a column.

Aggregations
r Common transformations – The common data frame transformations are summarized in
the table below: r Grouping data – Aggregate metrics are computed across groups as follows:

Massachusetts Institute of Technology 3 https://www.mit.edu/~amidi


15.003 Software Tools — Data Science Afshine Amidi & Shervine Amidi

r Row numbering – The table below summarizes the main commands that rank each row
across specified groups, ordered by a specific field:

Join type Command Example


row_number(x) Ties are given different ranks 1, 2, 3, 4
Ties are given same rank
rank(x) 1, 2.5, 2.5, 4
The R command is as follows: and skip numbers
Ties are given same rank
R dense_rank(x) 1, 2, 2, 3
and do not skip numbers
df %>%..................................................# Ungrouped data frame
..group_by(col_1, ..., col_n) %>%.......................# Group by some columns
..summarize(agg_metric = some_aggregation(some_cols))...# Aggregation step
r Values – The following window functions allow to keep track of specific types of values with
respect to the group:
r Aggregate functions – The table below summarizes the main aggregate functions that can
be used in an aggregation query: Command Description
first(x) Takes the first value of the column
Category Action Command
last(x) Takes the last value of the column
Properties Count of observations n()
lag(x, n) Takes the nth previous value of the column
Sum of values of observations sum()
lead(x, n) Takes the nth following value of the column
Max / min of values of observations max() / min()
Values
Mean / median of values of observations mean() / median() nth(x, n) Takes the nth value of the column

Standard deviation / variance across observations sd() / var()

Window functions
r Definition – A window function computes a metric over groups and has the following struc-
ture:

The R command is as follows:

R
df %>%........................................# Ungrouped data frame
..group_by(col_1, ..., col_n) %>%.............# Group by some columns
..mutate(win_metric = window_function(col))...# Window function

Remark: applying a window function will not change the initial number of rows of the data
frame.

Massachusetts Institute of Technology 4 https://www.mit.edu/~amidi

You might also like