0% found this document useful (0 votes)
39 views43 pages

Data Transformation 1 Reviewed

Here are the steps to solve this exercise: 1. Rank flights by arrival delay from most to least delayed: ranked_flights <- flights %>% mutate(arr_delay_rank = min_rank(-arr_delay)) 2. Select only the specified fields: ranked_flights <- ranked_flights %>% select(year, month, day, dest, arr_delay, sched_dep_time, arr_delay_rank) 3. Arrange by the new ranking field: ranked_flights <- ranked_flights %>% arrange(arr_delay_rank) 4. Add a cumulative arrival delay variable: ranked_flights <- ranked

Uploaded by

JORDI MASFERRER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views43 pages

Data Transformation 1 Reviewed

Here are the steps to solve this exercise: 1. Rank flights by arrival delay from most to least delayed: ranked_flights <- flights %>% mutate(arr_delay_rank = min_rank(-arr_delay)) 2. Select only the specified fields: ranked_flights <- ranked_flights %>% select(year, month, day, dest, arr_delay, sched_dep_time, arr_delay_rank) 3. Arrange by the new ranking field: ranked_flights <- ranked_flights %>% arrange(arr_delay_rank) 4. Add a cumulative arrival delay variable: ranked_flights <- ranked

Uploaded by

JORDI MASFERRER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

[IN007]

Data Analysis Tools

Data Transformation

IN007 – Data Analysis Tools Pág. 1


Data Transformation
Basics
■ Coding basics

IN007 – Data Analysis Tools Pág. 2


Data Transformation
Basics
■ Tibbles and data frames basics
■ Flights ➔ library(nycflights13)

■ View(flights)
■ glimpse(flights)
■ ?flights

IN007 – Data Analysis Tools Pág. 3


Data Transformation
Basics
■ Data types

IN007 – Data Analysis Tools Pág. 4


Data Transformation
Basics
■ Data transformations – introduction

– These can all be used in conjunction with group_by() which changes the scope of each function from
operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for
a language of data manipulation.

■ Make sure you have installed and loaded dplyr library


■ library(dplyr)

IN007 – Data Analysis Tools Pág. 5


Data Transformation
Basics
■ Data transformations – introduction

■ <VERB> (<data frame>, <manipulations>)

– …and remember to assign the result to a variable if you want to make some use of it.

IN007 – Data Analysis Tools Pág. 6


Data Transformation
Filter
■ Filter
■ Allows you to subset observations based on their values.
■ The first argument is the name of the data frame (you already know that)
■ The second and subsequent arguments are the expressions that filter the data frame.
■ For example, we can select all flights on January 1st with:

IN007 – Data Analysis Tools Pág. 7


Data Transformation
Filter
■ Just type the sentence to see the results

■ Assign it to a variable if you want to keep the results for further use

■ Assign + put the whole thing into brackets for both at the same time:
■ (jan1 <- filter(flights, month == 1, day == 1))

■ Logical operators:
■ ==, >, >=, <, <=, !=
■ You may also use near(<parameter 1>, <parameter 2>)

IN007 – Data Analysis Tools Pág. 8


Data Transformation
Filter
■ Logical operators

IN007 – Data Analysis Tools Pág. 9


Data Transformation
Filter
■ Logical operators

■ Can you tell the result of the following commands?


■ filter(flights, month == 11 | month == 12)
■ filter(flights, month != 11 & month != 12)
■ filter(flights, month != 11 | month != 12)
■ filter(flights, month == 11 & month == 12)

IN007 – Data Analysis Tools Pág. 10


Data Transformation
Filter
■ Obtain a list of the flights that were NOT delayed for more than an hour neither on
arrival or on departure.

■ Obtain a list of flights that were scheduled to depart on February between 5:00am and
9:59am

■ Any ideas on how to check the results?

IN007 – Data Analysis Tools Pág. 11


Data Transformation
Filter
■ Obtain a list of the flights that were NOT delayed for more than an hour neither on
arrival or on departure.

■ Obtain a list of flights that were scheduled to depart on February between 5:00am and
9:59am

■ Any ideas on how to check the results?

IN007 – Data Analysis Tools Pág. 12


Data Transformation
Filter
■ Always beware of null / unknown / missing / Not Available / NA values

■ is.na always returns TRUE (1) or FALSE (0). Nothing else.

■ filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA
values. If you want to preserve missing values, ask for them explicitly.

IN007 – Data Analysis Tools Pág. 13


Data Transformation
Filter
■ Filtering NAs

IN007 – Data Analysis Tools Pág. 14


Data Transformation
Filter
■ Filtering NAs

IN007 – Data Analysis Tools Pág. 15


Data Transformation
Arrange
■ Arrange works similar to filter. The goal is to reorder the rows (not the columns), that is,
the observations, in ascending order by default.

■ arrange(flights, year, month, day) → sort first by year, then by month, then by day

■ arrange(flights, desc(dep_delay)) → sort by dep_delay in descending order

■ Missing values are always sorted at the end → how can we change this?

IN007 – Data Analysis Tools Pág. 16


Data Transformation
Arrange
■ Arrange works similar to filter. The goal is to reorder the rows (not the columns), that is,
the observations, in ascending order by default.

■ arrange(flights, year, month, day) → sort first by year, then by month, then by day

■ arrange(flights, desc(dep_delay)) → sort by dep_delay in descending order

■ Missing values are always sorted at the end → how can we change this?

■ arrange(flights, desc(is.na(dep_time)), dep_time)


■ Remember is.na returns TRUE (1) or FALSE (0) only, so if you sort descending you will
get the TRUE first, meaning you will get the NA values first.

IN007 – Data Analysis Tools Pág. 17


Data Transformation
Select
■ Just like “filter” generates a subset of observations (rows), “select” generates a subset
of variables (columns)

IN007 – Data Analysis Tools Pág. 18


Data Transformation
Select
■ Rename
■ Keeps all the variables not explicitly mentioned

IN007 – Data Analysis Tools Pág. 19


Data Transformation
Select
■ Select (…, everything()) → useful, among other things, to reorder variables (e.g. bring
them to the beginning of the data frame)

IN007 – Data Analysis Tools Pág. 20


Data Transformation
Select
■ Different ways to use Select
■ Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and
arr_delay from flights.
■ select(flights, dep_time, dep_delay, arr_time, arr_delay)
■ select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
■ select(flights, 4, 6, 7, 9) #column numbers
■ select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
■ select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
■ select(flights, starts_with("dep_"), starts_with("arr_"))
■ vars <- c("year", "dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, any_of(vars))

IN007 – Data Analysis Tools Pág. 21


Data Transformation
Mutate
■ Add new variables (columns) at the end of your data frame

IN007 – Data Analysis Tools Pág. 22


Data Transformation
Mutate
■ Add new variables (columns) at the end of your data frame

■ You can also add CONSTANTS!!!

IN007 – Data Analysis Tools Pág. 23


Data Transformation
Mutate
■ You can refer to variables you just created

IN007 – Data Analysis Tools Pág. 24


Data Transformation
Mutate
■ Using transmute instead of mutate lets you keep only the newly created variables:

IN007 – Data Analysis Tools Pág. 25


Data Transformation
Mutate
■ Some useful operations (remember that mutate and transmute do not aggregate data, the
number of observations does not change!):
■ Arithmetic operators: +, -, *, /, ^.
■ Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/%
y) + (x %% y)

■ Logical comparisons, <, <=, >, >=, !=, and ==


■ Find the "previous" (lag()) or "next" (lead()) values in a vector. Useful for comparing
values behind of or ahead of the current values.
■ Cumulative and rolling aggregates: running sums, products, mins and maxes:
cumsum(), cumprod(), cummin(), cummax(); and cummean() for cumulative means.
■ Ranking: there are a number of ranking functions, but you should start with min_rank().
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives
smallest values the first ranks; use desc(x) to give the largest values the last ranks.

IN007 – Data Analysis Tools Pág. 26


Data Transformation
Mutate
■ Exercises (in order!)

■ Add a new value to the flights data frame: a ranking from less to most delayed
values. Consider the delay of arrival. Keep only date fields, destination, arrival
delay and scheduled depart time

■ Arrange the data frame by the newly created field. In case of doubt, sort by
departure time

■ Add a cumulative delay field.

IN007 – Data Analysis Tools Pág. 27


Data Transformation
Mutate
■ Exercises (in order!)

■ Add a new variable to the flights data frame: a ranking from most to less delayed
flights. Consider the delay of arrival. Keep only date fields, destination, arrival
delay and scheduled departure time
■ a<-transmute(flights, year, month, day, sched_dep_time, dest, arr_delay,
most_delayed=min_rank(desc(arr_delay)))

■ Arrange the data frame by the newly created field. In case of doubt, sort by
departure time
■ b<-arrange(a, most_delayed, sched_dep_time)

■ Add a cumulative delay field.


■ (c<-mutate(b, cumulated_delay=cumsum(arr_delay)))

IN007 – Data Analysis Tools Pág. 28


Data Transformation
Summarise
■ Summarise collapses a data frame into a summary of data
■ Used alone, it creates one single row:

■ Generally, we use it with “group_by” to summarize information in groups

IN007 – Data Analysis Tools Pág. 30


Data Transformation
Summarise
■ Count: useful to check that you’re not drawing conclusions based on very small
amounts of data
■ n() → Count the number of observations
■ sum(!is.na(x)) → Count the number of non-missing values
■ n_distinct(x) → Count the number of distinct values
■ Counts and proportions of logical values: sum(x > 10), mean(y == 0)
– summarise(n_early = sum(dep_time < 500))
» Sums number of observations for which dep_time is less than 500

■ Exercise: get the number of flights per day.

IN007 – Data Analysis Tools Pág. 31


Data Transformation
Summarise
■ Count: useful to check that you’re not drawing conclusions based on very small
amounts of data
■ n() → Count the number of observations
■ sum(!is.na(x)) → Count the number of non-missing values
■ n_distinct(x) → Count the number of distinct values
■ Counts and proportions of logical values: sum(x > 10), mean(y == 0)
– summarise(n_early = sum(dep_time < 500))
» Sums number of observations for which dep_time is less than 500

■ Exercise: get the number of flights per day.

IN007 – Data Analysis Tools Pág. 32


Data Transformation
Summarise
■ Exercise: what is the relationship between the distance of the destinations and the
average delay for each of them?

■ Tips:
1. Understand the statement
2. Imagine and mentally visualize what you’re trying to achieve
3. Start thinking about the necessary steps to get there
1. Which data will you require? How is it structured?
2. What kind of visualization can help you?

IN007 – Data Analysis Tools Pág. 33


Data Transformation
Summarise
■ Exercise: what is the relationship between the distance of the destinations and the
average delay for each of them?
■ 1 .Group flights by destination.

■ 2. Summarise to compute distance, average delay, and number of flights.

■ 3. Filter to remove noisy points if necessary (hint: avoid, at least, destinations with
less than 20 flights)

IN007 – Data Analysis Tools Pág. 34


Data Transformation
Summarise
■ Exercise: what is the relationship between the distance of the destinations and the
average delay for each of them?
■ 4. Plot de results

■ 5. Any outliers you would like to ignore? How could you have detected them? How
can you filter them?

■ 6. Conclusions??

IN007 – Data Analysis Tools Pág. 35


Data Transformation
Summarise
■ Exercise: what is the relationship between the distance of the destinations and the
average delay for each of them?

IN007 – Data Analysis Tools Pág. 36


Data Transformation
Summarise
■ Some of the most common operations with summarise include, but are not limited to:

■ mean(variable) ➔ we’ve seen it before. Computes the mean / average of the


values for each group

■ sum(variable) ➔ sums all the values. Must be a numerical value or it wont work.

■ max(variable), min(variable) ➔ gets the maximum or the minimum value

■ The “counts” we saw before (n, n_distinct…)

■ Any other operation you’ve seen before

IN007 – Data Analysis Tools Pág. 37


Data Transformation
Summarise
■ Touristic flats exercise: obtain a table with the number of touristic flats per district and
sort it descending. Filter any results you consider an error. Represent the resulting
information in a bar chart

IN007 – Data Analysis Tools Pág. 38


Data Transformation
Summarise
■ Census exercise: create new columns expressing the % of men, women and foreigners
(ESTRANGERS) for each section. Arrange the resulting table by % of women

■ d<-`2022_09_TAULA_MAP_SCENSAL` ➔ careful about the sign ` (not ’, not ´)

■ Plot the relationship between the different age ranges and the number of foreigners for
each section. In which age ranges can you find a more positive / negative relationship?
Can you find any explanation to that?
■ If there are any outliers, DELETE THEM from your dataset:

IN007 – Data Analysis Tools Pág. 39


Data Transformation
Combining operations with the pipe
■ No pipe:

■ Pipe:

IN007 – Data Analysis Tools Pág. 40


Data Transformation
Combining operations with the pipe
■ Some more examples:
■ Average delays (previous exercise) not considering cancelled flights beforehand

■ Planes that have the highest average delays:

IN007 – Data Analysis Tools Pág. 41


Data Transformation
Combining operations with the pipe
■ Always keep in mind the data frame you’re generating!
■ At this point you have a table with three variables: talinum, n, delay

■ Plot your results in order to understand them!!

■ Does this make sense? Why?

IN007 – Data Analysis Tools Pág. 42


Data Transformation
Summarise
■ Measures of spread: sd(x), IQR(x), mad(x). The root mean squared deviation, or
standard deviation sd(x), is the standard measure of spread. The interquartile range
IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more
useful if you have outliers.

IN007 – Data Analysis Tools Pág. 43


Data Transformation
Summarise
■ Measures of rank: min(x), quantile(x, 0.25)
■ Measures of position: first(x), nth(x, 2), last(x)

IN007 – Data Analysis Tools Pág. 44

You might also like