Data Transformation 1 Reviewed
Data Transformation 1 Reviewed
Data Transformation
■ View(flights)
■ glimpse(flights)
■ ?flights
– These can all be used in conjunction with group_by() which changes the scope of each function from
operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for
a language of data manipulation.
– …and remember to assign the result to a variable if you want to make some use of it.
■ Assign it to a variable if you want to keep the results for further use
■ Assign + put the whole thing into brackets for both at the same time:
■ (jan1 <- filter(flights, month == 1, day == 1))
■ Logical operators:
■ ==, >, >=, <, <=, !=
■ You may also use near(<parameter 1>, <parameter 2>)
■ Obtain a list of flights that were scheduled to depart on February between 5:00am and
9:59am
■ Obtain a list of flights that were scheduled to depart on February between 5:00am and
9:59am
■ filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA
values. If you want to preserve missing values, ask for them explicitly.
■ arrange(flights, year, month, day) → sort first by year, then by month, then by day
■ Missing values are always sorted at the end → how can we change this?
■ arrange(flights, year, month, day) → sort first by year, then by month, then by day
■ Missing values are always sorted at the end → how can we change this?
■ Add a new value to the flights data frame: a ranking from less to most delayed
values. Consider the delay of arrival. Keep only date fields, destination, arrival
delay and scheduled depart time
■ Arrange the data frame by the newly created field. In case of doubt, sort by
departure time
■ Add a new variable to the flights data frame: a ranking from most to less delayed
flights. Consider the delay of arrival. Keep only date fields, destination, arrival
delay and scheduled departure time
■ a<-transmute(flights, year, month, day, sched_dep_time, dest, arr_delay,
most_delayed=min_rank(desc(arr_delay)))
■ Arrange the data frame by the newly created field. In case of doubt, sort by
departure time
■ b<-arrange(a, most_delayed, sched_dep_time)
■ Tips:
1. Understand the statement
2. Imagine and mentally visualize what you’re trying to achieve
3. Start thinking about the necessary steps to get there
1. Which data will you require? How is it structured?
2. What kind of visualization can help you?
■ 3. Filter to remove noisy points if necessary (hint: avoid, at least, destinations with
less than 20 flights)
■ 5. Any outliers you would like to ignore? How could you have detected them? How
can you filter them?
■ 6. Conclusions??
■ sum(variable) ➔ sums all the values. Must be a numerical value or it wont work.
■ Plot the relationship between the different age ranges and the number of foreigners for
each section. In which age ranges can you find a more positive / negative relationship?
Can you find any explanation to that?
■ If there are any outliers, DELETE THEM from your dataset:
■ Pipe: