The document discusses the dplyr package in R which provides functions for data manipulation. It introduces common verbs like filter(), arrange(), select(), distinct(), mutate(), and summarise(). It explains how these can be used to manipulate and transform data frames. It also discusses using group_by() to perform grouped operations, applying verbs by groups within the data.
The document discusses the dplyr package in R which provides functions for data manipulation. It introduces common verbs like filter(), arrange(), select(), distinct(), mutate(), and summarise(). It explains how these can be used to manipulate and transform data frames. It also discusses using group_by() to perform grouped operations, applying verbs by groups within the data.
Introduction to dplyr from https://cran.rstudio.com/web/packages/dplyr /vignettes/introduction.html where shows step by step learning on dplyr package which is very useful for manipulating data By compiling the mentioned webpage to powerpoint, the material is easier to introduce in the lecture When working with data you must Figure out what you want to do Describe those tasks in the form of a computer program Execute the program The dplyr package makes these steps fast and easy By constraining your options, it simplifies how you can think about common data manipulation tasks It provides simple verbs, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code. It uses efficient data storage backends, so you spend less time waiting for the computer. dplyrs basic set of tools Databases Besides in-memory data frames, dplyr also connects to out-of-memory, remote databases. And by translating your R code into the appropriate SQL, it allows you to work with both types of data using the same set of tools. benchmark-baseball see how dplyr compares to other tools for data manipulation on a realistic use case. window-functions a window function is a variation on an aggregation function. Where an aggregate function uses n inputs to produce 1 output, a window function uses n inputs to produce n outputs. Data: nycflights13 To explore the basic data manipulation verbs of dplyr, well start with the built in nycflights13 data frame. This dataset contains all 336776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13 Prepare your nycflights13 data install.packages("nycflights13") library(nycflights13) dim(flights) head(flights) Large data: tbl_df dplyr can work with data frames as is, but if youre dealing with large data, its worthwhile to convert them to a tbl_df: this is a wrapper around a data frame that wont accidentally print a lot of data to the screen. SINGLE TABLE VERBS Single table verbs Dplyr aims to provide a function for each basic verb of data manipulation: filter() (and slice()) arrange() select() (and rename()) distinct() mutate() (and transmute()) summarise() sample_n() and sample_frac() If youve used plyr before, many of these will be familiar. Filter rows with filter() filter() allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame: For example, we can select all flights on January 1st with: Filter rows with filter() filter(flights, month == 1, day == 1)
This is equivalent to the more verbose code in base R:
flights[flights$month == 1 & flights$day == 1, ] filter() vs. subset() filter() works similarly to subset() except that you can give it any number of filtering conditions, which are joined together with & (not && which is easy to do accidentally!). You can also use other boolean operators: filter(flights, month == 1 | month == 2) slices() To select rows by position, use slice(): slice(flights, 1:10) Arrange rows with arrange() arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them It takes a data frame, and a set of column names (or more complicated expressions) to order by If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns: arrange() arrange(flights, year, month, day) arrange() and desc() Use desc() to order a column in descending order: arrange(flights, desc(arr_delay)) The previous code is equivalent to: flights[order(flights$year, flights$month, flights$day), ] flights[order(desc(flights$arr_delay)), ] Select columns with select() Often you work with large datasets with many columns but only a few are actually of interest to you select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions: select() # Select columns by name select(flights, year, month, day) select() # Select all columns between year and day (inclusive) select(flights, year:day) select() # Select all columns except those from year to day (inclusive) select(flights, -(year:day)) There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains() These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details. rename column in select() You can rename variables with rename() by using named arguments: rename(flights, tail_num = tailnum) Extract distinct (unique) rows A common use of select() is to find the values of a set of variables. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table. distinct(select(flights, tailnum)) distinct(select(flights, origin, dest))
This is very similar to base::unique() but should be
much faster. Add new columns with mutate() Besides selecting sets of existing columns, its often useful to add new columns that are functions of existing columns. This is the job of mutate(): mutate() mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) mutate allows you to refer to columns that youve just created: mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) transmute() If you only want to keep the new variables, use transmute(): transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) ) Summarise values with summarise() The last verb is summarise(). It collapses a data frame to a single row summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) Randomly sample rows with sample_n() and sample_frac() You can use sample_n() and sample_frac() to take a random sample of rows use sample_n() for a fixed number and sample_frac() for a fixed fraction. sample_n(flights, 10) sample_frac(flights, 0.01)
Use replace = TRUE to perform a bootstrap sample. If needed,
you can weight the sample with the weight argument. Commonalities You may have noticed that the syntax and function of all these verbs are very similar: The first argument is a data frame. The subsequent arguments describe what to do with the data frame. Notice that you can refer to columns in the data frame directly without using $. The result is a new data frame Together these properties make it easy to chain together multiple simple steps to achieve a complex result. At the most basic level, you can only alter a tidy data frame in five useful ways: you can reorder the rows (arrange()) pick observations and variables of interest (filter() and select()) add new variables that are functions of existing variables (mutate()) collapse many values to a summary (summarise()) The remainder of the language comes from applying the five functions to different types of data. For example, how these functions work with grouped data. GROUPED OPERATIONS These verbs are useful on their own, but they become really powerful when you apply them to groups of observations within a dataset. In dplyr, you do this by with the group_by() function It breaks down a dataset into specified groups of rows When you then apply the verbs above on the resulting object theyll be automatically applied by group. Most importantly, all this is achieved by using the same exact syntax youd use with an ungrouped object. In the following example, we split the complete dataset into individual planes and then summarise each plane by counting the number of flights (count = n()) and computing the average distance (dist = mean(Distance, na.rm = TRUE)) and arrival delay (delay = mean(ArrDelay, na.rm = TRUE)). We then use ggplot2 to display the output. by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000) # Interestingly, the average delay is only slightly related to the # average distance flown by a plane. ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() You use summarise() with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others: n(): the number of observations in the current group n_distinct(x):the number of unique values in x. first(x), last(x) and nth(x, n) - these work similarly to x[1], x[length(x)], and x[n] but give you more control over the result if the value is missing. For example, we could use these to find the number of planes and the number of flights that go to each possible destination: destinations <- group_by(flights, dest) summarise(destinations, planes = n_distinct(tailnum), flights = n() ) You can save the result in a new variable
Check type of t object
Check all of data by change t object to data.frame When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset: Chaining The dplyr API is functional in the sense that function calls dont have side-effects You must always save their results This doesnt lead to particularly elegant code, especially if you want to do many operations at once You either have to do it step-by-step: Or if you dont want to save the intermediate results, you need to wrap the function calls inside each other: This is difficult to read because the order of the operations is from inside to out Thus, the arguments are a long way away from the function To get around this problem, dplyr provides the %>% operator x %>% f(y) turns into f(x, y) so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom: New Expression by %>%