Lab4 Instructions
Lab4 Instructions
Preprocessing
Dr. Derrick L. Cogburn
2024-09-14
Table of contents
Lab Overview 2
Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Assignment Overview 2
Pre-Lab Instructions 2
Creating Your Lab 4 Project in RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 3
Installing Packages, Importing and Exploring Data . . . . . . . . . . . . . . . . . . . 3
Load the required libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Lab Instructions 4
Lab Instructions (to be completed before, during, or after the synchronous class): . . 4
1
Lab Overview
1. Deepen your understanding of conducting data wrangling and data analysis using the
tidyverse.
2. Understanding how data wrangling fits into the ETL Workflow.
3. Understand how to reshape, combine, and subset data, including grouping and making
new variables.
4. Dealing with missing data.
1. Understand how to transform data for analytics using the major functions used for data
manipulation.
2. Understand how ETL fits into big data and analytics in business decisions.
3. Understanding the importance of data wrangling to data analytics.
4. Understand when and how to preprocess and prepare data for analysis, modeling, and
visualization.
Assignment Overview
For this lab, we will focus on data wrangling, the process of transforming and mapping data
from one data form into another. The goal of data wrangling, also known as data munging,
is to make your data more appropriate for your analysis, and more valuable for a variety of
downstream applications. There are three main parts to the lab:
-Part 1, will focus on the six key dplyr verbs for data manipulation
-Part 2, will spend a little time exploring how dplyr handles missing data.
-Part 3, will apply these techniques on portions of real text mining projects.
-Part 4, will provide an overview of data wrangling in Python.
Pre-Lab Instructions
2
Creating Your Lab 4 Project in RStudio
In RStudio, create a project for Lab 4. Create a new Quarto document with a relevant title for
the lab, for example: “Lab 4: Data Wrangling: Cleaning, Preparation, and Tidying Data”.
Now begin working your way through the Lab 4 instructions or wait until class on Wednesday.
As you work through the instructions, I continue to encourage you to take a literate program-
ming approach, and use the space preceding a code chunk to explain in narrative terms, what
you are doing in the code chunk below.
Also, please remember that this and all subsequent labs need to be submitted as knitted pdf
files, with your Quarto YAML header set to:
echo: true
This setting will show your work (code output and plots). The Canvas assignment submission
for Lab 4 and all subsequent labs will restricted to a .pdf file format only (and that must be
a rendered .pdf file of your Quarto document, not an html file saved as a pdf. If you have
having problems with the .pdf rendering, please let me know. If you are facing the deadline for
submitting the assignment, you may comment out # the sections of your file that are causing
the rendering problems, and submit the assignment with a note about the rendering issue.
For this lab you will be installing one new R package, nycflights13. In your lab Quarto
document, please install the package nycflights13 in an r code chunk below.
#install.packages("nycflights13")
rvest, tm, readr, tm.plugin.mail, Rcrawler, RSelenium, xml2, tidyverse, tidytext, ny-
cflights13
3
The following object is masked from 'package:rvest':
guess_encoding
Please review the posit Cheat Sheet on Data Wrangling in dplyr and and tidyr:
https://rstudio.github.io/cheatsheets/tidyr.pdf
and the dplyr Cheat Sheet:
https://rstudio.github.io/cheatsheets/data-transformation.pdf.
*** End of Pre-Lab ***
Lab Instructions
Lab Instructions (to be completed before, during, or after the synchronous class):
Part 1, will focus on the six key verbs of data manipulation found within the dplyr pack-
age. The dplyr package, developed by RStudio, is an extremely powerful ecosystem for data
manipulation. It contains Main dplyr functions for data manipulation, which are:
In this lab we are going for focus heavily on data wrangling in the tidyverse environment
including tidytext. We will start by illustrating the key verbs of data manipulation using
numeric data, and then move on to using those concepts on textual data. The dplyr package,
developed by RStudio, is an extremely powerful ecosystem for data manipulation. It contains
Main dplyr functions for data manipulation, which are:
4
2. arrange() - reorder rows
3. select() - pick variables by name
4. mutate() - create new variables
5. summarize() - collapse many values down to a single summary
6. group_by() - can be used in conjunction with each of these five functions to change the
scope from operating on the entire dataset, to operating on it group by group.
These verbs/functions all work similarly: 1. First argument is a data frame. 2. Subsequent
arguments describe what to do with data frame, and references variable names without quotes
3. Results in a new data frame.
Let’s start by reviewing how these functions allow you to manipulate data. We will start with
a popular built-in dataset called iris. One note, as you use R resources, such as the Data
Wrangling with dplyr and tidyr cheat sheets.
Now use the as_tibble function to convert the built-in iris data into a tibble.
as_tibble(iris)
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# i 140 more rows
Now that you have converted the iris data into a tibble, you can use the dplyr functions to
manipulate the data.
Use the glimpse() function to get an information rich summary of tbl data like iris.
glimpse(iris)
5
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~
To view the entire dataset, you may use the View() function in the console (or in an r script),
but you should not use the View() function in any Quarto or RMarkdown document.
Deliverable 1: Call the iris dataset, and then use the group_by() function to group the
iris data by the variable Species, and then use the summarize() function using (avg =
mean(Sepal.Width)) in the argument, and then, arrange by average by using the
arrange() function with avg in the argument.
iris %>%
group_by(Species) %>%
summarise(avg = mean(Sepal.Width)) %>%
arrange(avg)
# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 versicolor 2.77
2 virginica 2.97
3 setosa 3.43
You may extract rows that meet logical criteria. Use the filter() function to filter iris for data
with a sepal length greater than 7. Hint: use (iris, Sepal.Length >7) in the argument.
6
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.2 6.0 1.8 virginica
9 7.2 3.0 5.8 1.6 virginica
10 7.4 2.8 6.1 1.9 virginica
11 7.9 3.8 6.4 2.0 virginica
12 7.7 3.0 6.1 2.3 virginica
Use the distinct() function to remove duplicate rows from the iris dataset.
distinct(iris)
7
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor
8
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica
9
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
Deliverable 2: Randomly select a fraction of 0.5 rows from the iris dataset
Use the sample_frac() function to randomly select a fraction of 0.5 rows from the iris dataset.
In the argument, use replace = TRUE.
10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.0 2.0 3.5 1.0 versicolor
2 5.7 2.8 4.1 1.3 versicolor
3 5.8 2.6 4.0 1.2 versicolor
4 6.4 2.7 5.3 1.9 virginica
5 6.0 2.2 5.0 1.5 virginica
6 5.7 2.8 4.5 1.3 versicolor
7 6.0 3.4 4.5 1.6 versicolor
8 7.9 3.8 6.4 2.0 virginica
9 6.8 3.2 5.9 2.3 virginica
10 5.8 2.7 4.1 1.0 versicolor
11 6.0 2.2 4.0 1.0 versicolor
12 7.6 3.0 6.6 2.1 virginica
13 4.6 3.4 1.4 0.3 setosa
14 6.5 3.0 5.8 2.2 virginica
15 6.4 2.8 5.6 2.2 virginica
16 5.6 2.8 4.9 2.0 virginica
17 5.1 2.5 3.0 1.1 versicolor
18 6.4 3.2 5.3 2.3 virginica
19 6.2 2.8 4.8 1.8 virginica
20 6.3 2.3 4.4 1.3 versicolor
21 5.8 2.7 5.1 1.9 virginica
22 6.0 3.0 4.8 1.8 virginica
23 5.8 2.7 5.1 1.9 virginica
24 7.7 2.8 6.7 2.0 virginica
25 5.1 3.8 1.6 0.2 setosa
26 5.9 3.0 5.1 1.8 virginica
27 4.9 3.1 1.5 0.2 setosa
28 4.8 3.1 1.6 0.2 setosa
29 6.4 2.8 5.6 2.2 virginica
30 4.8 3.0 1.4 0.3 setosa
31 5.9 3.0 5.1 1.8 virginica
32 5.8 4.0 1.2 0.2 setosa
33 5.6 3.0 4.5 1.5 versicolor
34 5.8 2.7 5.1 1.9 virginica
35 5.7 3.8 1.7 0.3 setosa
36 6.7 3.1 5.6 2.4 virginica
37 6.2 2.8 4.8 1.8 virginica
38 6.4 2.9 4.3 1.3 versicolor
39 5.2 3.4 1.4 0.2 setosa
40 5.6 3.0 4.5 1.5 versicolor
41 7.9 3.8 6.4 2.0 virginica
42 5.2 3.4 1.4 0.2 setosa
11
43 5.0 3.2 1.2 0.2 setosa
44 4.6 3.6 1.0 0.2 setosa
45 6.9 3.1 5.1 2.3 virginica
46 6.0 2.2 4.0 1.0 versicolor
47 4.9 3.1 1.5 0.1 setosa
48 4.4 2.9 1.4 0.2 setosa
49 5.5 2.4 3.8 1.1 versicolor
50 6.8 3.0 5.5 2.1 virginica
51 6.1 2.8 4.0 1.3 versicolor
52 5.9 3.0 4.2 1.5 versicolor
53 5.4 3.0 4.5 1.5 versicolor
54 6.2 3.4 5.4 2.3 virginica
55 5.7 2.5 5.0 2.0 virginica
56 4.8 3.0 1.4 0.3 setosa
57 5.1 3.5 1.4 0.3 setosa
58 5.4 3.9 1.3 0.4 setosa
59 5.8 2.7 5.1 1.9 virginica
60 6.4 2.7 5.3 1.9 virginica
61 6.0 3.4 4.5 1.6 versicolor
62 5.0 3.2 1.2 0.2 setosa
63 5.0 2.0 3.5 1.0 versicolor
64 5.2 3.5 1.5 0.2 setosa
65 5.1 3.8 1.9 0.4 setosa
66 6.9 3.1 5.1 2.3 virginica
67 4.4 3.2 1.3 0.2 setosa
68 5.0 2.3 3.3 1.0 versicolor
69 6.7 3.1 4.4 1.4 versicolor
70 5.0 2.0 3.5 1.0 versicolor
71 5.7 2.8 4.1 1.3 versicolor
72 4.6 3.1 1.5 0.2 setosa
73 6.0 2.7 5.1 1.6 versicolor
74 6.3 2.9 5.6 1.8 virginica
75 5.1 3.5 1.4 0.3 setosa
Use the sample_n() function to randomly select a specified (n) number of rows in iris. In the
argument, use replace = TRUE.
12
3 6.2 2.2 4.5 1.5 versicolor
4 6.6 3.0 4.4 1.4 versicolor
5 6.5 3.2 5.1 2.0 virginica
6 4.9 3.1 1.5 0.1 setosa
7 7.3 2.9 6.3 1.8 virginica
8 4.6 3.1 1.5 0.2 setosa
9 6.3 2.8 5.1 1.5 virginica
10 5.5 2.5 4.0 1.3 versicolor
Use the slice() function to select rows in iris by position in the index. For example, use position
10:15 in the argument.
slice(iris, 10:15)
Use the top_n() function to select and order a specified number (n) of top entries in the storms
object. For example try 2 and day in the argument.
top_n(storms, 2, day)
# A tibble: 397 x 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Caroline 1975 8 31 0 24 -97 hurrica~ 3 100 973
2 Caroline 1975 8 31 6 24.1 -97.5 hurrica~ 3 100 963
3 Caroline 1975 8 31 12 24.3 -97.8 hurrica~ 2 90 963
4 Caroline 1975 8 31 18 24.8 -98 tropica~ NA 55 993
5 Doris 1975 8 31 0 34.9 -46.3 hurrica~ 1 65 990
6 Doris 1975 8 31 6 34.8 -45.7 hurrica~ 1 65 990
7 Doris 1975 8 31 12 34.7 -45.2 hurrica~ 1 70 990
8 Doris 1975 8 31 18 34.6 -44.9 hurrica~ 1 70 990
9 Emmy 1976 8 31 12 35.1 -44.9 hurrica~ 2 85 977
10 Frances 1976 8 31 0 21 -54.9 hurrica~ 1 65 980
# i 387 more rows
13
# i 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
avg
1 5.843333
14
16 5.843333 3.057333 3.758 1.199333 NA
17 5.843333 3.057333 3.758 1.199333 NA
18 5.843333 3.057333 3.758 1.199333 NA
19 5.843333 3.057333 3.758 1.199333 NA
20 5.843333 3.057333 3.758 1.199333 NA
21 5.843333 3.057333 3.758 1.199333 NA
22 5.843333 3.057333 3.758 1.199333 NA
23 5.843333 3.057333 3.758 1.199333 NA
24 5.843333 3.057333 3.758 1.199333 NA
25 5.843333 3.057333 3.758 1.199333 NA
26 5.843333 3.057333 3.758 1.199333 NA
27 5.843333 3.057333 3.758 1.199333 NA
28 5.843333 3.057333 3.758 1.199333 NA
29 5.843333 3.057333 3.758 1.199333 NA
30 5.843333 3.057333 3.758 1.199333 NA
31 5.843333 3.057333 3.758 1.199333 NA
32 5.843333 3.057333 3.758 1.199333 NA
33 5.843333 3.057333 3.758 1.199333 NA
34 5.843333 3.057333 3.758 1.199333 NA
35 5.843333 3.057333 3.758 1.199333 NA
36 5.843333 3.057333 3.758 1.199333 NA
37 5.843333 3.057333 3.758 1.199333 NA
38 5.843333 3.057333 3.758 1.199333 NA
39 5.843333 3.057333 3.758 1.199333 NA
40 5.843333 3.057333 3.758 1.199333 NA
41 5.843333 3.057333 3.758 1.199333 NA
42 5.843333 3.057333 3.758 1.199333 NA
43 5.843333 3.057333 3.758 1.199333 NA
44 5.843333 3.057333 3.758 1.199333 NA
45 5.843333 3.057333 3.758 1.199333 NA
46 5.843333 3.057333 3.758 1.199333 NA
47 5.843333 3.057333 3.758 1.199333 NA
48 5.843333 3.057333 3.758 1.199333 NA
49 5.843333 3.057333 3.758 1.199333 NA
50 5.843333 3.057333 3.758 1.199333 NA
51 5.843333 3.057333 3.758 1.199333 NA
52 5.843333 3.057333 3.758 1.199333 NA
53 5.843333 3.057333 3.758 1.199333 NA
54 5.843333 3.057333 3.758 1.199333 NA
55 5.843333 3.057333 3.758 1.199333 NA
56 5.843333 3.057333 3.758 1.199333 NA
57 5.843333 3.057333 3.758 1.199333 NA
58 5.843333 3.057333 3.758 1.199333 NA
15
59 5.843333 3.057333 3.758 1.199333 NA
60 5.843333 3.057333 3.758 1.199333 NA
61 5.843333 3.057333 3.758 1.199333 NA
62 5.843333 3.057333 3.758 1.199333 NA
63 5.843333 3.057333 3.758 1.199333 NA
64 5.843333 3.057333 3.758 1.199333 NA
65 5.843333 3.057333 3.758 1.199333 NA
66 5.843333 3.057333 3.758 1.199333 NA
67 5.843333 3.057333 3.758 1.199333 NA
68 5.843333 3.057333 3.758 1.199333 NA
69 5.843333 3.057333 3.758 1.199333 NA
70 5.843333 3.057333 3.758 1.199333 NA
71 5.843333 3.057333 3.758 1.199333 NA
72 5.843333 3.057333 3.758 1.199333 NA
73 5.843333 3.057333 3.758 1.199333 NA
74 5.843333 3.057333 3.758 1.199333 NA
75 5.843333 3.057333 3.758 1.199333 NA
76 5.843333 3.057333 3.758 1.199333 NA
77 5.843333 3.057333 3.758 1.199333 NA
78 5.843333 3.057333 3.758 1.199333 NA
79 5.843333 3.057333 3.758 1.199333 NA
80 5.843333 3.057333 3.758 1.199333 NA
81 5.843333 3.057333 3.758 1.199333 NA
82 5.843333 3.057333 3.758 1.199333 NA
83 5.843333 3.057333 3.758 1.199333 NA
84 5.843333 3.057333 3.758 1.199333 NA
85 5.843333 3.057333 3.758 1.199333 NA
86 5.843333 3.057333 3.758 1.199333 NA
87 5.843333 3.057333 3.758 1.199333 NA
88 5.843333 3.057333 3.758 1.199333 NA
89 5.843333 3.057333 3.758 1.199333 NA
90 5.843333 3.057333 3.758 1.199333 NA
91 5.843333 3.057333 3.758 1.199333 NA
92 5.843333 3.057333 3.758 1.199333 NA
93 5.843333 3.057333 3.758 1.199333 NA
94 5.843333 3.057333 3.758 1.199333 NA
95 5.843333 3.057333 3.758 1.199333 NA
96 5.843333 3.057333 3.758 1.199333 NA
97 5.843333 3.057333 3.758 1.199333 NA
98 5.843333 3.057333 3.758 1.199333 NA
99 5.843333 3.057333 3.758 1.199333 NA
100 5.843333 3.057333 3.758 1.199333 NA
101 5.843333 3.057333 3.758 1.199333 NA
16
102 5.843333 3.057333 3.758 1.199333 NA
103 5.843333 3.057333 3.758 1.199333 NA
104 5.843333 3.057333 3.758 1.199333 NA
105 5.843333 3.057333 3.758 1.199333 NA
106 5.843333 3.057333 3.758 1.199333 NA
107 5.843333 3.057333 3.758 1.199333 NA
108 5.843333 3.057333 3.758 1.199333 NA
109 5.843333 3.057333 3.758 1.199333 NA
110 5.843333 3.057333 3.758 1.199333 NA
111 5.843333 3.057333 3.758 1.199333 NA
112 5.843333 3.057333 3.758 1.199333 NA
113 5.843333 3.057333 3.758 1.199333 NA
114 5.843333 3.057333 3.758 1.199333 NA
115 5.843333 3.057333 3.758 1.199333 NA
116 5.843333 3.057333 3.758 1.199333 NA
117 5.843333 3.057333 3.758 1.199333 NA
118 5.843333 3.057333 3.758 1.199333 NA
119 5.843333 3.057333 3.758 1.199333 NA
120 5.843333 3.057333 3.758 1.199333 NA
121 5.843333 3.057333 3.758 1.199333 NA
122 5.843333 3.057333 3.758 1.199333 NA
123 5.843333 3.057333 3.758 1.199333 NA
124 5.843333 3.057333 3.758 1.199333 NA
125 5.843333 3.057333 3.758 1.199333 NA
126 5.843333 3.057333 3.758 1.199333 NA
127 5.843333 3.057333 3.758 1.199333 NA
128 5.843333 3.057333 3.758 1.199333 NA
129 5.843333 3.057333 3.758 1.199333 NA
130 5.843333 3.057333 3.758 1.199333 NA
131 5.843333 3.057333 3.758 1.199333 NA
132 5.843333 3.057333 3.758 1.199333 NA
133 5.843333 3.057333 3.758 1.199333 NA
134 5.843333 3.057333 3.758 1.199333 NA
135 5.843333 3.057333 3.758 1.199333 NA
136 5.843333 3.057333 3.758 1.199333 NA
137 5.843333 3.057333 3.758 1.199333 NA
138 5.843333 3.057333 3.758 1.199333 NA
139 5.843333 3.057333 3.758 1.199333 NA
140 5.843333 3.057333 3.758 1.199333 NA
141 5.843333 3.057333 3.758 1.199333 NA
142 5.843333 3.057333 3.758 1.199333 NA
143 5.843333 3.057333 3.758 1.199333 NA
144 5.843333 3.057333 3.758 1.199333 NA
17
145 5.843333 3.057333 3.758 1.199333 NA
146 5.843333 3.057333 3.758 1.199333 NA
147 5.843333 3.057333 3.758 1.199333 NA
148 5.843333 3.057333 3.758 1.199333 NA
149 5.843333 3.057333 3.758 1.199333 NA
150 5.843333 3.057333 3.758 1.199333 NA
Species n
1 setosa 250.3
2 versicolor 296.8
3 virginica 329.4
Now, take a deeper dive into the Syntax of dplyr for data manipulation using the nycflights13
dataset
Examine and view the flights dataset from the nycflights13 package. Remember to access an
object directly from within a specific package use the double colon, for example: nycflights13::.
To examine the entire dataset you may then use the View() function in the console (but again,
remember do not use the View() function in the RMarkdown code chunk)
nycflights13::flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
You may use the filter() function to filter rows meeting certain criteria. So, use the filter()
function to identify all flights on January 1st. Remember in order to indicate “equals” you
need the double ==. Hint, use month == , and day==.
18
filter(flights, month == 1, day ==1)
# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 832 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Since dplyr results never modify their inputs (the input data) if you want to save the results,
you’ll need to create an object containing the results. So repeat what you just did using the
filter() but save the results in an object called jan1.
Use the filter() function to extract all flights on December 25 and save them to an object called
dec25. Surround the entire code block in parenthesis to simultaneously print out the results.
# A tibble: 719 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 12 25 456 500 -4 649 651
2 2013 12 25 524 515 9 805 814
3 2013 12 25 542 540 2 832 850
4 2013 12 25 546 550 -4 1022 1027
19
5 2013 12 25 556 600 -4 730 745
6 2013 12 25 557 600 -3 743 752
7 2013 12 25 557 600 -3 818 831
8 2013 12 25 559 600 -1 855 856
9 2013 12 25 559 600 -1 849 855
10 2013 12 25 600 600 0 850 846
# i 709 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, month = 1)
# A tibble: 27,004 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 26,994 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
20
Logical Operators
Multiple arguments to filter() are combined with “and”; every expression must be true in order
for a row to be included in the output. You may also use Boolean operations.
Try to identify all the flights that departed in November or December. Hint, use | for or.
# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# i 55,393 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The arrange() function works similarly to filter () except that instead of selecting columns, it
changes their order. Use the arrange() function on the flights object, and in the argument tell
it the order of columns to flights, year, month, and day.
21
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use the descending desc() function within the arrange() function to reorder by the arr_delay
column in descending order.
arrange(flights, desc(arr_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 7 22 2257 759 898 121 1026
9 2013 12 5 756 1700 896 1058 2020
10 2013 5 3 1133 2055 878 1250 2215
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use select() to identify the variables you are most interested in at the moment. Here, use the
select() function to select the following four columns, flights, year, month, and day.
22
select(flights, year, month, day)
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows
Now, use the select() function to select all columns between year and day (inclusive). Hint,
use the : operator.
select(flights, year:day)
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows
Use the - sign in the argument to select all columns except those from year to day (inclu-
sive). To accomplish this, you would use the select() function, and in the argument flights,
-(year:day))
23
select(flights, -(year:day))
# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <int> <int> <dbl> <chr>
1 517 515 2 830 819 11 UA
2 533 529 4 850 830 20 UA
3 542 540 2 923 850 33 AA
4 544 545 -1 1004 1022 -18 B6
5 554 600 -6 812 837 -25 DL
6 554 558 -4 740 728 12 UA
7 555 600 -5 913 854 19 B6
8 557 600 -3 709 723 -14 EV
9 557 600 -3 838 846 -8 B6
10 558 600 -2 753 745 8 AA
# i 336,766 more rows
# i 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Helper functions to use with the select() function; starts_with(“abc”) - matches names that
begin with “abc”; ends_with(“xyz”) - matches names that end with “xyz”; contains(“ijk”)
matches names that contain “ijk”; matches(“(.)\1”) selects variables that match a regular
expression.
The rename() function will keep all variables that aren’t explicitly mentioned. Here, use
the rename() function on the flights object, and rename the tail_num variable/column to
tailnum.
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
24
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use the mutate() function to create and add new variables to a dataset.
To practice this on the flights dataset and be able to see the end easily, let’s first create a
condensed version of the dataset called flights_sml containing the variables: flights, year -
day, columns that ends with the word “delay”, distance, and air_time.
Then, from that redced dataset, create two new variables: 1. “gain”, which consists of
arr_delay - dep_delay, and 2. “speed”, which consists of distance divided by air_time times
60. Hint, for the gain variable in the argument for the mutate() function, you would use gain
= arr_delay - dep_delay.
# A tibble: 336,776 x 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 370.
2 2013 1 1 4 20 1416 227 16 374.
3 2013 1 1 2 33 1089 160 31 408.
4 2013 1 1 -1 -18 1576 183 -17 517.
5 2013 1 1 -6 -25 762 116 -19 394.
6 2013 1 1 -4 12 719 150 16 288.
7 2013 1 1 -5 19 1065 158 24 404.
8 2013 1 1 -3 -14 229 53 -11 259.
9 2013 1 1 -3 -8 944 140 -5 405.
10 2013 1 1 -2 8 733 138 10 319.
# i 336,766 more rows
You may now use those variables you just created. Try out this line of code: mu-
tate(flights_sml, gain = arr_delay - dep_delay, hours = air_time/60, gain_per_hour =
gain/hours)
25
mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time/60, gain_per_hour = gain/h
# A tibble: 336,776 x 10
year month day dep_delay arr_delay distance air_time gain hours
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 3.78
2 2013 1 1 4 20 1416 227 16 3.78
3 2013 1 1 2 33 1089 160 31 2.67
4 2013 1 1 -1 -18 1576 183 -17 3.05
5 2013 1 1 -6 -25 762 116 -19 1.93
6 2013 1 1 -4 12 719 150 16 2.5
7 2013 1 1 -5 19 1065 158 24 2.63
8 2013 1 1 -3 -14 229 53 -11 0.883
9 2013 1 1 -3 -8 944 140 -5 2.33
10 2013 1 1 -2 8 733 138 10 2.3
# i 336,766 more rows
# i 1 more variable: gain_per_hour <dbl>
The transmute() function will keep the new variable. transmute(flights, gain = arr_delay -
dep_delay, hours = air_time/60, gain_per_hour = gain/hours)
# A tibble: 336,776 x 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 9 3.78 2.38
2 16 3.78 4.23
3 31 2.67 11.6
4 -17 3.05 -5.57
5 -19 1.93 -9.83
6 16 2.5 6.4
7 24 2.63 9.11
8 -11 0.883 -12.5
9 -5 2.33 -2.14
10 10 2.3 4.35
# i 336,766 more rows
26
summarize(flights, delay = mean(dep_delay, na.rm=TRUE))
# A tibble: 1 x 1
delay
<dbl>
1 12.6
Now, let’s increase the functionality by pairing the summarize() with the group_by() function.
This changes the unit of analysis from the complete dataset to individual groups.
So now, let’s apply the same code group by date.
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# i 355 more rows
Using group_by() and summarize() provide a commonly used capability in dplyr, providing
grouped summaries.
Now, let’s introduce the pipe to combine multiple operations.
Let’s explore the relationship between the distance and average delay for each location.
27
This first analysis is without using the pipe. There are three steps to prepare this data: 1.
Group flights by destination; 2. Summarize to compute distance, average delay, and number
of flights; and 3. Filter to remove noisy points and Honolulu airport, which is almost twice as
far away as the next closest airport.
40
30
count
4000
20
delay
8000
12000
10
16000
−10
0 1000 2000
dist
28
Deliverable 5: Use the pipe operator to create an object called delays which 1. Groups
flights by destination; 2. Summarizes and computes distance, average delay, and number
of flights; and 3. Filter to remove noisy points and Honolulu airport.
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 4
# Groups: year, month [12]
year month day flights
<int> <int> <int> <int>
1 2013 1 1 842
2 2013 1 2 943
3 2013 1 3 914
4 2013 1 4 915
5 2013 1 5 720
6 2013 1 6 832
7 2013 1 7 933
8 2013 1 8 899
9 2013 1 9 902
10 2013 1 10 932
# i 355 more rows
29
daily %>%
ungroup() %>%
summarize(flights=n())
# A tibble: 1 x 1
flights
<int>
1 336776
Demonstrate what happens to our flights dataset when we group by year, month, and day,
and then summarize by mean departure delay, but do not remove missing variables with the
na.rm argument to remove missing values?
flights %>%
group_by(year, month, day) %>%
summarize(mean=mean(dep_delay))
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 4
# Groups: year, month [12]
year month day mean
<int> <int> <int> <dbl>
1 2013 1 1 NA
2 2013 1 2 NA
3 2013 1 3 NA
4 2013 1 4 NA
5 2013 1 5 NA
6 2013 1 6 NA
7 2013 1 7 NA
8 2013 1 8 NA
9 2013 1 9 NA
10 2013 1 10 NA
# i 355 more rows
30
You see lots of missing values. It does this because any aggregation function will follow the
usual rule of missing values; if there is any missing value in the input, the output will be a
missing value. That is why all aggregation functions have the na.rm argument, which removes
the missing values prior to computation.
flights %>%
group_by(year, month, day) %>%
summarize(mean=mean(dep_delay, na.rm=TRUE))
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
# A tibble: 365 x 4
# Groups: year, month [12]
year month day mean
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# i 355 more rows
Now, please practice some of these data wrangling techniques on real text data. In most cases,
I am providing the complete code for you. I have provided some text data for you to use
during the lab, but you should be thinking about how you can apply these techniques to your
own text data for your final projcts.
31
impeachtidy <- read_tsv("impeach.tab")
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
You may review the new impeach_words object to see how it tokenized the column TEXT
into one word per row in a new column called word.
# A tibble: 376,436 x 5
HEARING SPEAKER `MAIN SPEAKER` ROLE word
<date> <chr> <chr> <chr> <chr>
1 2019-11-20 Adam Schiff D-Schiff Democrat your
2 2019-11-20 Adam Schiff D-Schiff Democrat interest
3 2019-11-20 Adam Schiff D-Schiff Democrat in
4 2019-11-20 Adam Schiff D-Schiff Democrat being
5 2019-11-20 Adam Schiff D-Schiff Democrat here
6 2019-11-20 Adam Schiff D-Schiff Democrat in
7 2019-11-20 Adam Schiff D-Schiff Democrat turn
8 2019-11-20 Adam Schiff D-Schiff Democrat we
9 2019-11-20 Adam Schiff D-Schiff Democrat ask
10 2019-11-20 Adam Schiff D-Schiff Democrat for
# i 376,426 more rows
32
You will see R returns a note tell you this object is: “A tibble [the tidyverse data struc-
ture for a data.frame] that is now 376,436 x 5 (remember, previously the dataset had 10,987
observations).
Load the tidytext stopword dictionary and explore its contents. You will notice the dictionary
draws its words (n=1,149) from three different lexicons (SMART, Snowball, onix). Use the
data(), head(), and tail() functions to review the stop_words dictionary.
data(stop_words)
head(stop_words)
tail(stop_words)
# A tibble: 6 x 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
# A tibble: 6 x 2
word lexicon
<chr> <chr>
1 you onix
2 young onix
3 younger onix
4 youngest onix
5 your onix
6 yours onix
33
Let’s take a quick look at the now “clean” dataset
impeach_clean
# A tibble: 133,884 x 5
HEARING SPEAKER `MAIN SPEAKER` ROLE word
<date> <chr> <chr> <chr> <chr>
1 2019-11-20 Adam Schiff D-Schiff Democrat respect
2 2019-11-20 Adam Schiff D-Schiff Democrat proceed
3 2019-11-20 Adam Schiff D-Schiff Democrat hearing
4 2019-11-20 Adam Schiff D-Schiff Democrat intention
5 2019-11-20 Adam Schiff D-Schiff Democrat committee
6 2019-11-20 Adam Schiff D-Schiff Democrat proceed
7 2019-11-20 Adam Schiff D-Schiff Democrat disruptions
8 2019-11-20 Adam Schiff D-Schiff Democrat chairman
9 2019-11-20 Adam Schiff D-Schiff Democrat ll
10 2019-11-20 Adam Schiff D-Schiff Democrat steps
# i 133,874 more rows
impeach_clean %>%
count(word, sort = TRUE)
# A tibble: 9,176 x 2
word n
<chr> <int>
1 president 5049
2 ukraine 1872
3 ambassador 1802
4 trump 1632
5 call 1210
6 zelensky 1130
7 correct 1096
8 meeting 889
9 time 805
10 sondland 795
# i 9,166 more rows
34
What are the top ten words in order from this clean dataset?
Deliverable 10: Visualize this count using the ggplot2 package. Create a barchart of all
the words occurring more than 600 times in the dataset (you could adjust that by
changing the filter() parameter).
impeach_clean %>%
count(word, sort = TRUE) %>%
filter(n>600) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
coord_flip()
president
ukraine
ambassador
trump
call
zelensky
correct
meeting
time
sondland
security
house
don
investigations
people
ve
ukrainian
impeachment
0 1000 2000 3000 4000 5000
n
Deliverable 11: Combinining all the steps using the pipe capabilities of dplyr.
As I mentioned , in the tidyverse, these steps could all be nested using the %>% capabilities as
below. Let’s use the broom to remove all the objects we created in both the environment and
in our plots, and recreate them by highlighting all the lines below and running them. Could
you combine this even further and achieve the same result?
35
impeachtidy <- read_tsv("impeach.tab")
impeach_clean %>%
count(word, sort = TRUE) %>%
filter(n>600) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
coord_flip()
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
36
president
ukraine
ambassador
trump
call
zelensky
correct
meeting
time
sondland
security
house
don
investigations
people
ve
ukrainian
impeachment
0 1000 2000 3000 4000 5000
n
Analyze the impeach_words dataset, get a count of how many words were spoken by each
speaker, and then visualize them.
Deliverable 12: Group by speaker then explore the object and visualize the results.
total_impeach
# A tibble: 75 x 2
SPEAKER total
<chr> <int>
1 Daniel Goldman 35478
37
2 Adam Schiff 30222
3 Stephen Castor 29646
4 Devin Nunes 19602
5 Kurt Volker 13404
6 Fiona Hill 13245
7 Doug Collins 13197
8 Gordon Sondland 12558
9 Bill Taylor 11998
10 M. Yovanovitch 11513
# i 65 more rows
You will see that four people stand out more than the others, with three that are perhaps
expected, and 1 that is somewhat surprising. 1. Daniel Goldman; Adam Schift, and Stephen
Castor; and Devin Nunes.
total_impeach %>%
ggplot(aes(SPEAKER,total)) +
geom_col() +
xlab(NULL) +
ylab(NULL) +
coord_flip()
You may click on Zoom to bring up the chart viewer and make the chart more readable.
38
Here, the coord_flip() funciton flips the cartesian coordinates, and allows you to see speakers
names on the y axis. If you remove that argument, the chart will plot, but it is harder to
read.
total_impeach %>%
ggplot(aes(SPEAKER,total)) +
geom_col() +
xlab(NULL) +
ylab(NULL)
30000
20000
10000
Adam
Andre
Andy
Barry
Brad
Ben
Schiff
Carson
Bill
Daniel
Chris
Biggs
David
Berke
Cline
Taylor
David
Wenstrup
Debbie
Debbie
Stewart
Goldman
Denny
Cicilline
Devin
Holmes
Doug
Elise
Eric
M.
Lesko
Gordon
George
Heck
Nunes
Fiona
Powel
Collins
Stefanik
Swalwell
Greg
Guy
Greg
Hakeem
Hill
Sondland
Reschenthal
Stanton
Jackie
J.
Kent
Jamie
Steube
Williams
Jerry
Jim
Jim
Jeffries
Jim
Joaquin
Speier
Raskin
Sesenbrenne
Nadler
Joe
Himes
John
Jonathan
Jordan
Kelly
Karen
Neguse
Castro
Ratcliffe
Ken
Kurt
Louie
Armstrong
Lou
LtTurley
Bass
Lucy
Col
Buck
Volker
M.
Correa
Madeleine
Gohmert
Mary
Vindman
Yovanovitch
Conaway
Martha
McBath
Michael
Matt
Michael
Gay
Mike
Mike
Roby
Dean
Gaetz
Noah
Mike
Scanlo
Gerhard
Johnson
Pamela
Norm
Turner
Quigley
Pramila
Raja
Peter
Turner
Feldman
Sean
Rep.
Eisen
Krishnamoo
Stephen
Sheila
Karlan
Welch
Jayapal
Steve
Sewell
Steve
Maloney
Sylvia
Ted
J.Chabot
Castor
Terri
Tom
Lee
Ted
Tim
Cohen
Deutch
Garcia
Val
Lieu
Morrison
Sewell
McClintock
Will
Demings
Hurd
Import the collection of igf transcripts, create a corpus called igfbali, and inspect and sum-
marize the first two cases of the corpus. This is an example of bringing in a collection of text
files into R using the tm package.
39
class(igfbali)
igfbali
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 63
All of these pre-processing steps are optional. I suggest making a first pass through the data
without using them (e.g. commenting them out, and then experimenting with using them
selectively)
Remove the stopwords from this igf corpus using the tm built in stopword dictionary
tm_map(igfbali, stemDocument)
40
Deliverable 15: Create a Document Term Matrix (DTM) of the igfbali corpus.
This approach will allow you to exploit some very interesting and powerful R functions (like
clustering, classifications, etc).
Use the function findFreqTerms to identify the key terms in the dataset that occur at least x
(n) times, explore different options until you find an useful grouping.
findFreqTerms(dtm, 500)
41
SPARSE TERMS: Will remove sparse terms using function: removeSparseTerms
inspect(removeSparseTerms(dtm, sparse=0.4))
Using the function: findAssocs(), find out which terms correlate with at least x (0.8) correlation
for at least two specified terms you would like to explore in this Internet Governance Forum
dataset.
$activists
moral coalitions. hackers, tunisia.
0.83 0.82 0.82 0.80
$cybersecurity
terrorism bleeds increasing, nation's
0.94 0.91 0.91 0.91
combating cybersecurity, norm, cybercrime,
0.90 0.88 0.87 0.83
chris, infrastructures spam, malicious
0.83 0.83 0.82 0.82
spam spamming "spam" '04
0.81 0.81 0.81 0.81
'05, '06, '06. '17,
0.81 0.81 0.81 0.81
'990s (beep) (beep) -- (security):
0.81 0.81 0.81 0.81
1770-something 1770-something, 2016 2016?
0.81 0.81 0.81 0.81
24--7 5:00, >>k. abcs
0.81 0.81 0.81 0.81
accede accomplish, acdc, acm,
0.81 0.81 0.81 0.81
acronym adamant adopt. adopt --
0.81 0.81 0.81 0.81
advertisements, affiliates, agencies' agency;
0.81 0.81 0.81 0.81
42
ain't analytics, analyzed, answered?
0.81 0.81 0.81 0.81
anti-abuse anti-phishing anti-spam antispam
0.81 0.81 0.81 0.81
anyone -- apcert, arises? article.
0.81 0.81 0.81 0.81
aspects; aspects? assault assist,
0.81 0.81 0.81 0.81
attuned audience -- auscert authenticating
0.81 0.81 0.81 0.81
avail back: backs. batnet
0.81 0.81 0.81 0.81
beep body? botnet, botnet-like
0.81 0.81 0.81 0.81
botnets, boundaries, box, boyer
0.81 0.81 0.81 0.81
boyer: branding. brian building --
0.81 0.81 0.81 0.81
burst buttons c-level caller
0.81 0.81 0.81 0.81
calling, canspam. capability. certs?
0.81 0.81 0.81 0.81
chair's chance. characteristic charities.
0.81 0.81 0.81 0.81
chris? chris -- circus citizen networks
0.81 0.81 0.81 0.81
classified clean. click, clogging
0.81 0.81 0.81 0.81
closed. closer, closes commercials
0.81 0.81 0.81 0.81
commercial -- commonwealth. communities -- complete?
0.81 0.81 0.81 0.81
components. computer -- conflating congratulations,
0.81 0.81 0.81 0.81
construed contents. cooperate -- counterparts --
0.81 0.81 0.81 0.81
counterterrorism, counterterrorism. country; crimes
0.81 0.81 0.81 0.81
cure. cured. cyberattacks cybercapacity
0.81 0.81 0.81 0.81
cybercrime-related cybercrime; cybercrime -- cyberevent
0.81 0.81 0.81 0.81
cyberlaw. cyberthreats, dangerous, daniel
43
0.81 0.81 0.81 0.81
debated, deeds defenses define --
0.81 0.81 0.81 0.81
degradation destroying diplomat, discern
0.81 0.81 0.81 0.81
discussion. dismissed disposal. disservice,
0.81 0.81 0.81 0.81
dominican donations, done -- doorstep
0.81 0.81 0.81 0.81
doorstep. dors driver. drops,
0.81 0.81 0.81 0.81
drove drunk. earlier -- educated.
0.81 0.81 0.81 0.81
employer enabler. enablers. enentire
0.81 0.81 0.81 0.81
enforcement; enriched, enrichment enter,
0.81 0.81 0.81 0.81
eu-funded european -- ex-colleagues except,
0.81 0.81 0.81 0.81
executive expressions, extra-territorial faso hassan.
0.81 0.81 0.81 0.81
fernando, fierce fighting. fines
0.81 0.81 0.81 0.81
fining firs, floated -- florida
0.81 0.81 0.81 0.81
follow-. four -- frameworks: frameworks:
0.81 0.81 0.81 0.81
fraud? ftc, gain, gambling,
0.81 0.81 0.81 0.81
getting, gideon gideon, give.
0.81 0.81 0.81 0.81
glove. government -- grass-root grass-roots
0.81 0.81 0.81 0.81
gsa, hacking-related hacks, haming
0.81 0.81 0.81 0.81
handed, hands- harmonization, headphones
0.81 0.81 0.81 0.81
hijack idea -- ills impinges
0.81 0.81 0.81 0.81
implement. inconvenience. increasing -- ineffective
0.81 0.81 0.81 0.81
infect infection infections infections.
0.81 0.81 0.81 0.81
44
infections? infects innovation-based instructor
0.81 0.81 0.81 0.81
integration. internationally -- internationals. interoperable,
0.81 0.81 0.81 0.81
interrelated, investigating, irritating jammed,
0.81 0.81 0.81 0.81
jay jayantha? jobs? johnson
0.81 0.81 0.81 0.81
johnson. jpcert judiciary, jurists,
0.81 0.81 0.81 0.81
karen, keshted labeled, last --
0.81 0.81 0.81 0.81
law-based leapfrog legislator legislators.
0.81 0.81 0.81 0.81
lepris, liaisons litany maawg
0.81 0.81 0.81 0.81
maawg, maawg. maawg -- mail.
0.81 0.81 0.81 0.81
mailbox, mailboxes makarim, makarim:
0.81 0.81 0.81 0.81
malware, malware. married mayu fumo,
0.81 0.81 0.81 0.81
merged messaging. mexico's mic).
0.81 0.81 0.81 0.81
microphones, misconduct, mismatch mobiles,
0.81 0.81 0.81 0.81
moderately month. montreal, mood
0.81 0.81 0.81 0.81
motivation. mpasa, mulberry, mulberry.
0.81 0.81 0.81 0.81
mulberry: must -- national-level natris,
0.81 0.81 0.81 0.81
natris: ncic nefarious netterlands
0.81 0.81 0.81 0.81
non-south nonsolicited nonstate normal," one
0.81 0.81 0.81 0.81
note -- notifying nuisance nuisance.
0.81 0.81 0.81 0.81
oddly offenses, offenses. omnibus
0.81 0.81 0.81 0.81
one? -- onwards onwards, open--shut,
0.81 0.81 0.81 0.81
opt- opted osc, outfits
45
0.81 0.81 0.81 0.81
outlining overlap, overwhelmed painter
0.81 0.81 0.81 0.81
painter. painter: panel -- partners. --
0.81 0.81 0.81 0.81
pass. pcs perspective? perspective --
0.81 0.81 0.81 0.81
pillar. pipes, plaintiffs. policymakers.
0.81 0.81 0.81 0.81
possible? postgraduate preference -- presenters,
0.81 0.81 0.81 0.81
pretended preventative privacy-sensitive profitable,
0.81 0.81 0.81 0.81
promote. promoting -- promotion, promptly
0.81 0.81 0.81 0.81
pronounced pronounced, propaganda proportion,
0.81 0.81 0.81 0.81
proportions, prpt psace put --
0.81 0.81 0.81 0.81
python's quantity question -- raising?
0.81 0.81 0.81 0.81
rater, realisation receiver receptive
0.81 0.81 0.81 0.81
reduction. reevaluate regarded -- regarding --
0.81 0.81 0.81 0.81
regardless, region. remember? remit.
0.81 0.81 0.81 0.81
remote -- reorganisation requiring, resnick.
0.81 0.81 0.81 0.81
resources? revolutionary rican router,
0.81 0.81 0.81 0.81
routes. sadowski. saturate, schedules
0.81 0.81 0.81 0.81
scheme. segueing self-aid self-governance
0.81 0.81 0.81 0.81
self-regulation, senders servers -- shalt
0.81 0.81 0.81 0.81
sharing? significantly. siphoned sketch
0.81 0.81 0.81 0.81
socialize -- spam. spam? spamed,
0.81 0.81 0.81 0.81
spammer spammers spamming, spam --
0.81 0.81 0.81 0.81
46
speak -- spear-phishing spear-phishing, specialists,
0.81 0.81 0.81 0.81
standards-based stated, statutory stopped,
0.81 0.81 0.81 0.81
streamlined subjects, subject -- succeed,
0.81 0.81 0.81 0.81
sufficient. summaries surprising, tailor-made
0.81 0.81 0.81 0.81
takeaway, takedowns talks. targeted.
0.81 0.81 0.81 0.81
tasks. technology-based teed territory.
0.81 0.81 0.81 0.81
terrorists, theft, therefore -- thou
0.81 0.81 0.81 0.81
thought- tiarma. tighten tong
0.81 0.81 0.81 0.81
toolkit, top -- tout tradition,
0.81 0.81 0.81 0.81
traditions. trainings? transborder tween.
0.81 0.81 0.81 0.81
ugandan ult uncharacteristic uncontinueed
0.81 0.81 0.81 0.81
undisputed unidentifying unsolicited variety,
0.81 0.81 0.81 0.81
vehicle -- vep waas wanteded
0.81 0.81 0.81 0.81
wcit. website.. wild wout
0.81 0.81 0.81 0.81
wout, wout. after system
0.81 0.81 0.81 0.81
efforts, minimize
0.80 0.80
47
inspect(DocumentTermMatrix(igfbali, list(dictionary = c("multistakeholder", "freedom", "devel
Docs
10 OPENNESS HUMAN RIGHTS FREEDOM OF EXPRESSION AND FREE FLOW OF INFORMATION ON THE INTERNET
14 INTERNET_GOVERNANCE_PRINCIPLES.txt
15 OPENING CEREMONY AND OPENING SESSION.txt
26 WS 44 FREEDOM ONLINE COALITION OPEN FORU1.txt
27 WS 44 FREEDOM ONLINE COALITION OPEN FORUM.txt
33 WS 57 MAKING MULTISTAKEHOLDERISM MORE EQUITABLE AND TRANSPARENT.txt
38 WS 357 THE INTERNET AS AN ENGINE FOR GROWTH AND ADVANCEMENT.txt
45 WS-297_PROTECTING_JOURNALISTS_BLOGGERS_AND_MEDIA_ACTORS_IN_DIGITAL_AGE.txt
6 BUILDING BRIDGES – ENHANCING MULTI-STAKEHOLDER COOPERATION FOR GROWTH AND SUSTAINABLE.txt
60 WS 300 DEVELOPING A STRATEGIC VISION FOR INTERNET GOVERNANCE.txt
Docs
10 OPENNESS HUMAN RIGHTS FREEDOM OF EXPRESSION AND FREE FLOW OF INFORMATION ON THE INTERNET
14 INTERNET_GOVERNANCE_PRINCIPLES.txt
15 OPENING CEREMONY AND OPENING SESSION.txt
26 WS 44 FREEDOM ONLINE COALITION OPEN FORU1.txt
27 WS 44 FREEDOM ONLINE COALITION OPEN FORUM.txt
33 WS 57 MAKING MULTISTAKEHOLDERISM MORE EQUITABLE AND TRANSPARENT.txt
38 WS 357 THE INTERNET AS AN ENGINE FOR GROWTH AND ADVANCEMENT.txt
45 WS-297_PROTECTING_JOURNALISTS_BLOGGERS_AND_MEDIA_ACTORS_IN_DIGITAL_AGE.txt
6 BUILDING BRIDGES – ENHANCING MULTI-STAKEHOLDER COOPERATION FOR GROWTH AND SUSTAINABLE.txt
60 WS 300 DEVELOPING A STRATEGIC VISION FOR INTERNET GOVERNANCE.txt
Docs
10 OPENNESS HUMAN RIGHTS FREEDOM OF EXPRESSION AND FREE FLOW OF INFORMATION ON THE INTERNET
14 INTERNET_GOVERNANCE_PRINCIPLES.txt
15 OPENING CEREMONY AND OPENING SESSION.txt
26 WS 44 FREEDOM ONLINE COALITION OPEN FORU1.txt
27 WS 44 FREEDOM ONLINE COALITION OPEN FORUM.txt
33 WS 57 MAKING MULTISTAKEHOLDERISM MORE EQUITABLE AND TRANSPARENT.txt
38 WS 357 THE INTERNET AS AN ENGINE FOR GROWTH AND ADVANCEMENT.txt
45 WS-297_PROTECTING_JOURNALISTS_BLOGGERS_AND_MEDIA_ACTORS_IN_DIGITAL_AGE.txt
48
6 BUILDING BRIDGES – ENHANCING MULTI-STAKEHOLDER COOPERATION FOR GROWTH AND SUSTAINABLE.txt
60 WS 300 DEVELOPING A STRATEGIC VISION FOR INTERNET GOVERNANCE.txt
Now we will practice some basic data wrangling in Python using the nltk package. We will use
the Reuters corpus, which is a collection of news documents. We will do some basic cleaning
and tokenization of the documents.
Deliverable 18: Importing the Reuters Corpus and Basic Data Cleaning and Inspection
Let’s begin by importing the nltk package and using the nltk.download() function to download
the Reuters corpus.
You may use the following sample code:
import nltk
nltk.download('reuters')
True
Now we will use the nltk.corpus function to import the Reuters corpus and inspect the cate-
gories and the number of documents.
You may use the following sample code:
Categories: ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'co
Now we will do some basic cleaning of this dataset. We will remove punctuation and numbers,
and make everything lowercase. You may use the following sample code:
For Selecting a Document
49
doc_id = reuters.fileids(categories="crude")[0]
doc_text = reuters.raw(doc_id)
import string
cleaned_text = doc_text.translate(str.maketrans('', '', string.punctuation))
cleaned_text = ' '.join(cleaned_text.split())
print(cleaned_text)
JAPAN TO REVISE LONGTERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and Indu
Tokenization is the process of splitting a string into a list of words. We will use the
word_tokenize() function from the nltk library to tokenize the cleaned text.
You may try the following sample code:
tokens = word_tokenize(cleaned_text)
tokens = [word for word in tokens if word not in stopwords.words('english')]
print(tokens)
['JAPAN', 'TO', 'REVISE', 'LONGTERM', 'ENERGY', 'DEMAND', 'DOWNWARDS', 'The', 'Ministry', 'In
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
50
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
print("Stemmed:", stemmed)
print("Lemmatized:", lemmatized)
Stemmed: ['japan', 'to', 'revis', 'longterm', 'energi', 'demand', 'downward', 'the', 'ministr
Lemmatized: ['JAPAN', 'TO', 'REVISE', 'LONGTERM', 'ENERGY', 'DEMAND', 'DOWNWARDS', 'The', 'Mi
Deliverable 20: Conducting a Basic Parts of Speech Tagging of the Reuters Corpus
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)
[('JAPAN', 'NNP'), ('TO', 'NNP'), ('REVISE', 'NNP'), ('LONGTERM', 'NNP'), ('ENERGY', 'NNP'),
Deliverable 21: Full Text Processing Pipeline for the Reuters Corpus
Finally, we will practice using a Python text processing pipeline applied to the Reuters dataset.
This example assumes ‘tokens’ are already tokenized and cleaned from the previous steps.
You may use the following sample code:
def preprocess_pipeline(text):
text = text.lower().translate(str.maketrans('', '', string.punctuation))
text = ' '.join(text.split())
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in stopwords.words('english')]
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
tagged = pos_tag(lemmatized)
return tagged
doc_text = reuters.raw(reuters.fileids(categories='crude')[0])
51
processed = preprocess_pipeline(doc_text)
print(processed)
[('japan', 'NN'), ('revise', 'NN'), ('longterm', 'JJ'), ('energy', 'NN'), ('demand', 'NN'), (
52