0% found this document useful (0 votes)
12 views52 pages

Lab4 Instructions

Lab 4 focuses on data wrangling using R's tidyverse, emphasizing data transformation and preparation for analysis. Key learning objectives include understanding data manipulation with dplyr, handling missing values, and applying these techniques in real-world text mining projects. The lab includes practical assignments involving the use of various dplyr functions to manipulate datasets, particularly the iris dataset.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views52 pages

Lab4 Instructions

Lab 4 focuses on data wrangling using R's tidyverse, emphasizing data transformation and preparation for analysis. Key learning objectives include understanding data manipulation with dplyr, handling missing values, and applying these techniques in real-world text mining projects. The lab includes practical assignments involving the use of various dplyr functions to manipulate datasets, particularly the iris dataset.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Lab 4 - Data Wrangling - Cleaning -

Preprocessing
Dr. Derrick L. Cogburn

2024-09-14

Table of contents

Lab Overview 2
Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Assignment Overview 2

Pre-Lab Instructions 2
Creating Your Lab 4 Project in RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 3
Installing Packages, Importing and Exploring Data . . . . . . . . . . . . . . . . . . . 3
Load the required libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Lab Instructions 4
Lab Instructions (to be completed before, during, or after the synchronous class): . . 4

Part 1: Primary Data Wrangling Verbs in dplyr and tidyr 4

Part 2: Handling Missing Values with dplyr 30

Part 3: Practicing Data Wrangling on Real Text Mining Projects 31

Part 4: Introduction to Data Wrangling in Python 49

1
Lab Overview

Technical Learning Objectives

1. Deepen your understanding of conducting data wrangling and data analysis using the
tidyverse.
2. Understanding how data wrangling fits into the ETL Workflow.
3. Understand how to reshape, combine, and subset data, including grouping and making
new variables.
4. Dealing with missing data.

Business Learning Objectives

1. Understand how to transform data for analytics using the major functions used for data
manipulation.
2. Understand how ETL fits into big data and analytics in business decisions.
3. Understanding the importance of data wrangling to data analytics.
4. Understand when and how to preprocess and prepare data for analysis, modeling, and
visualization.

Assignment Overview

For this lab, we will focus on data wrangling, the process of transforming and mapping data
from one data form into another. The goal of data wrangling, also known as data munging,
is to make your data more appropriate for your analysis, and more valuable for a variety of
downstream applications. There are three main parts to the lab:
-Part 1, will focus on the six key dplyr verbs for data manipulation
-Part 2, will spend a little time exploring how dplyr handles missing data.
-Part 3, will apply these techniques on portions of real text mining projects.
-Part 4, will provide an overview of data wrangling in Python.

Pre-Lab Instructions

Pre-Lab Instructions (to be completed before class):

2
Creating Your Lab 4 Project in RStudio

In RStudio, create a project for Lab 4. Create a new Quarto document with a relevant title for
the lab, for example: “Lab 4: Data Wrangling: Cleaning, Preparation, and Tidying Data”.
Now begin working your way through the Lab 4 instructions or wait until class on Wednesday.
As you work through the instructions, I continue to encourage you to take a literate program-
ming approach, and use the space preceding a code chunk to explain in narrative terms, what
you are doing in the code chunk below.
Also, please remember that this and all subsequent labs need to be submitted as knitted pdf
files, with your Quarto YAML header set to:

echo: true

This setting will show your work (code output and plots). The Canvas assignment submission
for Lab 4 and all subsequent labs will restricted to a .pdf file format only (and that must be
a rendered .pdf file of your Quarto document, not an html file saved as a pdf. If you have
having problems with the .pdf rendering, please let me know. If you are facing the deadline for
submitting the assignment, you may comment out # the sections of your file that are causing
the rendering problems, and submit the assignment with a note about the rendering issue.

Installing Packages, Importing and Exploring Data

For this lab you will be installing one new R package, nycflights13. In your lab Quarto
document, please install the package nycflights13 in an r code chunk below.

#install.packages("nycflights13")

Load the required libraries

rvest, tm, readr, tm.plugin.mail, Rcrawler, RSelenium, xml2, tidyverse, tidytext, ny-
cflights13

Loading required package: NLP

Attaching package: 'readr'

3
The following object is masked from 'package:rvest':

guess_encoding

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --


v dplyr 1.1.4 v purrr 1.0.2
v forcats 1.0.0 v stringr 1.5.1
v ggplot2 3.5.1 v tibble 3.2.1
v lubridate 1.9.3 v tidyr 1.3.1
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x ggplot2::annotate() masks NLP::annotate()
x dplyr::filter() masks stats::filter()
x readr::guess_encoding() masks rvest::guess_encoding()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom

Please review the posit Cheat Sheet on Data Wrangling in dplyr and and tidyr:
https://rstudio.github.io/cheatsheets/tidyr.pdf
and the dplyr Cheat Sheet:
https://rstudio.github.io/cheatsheets/data-transformation.pdf.
*** End of Pre-Lab ***

Lab Instructions

Lab Instructions (to be completed before, during, or after the synchronous class):

Part 1: Primary Data Wrangling Verbs in dplyr and tidyr

Part 1, will focus on the six key verbs of data manipulation found within the dplyr pack-
age. The dplyr package, developed by RStudio, is an extremely powerful ecosystem for data
manipulation. It contains Main dplyr functions for data manipulation, which are:
In this lab we are going for focus heavily on data wrangling in the tidyverse environment
including tidytext. We will start by illustrating the key verbs of data manipulation using
numeric data, and then move on to using those concepts on textual data. The dplyr package,
developed by RStudio, is an extremely powerful ecosystem for data manipulation. It contains
Main dplyr functions for data manipulation, which are:

1. filter() - pick observations by their values

4
2. arrange() - reorder rows
3. select() - pick variables by name
4. mutate() - create new variables
5. summarize() - collapse many values down to a single summary
6. group_by() - can be used in conjunction with each of these five functions to change the
scope from operating on the entire dataset, to operating on it group by group.

These verbs/functions all work similarly: 1. First argument is a data frame. 2. Subsequent
arguments describe what to do with data frame, and references variable names without quotes
3. Results in a new data frame.
Let’s start by reviewing how these functions allow you to manipulate data. We will start with
a popular built-in dataset called iris. One note, as you use R resources, such as the Data
Wrangling with dplyr and tidyr cheat sheets.
Now use the as_tibble function to convert the built-in iris data into a tibble.

as_tibble(iris)

# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# i 140 more rows

Now that you have converted the iris data into a tibble, you can use the dplyr functions to
manipulate the data.
Use the glimpse() function to get an information rich summary of tbl data like iris.

glimpse(iris)

5
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.~
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.~
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.~
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.~
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s~

To view the entire dataset, you may use the View() function in the console (or in an r script),
but you should not use the View() function in any Quarto or RMarkdown document.

Deliverable 1: Call the iris dataset, and then use the group_by() function to group the
iris data by the variable Species, and then use the summarize() function using (avg =
mean(Sepal.Width)) in the argument, and then, arrange by average by using the
arrange() function with avg in the argument.

iris %>%
group_by(Species) %>%
summarise(avg = mean(Sepal.Width)) %>%
arrange(avg)

# A tibble: 3 x 2
Species avg
<fct> <dbl>
1 versicolor 2.77
2 virginica 2.97
3 setosa 3.43

You may extract rows that meet logical criteria. Use the filter() function to filter iris for data
with a sepal length greater than 7. Hint: use (iris, Sepal.Length >7) in the argument.

filter(iris, Sepal.Length >7)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 7.1 3.0 5.9 2.1 virginica
2 7.6 3.0 6.6 2.1 virginica
3 7.3 2.9 6.3 1.8 virginica
4 7.2 3.6 6.1 2.5 virginica
5 7.7 3.8 6.7 2.2 virginica

6
6 7.7 2.6 6.9 2.3 virginica
7 7.7 2.8 6.7 2.0 virginica
8 7.2 3.2 6.0 1.8 virginica
9 7.2 3.0 5.8 1.6 virginica
10 7.4 2.8 6.1 1.9 virginica
11 7.9 3.8 6.4 2.0 virginica
12 7.7 3.0 6.1 2.3 virginica

Use the distinct() function to remove duplicate rows from the iris dataset.

distinct(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa

7
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor

8
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica

9
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

Deliverable 2: Randomly select a fraction of 0.5 rows from the iris dataset

Use the sample_frac() function to randomly select a fraction of 0.5 rows from the iris dataset.
In the argument, use replace = TRUE.

sample_frac(iris, 0.5, replace = TRUE)

10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.0 2.0 3.5 1.0 versicolor
2 5.7 2.8 4.1 1.3 versicolor
3 5.8 2.6 4.0 1.2 versicolor
4 6.4 2.7 5.3 1.9 virginica
5 6.0 2.2 5.0 1.5 virginica
6 5.7 2.8 4.5 1.3 versicolor
7 6.0 3.4 4.5 1.6 versicolor
8 7.9 3.8 6.4 2.0 virginica
9 6.8 3.2 5.9 2.3 virginica
10 5.8 2.7 4.1 1.0 versicolor
11 6.0 2.2 4.0 1.0 versicolor
12 7.6 3.0 6.6 2.1 virginica
13 4.6 3.4 1.4 0.3 setosa
14 6.5 3.0 5.8 2.2 virginica
15 6.4 2.8 5.6 2.2 virginica
16 5.6 2.8 4.9 2.0 virginica
17 5.1 2.5 3.0 1.1 versicolor
18 6.4 3.2 5.3 2.3 virginica
19 6.2 2.8 4.8 1.8 virginica
20 6.3 2.3 4.4 1.3 versicolor
21 5.8 2.7 5.1 1.9 virginica
22 6.0 3.0 4.8 1.8 virginica
23 5.8 2.7 5.1 1.9 virginica
24 7.7 2.8 6.7 2.0 virginica
25 5.1 3.8 1.6 0.2 setosa
26 5.9 3.0 5.1 1.8 virginica
27 4.9 3.1 1.5 0.2 setosa
28 4.8 3.1 1.6 0.2 setosa
29 6.4 2.8 5.6 2.2 virginica
30 4.8 3.0 1.4 0.3 setosa
31 5.9 3.0 5.1 1.8 virginica
32 5.8 4.0 1.2 0.2 setosa
33 5.6 3.0 4.5 1.5 versicolor
34 5.8 2.7 5.1 1.9 virginica
35 5.7 3.8 1.7 0.3 setosa
36 6.7 3.1 5.6 2.4 virginica
37 6.2 2.8 4.8 1.8 virginica
38 6.4 2.9 4.3 1.3 versicolor
39 5.2 3.4 1.4 0.2 setosa
40 5.6 3.0 4.5 1.5 versicolor
41 7.9 3.8 6.4 2.0 virginica
42 5.2 3.4 1.4 0.2 setosa

11
43 5.0 3.2 1.2 0.2 setosa
44 4.6 3.6 1.0 0.2 setosa
45 6.9 3.1 5.1 2.3 virginica
46 6.0 2.2 4.0 1.0 versicolor
47 4.9 3.1 1.5 0.1 setosa
48 4.4 2.9 1.4 0.2 setosa
49 5.5 2.4 3.8 1.1 versicolor
50 6.8 3.0 5.5 2.1 virginica
51 6.1 2.8 4.0 1.3 versicolor
52 5.9 3.0 4.2 1.5 versicolor
53 5.4 3.0 4.5 1.5 versicolor
54 6.2 3.4 5.4 2.3 virginica
55 5.7 2.5 5.0 2.0 virginica
56 4.8 3.0 1.4 0.3 setosa
57 5.1 3.5 1.4 0.3 setosa
58 5.4 3.9 1.3 0.4 setosa
59 5.8 2.7 5.1 1.9 virginica
60 6.4 2.7 5.3 1.9 virginica
61 6.0 3.4 4.5 1.6 versicolor
62 5.0 3.2 1.2 0.2 setosa
63 5.0 2.0 3.5 1.0 versicolor
64 5.2 3.5 1.5 0.2 setosa
65 5.1 3.8 1.9 0.4 setosa
66 6.9 3.1 5.1 2.3 virginica
67 4.4 3.2 1.3 0.2 setosa
68 5.0 2.3 3.3 1.0 versicolor
69 6.7 3.1 4.4 1.4 versicolor
70 5.0 2.0 3.5 1.0 versicolor
71 5.7 2.8 4.1 1.3 versicolor
72 4.6 3.1 1.5 0.2 setosa
73 6.0 2.7 5.1 1.6 versicolor
74 6.3 2.9 5.6 1.8 virginica
75 5.1 3.5 1.4 0.3 setosa

Use the sample_n() function to randomly select a specified (n) number of rows in iris. In the
argument, use replace = TRUE.

sample_n(iris, 10, replace = TRUE)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 6.4 2.8 5.6 2.1 virginica
2 4.4 3.0 1.3 0.2 setosa

12
3 6.2 2.2 4.5 1.5 versicolor
4 6.6 3.0 4.4 1.4 versicolor
5 6.5 3.2 5.1 2.0 virginica
6 4.9 3.1 1.5 0.1 setosa
7 7.3 2.9 6.3 1.8 virginica
8 4.6 3.1 1.5 0.2 setosa
9 6.3 2.8 5.1 1.5 virginica
10 5.5 2.5 4.0 1.3 versicolor

Use the slice() function to select rows in iris by position in the index. For example, use position
10:15 in the argument.

slice(iris, 10:15)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 4.9 3.1 1.5 0.1 setosa
2 5.4 3.7 1.5 0.2 setosa
3 4.8 3.4 1.6 0.2 setosa
4 4.8 3.0 1.4 0.1 setosa
5 4.3 3.0 1.1 0.1 setosa
6 5.8 4.0 1.2 0.2 setosa

Use the top_n() function to select and order a specified number (n) of top entries in the storms
object. For example try 2 and day in the argument.

top_n(storms, 2, day)

# A tibble: 397 x 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Caroline 1975 8 31 0 24 -97 hurrica~ 3 100 973
2 Caroline 1975 8 31 6 24.1 -97.5 hurrica~ 3 100 963
3 Caroline 1975 8 31 12 24.3 -97.8 hurrica~ 2 90 963
4 Caroline 1975 8 31 18 24.8 -98 tropica~ NA 55 993
5 Doris 1975 8 31 0 34.9 -46.3 hurrica~ 1 65 990
6 Doris 1975 8 31 6 34.8 -45.7 hurrica~ 1 65 990
7 Doris 1975 8 31 12 34.7 -45.2 hurrica~ 1 70 990
8 Doris 1975 8 31 18 34.6 -44.9 hurrica~ 1 70 990
9 Emmy 1976 8 31 12 35.1 -44.9 hurrica~ 2 85 977
10 Frances 1976 8 31 0 21 -54.9 hurrica~ 1 65 980
# i 387 more rows

13
# i 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>

Deliverable 3: Summarize the Data in the iris dataset


Use the summarize() function to collapse many values down to a single summary. For example,
use the mean of Sepal.Length in the argument.

summarize(iris, avg = mean(Sepal.Length))

mutate_each(iris, funs = mean)

count(iris, Species, wt = Sepal.Length)

avg
1 5.843333

Warning: `mutate_each()` was deprecated in dplyr 0.7.0.


i Please use `across()` instead.

Warning: There was 1 warning in `mutate()`.


i In argument: `Species = (function (x, ...) ...`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


1 5.843333 3.057333 3.758 1.199333 NA
2 5.843333 3.057333 3.758 1.199333 NA
3 5.843333 3.057333 3.758 1.199333 NA
4 5.843333 3.057333 3.758 1.199333 NA
5 5.843333 3.057333 3.758 1.199333 NA
6 5.843333 3.057333 3.758 1.199333 NA
7 5.843333 3.057333 3.758 1.199333 NA
8 5.843333 3.057333 3.758 1.199333 NA
9 5.843333 3.057333 3.758 1.199333 NA
10 5.843333 3.057333 3.758 1.199333 NA
11 5.843333 3.057333 3.758 1.199333 NA
12 5.843333 3.057333 3.758 1.199333 NA
13 5.843333 3.057333 3.758 1.199333 NA
14 5.843333 3.057333 3.758 1.199333 NA
15 5.843333 3.057333 3.758 1.199333 NA

14
16 5.843333 3.057333 3.758 1.199333 NA
17 5.843333 3.057333 3.758 1.199333 NA
18 5.843333 3.057333 3.758 1.199333 NA
19 5.843333 3.057333 3.758 1.199333 NA
20 5.843333 3.057333 3.758 1.199333 NA
21 5.843333 3.057333 3.758 1.199333 NA
22 5.843333 3.057333 3.758 1.199333 NA
23 5.843333 3.057333 3.758 1.199333 NA
24 5.843333 3.057333 3.758 1.199333 NA
25 5.843333 3.057333 3.758 1.199333 NA
26 5.843333 3.057333 3.758 1.199333 NA
27 5.843333 3.057333 3.758 1.199333 NA
28 5.843333 3.057333 3.758 1.199333 NA
29 5.843333 3.057333 3.758 1.199333 NA
30 5.843333 3.057333 3.758 1.199333 NA
31 5.843333 3.057333 3.758 1.199333 NA
32 5.843333 3.057333 3.758 1.199333 NA
33 5.843333 3.057333 3.758 1.199333 NA
34 5.843333 3.057333 3.758 1.199333 NA
35 5.843333 3.057333 3.758 1.199333 NA
36 5.843333 3.057333 3.758 1.199333 NA
37 5.843333 3.057333 3.758 1.199333 NA
38 5.843333 3.057333 3.758 1.199333 NA
39 5.843333 3.057333 3.758 1.199333 NA
40 5.843333 3.057333 3.758 1.199333 NA
41 5.843333 3.057333 3.758 1.199333 NA
42 5.843333 3.057333 3.758 1.199333 NA
43 5.843333 3.057333 3.758 1.199333 NA
44 5.843333 3.057333 3.758 1.199333 NA
45 5.843333 3.057333 3.758 1.199333 NA
46 5.843333 3.057333 3.758 1.199333 NA
47 5.843333 3.057333 3.758 1.199333 NA
48 5.843333 3.057333 3.758 1.199333 NA
49 5.843333 3.057333 3.758 1.199333 NA
50 5.843333 3.057333 3.758 1.199333 NA
51 5.843333 3.057333 3.758 1.199333 NA
52 5.843333 3.057333 3.758 1.199333 NA
53 5.843333 3.057333 3.758 1.199333 NA
54 5.843333 3.057333 3.758 1.199333 NA
55 5.843333 3.057333 3.758 1.199333 NA
56 5.843333 3.057333 3.758 1.199333 NA
57 5.843333 3.057333 3.758 1.199333 NA
58 5.843333 3.057333 3.758 1.199333 NA

15
59 5.843333 3.057333 3.758 1.199333 NA
60 5.843333 3.057333 3.758 1.199333 NA
61 5.843333 3.057333 3.758 1.199333 NA
62 5.843333 3.057333 3.758 1.199333 NA
63 5.843333 3.057333 3.758 1.199333 NA
64 5.843333 3.057333 3.758 1.199333 NA
65 5.843333 3.057333 3.758 1.199333 NA
66 5.843333 3.057333 3.758 1.199333 NA
67 5.843333 3.057333 3.758 1.199333 NA
68 5.843333 3.057333 3.758 1.199333 NA
69 5.843333 3.057333 3.758 1.199333 NA
70 5.843333 3.057333 3.758 1.199333 NA
71 5.843333 3.057333 3.758 1.199333 NA
72 5.843333 3.057333 3.758 1.199333 NA
73 5.843333 3.057333 3.758 1.199333 NA
74 5.843333 3.057333 3.758 1.199333 NA
75 5.843333 3.057333 3.758 1.199333 NA
76 5.843333 3.057333 3.758 1.199333 NA
77 5.843333 3.057333 3.758 1.199333 NA
78 5.843333 3.057333 3.758 1.199333 NA
79 5.843333 3.057333 3.758 1.199333 NA
80 5.843333 3.057333 3.758 1.199333 NA
81 5.843333 3.057333 3.758 1.199333 NA
82 5.843333 3.057333 3.758 1.199333 NA
83 5.843333 3.057333 3.758 1.199333 NA
84 5.843333 3.057333 3.758 1.199333 NA
85 5.843333 3.057333 3.758 1.199333 NA
86 5.843333 3.057333 3.758 1.199333 NA
87 5.843333 3.057333 3.758 1.199333 NA
88 5.843333 3.057333 3.758 1.199333 NA
89 5.843333 3.057333 3.758 1.199333 NA
90 5.843333 3.057333 3.758 1.199333 NA
91 5.843333 3.057333 3.758 1.199333 NA
92 5.843333 3.057333 3.758 1.199333 NA
93 5.843333 3.057333 3.758 1.199333 NA
94 5.843333 3.057333 3.758 1.199333 NA
95 5.843333 3.057333 3.758 1.199333 NA
96 5.843333 3.057333 3.758 1.199333 NA
97 5.843333 3.057333 3.758 1.199333 NA
98 5.843333 3.057333 3.758 1.199333 NA
99 5.843333 3.057333 3.758 1.199333 NA
100 5.843333 3.057333 3.758 1.199333 NA
101 5.843333 3.057333 3.758 1.199333 NA

16
102 5.843333 3.057333 3.758 1.199333 NA
103 5.843333 3.057333 3.758 1.199333 NA
104 5.843333 3.057333 3.758 1.199333 NA
105 5.843333 3.057333 3.758 1.199333 NA
106 5.843333 3.057333 3.758 1.199333 NA
107 5.843333 3.057333 3.758 1.199333 NA
108 5.843333 3.057333 3.758 1.199333 NA
109 5.843333 3.057333 3.758 1.199333 NA
110 5.843333 3.057333 3.758 1.199333 NA
111 5.843333 3.057333 3.758 1.199333 NA
112 5.843333 3.057333 3.758 1.199333 NA
113 5.843333 3.057333 3.758 1.199333 NA
114 5.843333 3.057333 3.758 1.199333 NA
115 5.843333 3.057333 3.758 1.199333 NA
116 5.843333 3.057333 3.758 1.199333 NA
117 5.843333 3.057333 3.758 1.199333 NA
118 5.843333 3.057333 3.758 1.199333 NA
119 5.843333 3.057333 3.758 1.199333 NA
120 5.843333 3.057333 3.758 1.199333 NA
121 5.843333 3.057333 3.758 1.199333 NA
122 5.843333 3.057333 3.758 1.199333 NA
123 5.843333 3.057333 3.758 1.199333 NA
124 5.843333 3.057333 3.758 1.199333 NA
125 5.843333 3.057333 3.758 1.199333 NA
126 5.843333 3.057333 3.758 1.199333 NA
127 5.843333 3.057333 3.758 1.199333 NA
128 5.843333 3.057333 3.758 1.199333 NA
129 5.843333 3.057333 3.758 1.199333 NA
130 5.843333 3.057333 3.758 1.199333 NA
131 5.843333 3.057333 3.758 1.199333 NA
132 5.843333 3.057333 3.758 1.199333 NA
133 5.843333 3.057333 3.758 1.199333 NA
134 5.843333 3.057333 3.758 1.199333 NA
135 5.843333 3.057333 3.758 1.199333 NA
136 5.843333 3.057333 3.758 1.199333 NA
137 5.843333 3.057333 3.758 1.199333 NA
138 5.843333 3.057333 3.758 1.199333 NA
139 5.843333 3.057333 3.758 1.199333 NA
140 5.843333 3.057333 3.758 1.199333 NA
141 5.843333 3.057333 3.758 1.199333 NA
142 5.843333 3.057333 3.758 1.199333 NA
143 5.843333 3.057333 3.758 1.199333 NA
144 5.843333 3.057333 3.758 1.199333 NA

17
145 5.843333 3.057333 3.758 1.199333 NA
146 5.843333 3.057333 3.758 1.199333 NA
147 5.843333 3.057333 3.758 1.199333 NA
148 5.843333 3.057333 3.758 1.199333 NA
149 5.843333 3.057333 3.758 1.199333 NA
150 5.843333 3.057333 3.758 1.199333 NA

Species n
1 setosa 250.3
2 versicolor 296.8
3 virginica 329.4

Now, take a deeper dive into the Syntax of dplyr for data manipulation using the nycflights13
dataset
Examine and view the flights dataset from the nycflights13 package. Remember to access an
object directly from within a specific package use the double colon, for example: nycflights13::.
To examine the entire dataset you may then use the View() function in the console (but again,
remember do not use the View() function in the RMarkdown code chunk)

nycflights13::flights

# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

You may use the filter() function to filter rows meeting certain criteria. So, use the filter()
function to identify all flights on January 1st. Remember in order to indicate “equals” you
need the double ==. Hint, use month == , and day==.

18
filter(flights, month == 1, day ==1)

# A tibble: 842 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 832 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

Since dplyr results never modify their inputs (the input data) if you want to save the results,
you’ll need to create an object containing the results. So repeat what you just did using the
filter() but save the results in an object called jan1.

jan1 <- filter(flights, month == 1, day ==1)

Deliverable 3: Identify Christmas Flights

Use the filter() function to extract all flights on December 25 and save them to an object called
dec25. Surround the entire code block in parenthesis to simultaneously print out the results.

(dec25 <- filter(flights, month == 12, day == 25))

# A tibble: 719 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 12 25 456 500 -4 649 651
2 2013 12 25 524 515 9 805 814
3 2013 12 25 542 540 2 832 850
4 2013 12 25 546 550 -4 1022 1027

19
5 2013 12 25 556 600 -4 730 745
6 2013 12 25 557 600 -3 743 752
7 2013 12 25 557 600 -3 818 831
8 2013 12 25 559 600 -1 855 856
9 2013 12 25 559 600 -1 849 855
10 2013 12 25 600 600 0 850 846
# i 709 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

How many flights departed on Christmas Day?


Now practice filtering using the comparison operators in R: > (greater than); >= (greater
than or equal to); < (less than); <= (less than or equal to); != (not equal); == (equal)
Try out this line of code. You will receive an error message. Interpret the error message and
then fix the code.

filter(flights, month = 1)

Now, fix the error

# A tibble: 27,004 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 26,994 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

20
Logical Operators
Multiple arguments to filter() are combined with “and”; every expression must be true in order
for a row to be included in the output. You may also use Boolean operations.
Try to identify all the flights that departed in November or December. Hint, use | for or.

filter(flights, month == 11 | month == 12)

# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# i 55,393 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

How many flights departed in either November or December?


Practicing X %in% y. Create an object called nov_dec by using the filter() function on the
flights object, with month %in% c(11,12)) in the argument.

nov_dec <- filter(flights, month %in% c(11,12))

The arrange() function works similarly to filter () except that instead of selecting columns, it
changes their order. Use the arrange() function on the flights object, and in the argument tell
it the order of columns to flights, year, month, and day.

arrange(flights, year, month, day)

21
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

Use the descending desc() function within the arrange() function to reorder by the arr_delay
column in descending order.

arrange(flights, desc(arr_delay))

# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 7 22 2257 759 898 121 1026
9 2013 12 5 756 1700 896 1058 2020
10 2013 5 3 1133 2055 878 1250 2215
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

Use select() to identify the variables you are most interested in at the moment. Here, use the
select() function to select the following four columns, flights, year, month, and day.

22
select(flights, year, month, day)

# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows

Now, use the select() function to select all columns between year and day (inclusive). Hint,
use the : operator.

select(flights, year:day)

# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows

Use the - sign in the argument to select all columns except those from year to day (inclu-
sive). To accomplish this, you would use the select() function, and in the argument flights,
-(year:day))

23
select(flights, -(year:day))

# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <int> <int> <dbl> <chr>
1 517 515 2 830 819 11 UA
2 533 529 4 850 830 20 UA
3 542 540 2 923 850 33 AA
4 544 545 -1 1004 1022 -18 B6
5 554 600 -6 812 837 -25 DL
6 554 558 -4 740 728 12 UA
7 555 600 -5 913 854 19 B6
8 557 600 -3 709 723 -14 EV
9 557 600 -3 838 846 -8 B6
10 558 600 -2 753 745 8 AA
# i 336,766 more rows
# i 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Helper functions to use with the select() function; starts_with(“abc”) - matches names that
begin with “abc”; ends_with(“xyz”) - matches names that end with “xyz”; contains(“ijk”)
matches names that contain “ijk”; matches(“(.)\1”) selects variables that match a regular
expression.
The rename() function will keep all variables that aren’t explicitly mentioned. Here, use
the rename() function on the flights object, and rename the tail_num variable/column to
tailnum.

rename(flights, tail_num = tailnum)

# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723

24
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>

Deliverable 4: Use the mutate()

Use the mutate() function to create and add new variables to a dataset.
To practice this on the flights dataset and be able to see the end easily, let’s first create a
condensed version of the dataset called flights_sml containing the variables: flights, year -
day, columns that ends with the word “delay”, distance, and air_time.

flights_sml <- select(flights, year:day, ends_with("delay"),distance, air_time)

Then, from that redced dataset, create two new variables: 1. “gain”, which consists of
arr_delay - dep_delay, and 2. “speed”, which consists of distance divided by air_time times
60. Hint, for the gain variable in the argument for the mutate() function, you would use gain
= arr_delay - dep_delay.

mutate(flights_sml, gain = arr_delay - dep_delay, speed = distance/air_time*60)

# A tibble: 336,776 x 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 370.
2 2013 1 1 4 20 1416 227 16 374.
3 2013 1 1 2 33 1089 160 31 408.
4 2013 1 1 -1 -18 1576 183 -17 517.
5 2013 1 1 -6 -25 762 116 -19 394.
6 2013 1 1 -4 12 719 150 16 288.
7 2013 1 1 -5 19 1065 158 24 404.
8 2013 1 1 -3 -14 229 53 -11 259.
9 2013 1 1 -3 -8 944 140 -5 405.
10 2013 1 1 -2 8 733 138 10 319.
# i 336,766 more rows

You may now use those variables you just created. Try out this line of code: mu-
tate(flights_sml, gain = arr_delay - dep_delay, hours = air_time/60, gain_per_hour =
gain/hours)

25
mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time/60, gain_per_hour = gain/h

# A tibble: 336,776 x 10
year month day dep_delay arr_delay distance air_time gain hours
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 9 3.78
2 2013 1 1 4 20 1416 227 16 3.78
3 2013 1 1 2 33 1089 160 31 2.67
4 2013 1 1 -1 -18 1576 183 -17 3.05
5 2013 1 1 -6 -25 762 116 -19 1.93
6 2013 1 1 -4 12 719 150 16 2.5
7 2013 1 1 -5 19 1065 158 24 2.63
8 2013 1 1 -3 -14 229 53 -11 0.883
9 2013 1 1 -3 -8 944 140 -5 2.33
10 2013 1 1 -2 8 733 138 10 2.3
# i 336,766 more rows
# i 1 more variable: gain_per_hour <dbl>

The transmute() function will keep the new variable. transmute(flights, gain = arr_delay -
dep_delay, hours = air_time/60, gain_per_hour = gain/hours)

transmute(flights, gain = arr_delay - dep_delay, hours = air_time/60, gain_per_hour = gain/ho

# A tibble: 336,776 x 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 9 3.78 2.38
2 16 3.78 4.23
3 31 2.67 11.6
4 -17 3.05 -5.57
5 -19 1.93 -9.83
6 16 2.5 6.4
7 24 2.63 9.11
8 -11 0.883 -12.5
9 -5 2.33 -2.14
10 10 2.3 4.35
# i 336,766 more rows

Useful Creation Functions


Grouped Summaries with summarize() function. This collapses a dataframe to a single row.
summarize(flights, delay = mean(dep_delay, na.rm=TRUE))

26
summarize(flights, delay = mean(dep_delay, na.rm=TRUE))

# A tibble: 1 x 1
delay
<dbl>
1 12.6

Now, let’s increase the functionality by pairing the summarize() with the group_by() function.
This changes the unit of analysis from the complete dataset to individual groups.
So now, let’s apply the same code group by date.

by_day <- group_by(flights, year, month, day)

summarize(by_day, delay = mean(dep_delay, na.rm=TRUE))

`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.

# A tibble: 365 x 4
# Groups: year, month [12]
year month day delay
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# i 355 more rows

Using group_by() and summarize() provide a commonly used capability in dplyr, providing
grouped summaries.
Now, let’s introduce the pipe to combine multiple operations.
Let’s explore the relationship between the distance and average delay for each location.

27
This first analysis is without using the pipe. There are three steps to prepare this data: 1.
Group flights by destination; 2. Summarize to compute distance, average delay, and number
of flights; and 3. Filter to remove noisy points and Honolulu airport, which is almost twice as
far away as the next closest airport.

by_dest <- group_by(flights, dest)

delay <- summarize(by_dest, count=n(), dist=mean(distance, na.rm=TRUE),


delay=mean(arr_delay, na.rm=TRUE))

delay <- filter(delay, count >20, dest != "HNL")

ggplot(data = delay, mapping = aes(x=dist, y=delay))+


geom_point(aes(size=count), alpha = 1/3)+
geom_smooth(se=FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

40

30
count
4000
20
delay

8000
12000
10
16000

−10
0 1000 2000
dist

28
Deliverable 5: Use the pipe operator to create an object called delays which 1. Groups
flights by destination; 2. Summarizes and computes distance, average delay, and number
of flights; and 3. Filter to remove noisy points and Honolulu airport.

delays <- flights %>%


group_by(dest) %>%
summarize(
count=n(),
dist=mean(distance, na.rm=TRUE),
delay=mean(arr_delay, na.rm=TRUE)) %>%
filter(count > 20, dest != "HNL")

Group by Multiple Variables

daily <- group_by(flights, year, month, day)


(per_day <- summarize(daily, flights=n()))

`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.

# A tibble: 365 x 4
# Groups: year, month [12]
year month day flights
<int> <int> <int> <int>
1 2013 1 1 842
2 2013 1 2 943
3 2013 1 3 914
4 2013 1 4 915
5 2013 1 5 720
6 2013 1 6 832
7 2013 1 7 933
8 2013 1 8 899
9 2013 1 9 902
10 2013 1 10 932
# i 355 more rows

To remove the grouping

29
daily %>%
ungroup() %>%
summarize(flights=n())

# A tibble: 1 x 1
flights
<int>
1 336776

Part 2: Handling Missing Values with dplyr

Deliverable 6: Practicing group_by

Demonstrate what happens to our flights dataset when we group by year, month, and day,
and then summarize by mean departure delay, but do not remove missing variables with the
na.rm argument to remove missing values?

flights %>%
group_by(year, month, day) %>%
summarize(mean=mean(dep_delay))

`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.

# A tibble: 365 x 4
# Groups: year, month [12]
year month day mean
<int> <int> <int> <dbl>
1 2013 1 1 NA
2 2013 1 2 NA
3 2013 1 3 NA
4 2013 1 4 NA
5 2013 1 5 NA
6 2013 1 6 NA
7 2013 1 7 NA
8 2013 1 8 NA
9 2013 1 9 NA
10 2013 1 10 NA
# i 355 more rows

30
You see lots of missing values. It does this because any aggregation function will follow the
usual rule of missing values; if there is any missing value in the input, the output will be a
missing value. That is why all aggregation functions have the na.rm argument, which removes
the missing values prior to computation.

flights %>%
group_by(year, month, day) %>%
summarize(mean=mean(dep_delay, na.rm=TRUE))

`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.

# A tibble: 365 x 4
# Groups: year, month [12]
year month day mean
<int> <int> <int> <dbl>
1 2013 1 1 11.5
2 2013 1 2 13.9
3 2013 1 3 11.0
4 2013 1 4 8.95
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.55
9 2013 1 9 2.28
10 2013 1 10 2.84
# i 355 more rows

Now, please practice some of these data wrangling techniques on real text data. In most cases,
I am providing the complete code for you. I have provided some text data for you to use
during the lab, but you should be thinking about how you can apply these techniques to your
own text data for your final projcts.

Part 3: Practicing Data Wrangling on Real Text Mining Projects

IMPORT THE DATA


Now, let’s return to the impeachment dataset we started on in Lab 3. Import the tab-delimited
data using the tidyverse approach, using the read_tsv() from readr, and create an object called
“impeachtidy”. After you create this object, take a quick look at it with the View() function
in the console (reminder, not in an RMarkdown chunk.

31
impeachtidy <- read_tsv("impeach.tab")

Rows: 10987 Columns: 5


-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (4): SPEAKER, MAIN SPEAKER, ROLE, TEXT
date (1): HEARING

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

PREPROCESS THE DATA

Deliverable 7: Tokenize the impeachtidy dataset using the unnest_tokens() function on


the “TEXT” variable/column to separate the text, so that it has one token per row and
store that output in a new object called impeach_words.

impeach_words <- impeachtidy %>%


unnest_tokens(word,TEXT)

You may review the new impeach_words object to see how it tokenized the column TEXT
into one word per row in a new column called word.

# A tibble: 376,436 x 5
HEARING SPEAKER `MAIN SPEAKER` ROLE word
<date> <chr> <chr> <chr> <chr>
1 2019-11-20 Adam Schiff D-Schiff Democrat your
2 2019-11-20 Adam Schiff D-Schiff Democrat interest
3 2019-11-20 Adam Schiff D-Schiff Democrat in
4 2019-11-20 Adam Schiff D-Schiff Democrat being
5 2019-11-20 Adam Schiff D-Schiff Democrat here
6 2019-11-20 Adam Schiff D-Schiff Democrat in
7 2019-11-20 Adam Schiff D-Schiff Democrat turn
8 2019-11-20 Adam Schiff D-Schiff Democrat we
9 2019-11-20 Adam Schiff D-Schiff Democrat ask
10 2019-11-20 Adam Schiff D-Schiff Democrat for
# i 376,426 more rows

32
You will see R returns a note tell you this object is: “A tibble [the tidyverse data struc-
ture for a data.frame] that is now 376,436 x 5 (remember, previously the dataset had 10,987
observations).
Load the tidytext stopword dictionary and explore its contents. You will notice the dictionary
draws its words (n=1,149) from three different lexicons (SMART, Snowball, onix). Use the
data(), head(), and tail() functions to review the stop_words dictionary.

data(stop_words)
head(stop_words)
tail(stop_words)

# A tibble: 6 x 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART

# A tibble: 6 x 2
word lexicon
<chr> <chr>
1 you onix
2 young onix
3 younger onix
4 youngest onix
5 your onix
6 yours onix

Deliverable 8: Apply the built-in stopwords dictionary to our impeach_words dataset


using the anti_join() function. Use the pipe capabilities %>% of the tidyverse.

impeach_clean <- impeach_words %>%


anti_join(stop_words)

Joining with `by = join_by(word)`

33
Let’s take a quick look at the now “clean” dataset

impeach_clean

# A tibble: 133,884 x 5
HEARING SPEAKER `MAIN SPEAKER` ROLE word
<date> <chr> <chr> <chr> <chr>
1 2019-11-20 Adam Schiff D-Schiff Democrat respect
2 2019-11-20 Adam Schiff D-Schiff Democrat proceed
3 2019-11-20 Adam Schiff D-Schiff Democrat hearing
4 2019-11-20 Adam Schiff D-Schiff Democrat intention
5 2019-11-20 Adam Schiff D-Schiff Democrat committee
6 2019-11-20 Adam Schiff D-Schiff Democrat proceed
7 2019-11-20 Adam Schiff D-Schiff Democrat disruptions
8 2019-11-20 Adam Schiff D-Schiff Democrat chairman
9 2019-11-20 Adam Schiff D-Schiff Democrat ll
10 2019-11-20 Adam Schiff D-Schiff Democrat steps
# i 133,874 more rows

ANALYZE THE DATA

Deliverable 9: Count the most frequently occurring words in the dataset.

impeach_clean %>%
count(word, sort = TRUE)

# A tibble: 9,176 x 2
word n
<chr> <int>
1 president 5049
2 ukraine 1872
3 ambassador 1802
4 trump 1632
5 call 1210
6 zelensky 1130
7 correct 1096
8 meeting 889
9 time 805
10 sondland 795
# i 9,166 more rows

34
What are the top ten words in order from this clean dataset?

Deliverable 10: Visualize this count using the ggplot2 package. Create a barchart of all
the words occurring more than 600 times in the dataset (you could adjust that by
changing the filter() parameter).

impeach_clean %>%
count(word, sort = TRUE) %>%
filter(n>600) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
coord_flip()

president
ukraine
ambassador
trump
call
zelensky
correct
meeting
time
sondland
security
house
don
investigations
people
ve
ukrainian
impeachment
0 1000 2000 3000 4000 5000
n

Deliverable 11: Combinining all the steps using the pipe capabilities of dplyr.

As I mentioned , in the tidyverse, these steps could all be nested using the %>% capabilities as
below. Let’s use the broom to remove all the objects we created in both the environment and
in our plots, and recreate them by highlighting all the lines below and running them. Could
you combine this even further and achieve the same result?

35
impeachtidy <- read_tsv("impeach.tab")

impeach_words <- impeachtidy %>%


unnest_tokens(word,TEXT) %>%
anti_join(stop_words)

impeach_clean <- impeach_words %>%


anti_join(stop_words)

impeach_clean %>%
count(word, sort = TRUE) %>%
filter(n>600) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
coord_flip()

Rows: 10987 Columns: 5


-- Column specification --------------------------------------------------------
Delimiter: "\t"
chr (4): SPEAKER, MAIN SPEAKER, ROLE, TEXT
date (1): HEARING

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`

36
president
ukraine
ambassador
trump
call
zelensky
correct
meeting
time
sondland
security
house
don
investigations
people
ve
ukrainian
impeachment
0 1000 2000 3000 4000 5000
n

Analyze the impeach_words dataset, get a count of how many words were spoken by each
speaker, and then visualize them.

impeach_words <- impeachtidy %>%


unnest_tokens(word,TEXT) %>%
count(SPEAKER, word, sort=TRUE) %>%
ungroup()

Deliverable 12: Group by speaker then explore the object and visualize the results.

total_impeach <- impeach_words %>%


group_by(SPEAKER) %>%
summarize(total=sum(n)) %>%
arrange(desc(total))

total_impeach

# A tibble: 75 x 2
SPEAKER total
<chr> <int>
1 Daniel Goldman 35478

37
2 Adam Schiff 30222
3 Stephen Castor 29646
4 Devin Nunes 19602
5 Kurt Volker 13404
6 Fiona Hill 13245
7 Doug Collins 13197
8 Gordon Sondland 12558
9 Bill Taylor 11998
10 M. Yovanovitch 11513
# i 65 more rows

You will see that four people stand out more than the others, with three that are perhaps
expected, and 1 that is somewhat surprising. 1. Daniel Goldman; Adam Schift, and Stephen
Castor; and Devin Nunes.

total_impeach %>%
ggplot(aes(SPEAKER,total)) +
geom_col() +
xlab(NULL) +
ylab(NULL) +
coord_flip()

Val Will Hurd


Demings
TomTimMcClintock
Morrison
Terri Sewell
TedTed
Sylvia
Lieu
Deutch
Garcia
SteveChabot
Steve Cohen
Stephen
Sheila Castor
J. Lee
Sean
Rep. Maloney
Sewell
Raja Krishnamoo
Pramila Jayapal
Peter Karlan
Pamela Welch
NoahNorm Eisen
Feldman
Mike Turner
MikeJohnson
Quigley
Mike
Michael Turner
MichaelMattGerhard
Gaetz
MaryMartha
Gay Scanlo
Roby
Madeleine
M. M. Dean
Yovanovitch
Lucy Conaway
McBath
Lt Col Vindman
Louie Gohmert
Lou
Kurt Correa
Volker
Ken
KellyKaren Buck
Armstrong
Jonathan Bass
Turley
John Ratcliffe
Joe Neguse
Joaquin Castro
Jim Sesenbrenne
Jim
Jim Jordan
Himes
Jerry
Jamie Nadler
Raskin
Jackie
J. Speier
Williams
Hakeem
Guy Jeffries
Reschenthal
Greg
Greg Steube
Stanton
Gordon Sondland
George Kent
Fiona
Eric Swalwell Hill
Elise
DougStefanik
Collins
Devin
Denny Nunes
Heck
Debbie
Debbie M. Powel
Lesko
David
David Holmes
Cicilline
Daniel
Chris Goldman
Stewart
Brad Wenstrup
Bill Taylor
Ben
Barry Cline
Andy Berke
Andre Biggs
AdamCarson
Schiff
0 10000 20000 30000

You may click on Zoom to bring up the chart viewer and make the chart more readable.

38
Here, the coord_flip() funciton flips the cartesian coordinates, and allows you to see speakers
names on the y axis. If you remove that argument, the chart will plot, but it is harder to
read.

total_impeach %>%
ggplot(aes(SPEAKER,total)) +
geom_col() +
xlab(NULL) +
ylab(NULL)

30000

20000

10000

Adam
Andre
Andy
Barry
Brad
Ben
Schiff
Carson
Bill
Daniel
Chris
Biggs
David
Berke
Cline
Taylor
David
Wenstrup
Debbie
Debbie
Stewart
Goldman
Denny
Cicilline
Devin
Holmes
Doug
Elise
Eric
M.
Lesko
Gordon
George
Heck
Nunes
Fiona
Powel
Collins
Stefanik
Swalwell
Greg
Guy
Greg
Hakeem
Hill
Sondland
Reschenthal
Stanton
Jackie
J.
Kent
Jamie
Steube
Williams
Jerry
Jim
Jim
Jeffries
Jim
Joaquin
Speier
Raskin
Sesenbrenne
Nadler
Joe
Himes
John
Jonathan
Jordan
Kelly
Karen
Neguse
Castro
Ratcliffe
Ken
Kurt
Louie
Armstrong
Lou
LtTurley
Bass
Lucy
Col
Buck
Volker
M.
Correa
Madeleine
Gohmert
Mary
Vindman
Yovanovitch
Conaway
Martha
McBath
Michael
Matt
Michael
Gay
Mike
Mike
Roby
Dean
Gaetz
Noah
Mike
Scanlo
Gerhard
Johnson
Pamela
Norm
Turner
Quigley
Pramila
Raja
Peter
Turner
Feldman
Sean
Rep.
Eisen
Krishnamoo
Stephen
Sheila
Karlan
Welch
Jayapal
Steve
Sewell
Steve
Maloney
Sylvia
Ted
J.Chabot
Castor
Terri
Tom
Lee
Ted
Tim
Cohen
Deutch
Garcia
Val
Lieu
Morrison
Sewell
McClintock
Will
Demings
Hurd

IMPORTED A FOLDER OF .TXT FILES USING TM: IGF BALI TRANSCRIPTS

Deliverable 13: Exploring .txt files using tm package

Import the collection of igf transcripts, create a corpus called igfbali, and inspect and sum-
marize the first two cases of the corpus. This is an example of bringing in a collection of text
files into R using the tm package.

igfbali <- Corpus(DirSource('txt_data'), readerControl=list(reader=readPlain))

Review the class of the igfbali object.

39
class(igfbali)

igfbali

[1] "SimpleCorpus" "Corpus"

<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 63

All of these pre-processing steps are optional. I suggest making a first pass through the data
without using them (e.g. commenting them out, and then experimenting with using them
selectively)

Deliverable 14: Pre-processing the igfbali corpus

Remove the stopwords from this igf corpus using the tm built in stopword dictionary

igfbali <- tm_map(igfbali, removeWords, stopwords("english"))

Create more StopWords (essentially adding to an Exclusion List)

more.stop.words <- c("transcript", "transcripts")

Now remove those new words from the Exclusion list

igfbali <- tm_map(igfbali, removeWords, stopwords("more.stop.words"))

igfbali <- tm_map(igfbali removeWords("more.stop.words"))

STEMMING: Will apply the stemming algorithm to the corpus

tm_map(igfbali, stemDocument)

ANALYZE YOUR CORPUS


CREATING A TERM-DOCUMENT MATRIX: Will create a term-document matrix from a
corpus using either
TermDocumentMatrix: Terms are in “rows” and Documents are in “columns” DocumentTer-
mMatrix: Documents are in “rows” and Terms are in “columns” * my preference *
NB: It is critical that you remember, which one you created and refer to tdm or dtm accu-
rately

40
Deliverable 15: Create a Document Term Matrix (DTM) of the igfbali corpus.

This approach will allow you to exploit some very interesting and powerful R functions (like
clustering, classifications, etc).

dtm <- DocumentTermMatrix(igfbali)

or if you prefer to use the TermDocumentMatrix function:

tdm <- TermDocumentMatrix(igfbali)

Deliverable 16: Exploring the Document Term Matrix (DTM)

Use the function findFreqTerms to identify the key terms in the dataset that occur at least x
(n) times, explore different options until you find an useful grouping.

findFreqTerms(dtm, 500)

[1] ". " "actually" "also" "and"


[5] "around" "back" "big" "can"
[9] "come" "countries" "data" "different"
[13] "even" "first" "give" "going"
[17] "good" "governance" "government" "i'm"
[21] "igf" "important" "information" "internet"
[25] "issues" "just" "kind" "know"
[29] "know," "last" "like" "look"
[33] "lot" "make" "many" "may"
[37] "maybe" "much" "need" "new"
[41] "now" "one" "part" "people"
[45] "point" "policy" "question" "really"
[49] "right" "say" "see" "something"
[53] "take" "talk" "talking" "technical"
[57] "terms" "thank" "that" "that's"
[61] "the" "there" "they" "thing"
[65] "things" "think" "this" "time"
[69] "two" "use" "way" "will"
[73] "work" "working" "world" "you"
[77] "access" "but" "community" "content"
[81] "freedom" "get" "human" "it's"
[85] "local" "online" "rights" "want"
[89] "{oops/}" " " " " " "
[93] " " " "

41
SPARSE TERMS: Will remove sparse terms using function: removeSparseTerms

inspect(removeSparseTerms(dtm, sparse=0.4))

Deliverable 17: Finding Word Associations in the DTM

Using the function: findAssocs(), find out which terms correlate with at least x (0.8) correlation
for at least two specified terms you would like to explore in this Internet Governance Forum
dataset.

findAssocs(dtm, "activists", 0.8)

findAssocs(dtm, "cybersecurity", 0.8)

$activists
moral coalitions. hackers, tunisia.
0.83 0.82 0.82 0.80

$cybersecurity
terrorism bleeds increasing, nation's
0.94 0.91 0.91 0.91
combating cybersecurity, norm, cybercrime,
0.90 0.88 0.87 0.83
chris, infrastructures spam, malicious
0.83 0.83 0.82 0.82
spam spamming "spam" '04
0.81 0.81 0.81 0.81
'05, '06, '06. '17,
0.81 0.81 0.81 0.81
'990s (beep) (beep) -- (security):
0.81 0.81 0.81 0.81
1770-something 1770-something, 2016 2016?
0.81 0.81 0.81 0.81
24--7 5:00, >>k. abcs
0.81 0.81 0.81 0.81
accede accomplish, acdc, acm,
0.81 0.81 0.81 0.81
acronym adamant adopt. adopt --
0.81 0.81 0.81 0.81
advertisements, affiliates, agencies' agency;
0.81 0.81 0.81 0.81

42
ain't analytics, analyzed, answered?
0.81 0.81 0.81 0.81
anti-abuse anti-phishing anti-spam antispam
0.81 0.81 0.81 0.81
anyone -- apcert, arises? article.
0.81 0.81 0.81 0.81
aspects; aspects? assault assist,
0.81 0.81 0.81 0.81
attuned audience -- auscert authenticating
0.81 0.81 0.81 0.81
avail back: backs. batnet
0.81 0.81 0.81 0.81
beep body? botnet, botnet-like
0.81 0.81 0.81 0.81
botnets, boundaries, box, boyer
0.81 0.81 0.81 0.81
boyer: branding. brian building --
0.81 0.81 0.81 0.81
burst buttons c-level caller
0.81 0.81 0.81 0.81
calling, canspam. capability. certs?
0.81 0.81 0.81 0.81
chair's chance. characteristic charities.
0.81 0.81 0.81 0.81
chris? chris -- circus citizen networks
0.81 0.81 0.81 0.81
classified clean. click, clogging
0.81 0.81 0.81 0.81
closed. closer, closes commercials
0.81 0.81 0.81 0.81
commercial -- commonwealth. communities -- complete?
0.81 0.81 0.81 0.81
components. computer -- conflating congratulations,
0.81 0.81 0.81 0.81
construed contents. cooperate -- counterparts --
0.81 0.81 0.81 0.81
counterterrorism, counterterrorism. country; crimes
0.81 0.81 0.81 0.81
cure. cured. cyberattacks cybercapacity
0.81 0.81 0.81 0.81
cybercrime-related cybercrime; cybercrime -- cyberevent
0.81 0.81 0.81 0.81
cyberlaw. cyberthreats, dangerous, daniel

43
0.81 0.81 0.81 0.81
debated, deeds defenses define --
0.81 0.81 0.81 0.81
degradation destroying diplomat, discern
0.81 0.81 0.81 0.81
discussion. dismissed disposal. disservice,
0.81 0.81 0.81 0.81
dominican donations, done -- doorstep
0.81 0.81 0.81 0.81
doorstep. dors driver. drops,
0.81 0.81 0.81 0.81
drove drunk. earlier -- educated.
0.81 0.81 0.81 0.81
employer enabler. enablers. enentire
0.81 0.81 0.81 0.81
enforcement; enriched, enrichment enter,
0.81 0.81 0.81 0.81
eu-funded european -- ex-colleagues except,
0.81 0.81 0.81 0.81
executive expressions, extra-territorial faso hassan.
0.81 0.81 0.81 0.81
fernando, fierce fighting. fines
0.81 0.81 0.81 0.81
fining firs, floated -- florida
0.81 0.81 0.81 0.81
follow-. four -- frameworks: frameworks:
0.81 0.81 0.81 0.81
fraud? ftc, gain, gambling,
0.81 0.81 0.81 0.81
getting, gideon gideon, give.
0.81 0.81 0.81 0.81
glove. government -- grass-root grass-roots
0.81 0.81 0.81 0.81
gsa, hacking-related hacks, haming
0.81 0.81 0.81 0.81
handed, hands- harmonization, headphones
0.81 0.81 0.81 0.81
hijack idea -- ills impinges
0.81 0.81 0.81 0.81
implement. inconvenience. increasing -- ineffective
0.81 0.81 0.81 0.81
infect infection infections infections.
0.81 0.81 0.81 0.81

44
infections? infects innovation-based instructor
0.81 0.81 0.81 0.81
integration. internationally -- internationals. interoperable,
0.81 0.81 0.81 0.81
interrelated, investigating, irritating jammed,
0.81 0.81 0.81 0.81
jay jayantha? jobs? johnson
0.81 0.81 0.81 0.81
johnson. jpcert judiciary, jurists,
0.81 0.81 0.81 0.81
karen, keshted labeled, last --
0.81 0.81 0.81 0.81
law-based leapfrog legislator legislators.
0.81 0.81 0.81 0.81
lepris, liaisons litany maawg
0.81 0.81 0.81 0.81
maawg, maawg. maawg -- mail.
0.81 0.81 0.81 0.81
mailbox, mailboxes makarim, makarim:
0.81 0.81 0.81 0.81
malware, malware. married mayu fumo,
0.81 0.81 0.81 0.81
merged messaging. mexico's mic).
0.81 0.81 0.81 0.81
microphones, misconduct, mismatch mobiles,
0.81 0.81 0.81 0.81
moderately month. montreal, mood
0.81 0.81 0.81 0.81
motivation. mpasa, mulberry, mulberry.
0.81 0.81 0.81 0.81
mulberry: must -- national-level natris,
0.81 0.81 0.81 0.81
natris: ncic nefarious netterlands
0.81 0.81 0.81 0.81
non-south nonsolicited nonstate normal," one
0.81 0.81 0.81 0.81
note -- notifying nuisance nuisance.
0.81 0.81 0.81 0.81
oddly offenses, offenses. omnibus
0.81 0.81 0.81 0.81
one? -- onwards onwards, open--shut,
0.81 0.81 0.81 0.81
opt- opted osc, outfits

45
0.81 0.81 0.81 0.81
outlining overlap, overwhelmed painter
0.81 0.81 0.81 0.81
painter. painter: panel -- partners. --
0.81 0.81 0.81 0.81
pass. pcs perspective? perspective --
0.81 0.81 0.81 0.81
pillar. pipes, plaintiffs. policymakers.
0.81 0.81 0.81 0.81
possible? postgraduate preference -- presenters,
0.81 0.81 0.81 0.81
pretended preventative privacy-sensitive profitable,
0.81 0.81 0.81 0.81
promote. promoting -- promotion, promptly
0.81 0.81 0.81 0.81
pronounced pronounced, propaganda proportion,
0.81 0.81 0.81 0.81
proportions, prpt psace put --
0.81 0.81 0.81 0.81
python's quantity question -- raising?
0.81 0.81 0.81 0.81
rater, realisation receiver receptive
0.81 0.81 0.81 0.81
reduction. reevaluate regarded -- regarding --
0.81 0.81 0.81 0.81
regardless, region. remember? remit.
0.81 0.81 0.81 0.81
remote -- reorganisation requiring, resnick.
0.81 0.81 0.81 0.81
resources? revolutionary rican router,
0.81 0.81 0.81 0.81
routes. sadowski. saturate, schedules
0.81 0.81 0.81 0.81
scheme. segueing self-aid self-governance
0.81 0.81 0.81 0.81
self-regulation, senders servers -- shalt
0.81 0.81 0.81 0.81
sharing? significantly. siphoned sketch
0.81 0.81 0.81 0.81
socialize -- spam. spam? spamed,
0.81 0.81 0.81 0.81
spammer spammers spamming, spam --
0.81 0.81 0.81 0.81

46
speak -- spear-phishing spear-phishing, specialists,
0.81 0.81 0.81 0.81
standards-based stated, statutory stopped,
0.81 0.81 0.81 0.81
streamlined subjects, subject -- succeed,
0.81 0.81 0.81 0.81
sufficient. summaries surprising, tailor-made
0.81 0.81 0.81 0.81
takeaway, takedowns talks. targeted.
0.81 0.81 0.81 0.81
tasks. technology-based teed territory.
0.81 0.81 0.81 0.81
terrorists, theft, therefore -- thou
0.81 0.81 0.81 0.81
thought- tiarma. tighten tong
0.81 0.81 0.81 0.81
toolkit, top -- tout tradition,
0.81 0.81 0.81 0.81
traditions. trainings? transborder tween.
0.81 0.81 0.81 0.81
ugandan ult uncharacteristic uncontinueed
0.81 0.81 0.81 0.81
undisputed unidentifying unsolicited variety,
0.81 0.81 0.81 0.81
vehicle -- vep waas wanteded
0.81 0.81 0.81 0.81
wcit. website.. wild wout
0.81 0.81 0.81 0.81
wout, wout. after system
0.81 0.81 0.81 0.81
efforts, minimize
0.80 0.80

BUILDING A SIMPLE DICTIONARY TO INSPECT YOUR CORPUS


A Dictionary is a (multi-)set of strings, often used to denote relevant terms in text mining.
We represent a dictionary with a character vector which may be passed to the DTM/TDM.
Then the created matrix is tabulated against the dictionary, only terms from the dictionary
appear in the matrix. This allows you to restrict the dimension of the matrix a priori to focus
on distinct text mining contexts.

47
inspect(DocumentTermMatrix(igfbali, list(dictionary = c("multistakeholder", "freedom", "devel

<<DocumentTermMatrix (documents: 63, terms: 3)>>


Non-/sparse entries: 137/52
Sparsity : 28%
Maximal term length: 16
Weighting : term frequency (tf)
Sample :

Docs
10 OPENNESS HUMAN RIGHTS FREEDOM OF EXPRESSION AND FREE FLOW OF INFORMATION ON THE INTERNET
14 INTERNET_GOVERNANCE_PRINCIPLES.txt
15 OPENING CEREMONY AND OPENING SESSION.txt
26 WS 44 FREEDOM ONLINE COALITION OPEN FORU1.txt
27 WS 44 FREEDOM ONLINE COALITION OPEN FORUM.txt
33 WS 57 MAKING MULTISTAKEHOLDERISM MORE EQUITABLE AND TRANSPARENT.txt
38 WS 357 THE INTERNET AS AN ENGINE FOR GROWTH AND ADVANCEMENT.txt
45 WS-297_PROTECTING_JOURNALISTS_BLOGGERS_AND_MEDIA_ACTORS_IN_DIGITAL_AGE.txt
6 BUILDING BRIDGES – ENHANCING MULTI-STAKEHOLDER COOPERATION FOR GROWTH AND SUSTAINABLE.txt
60 WS 300 DEVELOPING A STRATEGIC VISION FOR INTERNET GOVERNANCE.txt

Docs
10 OPENNESS HUMAN RIGHTS FREEDOM OF EXPRESSION AND FREE FLOW OF INFORMATION ON THE INTERNET
14 INTERNET_GOVERNANCE_PRINCIPLES.txt
15 OPENING CEREMONY AND OPENING SESSION.txt
26 WS 44 FREEDOM ONLINE COALITION OPEN FORU1.txt
27 WS 44 FREEDOM ONLINE COALITION OPEN FORUM.txt
33 WS 57 MAKING MULTISTAKEHOLDERISM MORE EQUITABLE AND TRANSPARENT.txt
38 WS 357 THE INTERNET AS AN ENGINE FOR GROWTH AND ADVANCEMENT.txt
45 WS-297_PROTECTING_JOURNALISTS_BLOGGERS_AND_MEDIA_ACTORS_IN_DIGITAL_AGE.txt
6 BUILDING BRIDGES – ENHANCING MULTI-STAKEHOLDER COOPERATION FOR GROWTH AND SUSTAINABLE.txt
60 WS 300 DEVELOPING A STRATEGIC VISION FOR INTERNET GOVERNANCE.txt

Docs
10 OPENNESS HUMAN RIGHTS FREEDOM OF EXPRESSION AND FREE FLOW OF INFORMATION ON THE INTERNET
14 INTERNET_GOVERNANCE_PRINCIPLES.txt
15 OPENING CEREMONY AND OPENING SESSION.txt
26 WS 44 FREEDOM ONLINE COALITION OPEN FORU1.txt
27 WS 44 FREEDOM ONLINE COALITION OPEN FORUM.txt
33 WS 57 MAKING MULTISTAKEHOLDERISM MORE EQUITABLE AND TRANSPARENT.txt
38 WS 357 THE INTERNET AS AN ENGINE FOR GROWTH AND ADVANCEMENT.txt
45 WS-297_PROTECTING_JOURNALISTS_BLOGGERS_AND_MEDIA_ACTORS_IN_DIGITAL_AGE.txt

48
6 BUILDING BRIDGES – ENHANCING MULTI-STAKEHOLDER COOPERATION FOR GROWTH AND SUSTAINABLE.txt
60 WS 300 DEVELOPING A STRATEGIC VISION FOR INTERNET GOVERNANCE.txt

Part 4: Introduction to Data Wrangling in Python

Now we will practice some basic data wrangling in Python using the nltk package. We will use
the Reuters corpus, which is a collection of news documents. We will do some basic cleaning
and tokenization of the documents.

Deliverable 18: Importing the Reuters Corpus and Basic Data Cleaning and Inspection

Let’s begin by importing the nltk package and using the nltk.download() function to download
the Reuters corpus.
You may use the following sample code:

import nltk
nltk.download('reuters')

True

Now we will use the nltk.corpus function to import the Reuters corpus and inspect the cate-
gories and the number of documents.
You may use the following sample code:

from nltk.corpus import reuters


print("Categories:", reuters.categories())
print("Number of documents:", len(reuters.fileids()))

Categories: ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'co

Number of documents: 10788

Now we will do some basic cleaning of this dataset. We will remove punctuation and numbers,
and make everything lowercase. You may use the following sample code:
For Selecting a Document

49
doc_id = reuters.fileids(categories="crude")[0]
doc_text = reuters.raw(doc_id)

For Basic Cleaning:

import string
cleaned_text = doc_text.translate(str.maketrans('', '', string.punctuation))
cleaned_text = ' '.join(cleaned_text.split())
print(cleaned_text)

JAPAN TO REVISE LONGTERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and Indu

Deliverable 19: Tokenization, Stemming, and Lemmatization of the Reuters Corpus

Tokenization is the process of splitting a string into a list of words. We will use the
word_tokenize() function from the nltk library to tokenize the cleaned text.
You may try the following sample code:

from nltk.tokenize import word_tokenize


from nltk.corpus import stopwords

tokens = word_tokenize(cleaned_text)
tokens = [word for word in tokens if word not in stopwords.words('english')]
print(tokens)

['JAPAN', 'TO', 'REVISE', 'LONGTERM', 'ENERGY', 'DEMAND', 'DOWNWARDS', 'The', 'Ministry', 'In

Apply Stemming and Lemmatization


Now we will use the PorterStemmer() function and WordNetLemmatizer to stem and lemma-
tize the tokens.
You may use the following sample code:

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed = [stemmer.stem(word) for word in tokens]

50
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]

print("Stemmed:", stemmed)
print("Lemmatized:", lemmatized)

Stemmed: ['japan', 'to', 'revis', 'longterm', 'energi', 'demand', 'downward', 'the', 'ministr

Lemmatized: ['JAPAN', 'TO', 'REVISE', 'LONGTERM', 'ENERGY', 'DEMAND', 'DOWNWARDS', 'The', 'Mi

Deliverable 20: Conducting a Basic Parts of Speech Tagging of the Reuters Corpus

Part-of-Speech Tagging of Reuters Corpus


Now we will do Part-of-Speech tagging using the pos_tag() function. You may use the following
sample code:

from nltk import pos_tag

tagged_tokens = pos_tag(tokens)
print(tagged_tokens)

[('JAPAN', 'NNP'), ('TO', 'NNP'), ('REVISE', 'NNP'), ('LONGTERM', 'NNP'), ('ENERGY', 'NNP'),

Deliverable 21: Full Text Processing Pipeline for the Reuters Corpus

Finally, we will practice using a Python text processing pipeline applied to the Reuters dataset.
This example assumes ‘tokens’ are already tokenized and cleaned from the previous steps.
You may use the following sample code:

def preprocess_pipeline(text):
text = text.lower().translate(str.maketrans('', '', string.punctuation))
text = ' '.join(text.split())
tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in stopwords.words('english')]
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
tagged = pos_tag(lemmatized)
return tagged

doc_text = reuters.raw(reuters.fileids(categories='crude')[0])

51
processed = preprocess_pipeline(doc_text)

print(processed)

[('japan', 'NN'), ('revise', 'NN'), ('longterm', 'JJ'), ('energy', 'NN'), ('demand', 'NN'), (

*** End of Lab ***


Please render your Quarto file to pdf and submit to the assignment for Lab 4 within Canvas.

52

You might also like