0% found this document useful (0 votes)
6 views37 pages

02 Lab 2 Instructions

This document outlines Lab 2 for a course on text mining using R and Python, focusing on word and n-gram frequencies. It includes technical and business learning objectives, pre-lab instructions, and detailed lab instructions divided into four parts. Students will work with Delta Airlines' tweet data to explore text mining techniques and complete various deliverables using R and Python.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

02 Lab 2 Instructions

This document outlines Lab 2 for a course on text mining using R and Python, focusing on word and n-gram frequencies. It includes technical and business learning objectives, pre-lab instructions, and detailed lab instructions divided into four parts. Students will work with Delta Airlines' tweet data to explore text mining techniques and complete various deliverables using R and Python.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lab 2: Word and N-Gram Frequencies: In R

and Python
Dr. Derrick L. Cogburn

2024-09-07

Table of contents

Lab Overview 2
Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Assignment Overview 2

Pre-Lab Instructions 3
Creating Your Lab 2 Project in RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 3
Installing Packages, Importing and Exploring Data . . . . . . . . . . . . . . . . . . . 4
Outline of the CRISP-DM Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Lab Instructions 6
Lab Instructions (to be completed before, during, or after the synchronous class): . . 6
There are four main parts to the lab, three in R and one in Python. . . . . . . . . . 6

Part 1: Importing and Exploring Text Data in BaseR 7

Part 2: Extract Features from Data with the tm Package 13

Part 3: Introduction to the Tidyverse and Tidytext 18

Part 4: Word and N-Gram Frequencies using Python 28

End of Lab! 37

1
Lab Overview

Technical Learning Objectives

1. Understand how to collect or import data into R and Python from a wide variety of
sources.
2. Understand decision-making processes and steps in data preparation, including how to
use string manipulation features.
3. Understand how to use the tm package in R, and the NLTK package in Python.
4. Understand how to create a Document-Term-Matrix (DTM) and its role in text mining.
5. Understand the tidyverse, how to prepare a tibble, how to pipe data, and how to ma-
nipulate data using tidytext.

Business Learning Objectives

1. Understand the role of exploratory (inductive) text mining techniques.


2. Understand the major data structures in R and Python and explore their impact on data
analysis.
3. Understanding the major differences between the “bag of words” and NLP approaches
to text mining.
4. Understand how the tidyverse is transforming data science in R.
5. Understand when and how to use word frequencies and other inductive techniques.

Assignment Overview

In this lab, you will expand on what you have learned so far about R, Python, RStudio, and
Quarto. You will begin to focus on inductive (or exploratory) text mining techniques. There
are 28 deliverables in the lab. You will gain more experience with installing and loading pack-
ages, importing and preparing data, and with exploratory text mining. These are some of the
key foundational steps to begin your text mining journey. You will also get a brief introduc-
tion to the tidyverse, an important modern approach to data science in R. The tidyverse has
transformed data science in R and has also enhanced how we conduct text mining in R via
tidytext. You will learn about the numerous file types may be imported into R and Python
for analysis, including structured and unstructured text files.

2
Pre-Lab Instructions

Pre-Lab Instructions (to be completed before class):

Creating Your Lab 2 Project in RStudio

You should create a project like this for each of your subsequent lab assignments. This will
help you keep your work organized and make it easier for you to submit your work.
In RStudio, create a project for Lab 2. To do so, follow these steps:
Step 1:
Create a folder on your computer called Lab2. Go to the GitHub repo for this course and
download to this folder the following files: 1. The rendered Quarto file as a pdf of the lab
instructions, and 2. A data file for this lab (oct_delta.csv).
Step 2:
Now, within RStudio, create a new Project by going to File > New Project > Existing Direc-
tory. Then, select the option to create a new project from an existing directory, and point it
to the folder you just created called Lab2. This will create a new RStudio project for Lab 2.
You will notice this name in the upper right corner of your RStudio window.
Step 3:
Within your RStudio project for Lab 2, create a new Quarto file. You can do this by going
to File > New File > Quarto Document. You can name this file Lab2.qmd. This will create
a new Quarto file in your RStudio project for Lab 2. In addition to the lab instructions, you
have a data file of Tweets to use for this lab (oct_delta.csv).
Now begin working your way through the Pre-Lab Instructions for Lab 2. You can do this by
following the instructions in the Lab 2 instructions pdf file you downloaded from the course
GitHub repo. You may begin to work on the rest of the lab instructions if you want, or you
can also wait until class on Wednesday or when you watch the video to work through the lab
instructions.
Remember, all labs need to be submitted as rendered pdf files, with the echo in your Quarto
YAML header set to TRUE)

execute:
echo: true

This will show your work in your rendered document (both code output and plots). The
Canvas assignment submission for Lab 2 and all subsequent labs will restricted to a .pdf file
format only (and that must be a rendered.pdf file, not an html file saved as a pdf.

3
Installing Packages, Importing and Exploring Data

For this lab you will be installing eleven R packages (and their dependent packages) and
later 5 Python packages (most of these packages were installed in Lab 1). In Quarto file for
this lab, please create an R code chunk and install the following packages. Remember the
package names are strings (“they must be enclosed by quotation marks”), and they are case
sensitive. Also please remember you must be connected to the Internet when you are installing
packages):

1. stringi
2. stringr
3. rJava
4. ggthemes
5. gutenbergr
6. janeaustenr
7. tm
8. tidyr
9. ggplot2
10. scales
11. tidytext

You may install these packages one at a time using this example:

install.packages("stringi")

Or, you may install multiple packages at once using the concatenate “c” function as below:

install.packages(c("stringi","stringr","rJava","ggthemes","gutenbergr",
"janeaustenr","tm","tidyr","ggplot2","scales","tidytext"))

However, when installing packages, you may want to consider installing them one at a time
(at least initially). This approach helps you identify and troubleshoot any problems for each
package as you install. Conversely, you can be confident that the packages you need are
installed properly. Once you have these packages installed, you may comment them out (using
#), in case you want to knit or source the entire script at once to reproduce your project.
If you receive any error messages during the package installation, try to interpret the messages
and overcome them. Remember, it is prefectly fine to use Stackoverflow, or generative AI tools
to support your troubleshooting efforts.

4
Outline of the CRISP-DM Approach

Sometimes it is helpful to write out your task using narrative in Quarto (or commenting in an
R script), and then later go back and put in the actual code needed to accomplish those tasks
(e.g. in an R or Python code chunk in Quarto). As a reminder, here are the six steps of the
CRISP-DM approach. Please add these steps to your Quarto file.

• CRISP-DM 1. Define the Problem


• CRISP-DM 2. Availability of Textual Data
• CRISP-DM 3. Prepare the Textual Data
• CRISP-DM 4. Extract Features
• CRISP-DM 5. Analyze
• CRISP-DM 6. Reach an Insight or Recommendation

Within each of these sections you are going to start writing your R code to accomplish the
objectives of the lab.

1. Define the Problem

The data set we will be working on for this case study is a CSV file containing tweets from
Delta Airlines. Our goal in this case study is to understand Delta Airline’s customer service
tweets. We want to learn more about this particular domain, and need to answer the following
questions: 1. What is the average length of a social customer service reply? 2. What links
were referenced most often? 3. How many people should be on a social media customer
service team? 4. How many social replies are reasonable for a customer service representative
to handle? Add these questions to a comment in the appropriate section of your R script.

2. Availability and Nature of the Data

This analysis will focus on Twitter data, but it could be expanded to include online fo-
rums, Facebook, Instagram, and other social media sources. The name of the dataset is
“oct_delta.csv”, using your file structure, please place this dataset into the folder where your
project is located. The data for this case study is called “oct_delta.csv”. It is a collection of
Delta tweets from the Twitter API from 1 October 1 to 15 October 2015. It has been cleaned
up for easier analysis. Please update the appropriate comments section of your code to reflect
this understanding.

5
3. Prepare the Text

For this case study, this process has already been done for you. In a typical study, you would
have to do this yourself. For this case study, the Twitter text has already been organized, and
its many parameters have been reduced to a smaller CSV file containing only the text you
want to use in this study. Keep in mind, in most studies, you will have to conduct these steps
yourself.

4. Extract Features

For this analysis we will use basic string manipulation techniques, use bag of words text
cleaning functions, and extract features. Given our three main research questions, these are
the modeling steps we will take and the functions we will use: 1. What is the average length
of a social customer service reply? We will use the function nchar() to address this question.
2. What links were referenced most often? We will use three functions, grep(), grepl(),
and summary() to address this question. 3. How many people should be on a social media
customer service team? We will analyze Delta agent signature files. 4. How many social replies
are reasonable for a customer service representative to handle? We will look at the data in
time series to focus on this insight. 5. Analyze In this step, we analyze the results from our
modeling in step 4 and attempt to answer our questions and address our overall problem.

6. Reach an Insight or Recommendation

Once we answer our study questions, we will be able to make specific recommendations to
Delta about how to best structure its social media customer service team.
End of Pre-Lab

Lab Instructions

Lab Instructions (to be completed before, during, or after the synchronous class):

There are four main parts to the lab, three in R and one in Python.

There are four required parts to the lab. Each of these parts include instructions for completing
your required deliverables in R and Python. Labs are graded on best effort. This means
they are not expected to be perfect. You will receive full credit for completing the lab as long
as you have made a good faith effort to complete as much as you can in the lab. For example,
this may mean you do not complete all of the Python examples, but you do finish all the
deliverables in R. As long as you demonstrate a good faith effort, and submit your lab files

6
on time (as a rendered PDF), you will receive credit. However, remember the goal of the lab
is to give you working code and techniques you can transfer to your own final project. So, it
benefits you in the long run, to try to do as much of the labs as possible.
-Part 1: Importing and Exploring Data in BaseR: Here in Part 1, we will focus on
exploring text data using functions in BaseR. We will load all the required packages and then
start our exploration.
-Part 2: Extract Features from Data with the tm Package: Next in Part 2, we move
from BaseR to use the tm package for text mining.
-Part 3: Introduction to the Tidyverse and Tidytext: In Part 3, we leave the tm
package, and begin to explore text mining using the tidyverse ecosystem.
-Part 4: Introductory Text Mining in Python: Finally, we will continue our introduction
to text mining in Python.

Part 1: Importing and Exploring Text Data in BaseR

Set Options and Load Libraries


For text mining in the tm package, you will want to use the options() function in RStudio
to set your option for “stringsAsFactors” to equal FALSE. This will tell R to keep treating
your text (strings) as text, and to not treat them as “factors”. The code you may use for thes
options and settings is below:

options(stringsAsFactors = FALSE)

You will also want to set your locale as follows:

Sys.setlocale("LC_ALL","C")

What this does is set your locale to the default “C” locale. This is important for text mining
in R, as it will help you avoid any issues with encoding.

[1] "C/C/C/C/C/en_US.UTF-8"

Now use the library() function to load all the libraries installed during the pre-lab:
stringi; stringr; rJava; ggthemes; gutenbergr; janeaustenr; tm; ggplot2; tidytext; tidyr; dplyr;
scales.

Loading required package: NLP

7
Attaching package: 'ggplot2'

The following object is masked from 'package:NLP':

annotate

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

If you would like, you may double check to conform that these packages are now installed in
your RStudio. You will see them listed in your collection of packages under the “Packages” tab
of the Files, Plots, Packages, Help, Viewer pane/’window. Once you have loaded the library,
you will see a tick mark in the box next to that package. That lets you know the package has
been loaded (you may also load it via the GUI by clicking that tick mark; but where is the
fun in that!).

Explore Datasets Built Into BaseR

R (like Python) includes a number of built-in datasets. To review a description of the datasets
built into BaseR, use the following function:

data()

Deliverable 1: Read the Delta Tweets Data into R

First, let’s get the Delta Tweets into R. This dataset represents a “structured” text dataset.
Since the dataset is saved as a csv (comma separated values) file, we can use the BaseR
function:

8
read.csv()

To bring that data into R. As with every function, it consists of the name of the function
(in this case read.csv) that was given to the function by the developers when they created
the function; and an open and closed parenthesis (). Between the () you need to insert an
appropriate “argument” for that function. To get a hint on what is expected in the argument,
in addition to reading the documentation, you may start typing the function and then hover
your cursor over the name. As an example, try this with the read.csv() function.
When you use the read.csv() function, in the argument, include the name of the file we want
to read into R, in this case: oct_delta.csv. We need to put the path to the name of the file in
quotation marks. The quotation marks may be single or double marks. Here is my code:

read.csv('oct_delta.csv')

Notice, this helps to illustrate the value of using a project structure and “relative” pathnames.
If you were trying to reach this file outside of a project structure (or without changing your
working directory) you would have to use an “absolute” pathname instead, which would look
something like the following (on a Mac):

read.csv("/Users/johndoe/maindirectory/Advanced Text Analytics -


Fall 2024/Labs Fall 2024/Lab 2 - Word and ngram Frequencies/oct_delta.csv")

The absolute pathname is the full path to the file, starting from the root directory. This is
why it is so much easier to use a project structure and relative pathnames.
So, use the read.csv() function from BaseR to read in the oct_delta.csv data and take a look
at what you produce. You can see all the key variables and should have better sense of the
data structure.
Now that you know you can bring in teh dataset, let’s do it again, but this time when we bring
it in, we can create an object, so we may use it in our work.
For now, let’s name this object “text.df” to remind us that this is now a text data frame
in R. Remember there are four main data structures in R: (1) Data Frame; (2) Matrix; (3)
Array; and (4) List. Remember, there are a number of ways of getting data, both textual and
numeric, into R. The read.csv function is just one way to do so in BaseR.

9
Deliverable 2: Create an Object from the Delta Tweets

In an R code chunk below, create an object called text.df by using the <- assignment operator
and the BaseR read.csv() function to bring the oct_delta.csv file into R.

text.df <- read.csv('oct_delta.csv')

NB: Remember to make sure the path in the argument is contained within single or double
quotation marks, and that you substitute whatever your path is to the data file.
When you have done this, you will notice in your Global Environment pane a new object. If
you named yours as mine, the new object will be called text.df.
The environment pane should indicate the object has 1377 obs. of 5 variables. This means
there are 1377 rows and 5 columns in the dataset.

Deliverable 4: List the variable names in the Delta Tweets dataset.

In your Quarto file, list the five variables in the text.df dataset and what they represent? It is
important that you understand what each variable in a dataset represents.
variable 1: variable 2: variable 3: variable 4: variable 5:

Deliverable 5: Examine the Delta Tweets Dataframe

Now, take a look at the dataset by using the head(), tail(), class(), and summary() functions
to examine the first six rows, and the last six rows, and summary of the text.df object:

head(text.df)
tail(text.df)
class(text.df)
summary(text.df)

weekday month date year


1 Thu Oct 1 2015
2 Thu Oct 1 2015
3 Thu Oct 1 2015
4 Thu Oct 1 2015
5 Thu Oct 1 2015
6 Thu Oct 1 2015

1 @mjdout I know that can be frustrating..we hope to have you parked a

10
2 @rmarkerm Terribly sorry for the inconvenience. If we can b
3 @checho85 I can check, pls
4 @nealaa ...Ale
5 @nealaa ...advisory has only been issued for the Bahamas, but that could change. To c
6 @nealaa Hi. Our meteorologist team is aware of Hurricane Joaquin &amp; monitors weather con

weekday month date year


1372 Thu Oct 15 2015
1373 Thu Oct 15 2015
1374 Thu Oct 15 2015
1375 Thu Oct 15 2015
1376 Thu Oct 15 2015
1377 Thu Oct 15 2015

1372 @mmmeincke Hi there. My apologies for the delay. I see the equipment is arr
1373 @satijp Woohoo! Way to go Marl
1374 @lukenbaugh1 You'
1375 @jeffcarp If you do not make your connection, the gate agent will advise of other optio
1376
1377 @sv

[1] "data.frame"

weekday month date year


Length:1377 Length:1377 Min. : 1.000 Min. :2015
Class :character Class :character 1st Qu.: 4.000 1st Qu.:2015
Mode :character Mode :character Median : 8.000 Median :2015
Mean : 7.991 Mean :2015
3rd Qu.:12.000 3rd Qu.:2015
Max. :15.000 Max. :2015
text
Length:1377
Class :character
Mode :character

As a little “sanity check” (we use this phrase to mean, does this make sense or is this really
working), in your Quarto file, list the number of the last row of this dataset?”
Last row number:

11
Those variables, variable names, and the columns in which they are located will become very
important to you, as you might want to look at our use data in only one column (or even grab
just one individual piece of data). We do this through the use of an index.

Deliverable 6: Examine the Text of all the Tweets

So, now, to explore a little more about indexing. In a code chunk below, use the nchar()
function and the head() function nested together to identify the number of characters in the
top six rows of the column labeled “text” (this is the actual text content of the tweet).

nchar(head(text.df$text))

The overarching function nchar() allows you to count the number of characters (for example,
in an object that you supply in the argument for the function). However, inside the argument
for this function, you are asking another function, head() to ask R to look at only the first
six rows of the dataset, you tell it the dataset (here, text.df), but instead of looking at the
entire dataset, with the inclusion of the $, we are asking R to only look at the column/variable
“text”.

[1] 119 110 78 65 137 142

This code returns the following result: [1] 119 110 78 65 137 142
In your Quarto file, list the number of characters in the first six tweets in the Delta Tweets
dataset.
tweet 1: tweet 2: tweet 3: tweet 4: tweet 5: tweet 6:

Deliverable 7: Interpret the Results of the nchar analysis.

Interpret this result. What do these numbers mean?

Deliverable 8: Create an Index Example

Now, let’s talk more about indexing. R indexes dataframes from 1 to i (not 0 - i like in Python).
For example, in a code chunk below, use the following line of code to create an object called
index_example containing the values between 1 and 50. Then, take a look at the object.

index_example <- 1:50


index_example

12
Then, ask R to show us those values.

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Deliverable 9: Identify the Number of Characters in a Single Specified Tweet

When we look at the output, we can tell that each of those values 1, 2, 3, 4, is indexed according
to that same number (when the line wraps after 30, it tells us we are starting on item 31 in
the index, which is 31). This will be very helpful, when we are trying to understand how to
identify a particular item in our dataset.,
So, now, based on that understanding, in a code chunk ask R to apply the nchar() function
to only the fourth row and the fifth column, which will tell us the number of characters in a
single tweet. Hint: use [4,5] to identify the fourth row, fifth column.

nchar(text.df[4,5])

[1] 65

Part 2: Extract Features from Data with the tm Package

While this lab is using a csv file as data, as we have discussed, you may import many different
structured and unstructured file types into R to use as data. The tm package includes functions
to increase the types of text files you may read into R. These functions include the following:

readPlain(): #Reads in files as plain text ignoring metadata


readPDF(): #Reads in PDF documents
readDOC(): #Reads in MS Word documents
readHTML(): #Reads in simply structured HTML documents
readXML(): #Reads in XML documents
readCorpus(): #Reads in a corpus

Deliverable 10: Extract Mean Characters in Tweet Text

Now, let’s use what we’ve learned to answer our first research question. In a code chunk, write
a line of code that uses the mean() function and the nchar() function to provide us with the
average number of characters in a tweet contained in the text.df dataset.

13
mean(nchar(text.df$text))

[1] 92.16412

So, what is the average length of a Delta social customer service reply in October?
What insights do we gain from this finding?

Deliverable 11: Create a Dataframe of Just the Tweet Text

Let’s create a dataframe of just the Tweets in the Delta dataset. To do that, we are going
to create an object called “tweets”, and use the data.frame() function. The entire suggested
argument for this function is below:
(ID=seq(1:nrow(text.df)),text=text.df$text)
Make sure you look at that argument carefully to make sure you understand what each element
is doing.

tweets <- data.frame(ID=seq(1:nrow(text.df)),text=text.df$text)


tweets

Optional: If you want to apply some of the text pre-processing steps we have discussed, you
may use and/or modify the following code. However, I strongly recommend that you complete
the lab first, before building a code chunk to explore these pre-processing steps.
While there is already a built-in stopword dictionary called “stopwords”, and for the english
version you add “english” to the argumment for the stopwords() function. You may apply
just this existing stopword dictionary, or you may create a custom stopwords dictionary by
adding additional words to the default stopwords dictionary. You do this by created a cus-
tom.stopwords object, based on a concatenation c() of stopwords(“english”), and any additional
stopwords you would like to add, all enclosed in quotation marks. For example, you might
add “lol”,“smh”,“delta”,“amp” to the custome dictionary.
Then, you could use the function() function to create a function called clean.corpus. When
you create a corpus, you give it a name, such as “clean.corpus” and assign it the results of the
function() function. For example, if you wanted to have your function bring together several
text processing functions from the tm package, your function could look something like the
following:

14
clean.corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(tryTolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
return(corpus)
}

Then, to implement this function you have just created, you would use it like any other
functions you have used, by invoking the name of the function and supplying it with the
required argument (in this case it needs a dataset to clean)

clean.corpus(tweets)

Once you have completed the lab, you may want to explore these pre-processing steps, and
see how they affect your analysis.

Deliverable 12: Create and Inspect Corpus for Your Analysis

Now, let’s create the corpus we will use or our analysis, and subsequently to create a Document
Term Matrix (DTM) one of the most useful data formats for text mining.
Create an object called corpus and assign it the results of the VCorpus() function. A VCorpus
is a “virtual” corpus, one that is not planned to persist after you end your R session. Within
the argument of the VCorpus() function, you need to tell R what is the VectorSource(). In
this case our object called tweets. You also need to supply a readerControl argument. So, the
full VCorpus function could look something like the following, which you may use to. create
the “corpus” object you will need to create your DTM:

corpus <- VCorpus(VectorSource(tweets),readerControl =


readDataframe(tweets,"en",id = ID))

Now, use the inspect() function combined with your understanding of indexing to identify and
examine specific tweets within your object named “corpus”. For example [1:2], or [[2]].

inspect(corpus[1:2])
inspect(corpus[[2]])

15
Deliverable 13: Create a Document Term Matrix (DTM)

Now what we have a corpus (which is required for a DTM), let’s create a dtm in an object
simply named dtm. We use the function DocumentTermMatrix() and supply a corpus (in this
case named corpus) and the remainder of elements required in the argument. Suggested code
is below for you to explore:

dtm <- DocumentTermMatrix(corpus,control=list(weighting=weightTf))

One key element of this argument is where we indicate the weighting of the terms in the DTM.
In this case we are using “weightTf” which represents term frequency, which is simply the
frequency with which the term appears in the dataset. In contrast, we could use “weightTfIdf”
to use the popular analytical heuristic “term frequency by inverted document frequency.”
This approach will lead to us creating a Document Term Matrix (DTM), one of the most useful
data formats. In a DTM we are creating a numerical representation of our textual dataset
with each document represented in a row, and all the terms in that document represented in
columns. If we want to create the opposite representation of our dataset we can create Term
Document Matrix (TDM) using the TermDocumentMatrix(). Here, the Terms are represented
in the rows of the dataset, and the documents are represented in the columns.
As we learn more about the tidyverse, we will also discuss how to move between a DTM/TDM
and a “tibble” (the tidyverse data.frame format) by “casting” our data back and forth.
If you want, you may try a bit more data wrangling with your new dtm by using the following
sample code:

dtm.tweets.m <- as.matrix(dtm)


term.freq <- rowSums(dtm.tweets.m)
freq.df <- data.frame(word=names(term.freq),frequency=term.freq)
freq.df <- freq.df[order(freq.df[,2],decreasing = T),]

Deliverable 14: Create an Object to Represent Frequency

With this code, we have created an object called freq.df that is a dataframe representing the
frequency of words within the delta tweets. Now, we will use ggplot2 to visualize this data.
Before we do that, the unique words have to be changed from a string to a factor with unique
levels.

16
Deliverable 15: Use ggplot to Visualize Frequency

Create a bar chart to visualize the frequently occurring words in the corpus. At this point I
will supply suggestd code you may use to create your code chunk. Try to interpret what each
line of code is doing; but don’t worry, we will return to these details later, especially building
data visualization via ggplot.

freq.df$word <- factor(freq.df$word,levels = unique(as.character(freq.df$word)))

ggplot(freq.df[1:20,], aes(x=word, y=frequency)) +


geom_bar(stat="identity", fill="darkred") +
coord_flip() +
theme_gdocs() +
geom_text(aes(label=frequency), colour="white",hjust=1.25, size=5.0)

Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_bar()`).

Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_text()`).

NA
word

1 1278

2 17519

0 5000 10000 15000


frequency

17
Part 3: Introduction to the Tidyverse and Tidytext

Until now, we have been focusing on using the tm package, which has as data a document term
matrix (or dtm). Historically, this approach is the most standard for text mining. However,
there is a more contemporary approach, developed in 2014 by Hadley Wickham, a former
professor at Rice University, and now chief data scientist at posit (nee RStudio).
This newer approach is called the tidyverse, and it uses an approach to text mining called,
tidytext. I believe you will find the tidyverse to be an elegant and powerful ecosystem for your
data science broadly speaking, and text mining in particular.
However to gain this power, the tidyverse requires its data to be in a specific form of dataframe
called a tibble. A tibble has each variable as a column. Each observation as a row. and Each
type of observational unit as a table. For tidytext, this means a dataset that has a table with
one token (e.g. word) per row. While the tidyverse has these specific requirements, you will
see how valuable it is as a data wrangling approach. Also, you will see later that you can
“cast” the data back and forth from a dtm to a tibble and back again.

Deliverable 16: Create an Object Consisting of Jane Austen Books

In an R code chunk below, create an object called original_books made from calling the
austen_books() function. We have created objects like this before. But now, we will go
further by introducing the powerful concept introduced by the tidyverse called “piping” data.
To “pipe” data, we need to use the pipe operator which is made by inserting the greater than
symbol between to percentage symbols (%>%). It is the R package dplyr that uses the pipe
so effectively to help you wrangle your data. So after the line of code creating original books,
insert the pipe operator. In your mind, read the pipe operator as “and then”. Essentially you
are telling R to do “something” and then, do “something else”, and so-on until you complete
what you are trying to do in that discrete action. For example, in an R code chunk below, try
the following lines of code:

original_books <- austen_books() %>%


group_by(book) %>% mutate(linenumber=row_number(),
chapter=cumsum(str_detect(text,regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()

Here we are using several dplyr verbs that we will explore in much more detail later, such
as “mutate” to create a new variable and group_by to subset our data. We are also using a
regular expression, abbreviated regex, to further wrangle this data. We will spend much more
time on these concepts later. For now, just give it try.
Then, take a look at the object:

18
original_books
class(original_books)

# A tibble: 73,422 x 4
text book linenumber chapter
<chr> <fct> <int> <int>
1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
2 "" Sense & Sensibility 2 0
3 "by Jane Austen" Sense & Sensibility 3 0
4 "" Sense & Sensibility 4 0
5 "(1811)" Sense & Sensibility 5 0
6 "" Sense & Sensibility 6 0
7 "" Sense & Sensibility 7 0
8 "" Sense & Sensibility 8 0
9 "" Sense & Sensibility 9 0
10 "CHAPTER 1" Sense & Sensibility 10 0
# i 73,412 more rows

[1] "tbl_df" "tbl" "data.frame"

Deliverable 17: Create a Tidy Version of “original_books”

Now, we need to restucture the dataset into a “tidy” version appropriate for analyzing with
tidytext. We could use the same object name and overwrite it in this new format, but instead
let’s create a new object called tidy_books. To get there, let’s assign tidy_books, the result
of calling our original_books object, and then (inserting the pipe at the end of the line)
using a tidytext function called unnest_tokens(). For the argument we want to unnest our
tokens from the word colmn and output them into the text). This new “tidy_books” uses a
one-token-per-row format (a tibble).
Remember, the unnest_tokens() function comes from the tidytext package. If you do not yet
have the tidytext package installed and loaded, please do so now, and you may coment it out
afterwards.

tidy_books <- original_books %>%


unnest_tokens(word, text)

Then, take a look at the object:

19
tidy_books
class(tidy_books)

# A tibble: 725,055 x 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Sense & Sensibility 1 0 sense
2 Sense & Sensibility 1 0 and
3 Sense & Sensibility 1 0 sensibility
4 Sense & Sensibility 3 0 by
5 Sense & Sensibility 3 0 jane
6 Sense & Sensibility 3 0 austen
7 Sense & Sensibility 5 0 1811
8 Sense & Sensibility 10 0 chapter
9 Sense & Sensibility 10 0 1
10 Sense & Sensibility 13 0 the
# i 725,045 more rows

[1] "tbl_df" "tbl" "data.frame"

Deliverable 18: Apply Stopword Dictionary to tidy_books

Now, let’s apply the tidytext built in stopword dictionary to our text. Start by implementing
the data(stop_words) function. This will load the stop_words dictionary into your environ-
ment. Then, take a look at the stop_words object.

data(stop_words)
stop_words

Then, use the pipe to create a new version of the tidy_books object by calling tidy_books
and then using the anti_join() function with the “stop_words” dictionary supplied in the
argument.

tidy_books <- tidy_books %>%


anti_join(stop_words)
tidy_books

# A tibble: 1,149 x 2
word lexicon

20
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# i 1,139 more rows

Joining with `by = join_by(word)`

# A tibble: 217,609 x 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Sense & Sensibility 1 0 sense
2 Sense & Sensibility 1 0 sensibility
3 Sense & Sensibility 3 0 jane
4 Sense & Sensibility 3 0 austen
5 Sense & Sensibility 5 0 1811
6 Sense & Sensibility 10 0 chapter
7 Sense & Sensibility 10 0 1
8 Sense & Sensibility 13 0 family
9 Sense & Sensibility 13 0 dashwood
10 Sense & Sensibility 13 0 settled
# i 217,599 more rows

Deliverable 19: Count Words in tidy_books

Now that we have a clean tibble, let’s use it for our analysis. Begin by the tidy_books object
and then using the count() function to count the number of times each word appears in the
dataset. In the argument for count() insert word, and then sort = TRUE.

tidy_books %>%
count(word, sort = TRUE)

# A tibble: 13,914 x 2
word n

21
<chr> <int>
1 miss 1855
2 time 1337
3 fanny 862
4 dear 822
5 lady 817
6 sir 806
7 day 797
8 emma 787
9 sister 727
10 house 699
# i 13,904 more rows

Deliverable 19: Visualize Words in tidy_books

Now, we can really take advantage of the powerful visualization in ggplot by visualizing tshe
results with a bar chart of the most common words in Austen’s novels. Below, I provide sample
code for you to use. Again, we will go through this in detail in both our data wrangling and
data visualization sessions, but for now, see how the pipe (read “and then” allows you to build
an elegant code block accomplishing a great deal. You will. also see how ggplot does not yet
truly use the pipe, but instead uses a + sign to accomplish a somewhat similiar purpose. My
understanding is the next iteration of ggplot will switch to using the pipe. But for now, the
+ sign is what we use to indicate the layers of building a ggplot data visualization). In a
code chunk below, we want to take the tidy_books object, and then count() by word, sort =
TRUE, and then, use the filter() function, to identify words that occur more than 600 times,
and then create a new variable, word, using the mutate() with word=reorder(word,n) in the
function, and then move on to use the ggplot() function, with the argument being aes(word,n),
then with a + move to use the geom_col() and with + xlab (NULL) to use the default label
for the x axis, and ending with the coord_flip() function (no need to put anything in the
argument.)

tidy_books %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word=reorder(word, n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
coord_flip()

22
miss
time
fanny
dear
lady
sir
day
emma
sister
house
elizabeth
elinor
hope
0 500 1000 1500
n

Deliverable 20: Create hgwells Object Using Project Gutenberg

Now that we’ve used the janeaustenr package, let’s look at the gutenbergr package. It provides
access to the public domain works from the Project Gutenberg collection. We will make good
use of the gutenberg_download() function that downloads one or more works from the Project
Gutenberg by ID.
Let’s start by looking at some books of science function and fantasy by H.G. Wells. Create
an object called “hgwells” by assigning it the results of using the gutenberg_downloads()
function to get the following books: The Time Machine, The War of the Worlds, The Invisible
Man, and The Island of Doctor Moreau. These books have respective Project Gutenburg
IDs of 35,36,5230,and 159. In the argument for the gutenberg_download() function you may
combine these with the concatenate c() function.

hgwells <- gutenberg_download(c(35,36,5230,159))


hgwells

Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest

Using mirror http://aleph.gutenberg.org

23
Warning: ! Could not download a book at http://aleph.gutenberg.org/1/5/159/159.zip.
i The book may have been archived.
i Alternatively, You may need to select a different mirror.
> See https://www.gutenberg.org/MIRRORS.ALL for options.

# A tibble: 15,303 x 2
gutenberg_id text
<int> <chr>
1 35 "The Time Machine"
2 35 ""
3 35 "An Invention"
4 35 ""
5 35 "by H. G. Wells"
6 35 ""
7 35 ""
8 35 "CONTENTS"
9 35 ""
10 35 " I Introduction"
# i 15,293 more rows

Deliverable 21: Create Tidy Version of hgwells

Now create a tidy version of the hgwells dataset by creating an object called tidy_hgwells
which takes the hgwells object, and then, uses the unnest_tokens() function with “word, text”
as the argument,

Deliverable 22: Apply Stopwords to tidy_hgwells

and then, uses the anti_join() function with stop_words in the argument.
Finally, once you create it, take a look at the object.
The sample code below helps to illustrate the power of the pipe.

tidy_hgwells <- hgwells %>%


unnest_tokens(word, text) %>%
anti_join(stop_words)
tidy_hgwells

Joining with `by = join_by(word)`

24
Deliverable 23: Count Words in tidy_hgwells

Now, use the count() function to determine the most commonly occurring words in the novels
of H.G. Wells. Use “word, sort = TRUE” as the argument for the count() function.

tidy_hgwells %>%
count(word, sort = TRUE)

# A tibble: 10,320 x 2
word n
<chr> <int>
1 time 396
2 people 249
3 door 224
4 kemp 213
5 invisible 197
6 black 178
7 stood 174
8 night 168
9 heard 167
10 hall 165
# i 10,310 more rows

Deliverable 24: Create bronte Object

Now, we will return to Project Gutenburg to get some works from the Bronte sisters. create
an object called “bronte” and use the gutenberg_download() function to assign it the books
with the following project IDs: 1260,768,969,9182,767 using the concatenate function c() in
the argument.

bronte <- gutenberg_download(c(1260,768,969,9182,767))

Deliverable 25: Tidy the bronte Object

Now, create a tidy version of the bronte object. Create an object called tidy_bronte, assigning
it the results of taking bronte object, and then, the unnest_tokens() function, with “word,
text” in the argument, and then cleaning it by using the anti_join() function with stop_words
in the argument.

25
tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)

Joining with `by = join_by(word)`

Deliverable 26: Identify Frequent Words in tidy_bronte

Now, identify the most commonly occurring words in the novels of the Bronte sisters. Use
the tidy_bronte object, and then, use the count() function, with “word, sort = TRUE” in the
argument.

tidy_bronte %>%
count(word, sort = TRUE)

# A tibble: 23,213 x 2
word n
<chr> <int>
1 "time" 1065
2 "miss" 854
3 "day" 825
4 "don\u2019t" 780
5 "hand" 767
6 "eyes" 714
7 "night" 648
8 "heart" 638
9 "looked" 601
10 "door" 591
# i 23,203 more rows

Deliverable 27: Calculate and Word Frequency Amongst Three Objects

Now, let’s calculate the frequency for each word in the works of Jane Austen, the Bronte
sisters, and HG Wells, by binding the data frames together. We can use spread() and gather()
functions from tidyr to reshape our dataframe.
The suggested sample code is below:

26
frequency <- bind_rows(mutate(tidy_bronte, author="Bronte Sisters"),
mutate(tidy_hgwells, author="H.G. Wells"),
mutate(tidy_books, author="Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z'] +")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion) %>%
gather(author, proportion, "Bronte Sisters":"H.G. Wells")

Deliverable 28: Visualize Word Frequency Amongst Three Objects

Now, let’s plot using suggested code below (expect a warning about rows with missing values
being removed):

ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color =


abs(`Jane Austen` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4",
high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Jane Austen", x = NULL)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_text()`).

27
Bronte Sisters H.G. Wells

100%
Jane Austen

90%

80%
70% 80% 100% 70% 80% 100%

Part 4: Word and N-Gram Frequencies using Python

Now, let’s explore some of the same techniques using Python. In this lab, we will analyze
Twitter data using Python. We will use the Natural Language Toolkit (nltk) to analyze the
text data. We will also use the scikit-learn library to create a bag-of-words model and a term
frequency-inverse document frequency (TF-IDF) model. Finally, we will use the matplotlib
library to visualize the results.
Step 1: Setting Up Your Environment

1. Install Relevant Python Packages:

You will need to make sure you have the following relevant Python packages installed: nltk,
pandas, pandas, matplotlib, scikit-learn. My recommendation is to install these packages
within your Anaconda Navigator.

2. Import Necessary Libraries:

Once you have the packages installed, now import them into your Python environment. To
do so, open a Python code chunk below and use the import function and the following abbre-
viations or aliases: pd for pandas, nltk for nltk, and plt for matplotlib.pyplot. Try the sample
code below in a Python code chunk:

28
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt

Step 2: Analyzing Twitter Data

1. Read in Twitter Data:

We are assuming the “oct_delta.csv” file is still in your working directory. If not, please change
the path to the file. We are also assuming the csv file has a column named ‘text’, containing
the text of the tweets from Delta Airlines.
In a Python code chunk below, read in the csv file and assign it to a variable called tweets_df.
Remember in Python the assignment operator is = (as opposed to the <- operator in R).
Try the sample code below in a Python code chunk:

tweets_df = pd.read_csv('oct_delta.csv')

2. Basic Text Analysis

Character count in each tweet. Now let’s create a new column in the tweets_df dataframe
called ‘char_count’ that contains the character count in each tweet. Try the sample code
below in a Python code chunk:

tweets_df['char_count'] = tweets_df['text'].apply(len)

Visualize the character count in each tweet. Now let’s visualize the character count in each
tweet. Try the sample code below in a Python code chunk:

tweets_df['char_count'].plot.hist(bins=20)
plt.show()

29
175
150
125
Frequency

100
75
50
25
0
20 40 60 80 100 120 140

3. Word Count (Term Frequency) in Each Tweet

Now, we will calculate the term frequency or word count of each word in each tweet. Try the
sample code below in a Python code chunk:

cv = CountVectorizer()
tf = cv.fit_transform(tweets_df['text'])
tf_feature_names = cv.get_feature_names_out()

Then, you can convert to DataFrame for easier handling with the following code:

tf_df = pd.DataFrame(tf.toarray(), columns=tf_feature_names)

Visualize the word count in each tweet. Now let’s visualize the word count in each tweet.
Try the sample code below in a Python code chunk:

tf_df.sum().sort_values(ascending=False).head(20).plot.bar()
plt.show()

30
700
600
500
400
300
200
100
0
to

for
and

be
can
you
the
your

we
please

hi
dm
this

our
that
will
pls
follow
is
sorry
4. Word Count (TF-IDF) in Each Tweet

Now, we will calculate the TF-IDF of each word in each tweet.


Try the sample code below in a Python code chunk:

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(tweets_df['text'])
tfidf_feature_names = tfidf.get_feature_names_out()

You may now convert to DataFrame for easier handling with the following code:

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_feature_names)

Visualize the word count (TF-IDF) in each tweet. Now let’s visualize the word count (TF-IDF)
in each tweet.
Try the sample code below in a Python code chunk:

tfidf_df.sum().sort_values(ascending=False).head(20).plot.bar()
plt.show()

31
70
60
50
40
30
20
10
0
for
and

be
can
you
to
your
the

please
dm
we

hi
this
follow

will
that
hear
pls
sorry
confirmation
5. Most Common Words

Now, let’s find the most common words, and identify the top 20 common words using the
head() function.
Try the sample code below in a Python code chunk:

tf_df.sum().sort_values(ascending=False).head(20)

to 734
you 683
the 568
your 545
for 511
and 247
we 222
please 221
be 217
this 198
can 195
hi 185
sorry 182
dm 181

32
our 168
that 162
will 159
follow 155
pls 154
is 154
dtype: int64

Visualize the most common words. Now let’s visualize the most common words.
Try the sample code below in a Python code chunk:

tf_df.sum().sort_values(ascending=False).head(20).plot.bar()
plt.show()

700
600
500
400
300
200
100
0
to

for
and

be
can
you
the
your

we
please

hi
dm
this

our
that
will
pls
follow
is
sorry

6. Most Common Phrases (Bigrams and Trigrams)

Now, let’s find the most common phrases. We will create a bigram matrix and a trigram
matrix. You may use the sample code below in a Python code chunk:

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))


bigram_matrix = bigram_vectorizer.fit_transform(tweets_df['text'])

33
This is how you can create a trigram matrix:

trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))


trigram_matrix = trigram_vectorizer.fit_transform(tweets_df['text'])

7. Plotting Most Common Terms

Now, let’s plot the most common terms. You may use the sample code below in a Python
code chunk:

def plot_most_common_words(count_data, count_vectorizer, top_n=20):


words = count_vectorizer.get_feature_names_out()
total_counts = count_data.sum(axis=0).tolist()[0]
count_dict = (zip(words, total_counts))
count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:top_n]
words, counts = zip(*count_dict)
plt.figure(figsize=(10, 5))
plt.bar(words, counts)
plt.xticks(rotation=45)
plt.show()

plot_most_common_words(tf, cv)

700

600

500

400

300

200

100

0
to
u
the
ur
for

d
we
ase
be

s
n
hi

dm

r
t
l
low
is
pls
ry

wil
tha
ou
thi
yo

an

ca

sor
yo

fol
ple

Step 3: Analyzing Jane Austen’s Novels

1. Creating Dataset

34
We will use the nltk package and download the Jane Austen novels from its collection of texts.
We will then create a dataset containing the text of each novel and the author.
Let’s first download the Jane Austen novels from the nltk package. You may use the sample
code below in a Python code chunk:

nltk.download('gutenberg')
from nltk.corpus import gutenberg

austen_texts = gutenberg.raw(fileids=[f for f in gutenberg.fileids() if 'austen' in f])

You may want to coment out the nltk.download(‘gutenberg’) line after the first run to avoid
downloading the same data again.

2. Preprocessing the Data

We will use the nltk package to tokenize the text and remove stopwords.
In a Python code chunk, try the sample code below:

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
austen_words = nltk.word_tokenize(austen_texts)
filtered_words = [word for word in austen_words if word.lower() not in stop_words]

Again, you may want to comment out the nltk.download(‘stopwords’) line after the first run
to avoid downloading the same data again.

3. Analyze TF and TF-IDF

Now analyze the text of the Jane Austen novels using both term frequency and tf-idf. You
may use the sample code below in a Python code chunk:

cv = CountVectorizer()
tf = cv.fit_transform(filtered_words)
tf_feature_names != cv.get_feature_names_out()

Now, let’s analyze the text using tf-idf. You may use the sample code below in a Python code
chunk:

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(filtered_words)
tfidf_feature_names = tfidf.get_feature_names_out()

35
4. Visualization

Now, let’s visualize the most common words and phrases. You may use the sample code below
in a Python code chunk:

tf_df.sum().sort_values(ascending=False).head(20)
plot_most_common_words(tf, cv)
plot_most_common_words(tfidf_matrix, tfidf)

to 734
you 683
the 568
your 545
for 511
and 247
we 222
please 221
be 217
this 198
can 195
hi 185
sorry 182
dm 181
our 168
that 162
will 159
follow 155
pls 154
is 154
dtype: int64

36
0
250
500
750
1000
1250
1500
1750
0
250
500
750
1000
1250
1500
1750

End of Lab!
co co
uld uld
wo wo
uld uld
mr mr
mr mr
s s
mu mu
st st
sai sai
d d
on on
e e
mu mu
ch ch
mi mi
ss ss
ev ev
ery ery

37
em em
ma ma
we we
ll ll
thi thi
nk nk
mi
gh go
od
ne t mi
ve gh
r
kn ne t
ow ve
r
litt kn
le ow
eli litt
no le
r
go eli
no
od r
tim tim
e e

Submit your lab rendered as PDF into Canvas. Make sure to include all the code and output.

You might also like