02 Lab 2 Instructions
02 Lab 2 Instructions
and Python
Dr. Derrick L. Cogburn
2024-09-07
Table of contents
Lab Overview 2
Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Assignment Overview 2
Pre-Lab Instructions 3
Creating Your Lab 2 Project in RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 3
Installing Packages, Importing and Exploring Data . . . . . . . . . . . . . . . . . . . 4
Outline of the CRISP-DM Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Lab Instructions 6
Lab Instructions (to be completed before, during, or after the synchronous class): . . 6
There are four main parts to the lab, three in R and one in Python. . . . . . . . . . 6
End of Lab! 37
1
Lab Overview
1. Understand how to collect or import data into R and Python from a wide variety of
sources.
2. Understand decision-making processes and steps in data preparation, including how to
use string manipulation features.
3. Understand how to use the tm package in R, and the NLTK package in Python.
4. Understand how to create a Document-Term-Matrix (DTM) and its role in text mining.
5. Understand the tidyverse, how to prepare a tibble, how to pipe data, and how to ma-
nipulate data using tidytext.
Assignment Overview
In this lab, you will expand on what you have learned so far about R, Python, RStudio, and
Quarto. You will begin to focus on inductive (or exploratory) text mining techniques. There
are 28 deliverables in the lab. You will gain more experience with installing and loading pack-
ages, importing and preparing data, and with exploratory text mining. These are some of the
key foundational steps to begin your text mining journey. You will also get a brief introduc-
tion to the tidyverse, an important modern approach to data science in R. The tidyverse has
transformed data science in R and has also enhanced how we conduct text mining in R via
tidytext. You will learn about the numerous file types may be imported into R and Python
for analysis, including structured and unstructured text files.
2
Pre-Lab Instructions
You should create a project like this for each of your subsequent lab assignments. This will
help you keep your work organized and make it easier for you to submit your work.
In RStudio, create a project for Lab 2. To do so, follow these steps:
Step 1:
Create a folder on your computer called Lab2. Go to the GitHub repo for this course and
download to this folder the following files: 1. The rendered Quarto file as a pdf of the lab
instructions, and 2. A data file for this lab (oct_delta.csv).
Step 2:
Now, within RStudio, create a new Project by going to File > New Project > Existing Direc-
tory. Then, select the option to create a new project from an existing directory, and point it
to the folder you just created called Lab2. This will create a new RStudio project for Lab 2.
You will notice this name in the upper right corner of your RStudio window.
Step 3:
Within your RStudio project for Lab 2, create a new Quarto file. You can do this by going
to File > New File > Quarto Document. You can name this file Lab2.qmd. This will create
a new Quarto file in your RStudio project for Lab 2. In addition to the lab instructions, you
have a data file of Tweets to use for this lab (oct_delta.csv).
Now begin working your way through the Pre-Lab Instructions for Lab 2. You can do this by
following the instructions in the Lab 2 instructions pdf file you downloaded from the course
GitHub repo. You may begin to work on the rest of the lab instructions if you want, or you
can also wait until class on Wednesday or when you watch the video to work through the lab
instructions.
Remember, all labs need to be submitted as rendered pdf files, with the echo in your Quarto
YAML header set to TRUE)
execute:
echo: true
This will show your work in your rendered document (both code output and plots). The
Canvas assignment submission for Lab 2 and all subsequent labs will restricted to a .pdf file
format only (and that must be a rendered.pdf file, not an html file saved as a pdf.
3
Installing Packages, Importing and Exploring Data
For this lab you will be installing eleven R packages (and their dependent packages) and
later 5 Python packages (most of these packages were installed in Lab 1). In Quarto file for
this lab, please create an R code chunk and install the following packages. Remember the
package names are strings (“they must be enclosed by quotation marks”), and they are case
sensitive. Also please remember you must be connected to the Internet when you are installing
packages):
1. stringi
2. stringr
3. rJava
4. ggthemes
5. gutenbergr
6. janeaustenr
7. tm
8. tidyr
9. ggplot2
10. scales
11. tidytext
You may install these packages one at a time using this example:
install.packages("stringi")
Or, you may install multiple packages at once using the concatenate “c” function as below:
install.packages(c("stringi","stringr","rJava","ggthemes","gutenbergr",
"janeaustenr","tm","tidyr","ggplot2","scales","tidytext"))
However, when installing packages, you may want to consider installing them one at a time
(at least initially). This approach helps you identify and troubleshoot any problems for each
package as you install. Conversely, you can be confident that the packages you need are
installed properly. Once you have these packages installed, you may comment them out (using
#), in case you want to knit or source the entire script at once to reproduce your project.
If you receive any error messages during the package installation, try to interpret the messages
and overcome them. Remember, it is prefectly fine to use Stackoverflow, or generative AI tools
to support your troubleshooting efforts.
4
Outline of the CRISP-DM Approach
Sometimes it is helpful to write out your task using narrative in Quarto (or commenting in an
R script), and then later go back and put in the actual code needed to accomplish those tasks
(e.g. in an R or Python code chunk in Quarto). As a reminder, here are the six steps of the
CRISP-DM approach. Please add these steps to your Quarto file.
Within each of these sections you are going to start writing your R code to accomplish the
objectives of the lab.
The data set we will be working on for this case study is a CSV file containing tweets from
Delta Airlines. Our goal in this case study is to understand Delta Airline’s customer service
tweets. We want to learn more about this particular domain, and need to answer the following
questions: 1. What is the average length of a social customer service reply? 2. What links
were referenced most often? 3. How many people should be on a social media customer
service team? 4. How many social replies are reasonable for a customer service representative
to handle? Add these questions to a comment in the appropriate section of your R script.
This analysis will focus on Twitter data, but it could be expanded to include online fo-
rums, Facebook, Instagram, and other social media sources. The name of the dataset is
“oct_delta.csv”, using your file structure, please place this dataset into the folder where your
project is located. The data for this case study is called “oct_delta.csv”. It is a collection of
Delta tweets from the Twitter API from 1 October 1 to 15 October 2015. It has been cleaned
up for easier analysis. Please update the appropriate comments section of your code to reflect
this understanding.
5
3. Prepare the Text
For this case study, this process has already been done for you. In a typical study, you would
have to do this yourself. For this case study, the Twitter text has already been organized, and
its many parameters have been reduced to a smaller CSV file containing only the text you
want to use in this study. Keep in mind, in most studies, you will have to conduct these steps
yourself.
4. Extract Features
For this analysis we will use basic string manipulation techniques, use bag of words text
cleaning functions, and extract features. Given our three main research questions, these are
the modeling steps we will take and the functions we will use: 1. What is the average length
of a social customer service reply? We will use the function nchar() to address this question.
2. What links were referenced most often? We will use three functions, grep(), grepl(),
and summary() to address this question. 3. How many people should be on a social media
customer service team? We will analyze Delta agent signature files. 4. How many social replies
are reasonable for a customer service representative to handle? We will look at the data in
time series to focus on this insight. 5. Analyze In this step, we analyze the results from our
modeling in step 4 and attempt to answer our questions and address our overall problem.
Once we answer our study questions, we will be able to make specific recommendations to
Delta about how to best structure its social media customer service team.
End of Pre-Lab
Lab Instructions
Lab Instructions (to be completed before, during, or after the synchronous class):
There are four main parts to the lab, three in R and one in Python.
There are four required parts to the lab. Each of these parts include instructions for completing
your required deliverables in R and Python. Labs are graded on best effort. This means
they are not expected to be perfect. You will receive full credit for completing the lab as long
as you have made a good faith effort to complete as much as you can in the lab. For example,
this may mean you do not complete all of the Python examples, but you do finish all the
deliverables in R. As long as you demonstrate a good faith effort, and submit your lab files
6
on time (as a rendered PDF), you will receive credit. However, remember the goal of the lab
is to give you working code and techniques you can transfer to your own final project. So, it
benefits you in the long run, to try to do as much of the labs as possible.
-Part 1: Importing and Exploring Data in BaseR: Here in Part 1, we will focus on
exploring text data using functions in BaseR. We will load all the required packages and then
start our exploration.
-Part 2: Extract Features from Data with the tm Package: Next in Part 2, we move
from BaseR to use the tm package for text mining.
-Part 3: Introduction to the Tidyverse and Tidytext: In Part 3, we leave the tm
package, and begin to explore text mining using the tidyverse ecosystem.
-Part 4: Introductory Text Mining in Python: Finally, we will continue our introduction
to text mining in Python.
options(stringsAsFactors = FALSE)
Sys.setlocale("LC_ALL","C")
What this does is set your locale to the default “C” locale. This is important for text mining
in R, as it will help you avoid any issues with encoding.
[1] "C/C/C/C/C/en_US.UTF-8"
Now use the library() function to load all the libraries installed during the pre-lab:
stringi; stringr; rJava; ggthemes; gutenbergr; janeaustenr; tm; ggplot2; tidytext; tidyr; dplyr;
scales.
7
Attaching package: 'ggplot2'
annotate
filter, lag
If you would like, you may double check to conform that these packages are now installed in
your RStudio. You will see them listed in your collection of packages under the “Packages” tab
of the Files, Plots, Packages, Help, Viewer pane/’window. Once you have loaded the library,
you will see a tick mark in the box next to that package. That lets you know the package has
been loaded (you may also load it via the GUI by clicking that tick mark; but where is the
fun in that!).
R (like Python) includes a number of built-in datasets. To review a description of the datasets
built into BaseR, use the following function:
data()
First, let’s get the Delta Tweets into R. This dataset represents a “structured” text dataset.
Since the dataset is saved as a csv (comma separated values) file, we can use the BaseR
function:
8
read.csv()
To bring that data into R. As with every function, it consists of the name of the function
(in this case read.csv) that was given to the function by the developers when they created
the function; and an open and closed parenthesis (). Between the () you need to insert an
appropriate “argument” for that function. To get a hint on what is expected in the argument,
in addition to reading the documentation, you may start typing the function and then hover
your cursor over the name. As an example, try this with the read.csv() function.
When you use the read.csv() function, in the argument, include the name of the file we want
to read into R, in this case: oct_delta.csv. We need to put the path to the name of the file in
quotation marks. The quotation marks may be single or double marks. Here is my code:
read.csv('oct_delta.csv')
Notice, this helps to illustrate the value of using a project structure and “relative” pathnames.
If you were trying to reach this file outside of a project structure (or without changing your
working directory) you would have to use an “absolute” pathname instead, which would look
something like the following (on a Mac):
The absolute pathname is the full path to the file, starting from the root directory. This is
why it is so much easier to use a project structure and relative pathnames.
So, use the read.csv() function from BaseR to read in the oct_delta.csv data and take a look
at what you produce. You can see all the key variables and should have better sense of the
data structure.
Now that you know you can bring in teh dataset, let’s do it again, but this time when we bring
it in, we can create an object, so we may use it in our work.
For now, let’s name this object “text.df” to remind us that this is now a text data frame
in R. Remember there are four main data structures in R: (1) Data Frame; (2) Matrix; (3)
Array; and (4) List. Remember, there are a number of ways of getting data, both textual and
numeric, into R. The read.csv function is just one way to do so in BaseR.
9
Deliverable 2: Create an Object from the Delta Tweets
In an R code chunk below, create an object called text.df by using the <- assignment operator
and the BaseR read.csv() function to bring the oct_delta.csv file into R.
NB: Remember to make sure the path in the argument is contained within single or double
quotation marks, and that you substitute whatever your path is to the data file.
When you have done this, you will notice in your Global Environment pane a new object. If
you named yours as mine, the new object will be called text.df.
The environment pane should indicate the object has 1377 obs. of 5 variables. This means
there are 1377 rows and 5 columns in the dataset.
In your Quarto file, list the five variables in the text.df dataset and what they represent? It is
important that you understand what each variable in a dataset represents.
variable 1: variable 2: variable 3: variable 4: variable 5:
Now, take a look at the dataset by using the head(), tail(), class(), and summary() functions
to examine the first six rows, and the last six rows, and summary of the text.df object:
head(text.df)
tail(text.df)
class(text.df)
summary(text.df)
10
2 @rmarkerm Terribly sorry for the inconvenience. If we can b
3 @checho85 I can check, pls
4 @nealaa ...Ale
5 @nealaa ...advisory has only been issued for the Bahamas, but that could change. To c
6 @nealaa Hi. Our meteorologist team is aware of Hurricane Joaquin & monitors weather con
1372 @mmmeincke Hi there. My apologies for the delay. I see the equipment is arr
1373 @satijp Woohoo! Way to go Marl
1374 @lukenbaugh1 You'
1375 @jeffcarp If you do not make your connection, the gate agent will advise of other optio
1376
1377 @sv
[1] "data.frame"
As a little “sanity check” (we use this phrase to mean, does this make sense or is this really
working), in your Quarto file, list the number of the last row of this dataset?”
Last row number:
11
Those variables, variable names, and the columns in which they are located will become very
important to you, as you might want to look at our use data in only one column (or even grab
just one individual piece of data). We do this through the use of an index.
So, now, to explore a little more about indexing. In a code chunk below, use the nchar()
function and the head() function nested together to identify the number of characters in the
top six rows of the column labeled “text” (this is the actual text content of the tweet).
nchar(head(text.df$text))
The overarching function nchar() allows you to count the number of characters (for example,
in an object that you supply in the argument for the function). However, inside the argument
for this function, you are asking another function, head() to ask R to look at only the first
six rows of the dataset, you tell it the dataset (here, text.df), but instead of looking at the
entire dataset, with the inclusion of the $, we are asking R to only look at the column/variable
“text”.
This code returns the following result: [1] 119 110 78 65 137 142
In your Quarto file, list the number of characters in the first six tweets in the Delta Tweets
dataset.
tweet 1: tweet 2: tweet 3: tweet 4: tweet 5: tweet 6:
Now, let’s talk more about indexing. R indexes dataframes from 1 to i (not 0 - i like in Python).
For example, in a code chunk below, use the following line of code to create an object called
index_example containing the values between 1 and 50. Then, take a look at the object.
12
Then, ask R to show us those values.
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
When we look at the output, we can tell that each of those values 1, 2, 3, 4, is indexed according
to that same number (when the line wraps after 30, it tells us we are starting on item 31 in
the index, which is 31). This will be very helpful, when we are trying to understand how to
identify a particular item in our dataset.,
So, now, based on that understanding, in a code chunk ask R to apply the nchar() function
to only the fourth row and the fifth column, which will tell us the number of characters in a
single tweet. Hint: use [4,5] to identify the fourth row, fifth column.
nchar(text.df[4,5])
[1] 65
While this lab is using a csv file as data, as we have discussed, you may import many different
structured and unstructured file types into R to use as data. The tm package includes functions
to increase the types of text files you may read into R. These functions include the following:
Now, let’s use what we’ve learned to answer our first research question. In a code chunk, write
a line of code that uses the mean() function and the nchar() function to provide us with the
average number of characters in a tweet contained in the text.df dataset.
13
mean(nchar(text.df$text))
[1] 92.16412
So, what is the average length of a Delta social customer service reply in October?
What insights do we gain from this finding?
Let’s create a dataframe of just the Tweets in the Delta dataset. To do that, we are going
to create an object called “tweets”, and use the data.frame() function. The entire suggested
argument for this function is below:
(ID=seq(1:nrow(text.df)),text=text.df$text)
Make sure you look at that argument carefully to make sure you understand what each element
is doing.
Optional: If you want to apply some of the text pre-processing steps we have discussed, you
may use and/or modify the following code. However, I strongly recommend that you complete
the lab first, before building a code chunk to explore these pre-processing steps.
While there is already a built-in stopword dictionary called “stopwords”, and for the english
version you add “english” to the argumment for the stopwords() function. You may apply
just this existing stopword dictionary, or you may create a custom stopwords dictionary by
adding additional words to the default stopwords dictionary. You do this by created a cus-
tom.stopwords object, based on a concatenation c() of stopwords(“english”), and any additional
stopwords you would like to add, all enclosed in quotation marks. For example, you might
add “lol”,“smh”,“delta”,“amp” to the custome dictionary.
Then, you could use the function() function to create a function called clean.corpus. When
you create a corpus, you give it a name, such as “clean.corpus” and assign it the results of the
function() function. For example, if you wanted to have your function bring together several
text processing functions from the tm package, your function could look something like the
following:
14
clean.corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(tryTolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
return(corpus)
}
Then, to implement this function you have just created, you would use it like any other
functions you have used, by invoking the name of the function and supplying it with the
required argument (in this case it needs a dataset to clean)
clean.corpus(tweets)
Once you have completed the lab, you may want to explore these pre-processing steps, and
see how they affect your analysis.
Now, let’s create the corpus we will use or our analysis, and subsequently to create a Document
Term Matrix (DTM) one of the most useful data formats for text mining.
Create an object called corpus and assign it the results of the VCorpus() function. A VCorpus
is a “virtual” corpus, one that is not planned to persist after you end your R session. Within
the argument of the VCorpus() function, you need to tell R what is the VectorSource(). In
this case our object called tweets. You also need to supply a readerControl argument. So, the
full VCorpus function could look something like the following, which you may use to. create
the “corpus” object you will need to create your DTM:
Now, use the inspect() function combined with your understanding of indexing to identify and
examine specific tweets within your object named “corpus”. For example [1:2], or [[2]].
inspect(corpus[1:2])
inspect(corpus[[2]])
15
Deliverable 13: Create a Document Term Matrix (DTM)
Now what we have a corpus (which is required for a DTM), let’s create a dtm in an object
simply named dtm. We use the function DocumentTermMatrix() and supply a corpus (in this
case named corpus) and the remainder of elements required in the argument. Suggested code
is below for you to explore:
One key element of this argument is where we indicate the weighting of the terms in the DTM.
In this case we are using “weightTf” which represents term frequency, which is simply the
frequency with which the term appears in the dataset. In contrast, we could use “weightTfIdf”
to use the popular analytical heuristic “term frequency by inverted document frequency.”
This approach will lead to us creating a Document Term Matrix (DTM), one of the most useful
data formats. In a DTM we are creating a numerical representation of our textual dataset
with each document represented in a row, and all the terms in that document represented in
columns. If we want to create the opposite representation of our dataset we can create Term
Document Matrix (TDM) using the TermDocumentMatrix(). Here, the Terms are represented
in the rows of the dataset, and the documents are represented in the columns.
As we learn more about the tidyverse, we will also discuss how to move between a DTM/TDM
and a “tibble” (the tidyverse data.frame format) by “casting” our data back and forth.
If you want, you may try a bit more data wrangling with your new dtm by using the following
sample code:
With this code, we have created an object called freq.df that is a dataframe representing the
frequency of words within the delta tweets. Now, we will use ggplot2 to visualize this data.
Before we do that, the unique words have to be changed from a string to a factor with unique
levels.
16
Deliverable 15: Use ggplot to Visualize Frequency
Create a bar chart to visualize the frequently occurring words in the corpus. At this point I
will supply suggestd code you may use to create your code chunk. Try to interpret what each
line of code is doing; but don’t worry, we will return to these details later, especially building
data visualization via ggplot.
Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_bar()`).
Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_text()`).
NA
word
1 1278
2 17519
17
Part 3: Introduction to the Tidyverse and Tidytext
Until now, we have been focusing on using the tm package, which has as data a document term
matrix (or dtm). Historically, this approach is the most standard for text mining. However,
there is a more contemporary approach, developed in 2014 by Hadley Wickham, a former
professor at Rice University, and now chief data scientist at posit (nee RStudio).
This newer approach is called the tidyverse, and it uses an approach to text mining called,
tidytext. I believe you will find the tidyverse to be an elegant and powerful ecosystem for your
data science broadly speaking, and text mining in particular.
However to gain this power, the tidyverse requires its data to be in a specific form of dataframe
called a tibble. A tibble has each variable as a column. Each observation as a row. and Each
type of observational unit as a table. For tidytext, this means a dataset that has a table with
one token (e.g. word) per row. While the tidyverse has these specific requirements, you will
see how valuable it is as a data wrangling approach. Also, you will see later that you can
“cast” the data back and forth from a dtm to a tibble and back again.
In an R code chunk below, create an object called original_books made from calling the
austen_books() function. We have created objects like this before. But now, we will go
further by introducing the powerful concept introduced by the tidyverse called “piping” data.
To “pipe” data, we need to use the pipe operator which is made by inserting the greater than
symbol between to percentage symbols (%>%). It is the R package dplyr that uses the pipe
so effectively to help you wrangle your data. So after the line of code creating original books,
insert the pipe operator. In your mind, read the pipe operator as “and then”. Essentially you
are telling R to do “something” and then, do “something else”, and so-on until you complete
what you are trying to do in that discrete action. For example, in an R code chunk below, try
the following lines of code:
Here we are using several dplyr verbs that we will explore in much more detail later, such
as “mutate” to create a new variable and group_by to subset our data. We are also using a
regular expression, abbreviated regex, to further wrangle this data. We will spend much more
time on these concepts later. For now, just give it try.
Then, take a look at the object:
18
original_books
class(original_books)
# A tibble: 73,422 x 4
text book linenumber chapter
<chr> <fct> <int> <int>
1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
2 "" Sense & Sensibility 2 0
3 "by Jane Austen" Sense & Sensibility 3 0
4 "" Sense & Sensibility 4 0
5 "(1811)" Sense & Sensibility 5 0
6 "" Sense & Sensibility 6 0
7 "" Sense & Sensibility 7 0
8 "" Sense & Sensibility 8 0
9 "" Sense & Sensibility 9 0
10 "CHAPTER 1" Sense & Sensibility 10 0
# i 73,412 more rows
Now, we need to restucture the dataset into a “tidy” version appropriate for analyzing with
tidytext. We could use the same object name and overwrite it in this new format, but instead
let’s create a new object called tidy_books. To get there, let’s assign tidy_books, the result
of calling our original_books object, and then (inserting the pipe at the end of the line)
using a tidytext function called unnest_tokens(). For the argument we want to unnest our
tokens from the word colmn and output them into the text). This new “tidy_books” uses a
one-token-per-row format (a tibble).
Remember, the unnest_tokens() function comes from the tidytext package. If you do not yet
have the tidytext package installed and loaded, please do so now, and you may coment it out
afterwards.
19
tidy_books
class(tidy_books)
# A tibble: 725,055 x 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Sense & Sensibility 1 0 sense
2 Sense & Sensibility 1 0 and
3 Sense & Sensibility 1 0 sensibility
4 Sense & Sensibility 3 0 by
5 Sense & Sensibility 3 0 jane
6 Sense & Sensibility 3 0 austen
7 Sense & Sensibility 5 0 1811
8 Sense & Sensibility 10 0 chapter
9 Sense & Sensibility 10 0 1
10 Sense & Sensibility 13 0 the
# i 725,045 more rows
Now, let’s apply the tidytext built in stopword dictionary to our text. Start by implementing
the data(stop_words) function. This will load the stop_words dictionary into your environ-
ment. Then, take a look at the stop_words object.
data(stop_words)
stop_words
Then, use the pipe to create a new version of the tidy_books object by calling tidy_books
and then using the anti_join() function with the “stop_words” dictionary supplied in the
argument.
# A tibble: 1,149 x 2
word lexicon
20
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# i 1,139 more rows
# A tibble: 217,609 x 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Sense & Sensibility 1 0 sense
2 Sense & Sensibility 1 0 sensibility
3 Sense & Sensibility 3 0 jane
4 Sense & Sensibility 3 0 austen
5 Sense & Sensibility 5 0 1811
6 Sense & Sensibility 10 0 chapter
7 Sense & Sensibility 10 0 1
8 Sense & Sensibility 13 0 family
9 Sense & Sensibility 13 0 dashwood
10 Sense & Sensibility 13 0 settled
# i 217,599 more rows
Now that we have a clean tibble, let’s use it for our analysis. Begin by the tidy_books object
and then using the count() function to count the number of times each word appears in the
dataset. In the argument for count() insert word, and then sort = TRUE.
tidy_books %>%
count(word, sort = TRUE)
# A tibble: 13,914 x 2
word n
21
<chr> <int>
1 miss 1855
2 time 1337
3 fanny 862
4 dear 822
5 lady 817
6 sir 806
7 day 797
8 emma 787
9 sister 727
10 house 699
# i 13,904 more rows
Now, we can really take advantage of the powerful visualization in ggplot by visualizing tshe
results with a bar chart of the most common words in Austen’s novels. Below, I provide sample
code for you to use. Again, we will go through this in detail in both our data wrangling and
data visualization sessions, but for now, see how the pipe (read “and then” allows you to build
an elegant code block accomplishing a great deal. You will. also see how ggplot does not yet
truly use the pipe, but instead uses a + sign to accomplish a somewhat similiar purpose. My
understanding is the next iteration of ggplot will switch to using the pipe. But for now, the
+ sign is what we use to indicate the layers of building a ggplot data visualization). In a
code chunk below, we want to take the tidy_books object, and then count() by word, sort =
TRUE, and then, use the filter() function, to identify words that occur more than 600 times,
and then create a new variable, word, using the mutate() with word=reorder(word,n) in the
function, and then move on to use the ggplot() function, with the argument being aes(word,n),
then with a + move to use the geom_col() and with + xlab (NULL) to use the default label
for the x axis, and ending with the coord_flip() function (no need to put anything in the
argument.)
tidy_books %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word=reorder(word, n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
coord_flip()
22
miss
time
fanny
dear
lady
sir
day
emma
sister
house
elizabeth
elinor
hope
0 500 1000 1500
n
Now that we’ve used the janeaustenr package, let’s look at the gutenbergr package. It provides
access to the public domain works from the Project Gutenberg collection. We will make good
use of the gutenberg_download() function that downloads one or more works from the Project
Gutenberg by ID.
Let’s start by looking at some books of science function and fantasy by H.G. Wells. Create
an object called “hgwells” by assigning it the results of using the gutenberg_downloads()
function to get the following books: The Time Machine, The War of the Worlds, The Invisible
Man, and The Island of Doctor Moreau. These books have respective Project Gutenburg
IDs of 35,36,5230,and 159. In the argument for the gutenberg_download() function you may
combine these with the concatenate c() function.
23
Warning: ! Could not download a book at http://aleph.gutenberg.org/1/5/159/159.zip.
i The book may have been archived.
i Alternatively, You may need to select a different mirror.
> See https://www.gutenberg.org/MIRRORS.ALL for options.
# A tibble: 15,303 x 2
gutenberg_id text
<int> <chr>
1 35 "The Time Machine"
2 35 ""
3 35 "An Invention"
4 35 ""
5 35 "by H. G. Wells"
6 35 ""
7 35 ""
8 35 "CONTENTS"
9 35 ""
10 35 " I Introduction"
# i 15,293 more rows
Now create a tidy version of the hgwells dataset by creating an object called tidy_hgwells
which takes the hgwells object, and then, uses the unnest_tokens() function with “word, text”
as the argument,
and then, uses the anti_join() function with stop_words in the argument.
Finally, once you create it, take a look at the object.
The sample code below helps to illustrate the power of the pipe.
24
Deliverable 23: Count Words in tidy_hgwells
Now, use the count() function to determine the most commonly occurring words in the novels
of H.G. Wells. Use “word, sort = TRUE” as the argument for the count() function.
tidy_hgwells %>%
count(word, sort = TRUE)
# A tibble: 10,320 x 2
word n
<chr> <int>
1 time 396
2 people 249
3 door 224
4 kemp 213
5 invisible 197
6 black 178
7 stood 174
8 night 168
9 heard 167
10 hall 165
# i 10,310 more rows
Now, we will return to Project Gutenburg to get some works from the Bronte sisters. create
an object called “bronte” and use the gutenberg_download() function to assign it the books
with the following project IDs: 1260,768,969,9182,767 using the concatenate function c() in
the argument.
Now, create a tidy version of the bronte object. Create an object called tidy_bronte, assigning
it the results of taking bronte object, and then, the unnest_tokens() function, with “word,
text” in the argument, and then cleaning it by using the anti_join() function with stop_words
in the argument.
25
tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
Now, identify the most commonly occurring words in the novels of the Bronte sisters. Use
the tidy_bronte object, and then, use the count() function, with “word, sort = TRUE” in the
argument.
tidy_bronte %>%
count(word, sort = TRUE)
# A tibble: 23,213 x 2
word n
<chr> <int>
1 "time" 1065
2 "miss" 854
3 "day" 825
4 "don\u2019t" 780
5 "hand" 767
6 "eyes" 714
7 "night" 648
8 "heart" 638
9 "looked" 601
10 "door" 591
# i 23,203 more rows
Now, let’s calculate the frequency for each word in the works of Jane Austen, the Bronte
sisters, and HG Wells, by binding the data frames together. We can use spread() and gather()
functions from tidyr to reshape our dataframe.
The suggested sample code is below:
26
frequency <- bind_rows(mutate(tidy_bronte, author="Bronte Sisters"),
mutate(tidy_hgwells, author="H.G. Wells"),
mutate(tidy_books, author="Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z'] +")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion) %>%
gather(author, proportion, "Bronte Sisters":"H.G. Wells")
Now, let’s plot using suggested code below (expect a warning about rows with missing values
being removed):
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_text()`).
27
Bronte Sisters H.G. Wells
100%
Jane Austen
90%
80%
70% 80% 100% 70% 80% 100%
Now, let’s explore some of the same techniques using Python. In this lab, we will analyze
Twitter data using Python. We will use the Natural Language Toolkit (nltk) to analyze the
text data. We will also use the scikit-learn library to create a bag-of-words model and a term
frequency-inverse document frequency (TF-IDF) model. Finally, we will use the matplotlib
library to visualize the results.
Step 1: Setting Up Your Environment
You will need to make sure you have the following relevant Python packages installed: nltk,
pandas, pandas, matplotlib, scikit-learn. My recommendation is to install these packages
within your Anaconda Navigator.
Once you have the packages installed, now import them into your Python environment. To
do so, open a Python code chunk below and use the import function and the following abbre-
viations or aliases: pd for pandas, nltk for nltk, and plt for matplotlib.pyplot. Try the sample
code below in a Python code chunk:
28
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
We are assuming the “oct_delta.csv” file is still in your working directory. If not, please change
the path to the file. We are also assuming the csv file has a column named ‘text’, containing
the text of the tweets from Delta Airlines.
In a Python code chunk below, read in the csv file and assign it to a variable called tweets_df.
Remember in Python the assignment operator is = (as opposed to the <- operator in R).
Try the sample code below in a Python code chunk:
tweets_df = pd.read_csv('oct_delta.csv')
Character count in each tweet. Now let’s create a new column in the tweets_df dataframe
called ‘char_count’ that contains the character count in each tweet. Try the sample code
below in a Python code chunk:
tweets_df['char_count'] = tweets_df['text'].apply(len)
Visualize the character count in each tweet. Now let’s visualize the character count in each
tweet. Try the sample code below in a Python code chunk:
tweets_df['char_count'].plot.hist(bins=20)
plt.show()
29
175
150
125
Frequency
100
75
50
25
0
20 40 60 80 100 120 140
Now, we will calculate the term frequency or word count of each word in each tweet. Try the
sample code below in a Python code chunk:
cv = CountVectorizer()
tf = cv.fit_transform(tweets_df['text'])
tf_feature_names = cv.get_feature_names_out()
Then, you can convert to DataFrame for easier handling with the following code:
Visualize the word count in each tweet. Now let’s visualize the word count in each tweet.
Try the sample code below in a Python code chunk:
tf_df.sum().sort_values(ascending=False).head(20).plot.bar()
plt.show()
30
700
600
500
400
300
200
100
0
to
for
and
be
can
you
the
your
we
please
hi
dm
this
our
that
will
pls
follow
is
sorry
4. Word Count (TF-IDF) in Each Tweet
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(tweets_df['text'])
tfidf_feature_names = tfidf.get_feature_names_out()
You may now convert to DataFrame for easier handling with the following code:
Visualize the word count (TF-IDF) in each tweet. Now let’s visualize the word count (TF-IDF)
in each tweet.
Try the sample code below in a Python code chunk:
tfidf_df.sum().sort_values(ascending=False).head(20).plot.bar()
plt.show()
31
70
60
50
40
30
20
10
0
for
and
be
can
you
to
your
the
please
dm
we
hi
this
follow
will
that
hear
pls
sorry
confirmation
5. Most Common Words
Now, let’s find the most common words, and identify the top 20 common words using the
head() function.
Try the sample code below in a Python code chunk:
tf_df.sum().sort_values(ascending=False).head(20)
to 734
you 683
the 568
your 545
for 511
and 247
we 222
please 221
be 217
this 198
can 195
hi 185
sorry 182
dm 181
32
our 168
that 162
will 159
follow 155
pls 154
is 154
dtype: int64
Visualize the most common words. Now let’s visualize the most common words.
Try the sample code below in a Python code chunk:
tf_df.sum().sort_values(ascending=False).head(20).plot.bar()
plt.show()
700
600
500
400
300
200
100
0
to
for
and
be
can
you
the
your
we
please
hi
dm
this
our
that
will
pls
follow
is
sorry
Now, let’s find the most common phrases. We will create a bigram matrix and a trigram
matrix. You may use the sample code below in a Python code chunk:
33
This is how you can create a trigram matrix:
Now, let’s plot the most common terms. You may use the sample code below in a Python
code chunk:
plot_most_common_words(tf, cv)
700
600
500
400
300
200
100
0
to
u
the
ur
for
d
we
ase
be
s
n
hi
dm
r
t
l
low
is
pls
ry
wil
tha
ou
thi
yo
an
ca
sor
yo
fol
ple
1. Creating Dataset
34
We will use the nltk package and download the Jane Austen novels from its collection of texts.
We will then create a dataset containing the text of each novel and the author.
Let’s first download the Jane Austen novels from the nltk package. You may use the sample
code below in a Python code chunk:
nltk.download('gutenberg')
from nltk.corpus import gutenberg
You may want to coment out the nltk.download(‘gutenberg’) line after the first run to avoid
downloading the same data again.
We will use the nltk package to tokenize the text and remove stopwords.
In a Python code chunk, try the sample code below:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
austen_words = nltk.word_tokenize(austen_texts)
filtered_words = [word for word in austen_words if word.lower() not in stop_words]
Again, you may want to comment out the nltk.download(‘stopwords’) line after the first run
to avoid downloading the same data again.
Now analyze the text of the Jane Austen novels using both term frequency and tf-idf. You
may use the sample code below in a Python code chunk:
cv = CountVectorizer()
tf = cv.fit_transform(filtered_words)
tf_feature_names != cv.get_feature_names_out()
Now, let’s analyze the text using tf-idf. You may use the sample code below in a Python code
chunk:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(filtered_words)
tfidf_feature_names = tfidf.get_feature_names_out()
35
4. Visualization
Now, let’s visualize the most common words and phrases. You may use the sample code below
in a Python code chunk:
tf_df.sum().sort_values(ascending=False).head(20)
plot_most_common_words(tf, cv)
plot_most_common_words(tfidf_matrix, tfidf)
to 734
you 683
the 568
your 545
for 511
and 247
we 222
please 221
be 217
this 198
can 195
hi 185
sorry 182
dm 181
our 168
that 162
will 159
follow 155
pls 154
is 154
dtype: int64
36
0
250
500
750
1000
1250
1500
1750
0
250
500
750
1000
1250
1500
1750
End of Lab!
co co
uld uld
wo wo
uld uld
mr mr
mr mr
s s
mu mu
st st
sai sai
d d
on on
e e
mu mu
ch ch
mi mi
ss ss
ev ev
ery ery
37
em em
ma ma
we we
ll ll
thi thi
nk nk
mi
gh go
od
ne t mi
ve gh
r
kn ne t
ow ve
r
litt kn
le ow
eli litt
no le
r
go eli
no
od r
tim tim
e e
Submit your lab rendered as PDF into Canvas. Make sure to include all the code and output.