0% found this document useful (0 votes)
33 views51 pages

Lab5 Instructions

Lab 5 focuses on categorization models, dictionaries, and sentiment analysis using R and Python, with specific technical and business learning objectives outlined. The lab includes pre-lab instructions for setting up the environment, as well as detailed instructions for data preparation, text mining, and dictionary development using various R packages. Participants will also learn to conduct sentiment analysis using both R and Python, covering key packages like tm, tidyverse, and quanteda.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views51 pages

Lab5 Instructions

Lab 5 focuses on categorization models, dictionaries, and sentiment analysis using R and Python, with specific technical and business learning objectives outlined. The lab includes pre-lab instructions for setting up the environment, as well as detailed instructions for data preparation, text mining, and dictionary development using various R packages. Participants will also learn to conduct sentiment analysis using both R and Python, covering key packages like tm, tidyverse, and quanteda.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Lab 5 - Categorization Models, Dictionaries,

and Sentiment Analysis in R and Python


Dr. Derrick L. Cogburn

2024-09-22

Table of contents

Lab Overview 1
Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Assignment Overview 2

Pre-Lab Instructions 3
Installing Packages, Importing and Exploring Data . . . . . . . . . . . . . . . . . . . 3
Load the required libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Lab Instructions 7
Lab Instructions (to be completed before, during, or after the synchronous class): . . 7

Part 1: Data Preparation, Text Mining and Dictionary Development in tm 7

Part 2: Understanding Tidyverse Dictionary Construction and Sentiment Analysis 15

Part 3: Text Mining with quanteda, Including Variable Creation and Dictionaries 30

Part 4: Using nltk and TextBlob to conduct sentiment analysis in Python 39

Lab Overview

Technical Learning Objectives

1. Understand the structure of a text mining categorization model aka dictionary.

1
2. Learn how to identify and use an existing dictionary to explore specific interests in your
dataset.
3. Learn how to build your own dictionary in tm, tidyverse, and quanteda.
4. Learn how to use dictionaries to conduct sentiment analysis in tm, tidyverse, and
quanteda.
5. Learn how to conduct sentiment analysis in Python using the Natural Language Toolkit
(nltk) and TextBlob packages.

Business Learning Objectives

1. Understand the opportunities and challenges of using dictionaries.


2. Understand a popular application of dictionaries, known as sentiment analysis.
3. Understand the substantial advances made in the development and use dictionaries with
the quanteda package.

Assignment Overview

For this lab, we will build your skills in deductive (confirmatory) approaches to analyzing your
text. You will learn how to build and use categorization models, also known as dictionaries
(or lexicons), to test your hypotheses and concepts, and how they can be used to conduct
sentiment analysis. The lab begins with a quick look at very basic dictionaries using the tm
package. Then, the lab moves to introduce you to using the ecosystem in R known as quanteda
(Quantitative Analysis of Textual Data). The development of quanteda was supported by the
European Union, and it is a comprehensive approach to text mining in R, which rivals the
tidyverse ecosystem using tidytext. While I encourage you to understand all three of the major
ecosystems for text mining in R, you may decide you find one more suited to your needs or
interests.
There are four main parts to the lab:
-Part 1, will demonstrate a basic data preparation, text mining, and dictionary using tm.
-Part 2, helps you to understand dictionary construction and a specific and popular applica-
tion of a dictionary – sentiment analysis – using the tidyverse approach.
-Part 3, will introduce you to a full text mining tutorial using quanteda, including its easy
creation of document variables and use of dictionaries.
-Part 4, will introduce you to conducting sentiment analysis in Python using the nltk and
TextBlob packages.

2
Pre-Lab Instructions

Pre-Lab Instructions (to be completed before class):


Create Your Lab 5 Project in RStudio following the processes I have described previously.
Create a new Quarto document with a relevant title for the lab, for example: “Lab 5: Cat-
egorization Models, Dictionaries, and Sentiment Analysis in R and Python”. Download the
zipped Lab 5 file from the course GitHub repo and save them in your Lab 5 project folder.
The zipped file contains the Lab 5 instructions, the Lab 5 data, and an existing dictionary file
called “policy_agendas_english.lcd”. Unzip the file and save the files in your Lab 5 project
folder.
Now begin working your way through the Lab 5 instructions taking a literate programming
approach. You can complete the pre-lab instructions before class, or wait until class on Wednes-
day.
Also, please remember that this and all subsequent labs need to be submitted as knitted pdf
files, with your Quarto YAML header set to:

echo: true

Remember, my recommended best practice is to render your document to html first, as you
are working through the lab, and then render it to pdf once you are finished and when you
are ready to submit to Canvas.
However, as you are working through the lab, if you come to any sections that produce very
long output, they may cause problems with rendering to pdf. If these sections do create
rendering problems, you may set that specific code chunk to echo = FALSE, that will suppress
the output of that specific code chunk when you render the document.
As always, if you are having problems rendering your lab and are facing the deadline for
submitting the assignment, you may comment out # the sections of your file that are causing
the rendering problems, and submit the assignment with a note about the rendering issue.

Installing Packages, Importing and Exploring Data

For this lab you will be working in the quanteda ecosystem, along with the tm package. If
you have not already, you will need to install quanteda and these three related packages:
quanteda.textmodels, quanteda.textstats, quanteda.textplots.
Later, if you really like quanteda, you may install the following packages from their GitHub
site:
quanteda.sentiment, quanteda.tidy.
To install packages from a GitHub repo, you may use the following function:

3
remotes::install_github("quanteda/quanteda.sentiment")
remotes::install_github("quanteda/quanteda.tidy")

Also, if haven’t already, please install: textdata, wordcloud, and readtext.

install.packages("quanteda")
install.packages("quanteda.textmodels")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")
remotes::install_github("quanteda/quanteda.sentiment")
remotes::install_github("quanteda/quanteda.tidy")
install.packages("textdata")
install.packages("wordcloud")
install.packages("readtext")

Load the required libraries

Now, load the libraries required for this analysis: tm, tidyverse, tidytext, quanteda,
quanteda.textmodels, quanteda.textplots, quanteda.textstats, quanteda.sentiment, textdata,
wordcloud, readtext, reshape2, jaaneaustenr.

library(tm)
library(tidyverse)
library(tidytext)
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textplots)
library(quanteda.textstats)
library(quanteda.sentiment)
library(quanteda.tidy)
library(textdata)
library(wordcloud)
library(readtext)
library(reshape2)
library(janeaustenr)

Loading required package: NLP

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --


v dplyr 1.1.4 v readr 2.1.5

4
v forcats 1.0.0 v stringr 1.5.1
v ggplot2 3.5.1 v tibble 3.2.1
v lubridate 1.9.3 v tidyr 1.3.1
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x ggplot2::annotate() masks NLP::annotate()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1

Parallel computing: disabled

See https://quanteda.io for tutorials and examples.

Attaching package: 'quanteda'

The following object is masked from 'package:tm':

stopwords

The following objects are masked from 'package:NLP':

meta, meta<-

Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

data_dictionary_LSD2015

Attaching package: 'quanteda.tidy'

5
The following object is masked from 'package:stats':

filter

Loading required package: RColorBrewer

Attaching package: 'readtext'

The following object is masked from 'package:quanteda':

texts

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

smiths

Please note, loading the packages in this order leads to the stopwords dictionary from the tm
package to be masked by quanteda. You will want to observe this process each time you load
a package. So, when we want to call that library remember we need to include a reference to
that package within the tm package. As with any package, you may call a function within
a specific package bu using to colons :: preceded by the package name, and followed by the
function name. Example code is below:

tm::stopwords

*** End of Pre-Lab ***

6
Lab Instructions

Lab Instructions (to be completed before, during, or after the synchronous class):

This lab has 32 deliverables. Follow and complete these Lab Instructions before, during, or
after the synchronous class.

Part 1: Data Preparation, Text Mining and Dictionary


Development in tm

In this part of the lab, we will use the tm package to prepare our data, conduct text mining,
and develop a dictionary. We will use the Reuters dataset built into the tm package to create
a VCorpus object. We will then use the VCorpus object to create a DocumentTermMatrix
object. We will then use the DocumentTermMatrix object to create a dictionary.

Deliverable 1: Get your working directory and paste below:

getwd()

Deliverable 2: Create Files For Use from Reuters

There is a dataset built into BaseR called reuters. From this dataset lets pull some text files
to use as our initial data. Create an object called reut21578 and use the system.file() function,
with the following argument: (“texts”,“crude”, package = “tm”)

Deliverable 3: Create VCorpus Object

Now that we have the reuters texts to use for our data, let’s create use the VCorpus() function
from the tm package to create an object called reuters. In the argument for the VCorpus()
function use the following line of code to tell it the Directory Source; mode, and which Reader
Control to use”
(DirSource(reut21578,mode = “binary”), readerControl = list(reader=readReut21578XMLasPlain))
If you need to, you may use the sample code below:

reuters <- VCorpus(DirSource(reut21578,mode = "binary"), readerControl = list(reader=readReut

reuters

7
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20

Deliverable 4: Prepare and Preprocess the Corpus

Now let’s walk through several preprocessing steps in tm, starting with stripping out the
whitespace, then transforming the text to lowercase, and then removing the stopwords.
Preprocess - Strip Whitespace
Here, we are going to take the reuters object we created, and assign it the results of using the
tm_map() function, with the corpus object reuters, and the stripWhitespace function. The
argument for the tm_map() function would look like the following:
(reuters, stripWhitespace)
If you need to, you may use the sample code below:

reuters <- tm_map(reuters, stripWhitespace)

Preprocess - Transform to Lowercase


Now, take the reuters object and assign it the results of using the tm_map() function, with
the corpus object reuters, and the content_transformer() function, selecting: tolower as the
argument. The argument for the tm_map() function would look like the following:
(reuters, content_transformer(tolower))
If you need to, you may use the sample code below:

reuters <- tm_map(reuters, content_transformer(tolower))

Preprocess. - Remove Stopwords


Finally, take the reuters object and assign it the results of using the tm_map() function, with
the corpus object reuters, and the removeWords() function, selecting: stopwords(“english”) as
the argument. The argument for the tm_map() function would look like the following:
(reuters, removeWords, stopwords(“english”))
If you need to, you may use the sample code below:

reuters <- tm_map(reuters, removeWords, tm::stopwords("english"))

8
Preprocessing steps all in one block using tf*idf. The code below will remove punctuation,
numbers, and stopwords, and will also stem the words.

myStopwords = c(tm::stopwords(),"")
tdm3 = TermDocumentMatrix(reuters,
control = list(weighting = weightTfIdf,
stopwords = myStopwords,
removePunctuation = T,
removeNumbers = T,
stemming = T))
inspect(tdm3)

<<TermDocumentMatrix (terms: 781, documents: 20)>>


Non-/sparse entries: 1501/14119
Sparsity : 90%
Maximal term length: 16
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Sample :
Docs
Terms 144 211 237 242 246 273
billion 0.00000000 0.00000000 0.000000000 0.00000000 0.12501880 0.00000000
crude 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000 0.02380172
grade 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
januari 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000 0.05490790
mln 0.01752096 0.04266678 0.004430781 0.00000000 0.00000000 0.04284309
opec 0.05703422 0.00000000 0.003846154 0.02298851 0.01075269 0.02066116
post 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
reserv 0.00000000 0.12899601 0.000000000 0.00000000 0.01248348 0.00000000
saudi 0.00000000 0.00000000 0.000000000 0.02298851 0.00000000 0.05785124
west 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
Docs
Terms 368 502 704 708
billion 0 0.00000000 0 0.19540753
crude 0 0.00000000 0 0.03388244
grade 0 0.00000000 0 0.00000000
januari 0 0.00000000 0 0.39081507
mln 0 0.03170651 0 0.06776489
opec 0 0.00000000 0 0.00000000
post 0 0.00000000 0 0.00000000
reserv 0 0.06390628 0 0.00000000
saudi 0 0.00000000 0 0.00000000
west 0 0.00000000 0 0.00000000

9
Deliverable 5: Create Document Term Matrix with TF and TF*IDF

Use the DocumentTermMatrix() function to create an object called dtm from the reuters
corpus. Then use the inspect() function to explore your dtm.
If you need to, you may use the sample code below:

dtm <- DocumentTermMatrix(reuters)


inspect(dtm)

<<DocumentTermMatrix (documents: 20, terms: 1183)>>


Non-/sparse entries: 1908/21752
Sparsity : 92%
Maximal term length: 17
Weighting : term frequency (tf)
Sample :
Terms
Docs crude dlrs last mln oil opec prices reuter said saudi
144 0 0 1 4 11 10 3 1 9 0
236 1 2 4 4 7 6 2 1 6 0
237 0 1 3 1 3 1 0 1 0 0
242 0 0 0 0 3 2 1 1 3 1
246 0 0 2 0 4 1 0 1 4 0
248 0 3 1 3 9 6 7 1 5 5
273 5 2 7 9 5 5 4 1 5 7
489 0 1 0 2 4 0 2 1 2 0
502 0 1 0 2 4 0 2 1 2 0
704 0 0 0 0 3 0 2 1 3 0

Remember, there is a helpful alternative weighting to explore, which is the term frequency
by inverted document frequency (TF*IDF). To try it, create another object called dtm2
and after you call for the reuters corpus, use the following code in the argument: control
= list(weighting=weightTfIdf)
If you need to, you may use the sample code below:

dtm2 <- DocumentTermMatrix(reuters, control = list(weighting=weightTfIdf))

inspect(dtm2)

<<DocumentTermMatrix (documents: 20, terms: 1183)>>


Non-/sparse entries: 1868/21792

10
Sparsity : 92%
Maximal term length: 17
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Sample :
Terms
Docs 1.50 billion crude january mln opec posted
144 0 0.00000000 0.000000000 0.0000000 0.017258473 0.037453184 0
211 0 0.00000000 0.000000000 0.0000000 0.041891022 0.000000000 0
236 0 0.00000000 0.004314618 0.0000000 0.017258473 0.022471910 0
237 0 0.00000000 0.000000000 0.0000000 0.004314618 0.003745318 0
242 0 0.00000000 0.000000000 0.0000000 0.000000000 0.020618557 0
246 0 0.09770377 0.000000000 0.0000000 0.000000000 0.004901961 0
349 0 0.00000000 0.034909185 0.0000000 0.000000000 0.015151515 0
368 0 0.00000000 0.000000000 0.0000000 0.000000000 0.000000000 0
704 0 0.00000000 0.000000000 0.0000000 0.000000000 0.000000000 0
708 0 0.14764125 0.025600069 0.2952825 0.051200137 0.000000000 0
Terms
Docs power saudi west
144 0.000000 0.00000000 0
211 0.000000 0.00000000 0
236 0.000000 0.00000000 0
237 0.000000 0.00000000 0
242 0.000000 0.02061856 0
246 0.000000 0.00000000 0
349 0.000000 0.03030303 0
368 0.261935 0.00000000 0
704 0.000000 0.00000000 0
708 0.000000 0.00000000 0

Deliverable 6: Find the Most Frequent Terms

Now use the findFreqTerms() function to find the most frequently occurring terms in your dtm
object. In the argument, select the dtm object, followed by the number of times we want a
word to occur in order for it to count. For example, if you include the number 5 after the , in
the. argument, the resultant dataset would include all words in the dtm that occur at least
five times.
If you need to, you may use the sample code below:

findFreqTerms(dtm,5)

[1] "15.8" "abdul-aziz" "ability" "accord"

11
[5] "agency" "agreement" "ali" "also"
[9] "analysts" "arab" "arabia" "barrel."
[13] "barrels" "billion" "bpd" "budget"
[17] "company" "crude" "daily" "demand"
[21] "dlrs" "economic" "emergency" "energy"
[25] "exchange" "expected" "exports" "futures"
[29] "government" "group" "gulf" "help"
[33] "hold" "industry" "international" "january"
[37] "kuwait" "last" "market" "may"
[41] "meeting" "minister" "mln" "month"
[45] "nazer" "new" "now" "nymex"
[49] "official" "oil" "one" "opec"
[53] "output" "pct" "petroleum" "plans"
[57] "posted" "present" "price" "prices"
[61] "prices," "prices." "production" "quota"
[65] "quoted" "recent" "report" "research"
[69] "reserve" "reuter" "said" "said."
[73] "saudi" "sell" "sheikh" "sources"
[77] "study" "traders" "u.s." "united"
[81] "west" "will" "world"

Deliverable 7: Find Terms Associated with a Specific Term

Now, let’s use the findAssocs() function to look for terms in the dtm object that are closely
associated with another specific term of interest, in this case let’s use the term “opec”. In the
argument for the findAssocs() function, after you select the dtm object, add the term “opec”,
and then a decimal point correlation value you are searching for. For example, if you added
0.8 to the argument, you would be searching for terms that have a 0.8 correlation with the
term opec.
You may use the sample code below:

findAssocs(dtm, "opec", 0.8)

$opec
meeting emergency oil 15.8 analysts buyers said ability
0.88 0.87 0.87 0.85 0.85 0.83 0.82 0.80

And now, compare the default tf weighting in the dtm object, with the tfidf weighting in the
dtm2 object.
You may use the sample code below:

12
findAssocs(dtm2, "opec", 0.8)

$opec
emergency meeting analysts quota
0.85 0.85 0.84 0.81

Which do you find more useful?

Deliverable 8: Remove Sparse Terms

Now, to make our overall dataset smaller and more manageable, let’s use the use the inspect()
function, and as the argument, let’s include the removeSparseTerms() function, with an argu-
ment that includes the dtm object were are using, and the frequecy occurrence of the words
we want to use. Perhaps start with 0.4. With that argument, we are removing from the dtm
terms that occur less than 0.4 times. When inspecting this sparse matrix, you will also see that
we are using the default weighting of term frequency (tf), since we are not adding anything
else to the argument, it uses the default.
If you need to, you may use the sample code below:

inspect(removeSparseTerms(dtm, 0.4))

<<DocumentTermMatrix (documents: 20, terms: 3)>>


Non-/sparse entries: 58/2
Sparsity : 3%
Maximal term length: 6
Weighting : term frequency (tf)
Sample :
Terms
Docs oil reuter said
127 5 1 1
144 11 1 9
236 7 1 6
242 3 1 3
246 4 1 4
248 9 1 5
273 5 1 5
352 5 1 1
489 4 1 2
502 4 1 2

13
Try exploring the difference with the tf*idf weighted object dtm2.

inspect(removeSparseTerms(dtm2, 0.4))

<<DocumentTermMatrix (documents: 20, terms: 1)>>


Non-/sparse entries: 18/2
Sparsity : 10%
Maximal term length: 4
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Sample :
Terms
Docs said
144 0.005123700
191 0.003800077
194 0.003234108
211 0.008291078
236 0.003415800
242 0.004701127
248 0.003518590
368 0.004606154
489 0.003415800
543 0.005241486

Deliverable 9: Develop a Simple Dictionary in tm

The tm package is not the best ecosystem for developing dictionaries, but it does work. For
example, let’s say the concept in which we are interested is crude oil prices. We can use the
the argument, dictionary = c(“prices”, “crude”, “oil”) as part of the list() function, included
in the use of the inspect(DocumentTermMatrix) function where we inspect the reuters corpus.
This would create the simple dictionary and apply that dictionary to the corpus. You will see
which documents have what numbers of those three terms. You could then pull those three
documents and explore them further.
If you need to, you may use the sample code below:

inspect(DocumentTermMatrix(reuters, list(dictionary = c("prices","crude","oil"))))

<<DocumentTermMatrix (documents: 20, terms: 3)>>


Non-/sparse entries: 41/19
Sparsity : 32%
Maximal term length: 6

14
Weighting : term frequency (tf)
Sample :
Terms
Docs crude oil prices
127 2 5 3
144 0 11 3
236 1 7 2
248 0 9 7
273 5 5 4
352 0 5 4
353 2 4 1
489 0 4 2
502 0 4 2
543 2 2 2

Part 2: Understanding Tidyverse Dictionary Construction and


Sentiment Analysis

In Part 2, we will introduce you to dictionary development and use in the tidyverse ecosystem.
The tidyverse is a more sophisticated way to approach dictionary construction, and a popular
application of dictionaries in conducting a sentiment analysis. Of course, it will require the
use of Tidy Data.
We can use our understanding of the emotional intent of words to infer whether a section of
text is positive or negative, or even more nuanced emotions like surprise or disgust. Let’s start
by viewing the built-in sentiments dictionary from the tidytext package. The way dictionaries
are constructed is you have categories, and then words associated with that category. In the
case of sentiment dictionaries, you have the “categories” of sentiment, and then the words
associated with that sentiment. Also, in this case, the sentiment dictionary is comprised of
three “lexicons”: AFINN, Bing, and NRC) each with its own score or “weight” for the word if
that lexicon has one. Below, call the sentiments object, then review its head, tail, and class.
You may use the sample code below:

sentiments
head(sentiments)
tail(sentiments)
class(sentiments)

# A tibble: 6,786 x 2
word sentiment

15
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
7 abomination negative
8 abort negative
9 aborted negative
10 aborts negative
# i 6,776 more rows

# A tibble: 6 x 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative

# A tibble: 6 x 2
word sentiment
<chr> <chr>
1 zealous negative
2 zealously negative
3 zenith positive
4 zest positive
5 zippy positive
6 zombie negative

[1] "tbl_df" "tbl" "data.frame"

If you would like to take a look at the entire sentiments object, use the View() function in the
console (not in a code chunk).

16
Deliverable 10: Download Individual Lexicons within Sentiments

We can also use the get_sentiments() function from tidytext to extract the three separate
lexicons in the sentiments dictionary and store them in a tidy dataframe. Each time you use
the get_sentiments() function, use of of the following names of sentiment lexicons: “afinn”,
“bing”, and “nrc”.
There is an issue with the afinn lexicon (that initially stopped me from knitting because of
the required interactivity with the download question). It now needs to be downloaded, and
you will be faced with an option to download the license. You will choose 1 for yes, and 2 for
no. You should choose 1 for yes. I’m exploring to see if that only needs to be done once. NB:
If it does not work, I can remove the get_sentiments(“afinn”) line.
You may use the sample code below:

get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")

# A tibble: 2,477 x 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# i 2,467 more rows

# A tibble: 6,786 x 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative

17
7 abomination negative
8 abort negative
9 aborted negative
10 aborts negative
# i 6,776 more rows

# A tibble: 13,872 x 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# i 13,862 more rows

These results help you to visualize different ways dictionaries can be constructed, and how you
can develop a customized dictionary. What are the key differences between how these three
lexicons are constructed?
afinn:
bing:
nrc:
Now, let’s return to the Jane Austen books and perform a specific type of sentiment analysis
and identify how much “joy” is in the book Emma. We will do this by applying the “joy”
elements from the nrc lexicon to the book of emma. This allows us to see the positive and
“joyful” words in the book.

Deliverable 11: Create an object called tidy_books from the janeaustenr package

Create an object called tidy_books by assigning it the results of the austen_books() function.
In the argument, add the pipe operator, and then group_by(book), mutate(linenumber =
row_number(), chapter = cumsum(str_detect(text, regex(“^chapter [\divxlc]”, ignore_case
= TRUE))), ungroup(), and unnest_tokens(word, text).
You may use the sample code below:

18
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))
ungroup() %>%
unnest_tokens(word, text)
tidy_books

# A tibble: 725,055 x 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Sense & Sensibility 1 0 sense
2 Sense & Sensibility 1 0 and
3 Sense & Sensibility 1 0 sensibility
4 Sense & Sensibility 3 0 by
5 Sense & Sensibility 3 0 jane
6 Sense & Sensibility 3 0 austen
7 Sense & Sensibility 5 0 1811
8 Sense & Sensibility 10 1 chapter
9 Sense & Sensibility 10 1 1
10 Sense & Sensibility 13 1 the
# i 725,045 more rows

Deliverable 12: Create nrcjoy Sentiment Dictionary

Create an object called nrcjoy by assigning it the results of the get_sentiments() function. In
the argument, add “nrc” to select that library, and then %>% use the filter() function, with
the sentiment == “joy” in the argument. This will create a nrcjoy dictionary made from the
sentiment “Joy” in the nrc lexicon. Then explore the nrcjoy dictionary.
You may use the sample code below:

nrcjoy <- get_sentiments("nrc") %>%


filter(sentiment == "joy")

nrcjoy

# A tibble: 687 x 2
word sentiment
<chr> <chr>
1 absolution joy

19
2 abundance joy
3 abundant joy
4 accolade joy
5 accompaniment joy
6 accomplish joy
7 accomplished joy
8 achieve joy
9 achievement joy
10 acrobat joy
# i 677 more rows

Deliverable 13: Applying NRC Joy Extract to Emma

Now, apply the nrcjoy extract from the nrc sentiment dictionary to the Emma book using the
inner join function (similar to removing stop words with the anti-join function; but moving in
the opposite direction)
You may use the sample code below:

tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrcjoy) %>%
count(word, sort = TRUE)

Joining with `by = join_by(word)`

# A tibble: 301 x 2
word n
<chr> <int>
1 good 359
2 friend 166
3 hope 143
4 happy 125
5 love 117
6 deal 92
7 found 92
8 present 89
9 kind 82
10 happiness 76
# i 291 more rows

20
The image on the right shows you the top ten “joyful” words in Emma.
This result is interesting, but how does the book Emma compare to other books by Jane
Austen on the specific sentiment of joy?

Deliverable 14: Sentiment Analysis of Jane Austen Books

To determine the answer to this question, we will now create an object called janeaustensen-
timent and conduct a sentiment analysis of all the Jane Austen books. To do so, assign the
janeaustensentiment object the results of the following, call tidy_books, and then, use the
inner_join() function
You may use the sample code below:

janeaustensentiment <- tidy_books %>%


inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relatio


i Row 435434 of `x` matches multiple rows in `y`.
i Row 5051 of `y` matches multiple rows in `x`.
i If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.

Deliverable 15: Visualize Jane Austen Sentiment

Now we can visualize that sentiment analysis of the Jane Austen books using ggplot2. To
begin, use the ggplot() function, and in the argument add the janeaustensentiment object,
then add the following code to the rest of the argument:
You may use the sample code below:

aes(index, sentiment, fill = book)) +


geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")

21
Sense & Sensibility Pride & Prejudice
40
20
0
−20
0 50 100 150 0 50 100 150

Mansfield Park Emma


sentiment

40
20
0
−20
0 50 100 150 200 0 50 100 150 200

Northanger Abbey Persuasion


40
20
0
−20
0 25 50 75 100 0 25 50 75 100
index

In the resultant figure, you are seeing a ggplot2 function called facet_wrap() at work. It allows
you to compare a number of cases on selected variables. It is a very powerful visualization
technique.

Deliverable 16: Calculate and Visualize Sentiment and Words

Now we can calculate and visualize the overall positive v negative sentiment in the books,
and see which words contributes to each. Since we still have a way to go on understanding
the details of ggplot2 visualizaitons, I am going to provide you with the complete code below.
However, I strongly encourage you to retype the code rather than cut and paste.
You may use the sample code below:

bing_word_counts <- tidy_books %>%


inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()

bing_word_counts

View(bing_word_counts)

bing_word_counts %>%

22
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relatio


i Row 435434 of `x` matches multiple rows in `y`.
i Row 5051 of `y` matches multiple rows in `x`.
i If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.

# A tibble: 2,585 x 3
word sentiment n
<chr> <chr> <int>
1 miss negative 1855
2 well positive 1523
3 good positive 1380
4 great positive 981
5 like positive 725
6 better positive 639
7 enough positive 613
8 happy positive 534
9 love positive 495
10 pleasure positive 462
# i 2,575 more rows

Selecting by n

23
negative positive

miss well

poor good

doubt great

object like

sorry better

impossible enough

afraid happy

scarcely love

bad pleasure

anxious happiness

0 500 1000 1500 0 500 1000 1500


Contribution to sentiment

In the resulting image below, we notice that the word “miss” is listed as a “negative” word,
which in the context of these books is not accurate. So, let’s take that word out of the
dictionary.

Deliverable 17: Create a Custom Stopword Dictionary

Now, create and apply a custom stop word dictionary (or rather add a word “miss” to the
dictionary)
You may use the sample code below:

custom_stop_words <- bind_rows(tibble(word = c("miss"), lexicon = c("custom")), stop_words)

custom_stop_words

# A tibble: 1,150 x 2
word lexicon
<chr> <chr>
1 miss custom
2 a SMART
3 a's SMART
4 able SMART

24
5 about SMART
6 above SMART
7 according SMART
8 accordingly SMART
9 across SMART
10 actually SMART
# i 1,140 more rows

Deliverable 18: Apply Custom Stopword Dictionary

Now apply the custom stopword dictionary and then re-run analysis and the visualization.
You may use the sample code below:

bing_word_counts %>%
anti_join(custom_stop_words) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot() +
geom_col(aes(word, n, fill = sentiment), show.legend = F) +
labs(title = "Sentiment Analysis of Jane Austen's Works",
subtitle = "Separated by Sentiment",
x = "",
y = "Contribution to Sentiment") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~sentiment, scales = "free") +
coord_flip()

Joining with `by = join_by(word)`


Selecting by n

25
Sentiment Analysis of Jane Austen's Works
Separated by Sentiment
negative positive

poor happy
doubt love
object pleasure
impossible happiness
afraid comfort
scarcely affection
bad perfectly
anxious glad
pain pretty
spite agreeable
0 100 200 300 400 0 200 400
Contribution to Sentiment

So, the two visualizations of this analysis looks different.

Deliverable 19: Data Visualization with WordClouds

Now, let’s look at the popular data visualization the WordCloud. You may encounter some
warnings here.
You may use the sample code below:

tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))

Joining with `by = join_by(word)`

Warning in wordcloud(word, n, max.words = 100): catherine could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): home could not be fit on page.


It will not be plotted.

26
Warning in wordcloud(word, n, max.words = 100): feelings could not be fit on
page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): subject could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): lady could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): dear could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): marianne could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): captain could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): sister could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): elinor could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): elizabeth could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): word could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): miss could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): poor could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): mother could not be fit on


page. It will not be plotted.

27
Warning in wordcloud(word, n, max.words = 100): edmund could not be fit on
page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): replied could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): looked could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): cried could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): coming could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): return could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): woodhouse could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): day could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): fanny could not be fit on page.


It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): brother could not be fit on


page. It will not be plotted.

Warning in wordcloud(word, n, max.words = 100): anne could not be fit on page.


It will not be plotted.

28
harriet immediately suppose
friends people
comfort moment
opinion evening acquaintance
short leave
love
emma

house
thomas
chapter perfectly
visit

speak crawford
elton father darcy rest

doubt
hear
life bennet mind

character
told brought
till

left
sort
woman weston ill
world
colonel happiness
idea affection john jane
hope family feel minutes
heard foundobliged
deal
morning
party passed sir attention
half

friend
walk
eyes knightley happy
hour spirits manner
glad
heart letter
answer pleasure
time
And, we can reshape that visualization and add some indications of positive/negative values.
You may use the sample code below:

tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20","gray80"), max.words = 100)

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relatio


i Row 435434 of `x` matches multiple rows in `y`.
i Row 5051 of `y` matches multiple rows in `x`.
i If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.

Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =


100): excellent could not be fit on page. It will not be plotted.

Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =


100): admiration could not be fit on page. It will not be plotted.

29
Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
100): superior could not be fit on page. It will not be plotted.

Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =


100): gratitude could not be fit on page. It will not be plotted.

negativemisery pity disappointment


indifference ashamed trouble
vanity fear
loss absence regret

worse
alarm temper excuse
concern pain anxious mistaken

doubt
distress
impossible afraid
sorry cold angry
difficulty spite vain scarcely
wrong object poor bad struck
anxiety danger

miss
evil lost strange

easy
regard
fancy
like happiness
work happy worth
right pleased
silent

pleasure
fond comfort well best
great
betterassure
fair proper

good fine
beauty pretty satisfied
loved wonder respect
pride strong love perfectlypleasant
handsome
sensible agreeable
advantage smile
enough
glad fortune comfortable
greatest affection ready kindness delighted
delightful thank
praise
favour
delight positive instantly
amiable

Part 3: Text Mining with quanteda, Including Variable Creation


and Dictionaries

In Part 3, we will introduce you to text mining in the quanteda ecosytem. This part will give
you the final major leg of your text mining stool for this course.
First, create an object called global_path containing a path to your UN General Assembly
speech data. Remember, the path to the UN-data folder should be either a relative path if
you are using the recommended project structure, or an absolute path to the folder.

global_path <- "UN-data/"

30
Deliverable 20: Create an Object for the UNGD Speeches

Now, use the readtext function to go through the subfolders, named for the date of the speech
and create an object called UNGDspeeches containing the text of all the speeches. quanteda
can create metadata for each document. The argument tells R to use the filename to find
the document variables, and to create the variables: country, session, and year based on the
structure of the filenemes. Notice the argument for the global path and help explain the
regex.
You may use the sample code below:

UNGDspeeches <- readtext(


paste0(global_path, "*/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("country","session","year"),
dvsep = "_",
encoding = "UTF-8"
)

UNGDspeeches
class(UNGDspeeches)

readtext object consisting of 2329 documents and 3 docvars.


# A data frame: 2,329 x 5
doc_id text country session year
<chr> <chr> <chr> <int> <int>
1 AFG_26_1971.txt "\"82.\tMr. Pr\"..." AFG 26 1971
2 ALB_26_1971.txt "\"110.\t Thi\"..." ALB 26 1971
3 ARG_26_1971.txt "\"33.\t On be\"..." ARG 26 1971
4 AUS_26_1971.txt "\"38.\t I sh\"..." AUS 26 1971
5 AUT_26_1971.txt "\"112.\t Mr.\"..." AUT 26 1971
6 BDI_26_1971.txt "\"1.\tMr. Pre\"..." BDI 26 1971
# i 2,323 more rows

[1] "readtext" "data.frame"

Deliverable 21: Generate a Corpus from UNGDspeeches

Now, generate corpus from the UNGDspeeches object and then add unique identifiers to each
document.
You may use the sample code below:

31
mycorpus <- corpus(UNGDspeeches)

Assigns a unique identifier to each text


You may use the sample code below:

docvars(mycorpus, "Textno") <-


sprintf("%02d", 1:ndoc(mycorpus))

mycorpus

Corpus consisting of 2,329 documents and 4 docvars.


AFG_26_1971.txt :
"82. Mr. President, at the outset, I wish to congratulate you..."

ALB_26_1971.txt :
"110. This session of the General Assembly is meeting at a ..."

ARG_26_1971.txt :
"33. On behalf of the Argentine Government I wish to congrat..."

AUS_26_1971.txt :
"38. I should like, on behalf of Australia,, to extend my c..."

AUT_26_1971.txt :
"112. Mr. President. I am happy to convey to you our sincer..."

BDI_26_1971.txt :
"1. Mr. President, this great Assembly made a very wise choic..."

[ reached max_ndoc ... 2,323 more documents ]

Save statistics in “mycorpus.stats”


You may use the sample code below:

mycorpus.stats <- summary(mycorpus)

head(mycorpus.stats, n=10)

32
Text Types Tokens Sentences country session year Textno
1 AFG_26_1971.txt 1180 4475 181 AFG 26 1971 01
2 ALB_26_1971.txt 1804 8687 263 ALB 26 1971 02
3 ARG_26_1971.txt 1495 5344 227 ARG 26 1971 03
4 AUS_26_1971.txt 1086 3857 180 AUS 26 1971 04
5 AUT_26_1971.txt 1104 3616 154 AUT 26 1971 05
6 BDI_26_1971.txt 1825 6420 266 BDI 26 1971 06
7 BEL_26_1971.txt 1312 4543 190 BEL 26 1971 07
8 BEN_26_1971.txt 781 2184 81 BEN 26 1971 08
9 BFA_26_1971.txt 1319 5035 195 BFA 26 1971 09
10 BGR_26_1971.txt 1158 4505 182 BGR 26 1971 10

Notice the variable names that were created. What are the values for all eight variables for
the country with the largest number of sentences in their speech?

Deliverable 22: Preprocess the Text

Now, preprocess the text by removing stopwords, punctuation, and numbers. You will use the
tokens function to create a tokens object. It is important to remove stopwords, punctuation,
and numbers to ensure that the text is clean and ready for analysis. The tokens object is a
key object in the quanteda package that allows you to perform text analysis. It contains the
preprocessed text that will be used in the next steps to create a document-feature matrix.
You may use the sample code below:

token <-
tokens(
mycorpus,
split_hyphens = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
include_docvars = TRUE
)

Clean tokens created by OCR. These were combinations of digits and characters introduced
by the OCR process.
You may use the sample code below:

33
token_ungd <- tokens_select(
token,
c("[\\d-]", "[[:punct:]]", "^.{1,2}$"),
selection = "remove",
valuetype = "regex",
verbose = TRUE
)

Apply tokens_remove(): changed from 6,943,345 tokens (2,329 documents) to 5,481,398 tokens (2

Deliverable 23: Tokenize the Dataset by N-Grams

Now tokenize the dataset by n-grams; in this case finding all phrases 2-4 words in lenght using
the tokens_ngrams() function, and specifing the operator n = 2:4 in the argument.
You may use the sample code below:

toks_ngram <- tokens_ngrams(token, n = 2:4)


head(toks_ngram[[1]], 30)
tail(toks_ngram[[1]], 30)

[1] "Mr_President" "President_at" "at_the"


[4] "the_outset" "outset_I" "I_wish"
[7] "wish_to" "to_congratulate" "congratulate_you"
[10] "you_whole" "whole_heartedly" "heartedly_on"
[13] "on_your" "your_election" "election_as"
[16] "as_President" "President_of" "of_the"
[19] "the_General" "General_Assembly" "Assembly_the"
[22] "the_most" "most_esteemed" "esteemed_and"
[25] "and_highest" "highest_international" "international_post"
[28] "post_Our" "Our_congratulations" "congratulations_do"

[1] "inside_and_outside_the" "and_outside_the_United"


[3] "outside_the_United_Nations" "the_United_Nations_Only"
[5] "United_Nations_Only_then" "Nations_Only_then_will"
[7] "Only_then_will_mankind" "then_will_mankind_be"
[9] "will_mankind_be_confident" "mankind_be_confident_enough"
[11] "be_confident_enough_to" "confident_enough_to_look"
[13] "enough_to_look_forward" "to_look_forward_hopefully"
[15] "look_forward_hopefully_to" "forward_hopefully_to_seeing"

34
[17] "hopefully_to_seeing_a" "to_seeing_a_world"
[19] "seeing_a_world_united" "a_world_united_in"
[21] "world_united_in_order" "united_in_order_to"
[23] "in_order_to_achieve" "order_to_achieve_its"
[25] "to_achieve_its_common" "achieve_its_common_goals"
[27] "its_common_goals_of" "common_goals_of_peace"
[29] "goals_of_peace_and" "of_peace_and_prosperity"

Deliverable 24: Create a Document Feature Matrix

Now create a Document Feature Matrix using the dfm() function (the quanteda equivalent of
a DTM), preprocessing the corpus in the process. The dfm() function creates a document-
feature matrix from a tokens object. The dfm object is a key object in the quanteda package
that allows you to perform text analysis. It contains the preprocessed text that will be used
in the next steps to create a document-feature matrix.
You may use the sample code below:

mydfm <- dfm(token_ungd,


tolower = TRUE,
)
mydfm <- dfm_remove(mydfm, pattern = stopwords("english"))
mydfm <- dfm_wordstem(mydfm)

Deliverable 25: Trim the DFM

Now we can trim the text in the dfm. We are going to filter words that appear less than 7.5%
and more than 90%. This approach works because we have a sufficiently large corpus.
You may use the sample code below:

mydfm.trim <-
dfm_trim(
mydfm,
min_docfreq = 0.075,
max_docfreq = 0.90,
docfreq_type = "prop"
)

Now, let’s get a look at the DFM, by printing the first 5 observations and first 10 features:
You may use the sample code below:

35
head(dfm_sort(mydfm.trim, decreasing = TRUE, margin = "both"),
n = 10,
nf = 10)

Warning: nf argument is not used.

Document-feature matrix of: 10 documents, 1,960 features (51.71% sparse) and 4 docvars.
features
docs problem region conflict africa global council hope situat
CUB_34_1979.txt 36 16 1 13 3 0 8 23
BFA_29_1974.txt 25 20 1 15 0 4 20 9
PRY_38_1983.txt 30 7 12 0 3 3 10 16
LBY_64_2009.txt 5 1 3 8 2 76 3 5
LUX_35_1980.txt 21 19 13 11 10 12 9 15
DEU_38_1983.txt 11 12 11 7 12 3 10 6
features
docs resolut relat
CUB_34_1979.txt 10 18
BFA_29_1974.txt 10 8
PRY_38_1983.txt 21 5
LBY_64_2009.txt 13 1
LUX_35_1980.txt 13 17
DEU_38_1983.txt 4 23
[ reached max_ndoc ... 4 more documents, reached max_nfeat ... 1,950 more features ]

Which country refers most to the economy in this snapshot of the data?

Deliverable 26: Text Classification Using a Dictionary

Text Classification using a dictionary (e.g. when categories are known in advance). This allows
us to create lists of words that correspond to different categories. In this example, we will use
the “LexiCoder Policy Agenda” dictionary.
You may use the sample code below:

dict <- dictionary(file = "policy_agendas_english.lcd")

36
Deliverable 27: Apply Dictionary

Now we can apply the dictionary to filter the share of each country’s speeches on immigration,
international affairs, and defense.
You may use the sample code below:

# Step 1: Create the DFM without grouping or applying the dictionary


mydfm.un <- dfm(mydfm.trim)

# Step 2: Apply the dictionary using dfm_lookup()


mydfm.un <- dfm_lookup(mydfm.un, dictionary = dict)

# Step 3: Group the DFM by "country" using dfm_group()


mydfm.un <- dfm_group(mydfm.un, groups = docvars(mydfm.un, "country"))

Deliverable 28: Convert the DFM to a Data Frame

Now we can convert the DFM to a data frame for further analysis and visualization. We will
reshape the data frame to have one row per country and columns for each topic (immigration,
international affairs, defense).
You may use the sample code below:

un.topics.pa <- convert(mydfm.un, "data.frame") %>%


dplyr::rename(country = doc_id) %>%
select(country, immigration, intl_affairs, defence) %>%
tidyr::gather(immigration:defence, key = "Topic", value = "Share") %>%
group_by(country) %>%
mutate(Share = Share/ sum(Share)) %>%
mutate(Topic = haven::as_factor(Topic))

Deliverable 29: Visualize the Results

Now we can visualize the results using a bar plot. This will show the distribution of policy
agenda topics in the UN General Debate corpus.
You may use the sample code below:

37
un.topics.pa %>%
ggplot(aes(country, Share, colour = Topic, fill = Topic))+
geom_bar(stat = "identity")+
scale_color_brewer(palette = "Set1")+
scale_fill_brewer(palette = "Pastel1")+
ggtitle("Distribution of PA topics in the UN General Debate corpus")+
xlab("")+
ylab("Topic share (%)")+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())

Distribution of PA topics in the UN General Debate corpus


1.00

0.75
Topic share (%)

Topic
immigration
0.50
intl_affairs
defence

0.25

0.00

# A tibble: 600 x 3
# Groups: country [200]
country Topic Share
<chr> <fct> <dbl>
1 AFG immigration 0.0635
2 AGO immigration 0.0484
3 ALB immigration 0.0482
4 AND immigration 0.0804
5 ARE immigration 0.0759
6 ARG immigration 0.048
7 ARM immigration 0.0952

38
8 ATG immigration 0.0296
9 AUS immigration 0.0361
10 AUT immigration 0.0137
# i 590 more rows

Part 4: Using nltk and TextBlob to conduct sentiment analysis in


Python

In Part 4, we will introduce you to conducting sentiment analysis in the nltk package in
Python.
This part of the lab will be conducted in Python. It will assume you have already set up
your Python environment and have installed the necessary packages. If you have not done
so, please refer to the instructions in Lab 1. Remember, we are recommending that you use
Anaconda to manage your Python environment. Please make sure you have the nltk package
installed. It contains the built-in text dataset we will use for part of the sentiment analysis.
We will first explore lexicon and rule-based sentiment analysis, and then explore the use of
VADER for sentiment analysis.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based
sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It
is fully open-sourced under the MIT License. VADER not only tells about the Positivity and
Negativity score but also tells us about how positive or negative a sentiment is. It is built on
top of the NLTK package and is well-suited for analyzing social media texts. It is also faster
than other sentiment analysis techniques.
To begin, let’s focus on building a custom lexicon for sentiment analysis and apply it to a
short dataset of sample sentences.

Deliverable 30: Creating a Custom Lexicon and Applying it to a Sample Dataset

We will create a custom lexicon to analyze the sentiment of a small dataset of sample sentences.
We will use the SentimentIntensityAnalyzer from the nltk.sentiment package to conduct the
sentiment analysis. We will also import the pandas package as pd to work with the data.
First, we will define five categories we want to include in our custom lexicon. We will then
define a list of words for each category. We will then create a dictionary with the words and
their associated categories. We will then create a dataframe from the dictionary.
You may use the following sample code to create the custom lexicon (be sure to enclose your
categories and lists in curly braces)

39
custom_lexicon =
'positive': ['good', 'great', 'awesome', 'fantastic', 'terrific'],
'negative': ['bad', 'terrible', 'awful', 'dreadful', 'horrible'],
'neutral': ['okay', 'alright', 'fine', 'decent', 'satisfactory'],
'uncertain': ['maybe', 'perhaps', 'possibly', 'probably', 'likely'],
'conjunctions': ['and', 'but', 'or', 'so', 'yet']
}

If you run into an error, you may need to download ‘punkt_tab’ or ‘punkt’ from nltk. You
can do this by running the following code:

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

True

Now we will create a simple function to preprocess and Tokenize the text.
You may use the following sample code:

def preprocess_and_tokenize(text):
text = text.lower()
tokens = text.split()
return tokens

Now we will create a function to apply the custom lexicon to the text.
You may use the following sample code:

def categorize_text(text, lexicon):


tokens = preprocess_and_tokenize(text)
categories = {category: 0 for category in lexicon}

for token in tokens:


for category, words in lexicon.items():
if token in words:
categories[category] += 1
return categories

Now, let’s apply the custom lexicon to a small dataset of sample sentences.
You may use the following sample code:

40
sample_texts = [
'The movie was good and the acting was great.',
'The movie was terrible and the acting was dreadful.',
'The movie was okay and the acting was satisfactory.',
'The movie was perhaps good and the acting was probably great.',
'The movie was fine and the acting was decent.',
'The movie was good but the acting was terrible.',
'The movie was good or the acting was bad.',
'The movie was good so the acting was bad.',
'The movie was good yet the acting was bad.'
]

for text in sample_texts:


categorize = categorize_text(text, custom_lexicon)
print(categorize_text(text, custom_lexicon))

{'positive': 1, 'negative': 0, 'neutral': 0, 'uncertain': 0, 'conjunctions': 1}


{'positive': 0, 'negative': 1, 'neutral': 0, 'uncertain': 0, 'conjunctions': 1}
{'positive': 0, 'negative': 0, 'neutral': 1, 'uncertain': 0, 'conjunctions': 1}
{'positive': 1, 'negative': 0, 'neutral': 0, 'uncertain': 2, 'conjunctions': 1}
{'positive': 0, 'negative': 0, 'neutral': 1, 'uncertain': 0, 'conjunctions': 1}
{'positive': 1, 'negative': 0, 'neutral': 0, 'uncertain': 0, 'conjunctions': 1}
{'positive': 1, 'negative': 0, 'neutral': 0, 'uncertain': 0, 'conjunctions': 1}
{'positive': 1, 'negative': 0, 'neutral': 0, 'uncertain': 0, 'conjunctions': 1}
{'positive': 1, 'negative': 0, 'neutral': 0, 'uncertain': 0, 'conjunctions': 1}

Deliverable 31: Adding N-Grams to the Custom Lexicon

Now we will add n-grams to the custom lexicon. We will add bigrams and trigrams to the
custom lexicon. We will then apply the custom lexicon to the sample sentences. To do so,
just add the n-grams to the appropriate category lists alongside the unigrams (or individual
words).
To add the n-grams to our initial custom lexicon, we will adjust it by adding bigrams to
the lists of words in the categories. We will then apply the custom lexicon to our sample
sentences.
You may use the following sample code (remember to enclose your categories and lists in curly
braces):

41
custom_lexicon =
'positive': ['good', 'great', 'awesome', 'fantastic', 'terrific', 'good and', 'great and'
'negative': ['bad', 'terrible', 'awful', 'dreadful', 'horrible', 'bad and', 'terrible and
'neutral': ['okay', 'alright', 'fine', 'decent', 'satisfactory', 'okay and', 'alright and
'uncertain': ['maybe', 'perhaps', 'possibly', 'probably', 'likely', 'maybe and', 'perhaps
'conjunctions': ['and', 'but', 'or', 'so', 'yet', 'but and', 'or and', 'so and', 'yet and
}

Deliverable 32: Applying the Custom Lexicon with N-Grams to the Sample Sentences

When analyzing your text for the presence of both single words (unigrams) and n-grams, you
will need a tokenization process that can recognize n-grams as distinct tokens. You can use
the n-gram tokenization process to preprocess the text and then apply the custom lexicon to
the text.
You may use the following sample code (in the code below, enclose the code category: 0 for
category in lexicon in curly braces):

from nltk.util import ngrams

def preprocess_and_tokenize(text):
# Convert to lowercase
text = text.lower()
# Tokenize by splitting on whitespace
tokens = text.split()
# Generate n-grams (up to trigrams in this example)
all_tokens = tokens + [' '.join(gram) for gram in ngrams(tokens, 2)] + [' '.join(gram) fo
return all_tokens

def categorize_text(text, lexicon):


tokens = preprocess_and_tokenize(text)
categories = category: 0 for category in lexicon
# Initialize counts
for token in tokens:
for category, phrases in lexicon.items():
if token in phrases:
categories[category] += 1
return categories

Now, let’s move to exploring VADER for sentiment analysis.

42
Deliverable 33: Downloading NLTK Data and Preparing the Dataset

We will use the movie_reviews dataset from the nltk package. This dataset contains 2000
movie reviews, each labeled as positive or negative. We will use this dataset to train a sentiment
analysis model. We will then use the SentimentIntensityAnalyzer from the nltk.sentiment
package to conduct sentiment analysis on data from the movie_reviews dataset. We will
import the nltk package, then from the nltk.corpus package, we will import the movie_reviews
dataset. We will also import the SentimentIntensityAnalyzer from the nltk.sentiment package.
We will also import the pandas package as pd to work with the data.
However, before we try this on the complete dataset, let’s first try a sanity check of the Vader
Sentiment Intensity Analyzer (SIA) on a few sample sentences. You may try the sample code
below:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon


nltk.download('vader_lexicon')

# Initialize VADER sentiment analyzer


sia = SentimentIntensityAnalyzer()

# Sample text
text = "I love this product! It's absolutely amazing :)"

# Get sentiment scores


sentiment = sia.polarity_scores(text)
print(sentiment)

True

{'neg': 0.0, 'neu': 0.252, 'pos': 0.748, 'compound': 0.9163}

Now that we can see how well VADER can predict sentiment on a small sample, we can move
on to larger datasets.
You may use the following sample code to import the necessary packages and load the data.

import nltk
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd

43
Loading and Preparing the Dataset
Now we will load the movie_reviews data, which is built into the nltk package, and the
vader_lexicon.
You may use the following sample code:

nltk.download('movie_reviews')
nltk.download('vader_lexicon')

True

True

For our primary text data, we are using the movie_reviews dataset from the nltk package.
This dataset contains 2000 movie reviews, each labeled as positive or negative. We will now
set up the data for analysis.
You may use the following sample code:

documents = [(list(movie_reviews.words(fileid)), category)


for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

Now we want to convert the data into a pandas dataframe.


You may use the following sample code:

reviews = pd.DataFrame(documents, columns = ['text', 'sentiment'])


reviews['text'] = reviews['text'].apply(lambda x: ' '.join(x))

Deliverable 34: Display the first Five Rows of the Reviews Dataframe

Now display the first 5 rows of the reviews dataframe, using the print() function, and the
head() method.
You may use the following sample code:

print(reviews.head())
print(reviews.tail())

44
text sentiment
0 plot : two teen couples go to a church party ,... neg
1 the happy bastard ' s quick movie review damn ... neg
2 it is movies like these that make a jaded movi... neg
3 " quest for camelot " is warner bros . ' first... neg
4 synopsis : a mentally unstable man undergoing ... neg

text sentiment
1995 wow ! what a movie . it ' s everything a movie... pos
1996 richard gere can be a commanding actor , but h... pos
1997 glory -- starring matthew broderick , denzel w... pos
1998 steven spielberg ' s second epic film on world... pos
1999 truman ( " true - man " ) burbank is the perfe... pos

Deliverable 35: Sentiment Analysis with VADER

You may use the following sample code to conduct the sentiment analysis of the Movie Review
dataset with VADER:

sid = SentimentIntensityAnalyzer()
reviews['scores'] = reviews['text'].apply(lambda review: sid.polarity_scores(review))
reviews['compound'] = reviews['scores'].apply(lambda score_dict: score_dict['compound'])
reviews['comp_score'] = reviews['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

Now we will display the sentiment analysis results. We will display the text, sentiment, com-
pound, and comp_score columns of the reviews dataframe.
You may use the following sample code:

print(reviews[['text', 'sentiment', 'compound', 'comp_score']].head())

text ... comp_score


0 plot : two teen couples go to a church party ,... ... pos
1 the happy bastard ' s quick movie review damn ... ... pos
2 it is movies like these that make a jaded movi... ... pos
3 " quest for camelot " is warner bros . ' first... ... neg
4 synopsis : a mentally unstable man undergoing ... ... pos

[5 rows x 4 columns]

45
Deliverable 36: Quick Exploration of Sentiment Analysis in TextBlob

TextBlob is another popular library for sentiment analysis. It is built on top of NLTK and
provides a simple API for common natural language processing (NLP) tasks. TextBlob uses a
sentiment lexicon to assign polarity scores to text.
If you have not done so already, you will need to install TextBlob by running the following
command in your terminal:

conda install -c conda-forge textblob

or by installing TextBlob using the Anaconda Navigator GUI.


You may use the following sample code to conduct sentiment analysis with TextBlob:

import nltk
from textblob import TextBlob
nltk.download('gutenberg')

# Load text
from nltk.corpus import gutenberg
text = gutenberg.raw('austen-emma.txt')

# Split into sentences


sentences = nltk.sent_tokenize(text)

# Analyze sentiments of first 25 sentences


for sentence in sentences[:25]:
blob = TextBlob(sentence)
print(f"Sentence: {sentence}\nPolarity: {blob.sentiment.polarity}\n")

True

Sentence: [Emma by Jane Austen 1816]

VOLUME I

CHAPTER I

Emma Woodhouse, handsome, clever, and rich, with a comfortable home


and happy disposition, seemed to unite some of the best blessings

46
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
Polarity: 0.3872395833333333

Sentence: She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.
Polarity: 0.315

Sentence: Her mother


had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
Polarity: 0.2525

Sentence: Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.
Polarity: 0.06666666666666667

Sentence: Between _them_ it was more the intimacy


of sisters.
Polarity: 0.5

Sentence: Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.
Polarity: 0.20305555555555554

Sentence: The real evils, indeed, of Emma's situation were the power of having
rather too much her own way, and a disposition to think a little
too well of herself; these were the disadvantages which threatened
alloy to her many enjoyments.
Polarity: 0.2625

Sentence: The danger, however, was at present


so unperceived, that they did not by any means rank as misfortunes
with her.

47
Polarity: -0.4

Sentence: Sorrow came--a gentle sorrow--but not at all in the shape of any
disagreeable consciousness.--Miss Taylor married.
Polarity: 0.225

Sentence: It was Miss


Taylor's loss which first brought grief.
Polarity: -0.275

Sentence: It was on the wedding-day


of this beloved friend that Emma first sat in mournful thought
of any continuance.
Polarity: 0.475

Sentence: The wedding over, and the bride-people gone,


her father and herself were left to dine together, with no prospect
of a third to cheer a long evening.
Polarity: -0.016666666666666666

Sentence: Her father composed himself


to sleep after dinner, as usual, and she had then only to sit
and think of what she had lost.
Polarity: -0.125

Sentence: The event had every promise of happiness for her friend.
Polarity: 0.7

Sentence: Mr. Weston


was a man of unexceptionable character, easy fortune, suitable age,
and pleasant manners; and there was some satisfaction in considering
with what self-denying, generous friendship she had always wished
and promoted the match; but it was a black morning's work for her.
Polarity: 0.3875

Sentence: The want of Miss Taylor would be felt every hour of every day.
Polarity: 0.0

Sentence: She recalled her past kindness--the kindness, the affection of sixteen
years--how she had taught and how she had played with her from five
years old--how she had devoted all her powers to attach and amuse
her in health--and how nursed her through the various illnesses
of childhood.

48
Polarity: -0.125

Sentence: A large debt of gratitude was owing here; but the


intercourse of the last seven years, the equal footing and perfect
unreserve which had soon followed Isabella's marriage, on their
being left to each other, was yet a dearer, tenderer recollection.
Polarity: 0.18154761904761904

Sentence: She had been a friend and companion such as few possessed: intelligent,
well-informed, useful, gentle, knowing all the ways of the family,
interested in all its concerns, and peculiarly interested in herself,
in every pleasure, every scheme of hers--one to whom she could speak
every thought as it arose, and who had such an affection for her
as could never find fault.
Polarity: 0.2

Sentence: How was she to bear the change?--It was true that her friend was
going only half a mile from them; but Emma was aware that great must
be the difference between a Mrs. Weston, only half a mile from them,
and a Miss Taylor in the house; and with all her advantages,
natural and domestic, she was now in great danger of suffering
from intellectual solitude.
Polarity: 0.20606060606060606

Sentence: She dearly loved her father, but he


was no companion for her.
Polarity: 0.7

Sentence: He could not meet her in conversation,


rational or playful.
Polarity: 0.0

Sentence: The evil of the actual disparity in their ages (and Mr. Woodhouse had
not married early) was much increased by his constitution and habits;
for having been a valetudinarian all his life, without activity
of mind or body, he was a much older man in ways than in years;
and though everywhere beloved for the friendliness of his heart
and his amiable temper, his talents could not have recommended him
at any time.
Polarity: 0.005952380952380947

Sentence: Her sister, though comparatively but little removed by matrimony,


being settled in London, only sixteen miles off, was much beyond

49
her daily reach; and many a long October and November evening must
be struggled through at Hartfield, before Christmas brought the next
visit from Isabella and her husband, and their little children,
to fill the house, and give her pleasant society again.
Polarity: 0.11203703703703703

Sentence: Highbury, the large and populous village, almost amounting to a town,
to which Hartfield, in spite of its separate lawn, and shrubberies,
and name, did really belong, afforded her no equals.
Polarity: 0.20714285714285713

Deliverable 37: Sentiment Analysis on the UN Data with TextBlob

Now, let’s try the TextBlob sentiment analysis on the UN data. We will analyze the sentiment
of each text file in the ‘UN-data’ folder and print the sentiment polarity and subjectivity.
You may use the following sample code:

import os
from textblob import TextBlob

# Path to the folder containing the text files


folder_path = 'UN-data'

# List to store the results


results = []

# Walk through all subdirectories and files


for root, dirs, files in os.walk(folder_path):
for filename in files:
if filename.endswith('.txt'):
file_path = os.path.join(root, filename)
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
# Create a TextBlob object
blob = TextBlob(text)
# Get sentiment polarity and subjectivity
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
# Store the result
results.append({
'file_path': file_path,

50
'polarity': polarity,
'subjectivity': subjectivity
})

# Print the results from the first five files


for result in results[:5]:
print(f"File: {result['file_path']}")
print(f"Polarity: {result['polarity']}")
print(f"Subjectivity: {result['subjectivity']}")
print('---')

File: UN-data/Session 73 - 2018/BRB_73_2018.txt


Polarity: 0.056829403519977284
Subjectivity: 0.42787665886026544
---
File: UN-data/Session 73 - 2018/IND_73_2018.txt
Polarity: 0.12719839684125395
Subjectivity: 0.48171411921411905
---
File: UN-data/Session 73 - 2018/ARG_73_2018.txt
Polarity: 0.10939538239538237
Subjectivity: 0.39115632515632504
---
File: UN-data/Session 73 - 2018/JOR_73_2018.txt
Polarity: 0.07273518148518145
Subjectivity: 0.4263486513486513
---
File: UN-data/Session 73 - 2018/SWE_73_2018.txt
Polarity: 0.09259146767211283
Subjectivity: 0.39663748079877126
---

*** End of Lab ***


Please render your Quarto file to pdf and submit to the assignment for Lab 5 within Canvas.

51

You might also like