0% found this document useful (0 votes)
10 views36 pages

Lab8 Instructions

Lab 8 focuses on Natural Language Processing (NLP), specifically Parts of Speech (POS) tagging and Named Entity Recognition (NER). It includes a series of assignments that guide students through using various NLP tools in R and Python, such as openNLP, spacyr, and nltk, to analyze text data. The lab aims to enhance understanding of text parsing, tagging, and the application of NLP techniques in data analysis.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views36 pages

Lab8 Instructions

Lab 8 focuses on Natural Language Processing (NLP), specifically Parts of Speech (POS) tagging and Named Entity Recognition (NER). It includes a series of assignments that guide students through using various NLP tools in R and Python, such as openNLP, spacyr, and nltk, to analyze text data. The lab aims to enhance understanding of text parsing, tagging, and the application of NLP techniques in data analysis.

Uploaded by

ghemabuchoiiuoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lab 8: Natural Language Processing (NLP),

Parts of Speech Tagging (POS), and Named


Entity Recognition (NER)
Dr. Derrick L. Cogburn

2024-10-20

Table of contents

1 Lab Overview 2
1.1 Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Assignment Overview 3

3 Pre-Lab Instructions 4

4 Lab Instructions 8
4.1 Deliverable 1: Get your working directory and paste below: . . . . . . . . . . . 8

5 Part 1: Parts of Speech (POS) Tagging in openNLP 8


5.1 Deliverable 2: Create Some Sample Text . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Deliverable 3: Create Sentence and Word Token Annotations . . . . . . . . . . 9
5.3 Deliverable 3: Extract Tokens/POS Pairs . . . . . . . . . . . . . . . . . . . . . 10

6 Part 2: Applying Named Entity Recognition (NER) to the Clinton Email Dataset 10
6.1 Deliverable 4: Import Data and Organize the Clinton Email Dataset . . . . . . 11
6.2 Deliverable 5: Examine the Clinton Email Dataset . . . . . . . . . . . . . . . . 11
6.3 Deliverable 6: Create a Custom Function to Clean the Emails . . . . . . . . . . 11
6.4 Deliverable 7: Apply the Cleaning Function to the Clinton Email Dataset . . . 12
6.5 Deliverable 8: Apply POS Tagging to the Clinton Email Dataset . . . . . . . . 12
6.6 Deliverable 9: Annotate the Clinton Email Dataset Using a For Loop . . . . . . 13
6.7 Deliverable 10: Extract Terms from the Clinton Email Dataset . . . . . . . . . 14
6.8 Deliverable 11: Subset the Clinton Email Dataset . . . . . . . . . . . . . . . . . 14

1
6.9 Deliverable 12: Using the Annotate Entities Process . . . . . . . . . . . . . . . 15
6.10 Deliverable 13: Annotate Entities with OpenNLP . . . . . . . . . . . . . . . . . 15
6.11 Deliverable 14: Apply the Annotation Function . . . . . . . . . . . . . . . . . . 16

7 Part 3: Conducting NLP Analysis with spacyr 16


7.1 Deliverable 15: Load the spacyr Package . . . . . . . . . . . . . . . . . . . . . . 16
7.2 Deliverable 16: Review the Parsed Text . . . . . . . . . . . . . . . . . . . . . . 17
7.3 Deliverable 17: Parse the txt Object with spacyr . . . . . . . . . . . . . . . . . 18
7.4 Deliverable 18: Tokenize Text into a Data Frame . . . . . . . . . . . . . . . . . 19
7.5 Deliverable 19: Extract Named Entities From Parsed Text . . . . . . . . . . . . 19
7.6 Deliverable 20: Extract Extended Entity Set . . . . . . . . . . . . . . . . . . . 20
7.7 Deliverable 21: Consolidate Named Entities . . . . . . . . . . . . . . . . . . . . 20
7.8 Deliverable 22: Extract Noun Phrases . . . . . . . . . . . . . . . . . . . . . . . 20
7.9 Deliverable 23: Extract Additional Attributes . . . . . . . . . . . . . . . . . . . 23
7.10 Deliverable 24: Extract Additional Attributes . . . . . . . . . . . . . . . . . . . 23
7.11 Deliverable 25: Extract Additional Attributes . . . . . . . . . . . . . . . . . . . 24
7.12 Deliverable 26: Using spacyr with tidytext . . . . . . . . . . . . . . . . . . . . . 26
7.13 Deliverable 27: POS Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.14 Deliverable 28: Finalizing the SpaCy Connection . . . . . . . . . . . . . . . . . 27

8 Part 4: Conducting NLP Analysis in Python 27


8.1 Deliverable 29: Importing the nltk Package and Downloading Necessary POS
Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.2 Deliverable 30: Downloading the Necessary Datasets . . . . . . . . . . . . . . . 28
8.3 Deliverable 31: Tokenizing the Text . . . . . . . . . . . . . . . . . . . . . . . . 31
8.4 Deliverable 32: Perform POS Tagging and NER . . . . . . . . . . . . . . . . . . 32
8.5 Deliverable 33: Perform NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.6 Deliverable 34: Visualizing the Results . . . . . . . . . . . . . . . . . . . . . . . 32

1 Lab Overview

1.1 Technical Learning Objectives

1. Understand text parsing and Parts of Speech (POS) tagging


2. Understand the concept of Maximum Entropy (maxent) and how to conduct a POS
tagging using various Maxent Annotator options
3. Understand how to interpret the Penn Treebank POS tag codes
4. Become familiar with some of the main NLP packages available for R (such as NLP,
openNLP, coreNLP, cleanNLP) and the very powerful integration with Python NLP
libraries in spaCy using reticulate, and spacyr, with the option to use Python directly,
especially the nltk and spaCy libraries.
5. Understanding NLP in quanteda and tidytext.

2
1.2 Business Learning Objectives

1. Understand NLP alternative to the “Bag of Words” approach to text mining


2. Understand what you gain and lose in POS tagging and NLP approaches.
3. Understand the power of Named Entity Recognition (NER) for identifying dates, loca-
tions, money, organizations, people, percentages, and much more in a text corpus
4. Understand how to integrate the Python spaCy libraries into R using reticulate and the
spacyr package and natively.
5. Understand how to select the NLP packages to use for your needs

2 Assignment Overview

In this lab, you will transition from our statistical text mining approach, using Bag-of-Words
(BoW) into a focus on Natural Language Processing (NLP) techniques. You will learn how to
parse text and do Parts of Speech (POS) tagging required for NLP. You will use that tagged
text to do Named Entity Recognition (NER). You will learn how to use the spaCy library
in Python and how to integrate it into R using the reticulate and spacyr packages. You will
also learn about the main NLP packages available for R, such as NLP, openNLP, coreNLP,
and cleanNLP, and how to use them. Finally, you will learn about NLP in quanteda and
tidytext.
Remember, the the remaining three labs build on what you have learned and contain more
advanced techniques. While these techniques are very exciting, and some of you may choose to
incorporate them into your final projects, these labs are not required to meet the MVP for this
course. To receive minimum credit for the lab, you need only review the Lab Instructions and
submit a basic RMD file knitted to PDF. However, many of you want to learn and practice
these techniques. If so, these three advanced labs are for you. To receive full credit for the lab
please try any of the techniques included and submit the knitted PDF.
There are four main parts to the lab:
Part 1: Parts of Speech Tagging, will guide you through the process of parsing your text
and tagging the relevant parts of speech using openNLP, cleanNLP, and coreNLP.
Part 2: Applying Named Entity Recognition, introduces you to the concept of Maximum
Entropy (Maxent) and applies the maxent annotations for persons, locations, organizations,
and several others to the corpus.
Part 3: Conducting NLP Analysis with spacyr, provides an introduction to NLP anal-
ysis in the quanteda and tidytext ecosystems. In quanteda and tidytext, NLP requires a tight
integration with spacyr, the R environment for the powerful python spaCy libraries.
Part 4: Conducting NLP Analysis in Python, provides an introduction to NLP analysis
in Python using the nltk package and some of the built-in datasets.

3
3 Pre-Lab Instructions

Pre-Lab Instructions (to be completed before class):


Create Your Lab 8 Project in RStudio and a new Quarto document with a relevant title for
the lab, for example: “Lab 8: NLP, POS, and NER in R and Python”.
Download the lab files from the course GitHub repo and save them in your Lab 8 project folder.
The materials for Lab 8 are contained in a zipped file called containing the Lab 8 instructions
and the Lab 8 data, in a folder named “C8_final_txts” containing several unstructured files
ending with a .txt file extension. Unzip the file and save the files in your Lab 8 project folder.
Now begin working your way through the Lab 8 instructions taking a literate programming
approach. Remember to use your Quarto document to write your code and your explanations.
Use your own words to explain what you are doing and why you are doing it. You may
complete the pre-lab instructions or lab before class, or wait until class on Wednesday.
As usual, plan to submit your rendered pdf file, with your Quarto YAML header set to:

echo: true

For this lab you will be using several new packages in R and Python. You will need to
install the following packages in R: openNLP, sentimentr, coreNLP, cleanNLP, magrittr, NLP,
gridExtra, ggthemes, purrr, doBy, cshapes, rJava, sotu, spacyr, tinytex, and sf. Here again
is a way to install multiple packages at once using the concatenate function “c()” function in
R):

install.packages(c("openNLP", "sentimentr", "coreNLP", "cleanNLP", "magrittr", "NLP", "gridEx

If you haven’t done so already, you need to install the openNLPmodels.en from an outside
repository. Use the install.packages() function, with “openNLPmodels.en” in the argument,
but for repos = add “ “http://datacube.wu.ac.at/”, type = “source”. remember to put a
comma between the elements of your argument.

#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source")

If you haven’t done so already, you need to install the openNLPmodels.en from an outside
repository. Use the install.packages() function, with “openNLPmodels.en” in the argument,
but for repos = add “http://datacube.wu.ac.at/”, type = “source”. remember to put a comma
between the elements of your argument.
Now, add a R code chunk to set up your R system environment as follows:

4
options(stringsAsFactors = FALSE)
Sys.setlocale("LC_ALL","C")
lib.loc = "~/R/win-library/3.2"

[1] "C/C/C/C/C/en_US.UTF-8"

You will also need to install the spacyr package in R. You can install the spacyr package from
CRAN using the install.packages() function.

install.packages("spacyr")

You will need to load the required libraries in a specific order. Begin by loading the following
packages: gridExtra, ggmap, ggthemes, and sf.
These packages must be loaded before loading openNLP because ggplot2 (which will load when
ggthemes loads) has a conflict with openNLP using a function called annotate(); and we want
to use the function from openNLP (not ggplot2)
Then load the following remaining packages: NLP, openNLP, openNLPmodels.en, pbapply,
stringr, rvest, doBy, tm, cshapes, purr, dplyr, spacyr, tinytex.
These are some additional packages you may want to load: rJava, coreNLP, sentimentr,
cleanNLP, dplyr, magrittr.
You will also need to install the following packages into your Python environment: nltk,
pandas, numpy, and matplotlib. You can install these packages using Anaconda Navigator.
Just a note on rJava
When you load rJava, you may have an issue with not having a Java Virtual Machine (JVM)
for Java Runtime Environment (JRE) loaded on your computer. If so, you may go to Java.com
to download the appropriate version. Here is the link to install on a Mac: https://www.java.
com/en/download/mac_download.jsp. However, please note, openNLP may not compatible
with the most recent version of Java 8 (12), so you may need to use the last stable version of
Java 8 (11). This issue may be resolved by the time you work on this lab.
Preparing for the spacyr part of the lab.
To use spacyr, first install the spacyr package. This process also installs the various depen-
dencies for spacyr, including: RcppTOML, here, and reticulate. The first time you install this
package, it may take a little while, and may present some challenges.
You may install the latest version of spacyr from GitHub, or install from CRAN. If installing
from GitHub, the following is suggested code:

5
devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)

Now, load the spacyr package.


Now, you will want to “initialize” spacy, but before you can do that, you have to make sure you
have a miniconda environment on your system. If you would like more detail, the tutorial below
will walk you through loading the miniconda environment: https://spacyr.quanteda.io.
Once you are sure you have a miniconda environment you now want to “install” spacy (not
the spacyr library you already installed). This gets a little complicated, but essentially, what
you need to do is run the following function:

spacy_install()

For some of you, this will work fine with all the defaults.
However, if the default function doesn’t work for you, or if you know you have multiple
installations of Python on your system, you will want to use the argument to point to the
specific version of Python you have installed and/or want to use. For example, since I have
Python 3.11.4 as my primary working version, I set the argument to point to that version of
Python. Below is some sample code with the elements of the argument completed:

spacy_install(
conda = "auto",
version = "latest",
lang_models = "en_core_web_sm",
python_version = "3.11.4",
envname = "spacy_condaenv",
pip = FALSE,
python_path = NULL,
prompt = TRUE
)

or, for my updated version using Anaconda, I used the following code:

spacy_install(
conda = "auto",
version = "latest",
lang_models = "en_core_web_sm",
python_version = "3.11.5",
envname = "textmining",
pip = FALSE,

6
python_path = "/Users/derrickcogburn/miniconda3/envs/textmining/bin/python"
prompt = TRUE
)

Be forewarned, this function is going to install quite a bit of additional and dependent packages,
and will take some time. You definitely want to comment out this function once you have
installed spacy and its dependencies.
You should hopefully eventually get a message similar to the following in your console:
== Download and installation successful You can now load the package via spacy.load(‘en_core_web_sm’)
Installation complete. Condaenv: spacy_condaenv: Language model(s): en_core_web_sm
==
Once you get this message, congratulations! You are ready to initialize spacy using the
spacy_initialize() function with the following code in the argument: model=“en_core_web_sm”.
Note the “model” in the argument is calling the smaller version of the English tokenizer
(the larger one is “en_core_web_lg”) which you may also use, but it takes a longer time to
process.
Now you are ready to “initialize” spacyr. To do this, use the spacy_initialize() function. In
your argument, for model= add: “en_core_web_sm”. At the end of the lab you will learn the
opposite function, spacy_finalize() to “detatch” spacy from your environment.

successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)

Once you initialize spacy, you should see something similiar to the following result in the con-
sole. It indicates spacy could find the conda environment to use for spacy (spacy_condaenv).
It also tells you the version of spaCy you are using, the language model it attached, and the
Python options you have set up:
==
Found ‘spacy_condaenv’. spacyr will use this environment successfully initialized (spaCy Ver-
sion: 3.1.3, language model: en_core_web_sm) (python options: type = “condaenv”, value
= “spacy_condaenv”)
==
If your initialization is successful, you are now ready to use the powerful Python-based spacyr
package for Natural Language Processing (NLP) tasks.
*** End of Pre-Lab ***

7
4 Lab Instructions

Lab Instructions (to be completed during class):


This lab has 34 deliverables Follow and complete these Lab Instructions before, during, or
after the synchronous class.

4.1 Deliverable 1: Get your working directory and paste below:

getwd()

5 Part 1: Parts of Speech (POS) Tagging in openNLP

In this part of the lab, you will use the openNLP package to perform Parts of Speech (POS)
tagging on a text. Let’s now focus on text parsing, named entity recognition, and understand-
ing a Parts of Speech (POS) tokenizer in the openNLP package.

5.1 Deliverable 2: Create Some Sample Text

First, let’s create some example text to start. Create an object called “s” and assign it the
results of the following code chunk:

s <- paste(c('Pierre Vinken, 61 years old, will join the board as a ',
'nonexecutive director Nov 29.',
'Mr. Vinken is chairman of Elsevier, N.V., ',
'the Dutch publishing group.'),
collapse = '')
s <- as.String(s)
s

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov 29. Mr. Vinke

8
5.2 Deliverable 3: Create Sentence and Word Token Annotations

Now, let’s create sentence and word token annotations objects called sent_token_annotator
uand word_token_annotator. To do this, you will use the Maxent_Sent_Token_Annotator()
function and the Maxent_Word_Token_Annotator() function from the openNLP package.
Here is the sample code:

sent_token_annotator <- Maxent_Sent_Token_Annotator()


word_token_annotator <- Maxent_Word_Token_Annotator()

Then, create an object called a2 and assign it the results of the following code. And then
review a2.

annotate(s, list(sent_token_annotator, word_token_annotator))

Now, create an object called pos_tag_annotator and assign it the results of the Max-
ent_POS_Tag_Annotator() function. Create an object called a3 and assign it the results
of applying annotate() function to object s. Use the following code for the argument of the
annotate() function: (s, pos_tag_annotator, a2). Then review a3. Next, create an object
called a3w and assign it the results of the function subset(), using the following code chunk
in your argument:(a3, type == “word”). Then review a3w.
sample code is below:

pos_tag_annotator <- Maxent_POS_Tag_Annotator()

a3 <- annotate(s, pos_tag_annotator, a2)

a3

a3w <- subset(a3, type == "word")

a3w

Now, create an object called tags, and assign it the results of the following code chunk:

sapply(a3w$features,"[[","POS").

Then review the object tags. Finally, use the table() to create a table of tags. Then review
the object tags. Finally, use the table() to create a table of tags.
Sample code is below:

9
tags <- sapply(a3w$features,"[[","POS")
tags
table(tags)

For more information on Parts of Speech (POS) tagging, please see: https://repository.upenn.
edu/cgi/viewcontent.cgi?article=1603&context=cis_reports; and for a summary, please see:
https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html.

5.3 Deliverable 3: Extract Tokens/POS Pairs

Now we will extract tokens/POS pairs (all of them):


Here is sample code:

sprintf("%s/%s", s[a3w], tags)

[1] "Pierre/NNP" "Vinken/NNP" ",/," "61/CD"


[5] "years/NNS" "old/JJ" ",/," "will/MD"
[9] "join/VB" "the/DT" "board/NN" "as/IN"
[13] "a/DT" "nonexecutive/JJ" "director/NN" "Nov/NNP"
[17] "29/CD" "./." "Mr./NNP" "Vinken/NNP"
[21] "is/VBZ" "chairman/NN" "of/IN" "Elsevier/NNP"
[25] ",/," "N.V./NNP" ",/," "the/DT"
[29] "Dutch/JJ" "publishing/NN" "group/NN" "./."

6 Part 2: Applying Named Entity Recognition (NER) to the


Clinton Email Dataset

In this part of the lab, you will apply Named Entity Recognition (NER) to a subset of the
infamous Clinton Email Dataset. The Clinton Email Dataset is a collection of emails from
Hillary Clinton’s private email server. The dataset is available on Kaggle at: https://www.
kaggle.com/kaggle/hillary-clinton-emails.
We need to scan a folder in our working directory for multiple files representing the Hillary
Clinton emails saved as .txt files. We will be using the list.files function and a wildcard added
to the .txt file extension (*.txt) which will identify any file in that directory that ends in .txt.
We will then create an object called “temp” that contains those emails.
You may want to set your working directory, with the path leading to the folder containing
the emails.

10
6.1 Deliverable 4: Import Data and Organize the Clinton Email Dataset

Here is the sample code:

tmp <- list.files(path ="data/C8_final_txts/", pattern = '*.txt', full.names = T)


emails <- pblapply(tmp, readLines)
names(emails) <- gsub('.txt', '', list.files(pattern = '.txt'))

6.2 Deliverable 5: Examine the Clinton Email Dataset

Let’s now examine one (1) email. Use the following code:

emails[[1]]

[1] "UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05758905 Date: 02/
[2] ""
[3] "RELEASE IN FULL"
[4] ""
[5] "From: Sent: To: Subject:"
[6] ""
[7] "H <[email protected] > Monday, July 6, 2009 10:22 AM '[email protected]' Re: Sch
[8] ""
[9] "Either is fine w me."
[10] "Original Message From: Valmoro, Lona J <[email protected]> To: H Cc: Abedin, Huma <Abe
[11] "Secretary Salazar's office just called -- he would like to meeting with you early this
[12] ""
[13] "Lona Valmoro Special Assistant to the Secretary of State 202-647-9071 (direct)"
[14] ""
[15] "UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05758905 Date: 02/
[16] ""
[17] "\f"

6.3 Deliverable 6: Create a Custom Function to Clean the Emails

Now, let’s create a custom function called txtClean to clean the emails.
Sample code is below:

11
#txtClean <- function(x) {
x <- x[-1]
x <- paste(x,collapse = " ")
x <- str_replace_all(x, "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+", "")
x <- str_replace_all(x, "Doc No.", "")
x <- str_replace_all(x, "UNCLASSIFIED U.S. Department of State Case No.", "")
x <- removeNumbers(x)
x <- as.String(x)
return(x)
#}

6.4 Deliverable 7: Apply the Cleaning Function to the Clinton Email Dataset

Now, as a test, we will apply the cleaning function to one email and observe what happens to
it. Use the following code:

txtClean(emails[[1]])[[1]]

[1] " RELEASE IN FULL From: Sent: To: Subject: H < > Monday, July , : AM '' Re: Schedule

Now, let’s apply the cleaning function to all emails; and then review one email in the list.
Sample code is below:

allEmails <- pblapply(emails,txtClean)

allEmails[[2]][[1]][1]

[1] " From: Sent: To: Subject: H < > Tuesday, July , : AM '' Re: Reminder geithner is in ab

6.5 Deliverable 8: Apply POS Tagging to the Clinton Email Dataset

In this deliverable you will create several objects to use as various types of “annotators”.
Create three objects to use for your POS tagging. Create the following three objects: persons,
locations, and organizations. For each of those, assign its respective Maxent annotator (e.g
kind=“person”, kind=“location”, kind=“organization).
Sample code is below:

12
persons <- Maxent_Entity_Annotator(kind='person')
locations <- Maxent_Entity_Annotator(kind='location')
organizations <- Maxent_Entity_Annotator(kind='organization')

Then create two objects to hold your Maxent sentence and word annotators.
Sample code below:

sentTokenAnnotator <- Maxent_Sent_Token_Annotator(language='en')


wordTokenAnnotator <- Maxent_Word_Token_Annotator(language='en')

Finally, create one more object called posTagAnnotator to hold your Maxent POS tag anno-
tator.
Sample code below:

posTagAnnotator <- Maxent_POS_Tag_Annotator(language='en')

POS tagging for persons, locations, and organizations; as well as individual models for sen-
tences, words and parts of speech. This code load the pre-existing feature weights to be called
by your R session, but they do not yet apply them to any text.

6.6 Deliverable 9: Annotate the Clinton Email Dataset Using a For Loop

For this deliverable, we will work on understanding for loops in R.


In R, “for loops” iterate for a controlled counter or an index, incremented at each iteration
cycle. While “repeat loops” (with conditional clauses) break and next: based on the onset and
verification of some logical condition.
We will now annotate each document in a loop. Sample code is below:

annotationsData <- list()


for (i in 1:length(allEmails)){
print(paste('starting annotations on doc', i))
annotations <- annotate(allEmails[[i]], list(sentTokenAnnotator,
wordTokenAnnotator,
posTagAnnotator,
persons,
locations,
organizations))
annDF <- as.data.frame(annotations)[,2:5]

13
annDF$features <- unlist(as.character(annDF$features))

annotationsData[[tmp[i]]] <- annDF


print(paste('finished annotations on doc', i))
}

6.7 Deliverable 10: Extract Terms from the Clinton Email Dataset

Annotations have character indices. We will now obtain terms by index from each document
using a NESTED loop.
Sample code is below (run the entire code chunk at once):

allData<- list()
for (i in 1:length(allEmails)){
x <- allEmails[[i]] # get an individual document
y <- annotationsData[[i]] # get an individual doc's annotation information
print(paste('starting document:',i, 'of', length(allEmails)))
# for each row in the annotation information, extract the term by index
POSls <- list()
for(j in 1:nrow(y)){
annoChars <- ((substr(x,y[j,2],y[j,3]))) #substring position
# Organize information in data frame
z <- data.frame(doc_id = i,
type = y[j,1],
start = y[j,2],
end = y[j,3],
features = y[j,4],
text = as.character(annoChars))
POSls[[j]] <- z
#print(paste('getting POS:', j))
}
# Bind each documents annotations & terms from loop into a single DF
docPOS <- do.call(rbind, POSls)
# So each document will have an individual DF of terms, and annotations as a list element
allData[[i]] <- docPOS
}

6.8 Deliverable 11: Subset the Clinton Email Dataset

Now we can subset for each document.

14
Sample code is below:

people <- pblapply(allData, subset, grepl("*person", features))


locaction <- pblapply(allData, subset, grepl("*location", features))
organization <- pblapply(allData, subset, grepl("*organization", features))

people

locaction #note in this instance, location is misspelled. this is on purpose since location i
organization
### Or if you prefer to work with flat objects make it a data frame w/all info
POSdf <- do.call(rbind, allData)
# Subsetting example w/2 conditions; people found in email 1; note, do not include the \& in

subset(POSdf, POSdf$doc_id ==1 \& grepl("*person", POSdf$features) == T)

6.9 Deliverable 12: Using the Annotate Entities Process

This final data frame contains not only persons, locations, and organizations, but also each
detected sentence, word, and part of speech. To complete the named entity analysis, you may
need to subset the data frame for specific features.
The sample code is below:

annDF

subset(annDF$words, grepl("*people", annDF$features) == T)


subset(annDF$words, grepl("*locaction", annDF$features) == T)
subset(annDF$words, grepl("*organization", annDF$features) == T)

person

6.10 Deliverable 13: Annotate Entities with OpenNLP

Now, you may define an annotation sequence, and create a list with the specific opeNLP
models.
Sample code is below:

15
annotate.entities <- function(doc, annotation.pipeline) {
annotations <- annotate(doc, annotation.pipeline)
AnnotatedPlainTextDocument(doc, annotations)
}

ner.pipeline <- list(


Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator(),
Maxent_POS_Tag_Annotator(),
Maxent_Entity_Annotator(kind = "person"),
Maxent_Entity_Annotator(kind = "location"),
Maxent_Entity_Annotator(kind = "organization")
)

6.11 Deliverable 14: Apply the Annotation Function

Now that we have an annotation function that individually calls the models, we need to apply
them to the entire email list either the lapply or the pblapply functions will work, but the
pblapply is helpful because it provides a progress bar (pb). Note, the pbapply package must
be loaded.
Sample code is below:

all.ner <- pblapply(all.emails, annotate.entities, ner.pipeline)


all.ner <- pblapply(allEmails, annotate.entities, ner.pipeline)

Now, we can extract the useful information and construct a data frame with entity informa-
tion.
all.ner
all.ner <- pluck(all.ner, "annotations")
all.ner <- pblapply(all.ner, as.data.frame)
#all.ner[[3]][244:250,]
#all.ner <- Map(function(tex,fea,id) cbind(fea, entity = substring(tex,fea$start, fea$end), f

7 Part 3: Conducting NLP Analysis with spacyr

7.1 Deliverable 15: Load the spacyr Package

The spacyr package is a powerful tool for conducting NLP analysis. It is a wrapper for the
spaCy Python library, which is known for its speed and accuracy.

16
To use the spacyr package, you need to install it and load it into your R environment.
Now, to practice with spacyr, let’s create an object called txt that is assigned the results of
two sample “documents”, d1 and d2. We then use the spacy_parse() function to parse them,
and save in a data.table called parsedtxt, which we can then review.
Sample code is below:

txt <- c(d1 = "spaCy is great at fast natural language processing.",


d2 = "Mr. Smith spent two years in North Carolina.")

parsedtxt <- spacy_parse(txt)

parsedtxt

doc_id sentence_id token_id token lemma pos entity


1 d1 1 1 spaCy spacy INTJ
2 d1 1 2 is be AUX
3 d1 1 3 great great ADJ
4 d1 1 4 at at ADP
5 d1 1 5 fast fast ADJ
6 d1 1 6 natural natural ADJ
7 d1 1 7 language language NOUN
8 d1 1 8 processing processing NOUN
9 d1 1 9 . . PUNCT
10 d2 1 1 Mr. Mr. PROPN
11 d2 1 2 Smith Smith PROPN PERSON_B
12 d2 1 3 spent spend VERB
13 d2 1 4 two two NUM DATE_B
14 d2 1 5 years year NOUN DATE_I
15 d2 1 6 in in ADP
16 d2 1 7 North North PROPN GPE_B
17 d2 1 8 Carolina Carolina PROPN GPE_I
18 d2 1 9 . . PUNCT

7.2 Deliverable 16: Review the Parsed Text

In this analysis, we see two fields for POS tags (pos and entity). In this case the pos field
returns the parts of speech as tagged using the Penn Treebank tagset. You may adjust the
argument for the spacy_parse() function to determine what you get back. For example, you
may choose to not generate the lemma or entity fields with the code chunk below:

17
spacy_parse(txt, tag = TRUE, entity = FALSE, lemma = FALSE)

doc_id sentence_id token_id token pos tag


1 d1 1 1 spaCy INTJ UH
2 d1 1 2 is AUX VBZ
3 d1 1 3 great ADJ JJ
4 d1 1 4 at ADP IN
5 d1 1 5 fast ADJ JJ
6 d1 1 6 natural ADJ JJ
7 d1 1 7 language NOUN NN
8 d1 1 8 processing NOUN NN
9 d1 1 9 . PUNCT .
10 d2 1 1 Mr. PROPN NNP
11 d2 1 2 Smith PROPN NNP
12 d2 1 3 spent VERB VBD
13 d2 1 4 two NUM CD
14 d2 1 5 years NOUN NNS
15 d2 1 6 in ADP IN
16 d2 1 7 North PROPN NNP
17 d2 1 8 Carolina PROPN NNP
18 d2 1 9 . PUNCT .

You may also parse and POS tag documents in multiple languages using over 70 other lan-
guage models (some of these include: German (de), Spanish (es), Portugese (pt), French (fr),
Italian (it), Dutch(nl), and many others: See: https://spacy.io/usage/models#languages for
a complete list)

7.3 Deliverable 17: Parse the txt Object with spacyr

You may also use spacyr to directly parse the txt object. Use the spacy_tokenize() function
with txt in the argument. This approach is designed to match the tokens() functions within
quanteda.
Sample code is below:

spacy_tokenize(txt)

$d1
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

18
$d2
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."

7.4 Deliverable 18: Tokenize Text into a Data Frame

The default returns a named list (where the document name, eg d1, d2 is the list element
name). Or you may specify to output to a data.frame. Either way, make sure you have the
dplyr library loaded.
Sample code is below:

library(dplyr)

spacy_tokenize(txt, remove_punct = TRUE, output = "data.frame") %>%


tail()

doc_id token
11 d2 spent
12 d2 two
13 d2 years
14 d2 in
15 d2 North
16 d2 Carolina

7.5 Deliverable 19: Extract Named Entities From Parsed Text

You may also use spacyr to extract named entities from the parsed text.
Sample code is below:

parsedtxt <- spacy_parse(txt, lemma = FALSE, entity = TRUE, nounphrase = TRUE)


entity_extract(parsedtxt)

doc_id sentence_id entity entity_type


1 d2 1 Smith PERSON
2 d2 1 North_Carolina GPE

19
7.6 Deliverable 20: Extract Extended Entity Set

The following approach lets you extract the “extended” entity set.
Sample code is below:

entity_extract(parsedtxt, type = "all")

doc_id sentence_id entity entity_type


1 d2 1 Smith PERSON
2 d2 1 two_years DATE
3 d2 1 North_Carolina GPE

7.7 Deliverable 21: Consolidate Named Entities

Another interesting possibility is to “consolidate” multi-word entities into single tokens using
the entity_consolidate() functions.
Sample code is below:

entity_consolidate(parsedtxt) %>%
tail()

doc_id sentence_id token_id token pos entity_type


11 d2 1 2 Smith ENTITY PERSON
12 d2 1 3 spent VERB
13 d2 1 4 two_years ENTITY DATE
14 d2 1 5 in ADP
15 d2 1 6 North_Carolina ENTITY GPE
16 d2 1 7 . PUNCT

7.8 Deliverable 22: Extract Noun Phrases

Similarly, spacyr can extract noun phrases.


Sample code is below:

nounphrase_extract(parsedtxt)

20
doc_id sentence_id nounphrase
1 d1 1 fast_natural_language_processing
2 d2 1 Mr._Smith
3 d2 1 two_years
4 d2 1 North_Carolina

Noun phrases may also be consolidated using the nounphrase_consolidate() function applied
to the parsedtxt object.
Sample code is below:

nounphrase_consolidate(parsedtxt)

doc_id sentence_id token_id token pos


1 d1 1 1 spaCy INTJ
2 d1 1 2 is AUX
3 d1 1 3 great ADJ
4 d1 1 4 at ADP
5 d1 1 5 fast_natural_language_processing nounphrase
6 d1 1 6 . PUNCT
7 d2 1 1 Mr._Smith nounphrase
8 d2 1 2 spent VERB
9 d2 1 3 two_years nounphrase
10 d2 1 4 in ADP
11 d2 1 5 North_Carolina nounphrase
12 d2 1 6 . PUNCT

Deliverable 23: Extract Entities Without Parsing the Entire Text


To only extract entities without parsing the entire text, you may use the spacy_extract_entity()
function applied to the txt object.
Sample code is below:

spacy_extract_entity(txt)

doc_id text ent_type start_id length


1 d2 Smith PERSON 2 1
2 d2 two years DATE 4 2
3 d2 North Carolina GPE 7 2

21
Similarly, to extract noun phrases without parsing the entire text, you may use the
spacy_extract_nounphrases() function:
Sample code is below:

spacy_extract_nounphrases(txt)

doc_id text root_text start_id root_id length


1 d1 fast natural language processing processing 5 8 4
2 d2 Mr. Smith Smith 1 2 2
3 d2 two years years 4 5 2
4 d2 North Carolina Carolina 7 8 2

For detailed parsing of syntactic dependencies, you may use the dependency=TRUE option in
the argument. Sample code is below:

spacy_parse(txt, dependency = TRUE, lemma = FALSE, pos = FALSE)

doc_id sentence_id token_id token head_token_id dep_rel entity


1 d1 1 1 spaCy 2 nsubj
2 d1 1 2 is 2 ROOT
3 d1 1 3 great 2 acomp
4 d1 1 4 at 2 prep
5 d1 1 5 fast 8 amod
6 d1 1 6 natural 7 amod
7 d1 1 7 language 8 compound
8 d1 1 8 processing 4 pobj
9 d1 1 9 . 2 punct
10 d2 1 1 Mr. 2 compound
11 d2 1 2 Smith 3 nsubj PERSON_B
12 d2 1 3 spent 3 ROOT
13 d2 1 4 two 5 nummod DATE_B
14 d2 1 5 years 3 dobj DATE_I
15 d2 1 6 in 3 prep
16 d2 1 7 North 8 compound GPE_B
17 d2 1 8 Carolina 6 pobj GPE_I
18 d2 1 9 . 3 punct

22
7.9 Deliverable 23: Extract Additional Attributes

You may also extract additional attributes of spaCy tokens using the additional_attributes
option in the argument.
Sample code is below:

spacy_parse("I have six email addresses, including [email protected].",


additional_attributes = c("like_num", "like_email"),
lemma = FALSE, pos = FALSE, entity = FALSE)

doc_id sentence_id token_id token like_num like_email


1 text1 1 1 I FALSE FALSE
2 text1 1 2 have FALSE FALSE
3 text1 1 3 six TRUE FALSE
4 text1 1 4 email FALSE FALSE
5 text1 1 5 addresses FALSE FALSE
6 text1 1 6 , FALSE FALSE
7 text1 1 7 including FALSE FALSE
8 text1 1 8 [email protected] FALSE TRUE
9 text1 1 9 . FALSE FALSE

Take a moment to review this output and discuss why this was produced from the parsing
request.

7.10 Deliverable 24: Extract Additional Attributes

You may also integrate the output from spacyr directly into quanteda. Sample code is below:

library(quanteda, warn.conflicts = FALSE, quietly = TRUE)

# To identify the names of the documents


docnames(parsedtxt)

# To count the number of tokens in the documents


ntoken(parsedtxt)

# To count the number or types of tokens


ntype(parsedtxt)

23
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1

Parallel computing: disabled

See https://quanteda.io for tutorials and examples.

[1] "d1" "d2"

d1 d2
9 9

d1 d2
9 9

7.11 Deliverable 25: Extract Additional Attributes

You may also convert tokens in spacyr, which have tokenizers that are “smarter” than the
purely syntactic pattern-based parsers used by quanteda.
Sample code is below:

parsedtxt <- spacy_parse(txt, pos = TRUE, tag = TRUE)


as.tokens(parsedtxt)

Tokens consisting of 2 documents.


d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

d2 :
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."

If you want to select only nouns, using “glob” pattern matching with quanteda, you may use
the tokens_select() function.
Sample code is below:

24
spacy_parse("The cat in the hat ate green eggs and ham.", pos = TRUE) %>%
as.tokens(include_pos = "pos") %>%
tokens_select(pattern = c("*/NOUN"))

Tokens consisting of 1 document.


text1 :
[1] "cat/NOUN" "hat/NOUN" "eggs/NOUN"

You may also directly convert the spaCy-based tokens. Sample code is below:

spacy_tokenize(txt) %>%
as.tokens()

Tokens consisting of 2 documents.


d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

d2 :
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."

You may also do this for sentences, for which spaCy is very smart.
Sample code is below:

txt2 <- "A Ph.D. in Washington D.C. Mr. Smith went to Washington."
spacy_tokenize(txt2, what = "sentence") %>%
as.tokens()

Tokens consisting of 1 document.


text1 :
[1] "A Ph.D. in Washington D.C. Mr. Smith went to Washington."

This also works well with entity recognition.


Sample code is below:

spacy_parse(txt, entity = TRUE) %>%


entity_consolidate() %>%
as.tokens() %>%
head(1)

25
Tokens consisting of 1 document.
d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

7.12 Deliverable 26: Using spacyr with tidytext

The spacyr package also works well with tidytext.


Sample code is below:

library("tidytext")

unnest_tokens(parsedtxt, word, token) %>%


dplyr::anti_join(stop_words)

Joining with `by = join_by(word)`

doc_id sentence_id token_id lemma pos tag entity word


1 d1 1 1 spacy INTJ UH spacy
2 d1 1 5 fast ADJ JJ fast
3 d1 1 6 natural ADJ JJ natural
4 d1 1 7 language NOUN NN language
5 d1 1 8 processing NOUN NN processing
6 d2 1 2 Smith PROPN NNP PERSON_B smith
7 d2 1 3 spend VERB VBD spent
8 d2 1 7 North PROPN NNP GPE_B north
9 d2 1 8 Carolina PROPN NNP GPE_I carolina

7.13 Deliverable 27: POS Filtering

We can then use POS filtering using dplyr.


Sample code is below

spacy_parse("The cat in the hat ate green eggs and ham.", pos = TRUE) %>%
unnest_tokens(word, token) %>%
dplyr::filter(pos == "NOUN")

26
doc_id sentence_id token_id lemma pos entity word
1 text1 1 2 cat NOUN cat
2 text1 1 5 hat NOUN hat
3 text1 1 8 egg NOUN eggs

7.14 Deliverable 28: Finalizing the SpaCy Connection

Since the spacy_initialize() attaches a background process of spaCy in python space, it takes
up a significant amount of memory (especially when using a large language model such as:
en_core_web_lg). So, when you no longer need the connection to spaCy, you may remove
the spaCy object by calling the spacy_finalize() function.
And, when you are ready to reattach the back-end spaCy object, you call spacy_initialize()
again.

8 Part 4: Conducting NLP Analysis in Python

In this final section, we will conduct NLP analysis in Python. We will use the nltk package and
some of the built-in datasets to conduct our analysis. Make sure the nltk package is installed
into the Python environment you are using for this lab.

8.1 Deliverable 29: Importing the nltk Package and Downloading Necessary
POS Taggers

For this part of the lab you will need to import the nltk package and download punkt,
averaged_perceptron_tagger, maxent_ne_chunker, and words. These downloads are
necessary for the lab, punkt is a pre-trained model that helps you tokenize words, av-
eraged_perceptron_tagger is a pre-trained model that helps you tag parts of speech,
maxent_ne_chunker is a pre-trained model that helps you identify named entities, and words
is a dataset that contains a list of words.
You can do this by running the following sample code in a Python code chunk:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

True

27
True

True

True

8.2 Deliverable 30: Downloading the Necessary Datasets

You will also need to download the necessary datasets from nltk which are required for the
lab. NLTK comes with several built-in datastets. We will use the “state_union” corpus, a
collection of US Presidential State of the Union addresses. We will use the 2006 address by
President George W. Bush. Sample code is below:

from nltk.corpus import state_union


nltk.download('state_union')
# Load a sample text
sample_text = state_union.raw("2006-GWBush.txt")
print(sample_text)

True

PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE

January 31, 2006

THE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, member

President George W. Bush reacts to applause during his State of the Union Address at the Capi

In a system of two parties, two chambers, and two elected branches, there will always be diff

In this decisive year, you and I will make choices that determine both the future and the cha

Abroad, our nation is committed to an historic, long-term goal -- we seek the end of tyranny

Far from being a hopeless dream, the advance of freedom is the great story of our time. In 19

President George W. Bush delivers his State of the Union Address at the Capitol, Tuesday, Jan

Their aim is to seize power in Iraq, and use it as a safe haven to launch attacks against Ame

28
In a time of testing, we cannot find security by abandoning our commitments and retreating wi

America rejects the false comfort of isolationism. We are the nation that saved liberty in Eu

President George W. Bush greets members of Congress after his State of the Union Address at t

Second, we're continuing reconstruction efforts, and helping the Iraqi government to fight co

Our work in Iraq is difficult because our enemy is brutal. But that brutality has not stopped

The road of victory is the road that will take our troops home. As we make progress on the gr

Our coalition has learned from our experience in Iraq. We've adjusted our military tactics an

With so much in the balance, those of us in public office have a duty to speak with candor. A

Laura Bush is applauded as she is introduced Tuesday evening, Jan. 31, 2006 during the State

Staff Sergeant Dan Clay's wife, Lisa, and his mom and dad, Sara Jo and Bud, are with us this

Our nation is grateful to the fallen, who live in the memory of our country. We're grateful t

Our offensive against terror involves more than military action. Ultimately, the only way to

The great people of Egypt have voted in a multi-party presidential election -- and now their

President George W. Bush waves toward the upper visitors gallery of the House Chamber followi

Tonight, let me speak directly to the citizens of Iran: America respects you, and we respect

To overcome dangers in our world, we must also take the offensive by encouraging economic pro

In recent years, you and I have taken unprecedented action to fight AIDS and malaria, expand

Our country must also remain on the offensive against terrorism here at home. The enemy has n

It is said that prior to the attacks of September the 11th, our government failed to connect

In all these areas -- from the disruption of terror networks, to victory in Iraq, to the spre

Our own generation is in a long war against a determined enemy -- a war that will be fought b

29
Here at home, America also has a great opportunity: We will build the prosperity of our count

Our economy is healthy and vigorous, and growing faster than other major industrialized natio

The American economy is preeminent, but we cannot afford to be complacent. In a dynamic world

Tonight I will set out a better path: an agenda for a nation that competes with confidence; a

Keeping America competitive begins with keeping our economy growing. And our economy grows wh

Keeping America competitive requires us to be good stewards of tax dollars. Every year of my

I am pleased that members of Congress are working on earmark reform, because the federal budg

We must also confront the larger challenge of mandatory spending, or entitlements. This year,

So tonight, I ask you to join me in creating a commission to examine the full impact of baby

Keeping America competitive requires us to open more markets for all that Americans make and

Keeping America competitive requires an immigration system that upholds our laws, reflects ou

Keeping America competitive requires affordable health care. (Applause.) Our government has a

We will make wider use of electronic records and other health information technology, to help

Keeping America competitive requires affordable energy. And here we have a serious problem: A

So tonight, I announce the Advanced Energy Initiative -- a 22-percent increase in clean-energ

We must also change how we power our automobiles. We will increase our research in better bat

Breakthroughs on this and other new technologies will help us reach another great goal: to re

And to keep America competitive, one commitment is necessary above all: We must continue to l

First, I propose to double the federal commitment to the most critical basic research program

Second, I propose to make permanent the research and development tax credit -- (applause) --

Third, we need to encourage children to take more math and science, and to make sure those co

Preparing our nation to compete in the world is a goal that all of us can share. I urge you t

30
America is a great force for freedom and prosperity. Yet our greatness is not measured in pow

In recent years, America has become a more hopeful nation. Violent crime rates have fallen to

These gains are evidence of a quiet transformation -- a revolution of conscience, in which a

Yet many Americans, especially parents, still have deep concerns about the direction of our c

As we look at these challenges, we must never give in to the belief that America is in declin

A hopeful society depends on courts that deliver equal justice under the law. The Supreme Cou

Today marks the official retirement of a very special American. For 24 years of faithful serv

A hopeful society has institutions of science and medicine that do not cut ethical corners, a

A hopeful society expects elected officials to uphold the public trust. (Applause.) Honorable

As we renew the promise of our institutions, let us also show the character of America in our

A hopeful society gives special attention to children who lack direction and love. Through th

A hopeful society comes to the aid of fellow citizens in times of suffering and emergency --

In New Orleans and in other places, many of our fellow citizens have felt excluded from the p

A hopeful society acts boldly to fight diseases like HIV/AIDS, which can be prevented, and tr

Fellow citizens, we've been called to leadership in a period of consequence. We've entered a

Lincoln could have accepted peace at the cost of disunity and continued slavery. Martin Luthe

Before history is written down in books, it is written in courage. Like Americans before us,

May God bless America. (Applause.)

8.3 Deliverable 31: Tokenizing the Text

We want to perform POS tagging and NER, but before we can do that, we need to tokenize
the text, splitting it into sentences, and then into words. Sample code is below:

31
from nltk.tokenize import sent_tokenize, word_tokenize

# Tokenize the sentences


sentences = sent_tokenize(sample_text)

# Tokenize the words


#words = word_tokenize(sample_text)
words = [word_tokenize(sentence) for sentence in sentences]

8.4 Deliverable 32: Perform POS Tagging and NER

We can now perform POS tagging.


Sample code is below:

pos_tagged = [nltk.pos_tag(word) for word in words]


print(pos_tagged)

[[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('A

8.5 Deliverable 33: Perform NER

Now, let’s perform NER. Sample code is below:

named_entities = []
for tagged_sentence in pos_tagged:
chunked_sentence = nltk.ne_chunk(tagged_sentence, binary = True)
named_entities.extend([chunk for chunk in chunked_sentence if hasattr(chunk, 'label')])
print(named_entities[:10])

[Tree('NE', [('GEORGE', 'NNP')]), Tree('NE', [('ADDRESS', 'NNP')]), Tree('NE', [('THE', 'NNP'

8.6 Deliverable 34: Visualizing the Results

Now, we can visualize these results. We will use matplotlib to visualize the POS tagging
and NER. If you have not already installed the matplotlib package, you may do so within
Anaconda Navigator.
First, let’s create a frequency distribution of the POS tags, and then plot this distribution.
Sample code is below:

32
import nltk
from nltk import FreqDist
import matplotlib.pyplot as plt

# Assuming `pos_tagged` contains your POS-tagged text


pos_tags = [tag for sentence in pos_tagged for _, tag in sentence]

# Frequency distribution of POS tags


pos_freq = FreqDist(pos_tags)

# Creating a bar plot for POS tags frequency


plt.figure(figsize=(12, 8))
pos_freq.plot(30, cumulative=False)
plt.show()

800

600
Counts

400

200

0
NN
IN
DT
JJ
NNP
NNS
.
,
VB
CC
PRP
VBP
TO
RB
PRP$
VBZ
MD

VBN
:
CD
(
)
JJR
VBD
WDT
NNPS
WP
RP
POS
VBG

Samples

33
Now, we will visualize the NER results. We will create a frequency distribution of the NER
tags, and then plot this distribution.
Sample code is below:

import nltk
from nltk import FreqDist
import matplotlib.pyplot as plt

# Assuming `pos_tagged` contains your POS-tagged text


pos_tags = [tag for sentence in pos_tagged for _, tag in sentence]

# Frequency distribution of POS tags


pos_freq = FreqDist(pos_tags)

# Creating a bar plot for POS tags frequency


plt.figure(figsize=(12, 8))
pos_freq.plot(30, cumulative=False)
plt.show()

34
Counts

0
200
400
600
800

NN
IN

*** End of Lab ***


DT
JJ
NNP
NNS
.
,
VB
CC
PRP
VBP
TO
RB
PRP$

35
VBZ

Samples
MD
VBG
VBN
:
CD
(
)
JJR
VBD
WDT
NNPS
WP
RP
POS
Please render your Quarto file to pdf and submit to the assignment for Lab 7 within Canvas.

36

You might also like