0% found this document useful (0 votes)

10 views36 pages

Lab8 Instructions

Lab 8 focuses on Natural Language Processing (NLP), specifically Parts of Speech (POS) tagging and Named Entity Recognition (NER). It includes a series of assignments that guide students through using various NLP tools in R and Python, such as openNLP, spacyr, and nltk, to analyze text data. The lab aims to enhance understanding of text parsing, tagging, and the application of NLP techniques in data analysis.

Uploaded by

ghemabuchoiiuoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views36 pages

Lab8 Instructions

Uploaded by

ghemabuchoiiuoi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Lab 8: Natural Language Processing (NLP),

Parts of Speech Tagging (POS), and Named

Entity Recognition (NER)
Dr. Derrick L. Cogburn

2024-10-20

Table of contents

1 Lab Overview 2
1.1 Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Assignment Overview 3

3 Pre-Lab Instructions 4

4 Lab Instructions 8
4.1 Deliverable 1: Get your working directory and paste below: . . . . . . . . . . . 8

5 Part 1: Parts of Speech (POS) Tagging in openNLP 8

5.1 Deliverable 2: Create Some Sample Text . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Deliverable 3: Create Sentence and Word Token Annotations . . . . . . . . . . 9
5.3 Deliverable 3: Extract Tokens/POS Pairs . . . . . . . . . . . . . . . . . . . . . 10

6 Part 2: Applying Named Entity Recognition (NER) to the Clinton Email Dataset 10
6.1 Deliverable 4: Import Data and Organize the Clinton Email Dataset . . . . . . 11
6.2 Deliverable 5: Examine the Clinton Email Dataset . . . . . . . . . . . . . . . . 11
6.3 Deliverable 6: Create a Custom Function to Clean the Emails . . . . . . . . . . 11
6.4 Deliverable 7: Apply the Cleaning Function to the Clinton Email Dataset . . . 12
6.5 Deliverable 8: Apply POS Tagging to the Clinton Email Dataset . . . . . . . . 12
6.6 Deliverable 9: Annotate the Clinton Email Dataset Using a For Loop . . . . . . 13
6.7 Deliverable 10: Extract Terms from the Clinton Email Dataset . . . . . . . . . 14
6.8 Deliverable 11: Subset the Clinton Email Dataset . . . . . . . . . . . . . . . . . 14

1
6.9 Deliverable 12: Using the Annotate Entities Process . . . . . . . . . . . . . . . 15
6.10 Deliverable 13: Annotate Entities with OpenNLP . . . . . . . . . . . . . . . . . 15
6.11 Deliverable 14: Apply the Annotation Function . . . . . . . . . . . . . . . . . . 16

7 Part 3: Conducting NLP Analysis with spacyr 16

7.1 Deliverable 15: Load the spacyr Package . . . . . . . . . . . . . . . . . . . . . . 16
7.2 Deliverable 16: Review the Parsed Text . . . . . . . . . . . . . . . . . . . . . . 17
7.3 Deliverable 17: Parse the txt Object with spacyr . . . . . . . . . . . . . . . . . 18
7.4 Deliverable 18: Tokenize Text into a Data Frame . . . . . . . . . . . . . . . . . 19
7.5 Deliverable 19: Extract Named Entities From Parsed Text . . . . . . . . . . . . 19
7.6 Deliverable 20: Extract Extended Entity Set . . . . . . . . . . . . . . . . . . . 20
7.7 Deliverable 21: Consolidate Named Entities . . . . . . . . . . . . . . . . . . . . 20
7.8 Deliverable 22: Extract Noun Phrases . . . . . . . . . . . . . . . . . . . . . . . 20
7.9 Deliverable 23: Extract Additional Attributes . . . . . . . . . . . . . . . . . . . 23
7.10 Deliverable 24: Extract Additional Attributes . . . . . . . . . . . . . . . . . . . 23
7.11 Deliverable 25: Extract Additional Attributes . . . . . . . . . . . . . . . . . . . 24
7.12 Deliverable 26: Using spacyr with tidytext . . . . . . . . . . . . . . . . . . . . . 26
7.13 Deliverable 27: POS Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.14 Deliverable 28: Finalizing the SpaCy Connection . . . . . . . . . . . . . . . . . 27

8 Part 4: Conducting NLP Analysis in Python 27

8.1 Deliverable 29: Importing the nltk Package and Downloading Necessary POS
Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.2 Deliverable 30: Downloading the Necessary Datasets . . . . . . . . . . . . . . . 28
8.3 Deliverable 31: Tokenizing the Text . . . . . . . . . . . . . . . . . . . . . . . . 31
8.4 Deliverable 32: Perform POS Tagging and NER . . . . . . . . . . . . . . . . . . 32
8.5 Deliverable 33: Perform NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.6 Deliverable 34: Visualizing the Results . . . . . . . . . . . . . . . . . . . . . . . 32

1 Lab Overview

1.1 Technical Learning Objectives

1. Understand text parsing and Parts of Speech (POS) tagging

2. Understand the concept of Maximum Entropy (maxent) and how to conduct a POS
tagging using various Maxent Annotator options
3. Understand how to interpret the Penn Treebank POS tag codes
4. Become familiar with some of the main NLP packages available for R (such as NLP,
openNLP, coreNLP, cleanNLP) and the very powerful integration with Python NLP
libraries in spaCy using reticulate, and spacyr, with the option to use Python directly,
especially the nltk and spaCy libraries.
5. Understanding NLP in quanteda and tidytext.

2
1.2 Business Learning Objectives

1. Understand NLP alternative to the “Bag of Words” approach to text mining

2. Understand what you gain and lose in POS tagging and NLP approaches.
3. Understand the power of Named Entity Recognition (NER) for identifying dates, loca-
tions, money, organizations, people, percentages, and much more in a text corpus
4. Understand how to integrate the Python spaCy libraries into R using reticulate and the
spacyr package and natively.
5. Understand how to select the NLP packages to use for your needs

2 Assignment Overview

In this lab, you will transition from our statistical text mining approach, using Bag-of-Words
(BoW) into a focus on Natural Language Processing (NLP) techniques. You will learn how to
parse text and do Parts of Speech (POS) tagging required for NLP. You will use that tagged
text to do Named Entity Recognition (NER). You will learn how to use the spaCy library
in Python and how to integrate it into R using the reticulate and spacyr packages. You will
also learn about the main NLP packages available for R, such as NLP, openNLP, coreNLP,
and cleanNLP, and how to use them. Finally, you will learn about NLP in quanteda and
tidytext.
Remember, the the remaining three labs build on what you have learned and contain more
advanced techniques. While these techniques are very exciting, and some of you may choose to
incorporate them into your final projects, these labs are not required to meet the MVP for this
course. To receive minimum credit for the lab, you need only review the Lab Instructions and
submit a basic RMD file knitted to PDF. However, many of you want to learn and practice
these techniques. If so, these three advanced labs are for you. To receive full credit for the lab
please try any of the techniques included and submit the knitted PDF.
There are four main parts to the lab:
Part 1: Parts of Speech Tagging, will guide you through the process of parsing your text
and tagging the relevant parts of speech using openNLP, cleanNLP, and coreNLP.
Part 2: Applying Named Entity Recognition, introduces you to the concept of Maximum
Entropy (Maxent) and applies the maxent annotations for persons, locations, organizations,
and several others to the corpus.
Part 3: Conducting NLP Analysis with spacyr, provides an introduction to NLP anal-
ysis in the quanteda and tidytext ecosystems. In quanteda and tidytext, NLP requires a tight
integration with spacyr, the R environment for the powerful python spaCy libraries.
Part 4: Conducting NLP Analysis in Python, provides an introduction to NLP analysis
in Python using the nltk package and some of the built-in datasets.

3
3 Pre-Lab Instructions

Pre-Lab Instructions (to be completed before class):

Create Your Lab 8 Project in RStudio and a new Quarto document with a relevant title for
the lab, for example: “Lab 8: NLP, POS, and NER in R and Python”.
Download the lab files from the course GitHub repo and save them in your Lab 8 project folder.
The materials for Lab 8 are contained in a zipped file called containing the Lab 8 instructions
and the Lab 8 data, in a folder named “C8_final_txts” containing several unstructured files
ending with a .txt file extension. Unzip the file and save the files in your Lab 8 project folder.
Now begin working your way through the Lab 8 instructions taking a literate programming
approach. Remember to use your Quarto document to write your code and your explanations.
Use your own words to explain what you are doing and why you are doing it. You may
complete the pre-lab instructions or lab before class, or wait until class on Wednesday.
As usual, plan to submit your rendered pdf file, with your Quarto YAML header set to:

echo: true

For this lab you will be using several new packages in R and Python. You will need to
install the following packages in R: openNLP, sentimentr, coreNLP, cleanNLP, magrittr, NLP,
gridExtra, ggthemes, purrr, doBy, cshapes, rJava, sotu, spacyr, tinytex, and sf. Here again
is a way to install multiple packages at once using the concatenate function “c()” function in
R):

install.packages(c("openNLP", "sentimentr", "coreNLP", "cleanNLP", "magrittr", "NLP", "gridEx

If you haven’t done so already, you need to install the openNLPmodels.en from an outside
repository. Use the install.packages() function, with “openNLPmodels.en” in the argument,
but for repos = add “ “http://datacube.wu.ac.at/”, type = “source”. remember to put a
comma between the elements of your argument.

#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source")

If you haven’t done so already, you need to install the openNLPmodels.en from an outside
repository. Use the install.packages() function, with “openNLPmodels.en” in the argument,
but for repos = add “http://datacube.wu.ac.at/”, type = “source”. remember to put a comma
between the elements of your argument.
Now, add a R code chunk to set up your R system environment as follows:

4
options(stringsAsFactors = FALSE)
Sys.setlocale("LC_ALL","C")
lib.loc = "~/R/win-library/3.2"

[1] "C/C/C/C/C/en_US.UTF-8"

You will also need to install the spacyr package in R. You can install the spacyr package from
CRAN using the install.packages() function.

install.packages("spacyr")

You will need to load the required libraries in a specific order. Begin by loading the following
packages: gridExtra, ggmap, ggthemes, and sf.
These packages must be loaded before loading openNLP because ggplot2 (which will load when
ggthemes loads) has a conflict with openNLP using a function called annotate(); and we want
to use the function from openNLP (not ggplot2)
Then load the following remaining packages: NLP, openNLP, openNLPmodels.en, pbapply,
stringr, rvest, doBy, tm, cshapes, purr, dplyr, spacyr, tinytex.
These are some additional packages you may want to load: rJava, coreNLP, sentimentr,
cleanNLP, dplyr, magrittr.
You will also need to install the following packages into your Python environment: nltk,
pandas, numpy, and matplotlib. You can install these packages using Anaconda Navigator.
Just a note on rJava
When you load rJava, you may have an issue with not having a Java Virtual Machine (JVM)
for Java Runtime Environment (JRE) loaded on your computer. If so, you may go to Java.com
to download the appropriate version. Here is the link to install on a Mac: https://www.java.
com/en/download/mac_download.jsp. However, please note, openNLP may not compatible
with the most recent version of Java 8 (12), so you may need to use the last stable version of
Java 8 (11). This issue may be resolved by the time you work on this lab.
Preparing for the spacyr part of the lab.
To use spacyr, first install the spacyr package. This process also installs the various depen-
dencies for spacyr, including: RcppTOML, here, and reticulate. The first time you install this
package, it may take a little while, and may present some challenges.
You may install the latest version of spacyr from GitHub, or install from CRAN. If installing
from GitHub, the following is suggested code:

5
devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)

Now, load the spacyr package.

Now, you will want to “initialize” spacy, but before you can do that, you have to make sure you
have a miniconda environment on your system. If you would like more detail, the tutorial below
will walk you through loading the miniconda environment: https://spacyr.quanteda.io.
Once you are sure you have a miniconda environment you now want to “install” spacy (not
the spacyr library you already installed). This gets a little complicated, but essentially, what
you need to do is run the following function:

spacy_install()

For some of you, this will work fine with all the defaults.
However, if the default function doesn’t work for you, or if you know you have multiple
installations of Python on your system, you will want to use the argument to point to the
specific version of Python you have installed and/or want to use. For example, since I have
Python 3.11.4 as my primary working version, I set the argument to point to that version of
Python. Below is some sample code with the elements of the argument completed:

spacy_install(
conda = "auto",
version = "latest",
lang_models = "en_core_web_sm",
python_version = "3.11.4",
envname = "spacy_condaenv",
pip = FALSE,
python_path = NULL,
prompt = TRUE
)

or, for my updated version using Anaconda, I used the following code:

spacy_install(
conda = "auto",
version = "latest",
lang_models = "en_core_web_sm",
python_version = "3.11.5",
envname = "textmining",
pip = FALSE,

6
python_path = "/Users/derrickcogburn/miniconda3/envs/textmining/bin/python"
prompt = TRUE
)

Be forewarned, this function is going to install quite a bit of additional and dependent packages,
and will take some time. You definitely want to comment out this function once you have
installed spacy and its dependencies.
You should hopefully eventually get a message similar to the following in your console:
== Download and installation successful You can now load the package via spacy.load(‘en_core_web_sm’)
Installation complete. Condaenv: spacy_condaenv: Language model(s): en_core_web_sm
==
Once you get this message, congratulations! You are ready to initialize spacy using the
spacy_initialize() function with the following code in the argument: model=“en_core_web_sm”.
Note the “model” in the argument is calling the smaller version of the English tokenizer
(the larger one is “en_core_web_lg”) which you may also use, but it takes a longer time to
process.
Now you are ready to “initialize” spacyr. To do this, use the spacy_initialize() function. In
your argument, for model= add: “en_core_web_sm”. At the end of the lab you will learn the
opposite function, spacy_finalize() to “detatch” spacy from your environment.

successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)

Once you initialize spacy, you should see something similiar to the following result in the con-
sole. It indicates spacy could find the conda environment to use for spacy (spacy_condaenv).
It also tells you the version of spaCy you are using, the language model it attached, and the
Python options you have set up:
==
Found ‘spacy_condaenv’. spacyr will use this environment successfully initialized (spaCy Ver-
sion: 3.1.3, language model: en_core_web_sm) (python options: type = “condaenv”, value
= “spacy_condaenv”)
==
If your initialization is successful, you are now ready to use the powerful Python-based spacyr
package for Natural Language Processing (NLP) tasks.
*** End of Pre-Lab ***

7
4 Lab Instructions

Lab Instructions (to be completed during class):

This lab has 34 deliverables Follow and complete these Lab Instructions before, during, or
after the synchronous class.

4.1 Deliverable 1: Get your working directory and paste below:

getwd()

5 Part 1: Parts of Speech (POS) Tagging in openNLP

In this part of the lab, you will use the openNLP package to perform Parts of Speech (POS)
tagging on a text. Let’s now focus on text parsing, named entity recognition, and understand-
ing a Parts of Speech (POS) tokenizer in the openNLP package.

5.1 Deliverable 2: Create Some Sample Text

First, let’s create some example text to start. Create an object called “s” and assign it the
results of the following code chunk:

s <- paste(c('Pierre Vinken, 61 years old, will join the board as a ',
'nonexecutive director Nov 29.',
'Mr. Vinken is chairman of Elsevier, N.V., ',
'the Dutch publishing group.'),
collapse = '')
s <- as.String(s)
s

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov 29. Mr. Vinke

8
5.2 Deliverable 3: Create Sentence and Word Token Annotations

Now, let’s create sentence and word token annotations objects called sent_token_annotator
uand word_token_annotator. To do this, you will use the Maxent_Sent_Token_Annotator()
function and the Maxent_Word_Token_Annotator() function from the openNLP package.
Here is the sample code:

sent_token_annotator <- Maxent_Sent_Token_Annotator()

word_token_annotator <- Maxent_Word_Token_Annotator()

Then, create an object called a2 and assign it the results of the following code. And then
review a2.

annotate(s, list(sent_token_annotator, word_token_annotator))

Now, create an object called pos_tag_annotator and assign it the results of the Max-
ent_POS_Tag_Annotator() function. Create an object called a3 and assign it the results
of applying annotate() function to object s. Use the following code for the argument of the
annotate() function: (s, pos_tag_annotator, a2). Then review a3. Next, create an object
called a3w and assign it the results of the function subset(), using the following code chunk
in your argument:(a3, type == “word”). Then review a3w.
sample code is below:

pos_tag_annotator <- Maxent_POS_Tag_Annotator()

a3 <- annotate(s, pos_tag_annotator, a2)

a3w <- subset(a3, type == "word")

a3w

Now, create an object called tags, and assign it the results of the following code chunk:

sapply(a3w$features,"[[","POS").

Then review the object tags. Finally, use the table() to create a table of tags. Then review
the object tags. Finally, use the table() to create a table of tags.
Sample code is below:

9
tags <- sapply(a3w$features,"[[","POS")
tags
table(tags)

For more information on Parts of Speech (POS) tagging, please see: https://repository.upenn.
edu/cgi/viewcontent.cgi?article=1603&context=cis_reports; and for a summary, please see:
https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html.

5.3 Deliverable 3: Extract Tokens/POS Pairs

Now we will extract tokens/POS pairs (all of them):

Here is sample code:

sprintf("%s/%s", s[a3w], tags)

[1] "Pierre/NNP" "Vinken/NNP" ",/," "61/CD"

[5] "years/NNS" "old/JJ" ",/," "will/MD"
[9] "join/VB" "the/DT" "board/NN" "as/IN"
[13] "a/DT" "nonexecutive/JJ" "director/NN" "Nov/NNP"
[17] "29/CD" "./." "Mr./NNP" "Vinken/NNP"
[21] "is/VBZ" "chairman/NN" "of/IN" "Elsevier/NNP"
[25] ",/," "N.V./NNP" ",/," "the/DT"
[29] "Dutch/JJ" "publishing/NN" "group/NN" "./."

6 Part 2: Applying Named Entity Recognition (NER) to the

Clinton Email Dataset

In this part of the lab, you will apply Named Entity Recognition (NER) to a subset of the
infamous Clinton Email Dataset. The Clinton Email Dataset is a collection of emails from
Hillary Clinton’s private email server. The dataset is available on Kaggle at: https://www.
kaggle.com/kaggle/hillary-clinton-emails.
We need to scan a folder in our working directory for multiple files representing the Hillary
Clinton emails saved as .txt files. We will be using the list.files function and a wildcard added
to the .txt file extension (*.txt) which will identify any file in that directory that ends in .txt.
We will then create an object called “temp” that contains those emails.
You may want to set your working directory, with the path leading to the folder containing
the emails.

10
6.1 Deliverable 4: Import Data and Organize the Clinton Email Dataset

Here is the sample code:

tmp <- list.files(path ="data/C8_final_txts/", pattern = '*.txt', full.names = T)

emails <- pblapply(tmp, readLines)
names(emails) <- gsub('.txt', '', list.files(pattern = '.txt'))

6.2 Deliverable 5: Examine the Clinton Email Dataset

Let’s now examine one (1) email. Use the following code:

emails[[1]]

[1] "UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05758905 Date: 02/
[2] ""
[3] "RELEASE IN FULL"
[4] ""
[5] "From: Sent: To: Subject:"
[6] ""
[7] "H <[email protected] > Monday, July 6, 2009 10:22 AM '[email protected]' Re: Sch
[8] ""
[9] "Either is fine w me."
[10] "Original Message From: Valmoro, Lona J <[email protected]> To: H Cc: Abedin, Huma <Abe
[11] "Secretary Salazar's office just called -- he would like to meeting with you early this
[12] ""
[13] "Lona Valmoro Special Assistant to the Secretary of State 202-647-9071 (direct)"
[14] ""
[15] "UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05758905 Date: 02/
[16] ""
[17] "\f"

6.3 Deliverable 6: Create a Custom Function to Clean the Emails

Now, let’s create a custom function called txtClean to clean the emails.
Sample code is below:

11
#txtClean <- function(x) {
x <- x[-1]
x <- paste(x,collapse = " ")
x <- str_replace_all(x, "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+", "")
x <- str_replace_all(x, "Doc No.", "")
x <- str_replace_all(x, "UNCLASSIFIED U.S. Department of State Case No.", "")
x <- removeNumbers(x)
x <- as.String(x)
return(x)
#}

6.4 Deliverable 7: Apply the Cleaning Function to the Clinton Email Dataset

Now, as a test, we will apply the cleaning function to one email and observe what happens to
it. Use the following code:

txtClean(emails[[1]])[[1]]

[1] " RELEASE IN FULL From: Sent: To: Subject: H < > Monday, July , : AM '' Re: Schedule

Now, let’s apply the cleaning function to all emails; and then review one email in the list.
Sample code is below:

allEmails <- pblapply(emails,txtClean)

allEmails[[2]][[1]][1]

[1] " From: Sent: To: Subject: H < > Tuesday, July , : AM '' Re: Reminder geithner is in ab

6.5 Deliverable 8: Apply POS Tagging to the Clinton Email Dataset

In this deliverable you will create several objects to use as various types of “annotators”.
Create three objects to use for your POS tagging. Create the following three objects: persons,
locations, and organizations. For each of those, assign its respective Maxent annotator (e.g
kind=“person”, kind=“location”, kind=“organization).
Sample code is below:

12
persons <- Maxent_Entity_Annotator(kind='person')
locations <- Maxent_Entity_Annotator(kind='location')
organizations <- Maxent_Entity_Annotator(kind='organization')

Then create two objects to hold your Maxent sentence and word annotators.
Sample code below:

sentTokenAnnotator <- Maxent_Sent_Token_Annotator(language='en')

wordTokenAnnotator <- Maxent_Word_Token_Annotator(language='en')

Finally, create one more object called posTagAnnotator to hold your Maxent POS tag anno-
tator.
Sample code below:

posTagAnnotator <- Maxent_POS_Tag_Annotator(language='en')

POS tagging for persons, locations, and organizations; as well as individual models for sen-
tences, words and parts of speech. This code load the pre-existing feature weights to be called
by your R session, but they do not yet apply them to any text.

6.6 Deliverable 9: Annotate the Clinton Email Dataset Using a For Loop

For this deliverable, we will work on understanding for loops in R.

In R, “for loops” iterate for a controlled counter or an index, incremented at each iteration
cycle. While “repeat loops” (with conditional clauses) break and next: based on the onset and
verification of some logical condition.
We will now annotate each document in a loop. Sample code is below:

annotationsData <- list()

for (i in 1:length(allEmails)){
print(paste('starting annotations on doc', i))
annotations <- annotate(allEmails[[i]], list(sentTokenAnnotator,
wordTokenAnnotator,
posTagAnnotator,
persons,
locations,
organizations))
annDF <- as.data.frame(annotations)[,2:5]

13
annDF$features <- unlist(as.character(annDF$features))

annotationsData[[tmp[i]]] <- annDF

print(paste('finished annotations on doc', i))
}

6.7 Deliverable 10: Extract Terms from the Clinton Email Dataset

Annotations have character indices. We will now obtain terms by index from each document
using a NESTED loop.
Sample code is below (run the entire code chunk at once):

allData<- list()
for (i in 1:length(allEmails)){
x <- allEmails[[i]] # get an individual document
y <- annotationsData[[i]] # get an individual doc's annotation information
print(paste('starting document:',i, 'of', length(allEmails)))
# for each row in the annotation information, extract the term by index
POSls <- list()
for(j in 1:nrow(y)){
annoChars <- ((substr(x,y[j,2],y[j,3]))) #substring position
# Organize information in data frame
z <- data.frame(doc_id = i,
type = y[j,1],
start = y[j,2],
end = y[j,3],
features = y[j,4],
text = as.character(annoChars))
POSls[[j]] <- z
#print(paste('getting POS:', j))
}
# Bind each documents annotations & terms from loop into a single DF
docPOS <- do.call(rbind, POSls)
# So each document will have an individual DF of terms, and annotations as a list element
allData[[i]] <- docPOS
}

6.8 Deliverable 11: Subset the Clinton Email Dataset

Now we can subset for each document.

14
Sample code is below:

people <- pblapply(allData, subset, grepl("*person", features))

locaction <- pblapply(allData, subset, grepl("*location", features))
organization <- pblapply(allData, subset, grepl("*organization", features))

people

locaction #note in this instance, location is misspelled. this is on purpose since location i
organization
### Or if you prefer to work with flat objects make it a data frame w/all info
POSdf <- do.call(rbind, allData)
# Subsetting example w/2 conditions; people found in email 1; note, do not include the \& in

subset(POSdf, POSdf$doc_id ==1 \& grepl("*person", POSdf$features) == T)

6.9 Deliverable 12: Using the Annotate Entities Process

This final data frame contains not only persons, locations, and organizations, but also each
detected sentence, word, and part of speech. To complete the named entity analysis, you may
need to subset the data frame for specific features.
The sample code is below:

annDF

subset(annDF$words, grepl("*people", annDF$features) == T)

subset(annDF$words, grepl("*locaction", annDF$features) == T)
subset(annDF$words, grepl("*organization", annDF$features) == T)

person

6.10 Deliverable 13: Annotate Entities with OpenNLP

Now, you may define an annotation sequence, and create a list with the specific opeNLP
models.
Sample code is below:

15
annotate.entities <- function(doc, annotation.pipeline) {
annotations <- annotate(doc, annotation.pipeline)
AnnotatedPlainTextDocument(doc, annotations)
}

ner.pipeline <- list(

Maxent_Sent_Token_Annotator(),
Maxent_Word_Token_Annotator(),
Maxent_POS_Tag_Annotator(),
Maxent_Entity_Annotator(kind = "person"),
Maxent_Entity_Annotator(kind = "location"),
Maxent_Entity_Annotator(kind = "organization")
)

6.11 Deliverable 14: Apply the Annotation Function

Now that we have an annotation function that individually calls the models, we need to apply
them to the entire email list either the lapply or the pblapply functions will work, but the
pblapply is helpful because it provides a progress bar (pb). Note, the pbapply package must
be loaded.
Sample code is below:

all.ner <- pblapply(all.emails, annotate.entities, ner.pipeline)

all.ner <- pblapply(allEmails, annotate.entities, ner.pipeline)

Now, we can extract the useful information and construct a data frame with entity informa-
tion.
all.ner
all.ner <- pluck(all.ner, "annotations")
all.ner <- pblapply(all.ner, as.data.frame)
#all.ner[[3]][244:250,]
#all.ner <- Map(function(tex,fea,id) cbind(fea, entity = substring(tex,fea$start, fea$end), f

7 Part 3: Conducting NLP Analysis with spacyr

7.1 Deliverable 15: Load the spacyr Package

The spacyr package is a powerful tool for conducting NLP analysis. It is a wrapper for the
spaCy Python library, which is known for its speed and accuracy.

16
To use the spacyr package, you need to install it and load it into your R environment.
Now, to practice with spacyr, let’s create an object called txt that is assigned the results of
two sample “documents”, d1 and d2. We then use the spacy_parse() function to parse them,
and save in a data.table called parsedtxt, which we can then review.
Sample code is below:

txt <- c(d1 = "spaCy is great at fast natural language processing.",

d2 = "Mr. Smith spent two years in North Carolina.")

parsedtxt <- spacy_parse(txt)

parsedtxt

doc_id sentence_id token_id token lemma pos entity

1 d1 1 1 spaCy spacy INTJ
2 d1 1 2 is be AUX
3 d1 1 3 great great ADJ
4 d1 1 4 at at ADP
5 d1 1 5 fast fast ADJ
6 d1 1 6 natural natural ADJ
7 d1 1 7 language language NOUN
8 d1 1 8 processing processing NOUN
9 d1 1 9 . . PUNCT
10 d2 1 1 Mr. Mr. PROPN
11 d2 1 2 Smith Smith PROPN PERSON_B
12 d2 1 3 spent spend VERB
13 d2 1 4 two two NUM DATE_B
14 d2 1 5 years year NOUN DATE_I
15 d2 1 6 in in ADP
16 d2 1 7 North North PROPN GPE_B
17 d2 1 8 Carolina Carolina PROPN GPE_I
18 d2 1 9 . . PUNCT

7.2 Deliverable 16: Review the Parsed Text

In this analysis, we see two fields for POS tags (pos and entity). In this case the pos field
returns the parts of speech as tagged using the Penn Treebank tagset. You may adjust the
argument for the spacy_parse() function to determine what you get back. For example, you
may choose to not generate the lemma or entity fields with the code chunk below:

17
spacy_parse(txt, tag = TRUE, entity = FALSE, lemma = FALSE)

You may also parse and POS tag documents in multiple languages using over 70 other lan-
guage models (some of these include: German (de), Spanish (es), Portugese (pt), French (fr),
Italian (it), Dutch(nl), and many others: See: https://spacy.io/usage/models#languages for
a complete list)

7.3 Deliverable 17: Parse the txt Object with spacyr

You may also use spacyr to directly parse the txt object. Use the spacy_tokenize() function
with txt in the argument. This approach is designed to match the tokens() functions within
quanteda.
Sample code is below:

spacy_tokenize(txt)

$d1
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

18
$d2
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."

7.4 Deliverable 18: Tokenize Text into a Data Frame

The default returns a named list (where the document name, eg d1, d2 is the list element
name). Or you may specify to output to a data.frame. Either way, make sure you have the
dplyr library loaded.
Sample code is below:

library(dplyr)

spacy_tokenize(txt, remove_punct = TRUE, output = "data.frame") %>%

tail()

doc_id token
11 d2 spent
12 d2 two
13 d2 years
14 d2 in
15 d2 North
16 d2 Carolina

7.5 Deliverable 19: Extract Named Entities From Parsed Text

You may also use spacyr to extract named entities from the parsed text.
Sample code is below:

parsedtxt <- spacy_parse(txt, lemma = FALSE, entity = TRUE, nounphrase = TRUE)

entity_extract(parsedtxt)

doc_id sentence_id entity entity_type

1 d2 1 Smith PERSON
2 d2 1 North_Carolina GPE

19
7.6 Deliverable 20: Extract Extended Entity Set

The following approach lets you extract the “extended” entity set.
Sample code is below:

entity_extract(parsedtxt, type = "all")

doc_id sentence_id entity entity_type

1 d2 1 Smith PERSON
2 d2 1 two_years DATE
3 d2 1 North_Carolina GPE

7.7 Deliverable 21: Consolidate Named Entities

Another interesting possibility is to “consolidate” multi-word entities into single tokens using
the entity_consolidate() functions.
Sample code is below:

entity_consolidate(parsedtxt) %>%
tail()

doc_id sentence_id token_id token pos entity_type

11 d2 1 2 Smith ENTITY PERSON
12 d2 1 3 spent VERB
13 d2 1 4 two_years ENTITY DATE
14 d2 1 5 in ADP
15 d2 1 6 North_Carolina ENTITY GPE
16 d2 1 7 . PUNCT

7.8 Deliverable 22: Extract Noun Phrases

Similarly, spacyr can extract noun phrases.

Sample code is below:

nounphrase_extract(parsedtxt)

20
doc_id sentence_id nounphrase
1 d1 1 fast_natural_language_processing
2 d2 1 Mr._Smith
3 d2 1 two_years
4 d2 1 North_Carolina

Noun phrases may also be consolidated using the nounphrase_consolidate() function applied
to the parsedtxt object.
Sample code is below:

nounphrase_consolidate(parsedtxt)

doc_id sentence_id token_id token pos

1 d1 1 1 spaCy INTJ
2 d1 1 2 is AUX
3 d1 1 3 great ADJ
4 d1 1 4 at ADP
5 d1 1 5 fast_natural_language_processing nounphrase
6 d1 1 6 . PUNCT
7 d2 1 1 Mr._Smith nounphrase
8 d2 1 2 spent VERB
9 d2 1 3 two_years nounphrase
10 d2 1 4 in ADP
11 d2 1 5 North_Carolina nounphrase
12 d2 1 6 . PUNCT

Deliverable 23: Extract Entities Without Parsing the Entire Text

To only extract entities without parsing the entire text, you may use the spacy_extract_entity()
function applied to the txt object.
Sample code is below:

spacy_extract_entity(txt)

doc_id text ent_type start_id length

1 d2 Smith PERSON 2 1
2 d2 two years DATE 4 2
3 d2 North Carolina GPE 7 2

21
Similarly, to extract noun phrases without parsing the entire text, you may use the
spacy_extract_nounphrases() function:
Sample code is below:

spacy_extract_nounphrases(txt)

doc_id text root_text start_id root_id length

1 d1 fast natural language processing processing 5 8 4
2 d2 Mr. Smith Smith 1 2 2
3 d2 two years years 4 5 2
4 d2 North Carolina Carolina 7 8 2

For detailed parsing of syntactic dependencies, you may use the dependency=TRUE option in
the argument. Sample code is below:

spacy_parse(txt, dependency = TRUE, lemma = FALSE, pos = FALSE)

doc_id sentence_id token_id token head_token_id dep_rel entity

1 d1 1 1 spaCy 2 nsubj
2 d1 1 2 is 2 ROOT
3 d1 1 3 great 2 acomp
4 d1 1 4 at 2 prep
5 d1 1 5 fast 8 amod
6 d1 1 6 natural 7 amod
7 d1 1 7 language 8 compound
8 d1 1 8 processing 4 pobj
9 d1 1 9 . 2 punct
10 d2 1 1 Mr. 2 compound
11 d2 1 2 Smith 3 nsubj PERSON_B
12 d2 1 3 spent 3 ROOT
13 d2 1 4 two 5 nummod DATE_B
14 d2 1 5 years 3 dobj DATE_I
15 d2 1 6 in 3 prep
16 d2 1 7 North 8 compound GPE_B
17 d2 1 8 Carolina 6 pobj GPE_I
18 d2 1 9 . 3 punct

22
7.9 Deliverable 23: Extract Additional Attributes

You may also extract additional attributes of spaCy tokens using the additional_attributes
option in the argument.
Sample code is below:

spacy_parse("I have six email addresses, including [email protected].",

additional_attributes = c("like_num", "like_email"),
lemma = FALSE, pos = FALSE, entity = FALSE)

doc_id sentence_id token_id token like_num like_email

1 text1 1 1 I FALSE FALSE
2 text1 1 2 have FALSE FALSE
3 text1 1 3 six TRUE FALSE
4 text1 1 4 email FALSE FALSE
5 text1 1 5 addresses FALSE FALSE
6 text1 1 6 , FALSE FALSE
7 text1 1 7 including FALSE FALSE
8 text1 1 8 [email protected] FALSE TRUE
9 text1 1 9 . FALSE FALSE

Take a moment to review this output and discuss why this was produced from the parsing
request.

7.10 Deliverable 24: Extract Additional Attributes

You may also integrate the output from spacyr directly into quanteda. Sample code is below:

library(quanteda, warn.conflicts = FALSE, quietly = TRUE)

# To identify the names of the documents

docnames(parsedtxt)

# To count the number of tokens in the documents

ntoken(parsedtxt)

# To count the number or types of tokens

ntype(parsedtxt)

23
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1

Parallel computing: disabled

See https://quanteda.io for tutorials and examples.

[1] "d1" "d2"

d1 d2
9 9

7.11 Deliverable 25: Extract Additional Attributes

You may also convert tokens in spacyr, which have tokenizers that are “smarter” than the
purely syntactic pattern-based parsers used by quanteda.
Sample code is below:

parsedtxt <- spacy_parse(txt, pos = TRUE, tag = TRUE)

as.tokens(parsedtxt)

Tokens consisting of 2 documents.

d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

d2 :
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."

If you want to select only nouns, using “glob” pattern matching with quanteda, you may use
the tokens_select() function.
Sample code is below:

24
spacy_parse("The cat in the hat ate green eggs and ham.", pos = TRUE) %>%
as.tokens(include_pos = "pos") %>%
tokens_select(pattern = c("*/NOUN"))

Tokens consisting of 1 document.

text1 :
[1] "cat/NOUN" "hat/NOUN" "eggs/NOUN"

You may also directly convert the spaCy-based tokens. Sample code is below:

spacy_tokenize(txt) %>%
as.tokens()

Tokens consisting of 2 documents.

d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

d2 :
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."

You may also do this for sentences, for which spaCy is very smart.
Sample code is below:

txt2 <- "A Ph.D. in Washington D.C. Mr. Smith went to Washington."
spacy_tokenize(txt2, what = "sentence") %>%
as.tokens()

Tokens consisting of 1 document.

text1 :
[1] "A Ph.D. in Washington D.C. Mr. Smith went to Washington."

This also works well with entity recognition.

Sample code is below:

spacy_parse(txt, entity = TRUE) %>%

entity_consolidate() %>%
as.tokens() %>%
head(1)

25
Tokens consisting of 1 document.
d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."

7.12 Deliverable 26: Using spacyr with tidytext

The spacyr package also works well with tidytext.

Sample code is below:

library("tidytext")

unnest_tokens(parsedtxt, word, token) %>%

dplyr::anti_join(stop_words)

Joining with `by = join_by(word)`

7.13 Deliverable 27: POS Filtering

We can then use POS filtering using dplyr.

Sample code is below

spacy_parse("The cat in the hat ate green eggs and ham.", pos = TRUE) %>%
unnest_tokens(word, token) %>%
dplyr::filter(pos == "NOUN")

26
doc_id sentence_id token_id lemma pos entity word
1 text1 1 2 cat NOUN cat
2 text1 1 5 hat NOUN hat
3 text1 1 8 egg NOUN eggs

7.14 Deliverable 28: Finalizing the SpaCy Connection

Since the spacy_initialize() attaches a background process of spaCy in python space, it takes
up a significant amount of memory (especially when using a large language model such as:
en_core_web_lg). So, when you no longer need the connection to spaCy, you may remove
the spaCy object by calling the spacy_finalize() function.
And, when you are ready to reattach the back-end spaCy object, you call spacy_initialize()
again.

8 Part 4: Conducting NLP Analysis in Python

In this final section, we will conduct NLP analysis in Python. We will use the nltk package and
some of the built-in datasets to conduct our analysis. Make sure the nltk package is installed
into the Python environment you are using for this lab.

8.1 Deliverable 29: Importing the nltk Package and Downloading Necessary
POS Taggers

For this part of the lab you will need to import the nltk package and download punkt,
averaged_perceptron_tagger, maxent_ne_chunker, and words. These downloads are
necessary for the lab, punkt is a pre-trained model that helps you tokenize words, av-
eraged_perceptron_tagger is a pre-trained model that helps you tag parts of speech,
maxent_ne_chunker is a pre-trained model that helps you identify named entities, and words
is a dataset that contains a list of words.
You can do this by running the following sample code in a Python code chunk:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

True

27
True

True

8.2 Deliverable 30: Downloading the Necessary Datasets

You will also need to download the necessary datasets from nltk which are required for the
lab. NLTK comes with several built-in datastets. We will use the “state_union” corpus, a
collection of US Presidential State of the Union addresses. We will use the 2006 address by
President George W. Bush. Sample code is below:

from nltk.corpus import state_union

nltk.download('state_union')
# Load a sample text
sample_text = state_union.raw("2006-GWBush.txt")
print(sample_text)

True

PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE

January 31, 2006

THE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, member

President George W. Bush reacts to applause during his State of the Union Address at the Capi

In a system of two parties, two chambers, and two elected branches, there will always be diff

In this decisive year, you and I will make choices that determine both the future and the cha

Abroad, our nation is committed to an historic, long-term goal -- we seek the end of tyranny

Far from being a hopeless dream, the advance of freedom is the great story of our time. In 19

President George W. Bush delivers his State of the Union Address at the Capitol, Tuesday, Jan

Their aim is to seize power in Iraq, and use it as a safe haven to launch attacks against Ame

28
In a time of testing, we cannot find security by abandoning our commitments and retreating wi

America rejects the false comfort of isolationism. We are the nation that saved liberty in Eu

President George W. Bush greets members of Congress after his State of the Union Address at t

Second, we're continuing reconstruction efforts, and helping the Iraqi government to fight co

Our work in Iraq is difficult because our enemy is brutal. But that brutality has not stopped

The road of victory is the road that will take our troops home. As we make progress on the gr

Our coalition has learned from our experience in Iraq. We've adjusted our military tactics an

With so much in the balance, those of us in public office have a duty to speak with candor. A

Laura Bush is applauded as she is introduced Tuesday evening, Jan. 31, 2006 during the State

Staff Sergeant Dan Clay's wife, Lisa, and his mom and dad, Sara Jo and Bud, are with us this

Our nation is grateful to the fallen, who live in the memory of our country. We're grateful t

Our offensive against terror involves more than military action. Ultimately, the only way to

The great people of Egypt have voted in a multi-party presidential election -- and now their

President George W. Bush waves toward the upper visitors gallery of the House Chamber followi

Tonight, let me speak directly to the citizens of Iran: America respects you, and we respect

To overcome dangers in our world, we must also take the offensive by encouraging economic pro

In recent years, you and I have taken unprecedented action to fight AIDS and malaria, expand

Our country must also remain on the offensive against terrorism here at home. The enemy has n

It is said that prior to the attacks of September the 11th, our government failed to connect

In all these areas -- from the disruption of terror networks, to victory in Iraq, to the spre

Our own generation is in a long war against a determined enemy -- a war that will be fought b

29
Here at home, America also has a great opportunity: We will build the prosperity of our count

Our economy is healthy and vigorous, and growing faster than other major industrialized natio

The American economy is preeminent, but we cannot afford to be complacent. In a dynamic world

Tonight I will set out a better path: an agenda for a nation that competes with confidence; a

Keeping America competitive begins with keeping our economy growing. And our economy grows wh

Keeping America competitive requires us to be good stewards of tax dollars. Every year of my

I am pleased that members of Congress are working on earmark reform, because the federal budg

We must also confront the larger challenge of mandatory spending, or entitlements. This year,

So tonight, I ask you to join me in creating a commission to examine the full impact of baby

Keeping America competitive requires us to open more markets for all that Americans make and

Keeping America competitive requires an immigration system that upholds our laws, reflects ou

Keeping America competitive requires affordable health care. (Applause.) Our government has a

We will make wider use of electronic records and other health information technology, to help

Keeping America competitive requires affordable energy. And here we have a serious problem: A

So tonight, I announce the Advanced Energy Initiative -- a 22-percent increase in clean-energ

We must also change how we power our automobiles. We will increase our research in better bat

Breakthroughs on this and other new technologies will help us reach another great goal: to re

And to keep America competitive, one commitment is necessary above all: We must continue to l

First, I propose to double the federal commitment to the most critical basic research program

Second, I propose to make permanent the research and development tax credit -- (applause) --

Third, we need to encourage children to take more math and science, and to make sure those co

Preparing our nation to compete in the world is a goal that all of us can share. I urge you t

30
America is a great force for freedom and prosperity. Yet our greatness is not measured in pow

In recent years, America has become a more hopeful nation. Violent crime rates have fallen to

These gains are evidence of a quiet transformation -- a revolution of conscience, in which a

Yet many Americans, especially parents, still have deep concerns about the direction of our c

As we look at these challenges, we must never give in to the belief that America is in declin

A hopeful society depends on courts that deliver equal justice under the law. The Supreme Cou

Today marks the official retirement of a very special American. For 24 years of faithful serv

A hopeful society has institutions of science and medicine that do not cut ethical corners, a

A hopeful society expects elected officials to uphold the public trust. (Applause.) Honorable

As we renew the promise of our institutions, let us also show the character of America in our

A hopeful society gives special attention to children who lack direction and love. Through th

A hopeful society comes to the aid of fellow citizens in times of suffering and emergency --

In New Orleans and in other places, many of our fellow citizens have felt excluded from the p

A hopeful society acts boldly to fight diseases like HIV/AIDS, which can be prevented, and tr

Fellow citizens, we've been called to leadership in a period of consequence. We've entered a

Lincoln could have accepted peace at the cost of disunity and continued slavery. Martin Luthe

Before history is written down in books, it is written in courage. Like Americans before us,

May God bless America. (Applause.)

8.3 Deliverable 31: Tokenizing the Text

We want to perform POS tagging and NER, but before we can do that, we need to tokenize
the text, splitting it into sentences, and then into words. Sample code is below:

31
from nltk.tokenize import sent_tokenize, word_tokenize

# Tokenize the sentences

sentences = sent_tokenize(sample_text)

# Tokenize the words

#words = word_tokenize(sample_text)
words = [word_tokenize(sentence) for sentence in sentences]

8.4 Deliverable 32: Perform POS Tagging and NER

We can now perform POS tagging.

Sample code is below:

pos_tagged = [nltk.pos_tag(word) for word in words]

print(pos_tagged)

[[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('A

8.5 Deliverable 33: Perform NER

Now, let’s perform NER. Sample code is below:

named_entities = []
for tagged_sentence in pos_tagged:
chunked_sentence = nltk.ne_chunk(tagged_sentence, binary = True)
named_entities.extend([chunk for chunk in chunked_sentence if hasattr(chunk, 'label')])
print(named_entities[:10])

[Tree('NE', [('GEORGE', 'NNP')]), Tree('NE', [('ADDRESS', 'NNP')]), Tree('NE', [('THE', 'NNP'

8.6 Deliverable 34: Visualizing the Results

Now, we can visualize these results. We will use matplotlib to visualize the POS tagging
and NER. If you have not already installed the matplotlib package, you may do so within
Anaconda Navigator.
First, let’s create a frequency distribution of the POS tags, and then plot this distribution.
Sample code is below:

32
import nltk
from nltk import FreqDist
import matplotlib.pyplot as plt

# Assuming `pos_tagged` contains your POS-tagged text

pos_tags = [tag for sentence in pos_tagged for _, tag in sentence]

# Frequency distribution of POS tags

pos_freq = FreqDist(pos_tags)

# Creating a bar plot for POS tags frequency

plt.figure(figsize=(12, 8))
pos_freq.plot(30, cumulative=False)
plt.show()

800

600
Counts

400

200

0
NN
IN
DT
JJ
NNP
NNS
.
,
VB
CC
PRP
VBP
TO
RB
PRP$
VBZ
MD

VBN
:
CD
(
)
JJR
VBD
WDT
NNPS
WP
RP
POS
VBG

Samples

33
Now, we will visualize the NER results. We will create a frequency distribution of the NER
tags, and then plot this distribution.
Sample code is below:

import nltk
from nltk import FreqDist
import matplotlib.pyplot as plt

# Assuming `pos_tagged` contains your POS-tagged text

pos_tags = [tag for sentence in pos_tagged for _, tag in sentence]

# Frequency distribution of POS tags

pos_freq = FreqDist(pos_tags)

# Creating a bar plot for POS tags frequency

plt.figure(figsize=(12, 8))
pos_freq.plot(30, cumulative=False)
plt.show()

34
Counts

0
200
400
600
800

NN
IN

* End of Lab *

DT
JJ
NNP
NNS
.
,
VB
CC
PRP
VBP
TO
RB
PRP$

35
VBZ

Samples
MD
VBG
VBN
:
CD
(
)
JJR
VBD
WDT
NNPS
WP
RP
POS
Please render your Quarto file to pdf and submit to the assignment for Lab 7 within Canvas.

Arabic For Beginners A Guide To Modern Standard Arabic (Sarah Risha) (Z-Library)
No ratings yet
Arabic For Beginners A Guide To Modern Standard Arabic (Sarah Risha) (Z-Library)
322 pages
Session2 3
No ratings yet
Session2 3
18 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Unit I NLP
No ratings yet
Unit I NLP
5 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
TP1 3
No ratings yet
TP1 3
5 pages
Named Entity Recognition: Katharine Jarmul
No ratings yet
Named Entity Recognition: Katharine Jarmul
17 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
GBHRFTHRDF
No ratings yet
GBHRFTHRDF
3 pages
Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
No ratings yet
Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
454 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
7-Text Classification-13-11-2024
No ratings yet
7-Text Classification-13-11-2024
53 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
CSR 322 Syllabus
No ratings yet
CSR 322 Syllabus
2 pages
Applied Natural Language Processing
No ratings yet
Applied Natural Language Processing
3 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Syllabus For Course Work
No ratings yet
NLP Syllabus For Course Work
4 pages
NLPAssignment Purna
No ratings yet
NLPAssignment Purna
12 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
NLP 1
No ratings yet
NLP 1
11 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
NLP Essay
No ratings yet
NLP Essay
2 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
NLB Lab Manuel 2
No ratings yet
NLB Lab Manuel 2
71 pages
Spark NLP Training-Public-Oct 2020
No ratings yet
Spark NLP Training-Public-Oct 2020
50 pages
ch5&6 Lecture AI
No ratings yet
ch5&6 Lecture AI
69 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
5 - Introduction To NLP
No ratings yet
5 - Introduction To NLP
34 pages
NLP Final
No ratings yet
NLP Final
33 pages
Sample
No ratings yet
Sample
8 pages
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Natural Language Processing-Course Handout September 2022
No ratings yet
Natural Language Processing-Course Handout September 2022
8 pages
Natural Language Processing (NLP) With Python - Tutorial
No ratings yet
Natural Language Processing (NLP) With Python - Tutorial
72 pages
Core Components of Natural Language Processing
No ratings yet
Core Components of Natural Language Processing
43 pages
Mod 1
No ratings yet
Mod 1
71 pages
Master Roblox Studio Advanced Game Development Techniques: Roblox Studio, #3
From Everand
Master Roblox Studio Advanced Game Development Techniques: Roblox Studio, #3
Steven Mcananey
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Lect 02
No ratings yet
Lect 02
23 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Tokenization
No ratings yet
Tokenization
4 pages
Massp2023 NLP
No ratings yet
Massp2023 NLP
26 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Syllabus 2
No ratings yet
Syllabus 2
6 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Figurative Language Vocab
No ratings yet
Figurative Language Vocab
2 pages
Yale University Press 2014 Language Catalog
100% (1)
Yale University Press 2014 Language Catalog
42 pages
Gerunds Versus Infinitives Exercises-1
No ratings yet
Gerunds Versus Infinitives Exercises-1
7 pages
PTB-FTB Class IX English Chapter 3
No ratings yet
PTB-FTB Class IX English Chapter 3
16 pages
Root Words and Affixes
No ratings yet
Root Words and Affixes
12 pages
Procedure Text-WPS Office
No ratings yet
Procedure Text-WPS Office
8 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
CTY4 Inductive Grammar Charts Unit 8
No ratings yet
CTY4 Inductive Grammar Charts Unit 8
2 pages
Cities in USA Wordsearch
No ratings yet
Cities in USA Wordsearch
1 page
Topic 1
No ratings yet
Topic 1
5 pages
10 Tips For Cutting Your Word Count: 1. Start Sentences With The Subject
No ratings yet
10 Tips For Cutting Your Word Count: 1. Start Sentences With The Subject
5 pages
Skripsi Propsal
No ratings yet
Skripsi Propsal
7 pages
DLL - All Subjects 2 - Q4 - W6 - D3
No ratings yet
DLL - All Subjects 2 - Q4 - W6 - D3
9 pages
What Are Conditionals in English Grammar
No ratings yet
What Are Conditionals in English Grammar
16 pages
MT-1 JSC 2020 Model Test
No ratings yet
MT-1 JSC 2020 Model Test
4 pages
LR (0) and SLR (1) Parsers: C.Naga Raju Professor Department of CSE YSR Engineering College of YVU Proddatur
No ratings yet
LR (0) and SLR (1) Parsers: C.Naga Raju Professor Department of CSE YSR Engineering College of YVU Proddatur
102 pages
Liddicoat Anthony J Richard B Baldauf JR Language Planning I
No ratings yet
Liddicoat Anthony J Richard B Baldauf JR Language Planning I
298 pages
Top 500-Adjectives
No ratings yet
Top 500-Adjectives
17 pages
IELTS
No ratings yet
IELTS
25 pages
Contemporary Linguistics: An Introduction:, 5 Edition, Chapter 15: Sources, 1
No ratings yet
Contemporary Linguistics: An Introduction:, 5 Edition, Chapter 15: Sources, 1
4 pages
Dangling Participle
No ratings yet
Dangling Participle
2 pages
EAPP PPT2 Text Structure-Summarizing Techniques
No ratings yet
EAPP PPT2 Text Structure-Summarizing Techniques
49 pages
Alphabetical Series
100% (1)
Alphabetical Series
19 pages
The Evolution of English Foreign Elements and Vowel Changes
No ratings yet
The Evolution of English Foreign Elements and Vowel Changes
10 pages
Verbo TO BE
100% (1)
Verbo TO BE
12 pages
A Detailed Lesson Plan For Grade 10
No ratings yet
A Detailed Lesson Plan For Grade 10
7 pages
2025 Teacher Class Bee Words
No ratings yet
2025 Teacher Class Bee Words
108 pages
ENG10Q1L7-Implicit and Explicit Signals
No ratings yet
ENG10Q1L7-Implicit and Explicit Signals
16 pages
NCF by Bini
No ratings yet
NCF by Bini
11 pages

Lab8 Instructions

Uploaded by

Lab8 Instructions

Uploaded by

Lab 8: Natural Language Processing (NLP),

Parts of Speech Tagging (POS), and Named

5 Part 1: Parts of Speech (POS) Tagging in openNLP 8

7 Part 3: Conducting NLP Analysis with spacyr 16

8 Part 4: Conducting NLP Analysis in Python 27

1.1 Technical Learning Objectives

1. Understand text parsing and Parts of Speech (POS) tagging

1. Understand NLP alternative to the “Bag of Words” approach to text mining

Pre-Lab Instructions (to be completed before class):

install.packages(c("openNLP", "sentimentr", "coreNLP", "cleanNLP", "magrittr", "NLP", "gridEx

#install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source")

Now, load the spacyr package.

successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)

Lab Instructions (to be completed during class):

4.1 Deliverable 1: Get your working directory and paste below:

5 Part 1: Parts of Speech (POS) Tagging in openNLP

5.1 Deliverable 2: Create Some Sample Text

sent_token_annotator <- Maxent_Sent_Token_Annotator()

annotate(s, list(sent_token_annotator, word_token_annotator))

pos_tag_annotator <- Maxent_POS_Tag_Annotator()

a3 <- annotate(s, pos_tag_annotator, a2)

a3w <- subset(a3, type == "word")

5.3 Deliverable 3: Extract Tokens/POS Pairs

Now we will extract tokens/POS pairs (all of them):

sprintf("%s/%s", s[a3w], tags)

[1] "Pierre/NNP" "Vinken/NNP" ",/," "61/CD"

6 Part 2: Applying Named Entity Recognition (NER) to the

Here is the sample code:

tmp <- list.files(path ="data/C8_final_txts/", pattern = '*.txt', full.names = T)

6.2 Deliverable 5: Examine the Clinton Email Dataset

6.3 Deliverable 6: Create a Custom Function to Clean the Emails

allEmails <- pblapply(emails,txtClean)

6.5 Deliverable 8: Apply POS Tagging to the Clinton Email Dataset

sentTokenAnnotator <- Maxent_Sent_Token_Annotator(language='en')

posTagAnnotator <- Maxent_POS_Tag_Annotator(language='en')

For this deliverable, we will work on understanding for loops in R.

annotationsData <- list()

annotationsData[[tmp[i]]] <- annDF

6.8 Deliverable 11: Subset the Clinton Email Dataset

Now we can subset for each document.

people <- pblapply(allData, subset, grepl("*person", features))

subset(POSdf, POSdf$doc_id ==1 \& grepl("*person", POSdf$features) == T)

6.9 Deliverable 12: Using the Annotate Entities Process

subset(annDF$words, grepl("*people", annDF$features) == T)

6.10 Deliverable 13: Annotate Entities with OpenNLP

ner.pipeline <- list(

6.11 Deliverable 14: Apply the Annotation Function

all.ner <- pblapply(all.emails, annotate.entities, ner.pipeline)

7 Part 3: Conducting NLP Analysis with spacyr

7.1 Deliverable 15: Load the spacyr Package

txt <- c(d1 = "spaCy is great at fast natural language processing.",

parsedtxt <- spacy_parse(txt)

doc_id sentence_id token_id token lemma pos entity

7.2 Deliverable 16: Review the Parsed Text

doc_id sentence_id token_id token pos tag

7.3 Deliverable 17: Parse the txt Object with spacyr

7.4 Deliverable 18: Tokenize Text into a Data Frame

spacy_tokenize(txt, remove_punct = TRUE, output = "data.frame") %>%

7.5 Deliverable 19: Extract Named Entities From Parsed Text

parsedtxt <- spacy_parse(txt, lemma = FALSE, entity = TRUE, nounphrase = TRUE)

doc_id sentence_id entity entity_type

entity_extract(parsedtxt, type = "all")

doc_id sentence_id entity entity_type

7.7 Deliverable 21: Consolidate Named Entities

doc_id sentence_id token_id token pos entity_type

7.8 Deliverable 22: Extract Noun Phrases

Similarly, spacyr can extract noun phrases.

doc_id sentence_id token_id token pos

Deliverable 23: Extract Entities Without Parsing the Entire Text

doc_id text ent_type start_id length

doc_id text root_text start_id root_id length

spacy_parse(txt, dependency = TRUE, lemma = FALSE, pos = FALSE)

doc_id sentence_id token_id token head_token_id dep_rel entity

spacy_parse("I have six email addresses, including [email protected].",

doc_id sentence_id token_id token like_num like_email

7.10 Deliverable 24: Extract Additional Attributes

* End of Lab *