Lab8 Instructions
Lab8 Instructions
2024-10-20
Table of contents
1 Lab Overview 2
1.1 Technical Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Business Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Assignment Overview 3
3 Pre-Lab Instructions 4
4 Lab Instructions 8
4.1 Deliverable 1: Get your working directory and paste below: . . . . . . . . . . . 8
6 Part 2: Applying Named Entity Recognition (NER) to the Clinton Email Dataset 10
6.1 Deliverable 4: Import Data and Organize the Clinton Email Dataset . . . . . . 11
6.2 Deliverable 5: Examine the Clinton Email Dataset . . . . . . . . . . . . . . . . 11
6.3 Deliverable 6: Create a Custom Function to Clean the Emails . . . . . . . . . . 11
6.4 Deliverable 7: Apply the Cleaning Function to the Clinton Email Dataset . . . 12
6.5 Deliverable 8: Apply POS Tagging to the Clinton Email Dataset . . . . . . . . 12
6.6 Deliverable 9: Annotate the Clinton Email Dataset Using a For Loop . . . . . . 13
6.7 Deliverable 10: Extract Terms from the Clinton Email Dataset . . . . . . . . . 14
6.8 Deliverable 11: Subset the Clinton Email Dataset . . . . . . . . . . . . . . . . . 14
1
6.9 Deliverable 12: Using the Annotate Entities Process . . . . . . . . . . . . . . . 15
6.10 Deliverable 13: Annotate Entities with OpenNLP . . . . . . . . . . . . . . . . . 15
6.11 Deliverable 14: Apply the Annotation Function . . . . . . . . . . . . . . . . . . 16
1 Lab Overview
2
1.2 Business Learning Objectives
2 Assignment Overview
In this lab, you will transition from our statistical text mining approach, using Bag-of-Words
(BoW) into a focus on Natural Language Processing (NLP) techniques. You will learn how to
parse text and do Parts of Speech (POS) tagging required for NLP. You will use that tagged
text to do Named Entity Recognition (NER). You will learn how to use the spaCy library
in Python and how to integrate it into R using the reticulate and spacyr packages. You will
also learn about the main NLP packages available for R, such as NLP, openNLP, coreNLP,
and cleanNLP, and how to use them. Finally, you will learn about NLP in quanteda and
tidytext.
Remember, the the remaining three labs build on what you have learned and contain more
advanced techniques. While these techniques are very exciting, and some of you may choose to
incorporate them into your final projects, these labs are not required to meet the MVP for this
course. To receive minimum credit for the lab, you need only review the Lab Instructions and
submit a basic RMD file knitted to PDF. However, many of you want to learn and practice
these techniques. If so, these three advanced labs are for you. To receive full credit for the lab
please try any of the techniques included and submit the knitted PDF.
There are four main parts to the lab:
Part 1: Parts of Speech Tagging, will guide you through the process of parsing your text
and tagging the relevant parts of speech using openNLP, cleanNLP, and coreNLP.
Part 2: Applying Named Entity Recognition, introduces you to the concept of Maximum
Entropy (Maxent) and applies the maxent annotations for persons, locations, organizations,
and several others to the corpus.
Part 3: Conducting NLP Analysis with spacyr, provides an introduction to NLP anal-
ysis in the quanteda and tidytext ecosystems. In quanteda and tidytext, NLP requires a tight
integration with spacyr, the R environment for the powerful python spaCy libraries.
Part 4: Conducting NLP Analysis in Python, provides an introduction to NLP analysis
in Python using the nltk package and some of the built-in datasets.
3
3 Pre-Lab Instructions
echo: true
For this lab you will be using several new packages in R and Python. You will need to
install the following packages in R: openNLP, sentimentr, coreNLP, cleanNLP, magrittr, NLP,
gridExtra, ggthemes, purrr, doBy, cshapes, rJava, sotu, spacyr, tinytex, and sf. Here again
is a way to install multiple packages at once using the concatenate function “c()” function in
R):
If you haven’t done so already, you need to install the openNLPmodels.en from an outside
repository. Use the install.packages() function, with “openNLPmodels.en” in the argument,
but for repos = add “ “http://datacube.wu.ac.at/”, type = “source”. remember to put a
comma between the elements of your argument.
If you haven’t done so already, you need to install the openNLPmodels.en from an outside
repository. Use the install.packages() function, with “openNLPmodels.en” in the argument,
but for repos = add “http://datacube.wu.ac.at/”, type = “source”. remember to put a comma
between the elements of your argument.
Now, add a R code chunk to set up your R system environment as follows:
4
options(stringsAsFactors = FALSE)
Sys.setlocale("LC_ALL","C")
lib.loc = "~/R/win-library/3.2"
[1] "C/C/C/C/C/en_US.UTF-8"
You will also need to install the spacyr package in R. You can install the spacyr package from
CRAN using the install.packages() function.
install.packages("spacyr")
You will need to load the required libraries in a specific order. Begin by loading the following
packages: gridExtra, ggmap, ggthemes, and sf.
These packages must be loaded before loading openNLP because ggplot2 (which will load when
ggthemes loads) has a conflict with openNLP using a function called annotate(); and we want
to use the function from openNLP (not ggplot2)
Then load the following remaining packages: NLP, openNLP, openNLPmodels.en, pbapply,
stringr, rvest, doBy, tm, cshapes, purr, dplyr, spacyr, tinytex.
These are some additional packages you may want to load: rJava, coreNLP, sentimentr,
cleanNLP, dplyr, magrittr.
You will also need to install the following packages into your Python environment: nltk,
pandas, numpy, and matplotlib. You can install these packages using Anaconda Navigator.
Just a note on rJava
When you load rJava, you may have an issue with not having a Java Virtual Machine (JVM)
for Java Runtime Environment (JRE) loaded on your computer. If so, you may go to Java.com
to download the appropriate version. Here is the link to install on a Mac: https://www.java.
com/en/download/mac_download.jsp. However, please note, openNLP may not compatible
with the most recent version of Java 8 (12), so you may need to use the last stable version of
Java 8 (11). This issue may be resolved by the time you work on this lab.
Preparing for the spacyr part of the lab.
To use spacyr, first install the spacyr package. This process also installs the various depen-
dencies for spacyr, including: RcppTOML, here, and reticulate. The first time you install this
package, it may take a little while, and may present some challenges.
You may install the latest version of spacyr from GitHub, or install from CRAN. If installing
from GitHub, the following is suggested code:
5
devtools::install_github("quanteda/spacyr", build_vignettes = FALSE)
spacy_install()
For some of you, this will work fine with all the defaults.
However, if the default function doesn’t work for you, or if you know you have multiple
installations of Python on your system, you will want to use the argument to point to the
specific version of Python you have installed and/or want to use. For example, since I have
Python 3.11.4 as my primary working version, I set the argument to point to that version of
Python. Below is some sample code with the elements of the argument completed:
spacy_install(
conda = "auto",
version = "latest",
lang_models = "en_core_web_sm",
python_version = "3.11.4",
envname = "spacy_condaenv",
pip = FALSE,
python_path = NULL,
prompt = TRUE
)
or, for my updated version using Anaconda, I used the following code:
spacy_install(
conda = "auto",
version = "latest",
lang_models = "en_core_web_sm",
python_version = "3.11.5",
envname = "textmining",
pip = FALSE,
6
python_path = "/Users/derrickcogburn/miniconda3/envs/textmining/bin/python"
prompt = TRUE
)
Be forewarned, this function is going to install quite a bit of additional and dependent packages,
and will take some time. You definitely want to comment out this function once you have
installed spacy and its dependencies.
You should hopefully eventually get a message similar to the following in your console:
== Download and installation successful You can now load the package via spacy.load(‘en_core_web_sm’)
Installation complete. Condaenv: spacy_condaenv: Language model(s): en_core_web_sm
==
Once you get this message, congratulations! You are ready to initialize spacy using the
spacy_initialize() function with the following code in the argument: model=“en_core_web_sm”.
Note the “model” in the argument is calling the smaller version of the English tokenizer
(the larger one is “en_core_web_lg”) which you may also use, but it takes a longer time to
process.
Now you are ready to “initialize” spacyr. To do this, use the spacy_initialize() function. In
your argument, for model= add: “en_core_web_sm”. At the end of the lab you will learn the
opposite function, spacy_finalize() to “detatch” spacy from your environment.
Once you initialize spacy, you should see something similiar to the following result in the con-
sole. It indicates spacy could find the conda environment to use for spacy (spacy_condaenv).
It also tells you the version of spaCy you are using, the language model it attached, and the
Python options you have set up:
==
Found ‘spacy_condaenv’. spacyr will use this environment successfully initialized (spaCy Ver-
sion: 3.1.3, language model: en_core_web_sm) (python options: type = “condaenv”, value
= “spacy_condaenv”)
==
If your initialization is successful, you are now ready to use the powerful Python-based spacyr
package for Natural Language Processing (NLP) tasks.
*** End of Pre-Lab ***
7
4 Lab Instructions
getwd()
In this part of the lab, you will use the openNLP package to perform Parts of Speech (POS)
tagging on a text. Let’s now focus on text parsing, named entity recognition, and understand-
ing a Parts of Speech (POS) tokenizer in the openNLP package.
First, let’s create some example text to start. Create an object called “s” and assign it the
results of the following code chunk:
s <- paste(c('Pierre Vinken, 61 years old, will join the board as a ',
'nonexecutive director Nov 29.',
'Mr. Vinken is chairman of Elsevier, N.V., ',
'the Dutch publishing group.'),
collapse = '')
s <- as.String(s)
s
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov 29. Mr. Vinke
8
5.2 Deliverable 3: Create Sentence and Word Token Annotations
Now, let’s create sentence and word token annotations objects called sent_token_annotator
uand word_token_annotator. To do this, you will use the Maxent_Sent_Token_Annotator()
function and the Maxent_Word_Token_Annotator() function from the openNLP package.
Here is the sample code:
Then, create an object called a2 and assign it the results of the following code. And then
review a2.
Now, create an object called pos_tag_annotator and assign it the results of the Max-
ent_POS_Tag_Annotator() function. Create an object called a3 and assign it the results
of applying annotate() function to object s. Use the following code for the argument of the
annotate() function: (s, pos_tag_annotator, a2). Then review a3. Next, create an object
called a3w and assign it the results of the function subset(), using the following code chunk
in your argument:(a3, type == “word”). Then review a3w.
sample code is below:
a3
a3w
Now, create an object called tags, and assign it the results of the following code chunk:
sapply(a3w$features,"[[","POS").
Then review the object tags. Finally, use the table() to create a table of tags. Then review
the object tags. Finally, use the table() to create a table of tags.
Sample code is below:
9
tags <- sapply(a3w$features,"[[","POS")
tags
table(tags)
For more information on Parts of Speech (POS) tagging, please see: https://repository.upenn.
edu/cgi/viewcontent.cgi?article=1603&context=cis_reports; and for a summary, please see:
https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html.
In this part of the lab, you will apply Named Entity Recognition (NER) to a subset of the
infamous Clinton Email Dataset. The Clinton Email Dataset is a collection of emails from
Hillary Clinton’s private email server. The dataset is available on Kaggle at: https://www.
kaggle.com/kaggle/hillary-clinton-emails.
We need to scan a folder in our working directory for multiple files representing the Hillary
Clinton emails saved as .txt files. We will be using the list.files function and a wildcard added
to the .txt file extension (*.txt) which will identify any file in that directory that ends in .txt.
We will then create an object called “temp” that contains those emails.
You may want to set your working directory, with the path leading to the folder containing
the emails.
10
6.1 Deliverable 4: Import Data and Organize the Clinton Email Dataset
Let’s now examine one (1) email. Use the following code:
emails[[1]]
[1] "UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05758905 Date: 02/
[2] ""
[3] "RELEASE IN FULL"
[4] ""
[5] "From: Sent: To: Subject:"
[6] ""
[7] "H <[email protected] > Monday, July 6, 2009 10:22 AM '[email protected]' Re: Sch
[8] ""
[9] "Either is fine w me."
[10] "Original Message From: Valmoro, Lona J <[email protected]> To: H Cc: Abedin, Huma <Abe
[11] "Secretary Salazar's office just called -- he would like to meeting with you early this
[12] ""
[13] "Lona Valmoro Special Assistant to the Secretary of State 202-647-9071 (direct)"
[14] ""
[15] "UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05758905 Date: 02/
[16] ""
[17] "\f"
Now, let’s create a custom function called txtClean to clean the emails.
Sample code is below:
11
#txtClean <- function(x) {
x <- x[-1]
x <- paste(x,collapse = " ")
x <- str_replace_all(x, "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+", "")
x <- str_replace_all(x, "Doc No.", "")
x <- str_replace_all(x, "UNCLASSIFIED U.S. Department of State Case No.", "")
x <- removeNumbers(x)
x <- as.String(x)
return(x)
#}
6.4 Deliverable 7: Apply the Cleaning Function to the Clinton Email Dataset
Now, as a test, we will apply the cleaning function to one email and observe what happens to
it. Use the following code:
txtClean(emails[[1]])[[1]]
[1] " RELEASE IN FULL From: Sent: To: Subject: H < > Monday, July , : AM '' Re: Schedule
Now, let’s apply the cleaning function to all emails; and then review one email in the list.
Sample code is below:
allEmails[[2]][[1]][1]
[1] " From: Sent: To: Subject: H < > Tuesday, July , : AM '' Re: Reminder geithner is in ab
In this deliverable you will create several objects to use as various types of “annotators”.
Create three objects to use for your POS tagging. Create the following three objects: persons,
locations, and organizations. For each of those, assign its respective Maxent annotator (e.g
kind=“person”, kind=“location”, kind=“organization).
Sample code is below:
12
persons <- Maxent_Entity_Annotator(kind='person')
locations <- Maxent_Entity_Annotator(kind='location')
organizations <- Maxent_Entity_Annotator(kind='organization')
Then create two objects to hold your Maxent sentence and word annotators.
Sample code below:
Finally, create one more object called posTagAnnotator to hold your Maxent POS tag anno-
tator.
Sample code below:
POS tagging for persons, locations, and organizations; as well as individual models for sen-
tences, words and parts of speech. This code load the pre-existing feature weights to be called
by your R session, but they do not yet apply them to any text.
6.6 Deliverable 9: Annotate the Clinton Email Dataset Using a For Loop
13
annDF$features <- unlist(as.character(annDF$features))
6.7 Deliverable 10: Extract Terms from the Clinton Email Dataset
Annotations have character indices. We will now obtain terms by index from each document
using a NESTED loop.
Sample code is below (run the entire code chunk at once):
allData<- list()
for (i in 1:length(allEmails)){
x <- allEmails[[i]] # get an individual document
y <- annotationsData[[i]] # get an individual doc's annotation information
print(paste('starting document:',i, 'of', length(allEmails)))
# for each row in the annotation information, extract the term by index
POSls <- list()
for(j in 1:nrow(y)){
annoChars <- ((substr(x,y[j,2],y[j,3]))) #substring position
# Organize information in data frame
z <- data.frame(doc_id = i,
type = y[j,1],
start = y[j,2],
end = y[j,3],
features = y[j,4],
text = as.character(annoChars))
POSls[[j]] <- z
#print(paste('getting POS:', j))
}
# Bind each documents annotations & terms from loop into a single DF
docPOS <- do.call(rbind, POSls)
# So each document will have an individual DF of terms, and annotations as a list element
allData[[i]] <- docPOS
}
14
Sample code is below:
people
locaction #note in this instance, location is misspelled. this is on purpose since location i
organization
### Or if you prefer to work with flat objects make it a data frame w/all info
POSdf <- do.call(rbind, allData)
# Subsetting example w/2 conditions; people found in email 1; note, do not include the \& in
This final data frame contains not only persons, locations, and organizations, but also each
detected sentence, word, and part of speech. To complete the named entity analysis, you may
need to subset the data frame for specific features.
The sample code is below:
annDF
person
Now, you may define an annotation sequence, and create a list with the specific opeNLP
models.
Sample code is below:
15
annotate.entities <- function(doc, annotation.pipeline) {
annotations <- annotate(doc, annotation.pipeline)
AnnotatedPlainTextDocument(doc, annotations)
}
Now that we have an annotation function that individually calls the models, we need to apply
them to the entire email list either the lapply or the pblapply functions will work, but the
pblapply is helpful because it provides a progress bar (pb). Note, the pbapply package must
be loaded.
Sample code is below:
Now, we can extract the useful information and construct a data frame with entity informa-
tion.
all.ner
all.ner <- pluck(all.ner, "annotations")
all.ner <- pblapply(all.ner, as.data.frame)
#all.ner[[3]][244:250,]
#all.ner <- Map(function(tex,fea,id) cbind(fea, entity = substring(tex,fea$start, fea$end), f
The spacyr package is a powerful tool for conducting NLP analysis. It is a wrapper for the
spaCy Python library, which is known for its speed and accuracy.
16
To use the spacyr package, you need to install it and load it into your R environment.
Now, to practice with spacyr, let’s create an object called txt that is assigned the results of
two sample “documents”, d1 and d2. We then use the spacy_parse() function to parse them,
and save in a data.table called parsedtxt, which we can then review.
Sample code is below:
parsedtxt
In this analysis, we see two fields for POS tags (pos and entity). In this case the pos field
returns the parts of speech as tagged using the Penn Treebank tagset. You may adjust the
argument for the spacy_parse() function to determine what you get back. For example, you
may choose to not generate the lemma or entity fields with the code chunk below:
17
spacy_parse(txt, tag = TRUE, entity = FALSE, lemma = FALSE)
You may also parse and POS tag documents in multiple languages using over 70 other lan-
guage models (some of these include: German (de), Spanish (es), Portugese (pt), French (fr),
Italian (it), Dutch(nl), and many others: See: https://spacy.io/usage/models#languages for
a complete list)
You may also use spacyr to directly parse the txt object. Use the spacy_tokenize() function
with txt in the argument. This approach is designed to match the tokens() functions within
quanteda.
Sample code is below:
spacy_tokenize(txt)
$d1
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."
18
$d2
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."
The default returns a named list (where the document name, eg d1, d2 is the list element
name). Or you may specify to output to a data.frame. Either way, make sure you have the
dplyr library loaded.
Sample code is below:
library(dplyr)
doc_id token
11 d2 spent
12 d2 two
13 d2 years
14 d2 in
15 d2 North
16 d2 Carolina
You may also use spacyr to extract named entities from the parsed text.
Sample code is below:
19
7.6 Deliverable 20: Extract Extended Entity Set
The following approach lets you extract the “extended” entity set.
Sample code is below:
Another interesting possibility is to “consolidate” multi-word entities into single tokens using
the entity_consolidate() functions.
Sample code is below:
entity_consolidate(parsedtxt) %>%
tail()
nounphrase_extract(parsedtxt)
20
doc_id sentence_id nounphrase
1 d1 1 fast_natural_language_processing
2 d2 1 Mr._Smith
3 d2 1 two_years
4 d2 1 North_Carolina
Noun phrases may also be consolidated using the nounphrase_consolidate() function applied
to the parsedtxt object.
Sample code is below:
nounphrase_consolidate(parsedtxt)
spacy_extract_entity(txt)
21
Similarly, to extract noun phrases without parsing the entire text, you may use the
spacy_extract_nounphrases() function:
Sample code is below:
spacy_extract_nounphrases(txt)
For detailed parsing of syntactic dependencies, you may use the dependency=TRUE option in
the argument. Sample code is below:
22
7.9 Deliverable 23: Extract Additional Attributes
You may also extract additional attributes of spaCy tokens using the additional_attributes
option in the argument.
Sample code is below:
Take a moment to review this output and discuss why this was produced from the parsing
request.
You may also integrate the output from spacyr directly into quanteda. Sample code is below:
23
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
d1 d2
9 9
d1 d2
9 9
You may also convert tokens in spacyr, which have tokenizers that are “smarter” than the
purely syntactic pattern-based parsers used by quanteda.
Sample code is below:
d2 :
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."
If you want to select only nouns, using “glob” pattern matching with quanteda, you may use
the tokens_select() function.
Sample code is below:
24
spacy_parse("The cat in the hat ate green eggs and ham.", pos = TRUE) %>%
as.tokens(include_pos = "pos") %>%
tokens_select(pattern = c("*/NOUN"))
You may also directly convert the spaCy-based tokens. Sample code is below:
spacy_tokenize(txt) %>%
as.tokens()
d2 :
[1] "Mr." "Smith" "spent" "two" "years" "in" "North"
[8] "Carolina" "."
You may also do this for sentences, for which spaCy is very smart.
Sample code is below:
txt2 <- "A Ph.D. in Washington D.C. Mr. Smith went to Washington."
spacy_tokenize(txt2, what = "sentence") %>%
as.tokens()
25
Tokens consisting of 1 document.
d1 :
[1] "spaCy" "is" "great" "at" "fast"
[6] "natural" "language" "processing" "."
library("tidytext")
spacy_parse("The cat in the hat ate green eggs and ham.", pos = TRUE) %>%
unnest_tokens(word, token) %>%
dplyr::filter(pos == "NOUN")
26
doc_id sentence_id token_id lemma pos entity word
1 text1 1 2 cat NOUN cat
2 text1 1 5 hat NOUN hat
3 text1 1 8 egg NOUN eggs
Since the spacy_initialize() attaches a background process of spaCy in python space, it takes
up a significant amount of memory (especially when using a large language model such as:
en_core_web_lg). So, when you no longer need the connection to spaCy, you may remove
the spaCy object by calling the spacy_finalize() function.
And, when you are ready to reattach the back-end spaCy object, you call spacy_initialize()
again.
In this final section, we will conduct NLP analysis in Python. We will use the nltk package and
some of the built-in datasets to conduct our analysis. Make sure the nltk package is installed
into the Python environment you are using for this lab.
8.1 Deliverable 29: Importing the nltk Package and Downloading Necessary
POS Taggers
For this part of the lab you will need to import the nltk package and download punkt,
averaged_perceptron_tagger, maxent_ne_chunker, and words. These downloads are
necessary for the lab, punkt is a pre-trained model that helps you tokenize words, av-
eraged_perceptron_tagger is a pre-trained model that helps you tag parts of speech,
maxent_ne_chunker is a pre-trained model that helps you identify named entities, and words
is a dataset that contains a list of words.
You can do this by running the following sample code in a Python code chunk:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
True
27
True
True
True
You will also need to download the necessary datasets from nltk which are required for the
lab. NLTK comes with several built-in datastets. We will use the “state_union” corpus, a
collection of US Presidential State of the Union addresses. We will use the 2006 address by
President George W. Bush. Sample code is below:
True
PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE
THE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, member
President George W. Bush reacts to applause during his State of the Union Address at the Capi
In a system of two parties, two chambers, and two elected branches, there will always be diff
In this decisive year, you and I will make choices that determine both the future and the cha
Abroad, our nation is committed to an historic, long-term goal -- we seek the end of tyranny
Far from being a hopeless dream, the advance of freedom is the great story of our time. In 19
President George W. Bush delivers his State of the Union Address at the Capitol, Tuesday, Jan
Their aim is to seize power in Iraq, and use it as a safe haven to launch attacks against Ame
28
In a time of testing, we cannot find security by abandoning our commitments and retreating wi
America rejects the false comfort of isolationism. We are the nation that saved liberty in Eu
President George W. Bush greets members of Congress after his State of the Union Address at t
Second, we're continuing reconstruction efforts, and helping the Iraqi government to fight co
Our work in Iraq is difficult because our enemy is brutal. But that brutality has not stopped
The road of victory is the road that will take our troops home. As we make progress on the gr
Our coalition has learned from our experience in Iraq. We've adjusted our military tactics an
With so much in the balance, those of us in public office have a duty to speak with candor. A
Laura Bush is applauded as she is introduced Tuesday evening, Jan. 31, 2006 during the State
Staff Sergeant Dan Clay's wife, Lisa, and his mom and dad, Sara Jo and Bud, are with us this
Our nation is grateful to the fallen, who live in the memory of our country. We're grateful t
Our offensive against terror involves more than military action. Ultimately, the only way to
The great people of Egypt have voted in a multi-party presidential election -- and now their
President George W. Bush waves toward the upper visitors gallery of the House Chamber followi
Tonight, let me speak directly to the citizens of Iran: America respects you, and we respect
To overcome dangers in our world, we must also take the offensive by encouraging economic pro
In recent years, you and I have taken unprecedented action to fight AIDS and malaria, expand
Our country must also remain on the offensive against terrorism here at home. The enemy has n
It is said that prior to the attacks of September the 11th, our government failed to connect
In all these areas -- from the disruption of terror networks, to victory in Iraq, to the spre
Our own generation is in a long war against a determined enemy -- a war that will be fought b
29
Here at home, America also has a great opportunity: We will build the prosperity of our count
Our economy is healthy and vigorous, and growing faster than other major industrialized natio
The American economy is preeminent, but we cannot afford to be complacent. In a dynamic world
Tonight I will set out a better path: an agenda for a nation that competes with confidence; a
Keeping America competitive begins with keeping our economy growing. And our economy grows wh
Keeping America competitive requires us to be good stewards of tax dollars. Every year of my
I am pleased that members of Congress are working on earmark reform, because the federal budg
We must also confront the larger challenge of mandatory spending, or entitlements. This year,
So tonight, I ask you to join me in creating a commission to examine the full impact of baby
Keeping America competitive requires us to open more markets for all that Americans make and
Keeping America competitive requires an immigration system that upholds our laws, reflects ou
Keeping America competitive requires affordable health care. (Applause.) Our government has a
We will make wider use of electronic records and other health information technology, to help
Keeping America competitive requires affordable energy. And here we have a serious problem: A
We must also change how we power our automobiles. We will increase our research in better bat
Breakthroughs on this and other new technologies will help us reach another great goal: to re
And to keep America competitive, one commitment is necessary above all: We must continue to l
First, I propose to double the federal commitment to the most critical basic research program
Second, I propose to make permanent the research and development tax credit -- (applause) --
Third, we need to encourage children to take more math and science, and to make sure those co
Preparing our nation to compete in the world is a goal that all of us can share. I urge you t
30
America is a great force for freedom and prosperity. Yet our greatness is not measured in pow
In recent years, America has become a more hopeful nation. Violent crime rates have fallen to
Yet many Americans, especially parents, still have deep concerns about the direction of our c
As we look at these challenges, we must never give in to the belief that America is in declin
A hopeful society depends on courts that deliver equal justice under the law. The Supreme Cou
Today marks the official retirement of a very special American. For 24 years of faithful serv
A hopeful society has institutions of science and medicine that do not cut ethical corners, a
A hopeful society expects elected officials to uphold the public trust. (Applause.) Honorable
As we renew the promise of our institutions, let us also show the character of America in our
A hopeful society gives special attention to children who lack direction and love. Through th
A hopeful society comes to the aid of fellow citizens in times of suffering and emergency --
In New Orleans and in other places, many of our fellow citizens have felt excluded from the p
A hopeful society acts boldly to fight diseases like HIV/AIDS, which can be prevented, and tr
Fellow citizens, we've been called to leadership in a period of consequence. We've entered a
Lincoln could have accepted peace at the cost of disunity and continued slavery. Martin Luthe
Before history is written down in books, it is written in courage. Like Americans before us,
We want to perform POS tagging and NER, but before we can do that, we need to tokenize
the text, splitting it into sentences, and then into words. Sample code is below:
31
from nltk.tokenize import sent_tokenize, word_tokenize
[[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('A
named_entities = []
for tagged_sentence in pos_tagged:
chunked_sentence = nltk.ne_chunk(tagged_sentence, binary = True)
named_entities.extend([chunk for chunk in chunked_sentence if hasattr(chunk, 'label')])
print(named_entities[:10])
Now, we can visualize these results. We will use matplotlib to visualize the POS tagging
and NER. If you have not already installed the matplotlib package, you may do so within
Anaconda Navigator.
First, let’s create a frequency distribution of the POS tags, and then plot this distribution.
Sample code is below:
32
import nltk
from nltk import FreqDist
import matplotlib.pyplot as plt
800
600
Counts
400
200
0
NN
IN
DT
JJ
NNP
NNS
.
,
VB
CC
PRP
VBP
TO
RB
PRP$
VBZ
MD
VBN
:
CD
(
)
JJR
VBD
WDT
NNPS
WP
RP
POS
VBG
Samples
33
Now, we will visualize the NER results. We will create a frequency distribution of the NER
tags, and then plot this distribution.
Sample code is below:
import nltk
from nltk import FreqDist
import matplotlib.pyplot as plt
34
Counts
0
200
400
600
800
NN
IN
35
VBZ
Samples
MD
VBG
VBN
:
CD
(
)
JJR
VBD
WDT
NNPS
WP
RP
POS
Please render your Quarto file to pdf and submit to the assignment for Lab 7 within Canvas.
36