Principles of Natural Language Processing
Principles of Natural Language Processing
Processing
PRINCIPLES OF NATURAL
LANGUAGE PROCESSING
SUSAN MCROY
Susan McRoy
Milwaukee
Principles of Natural Language Processing by Susan McRoy is licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License, except
where otherwise noted.
Acknowledgements vii
Preface viii
“I am putting myself to the fullest possible use, which is all I think that
any conscious entity can ever hope to do.”
As spoken by HAL 9000,
2001: A Space Odyssey (Kubrick, 1968)1
x SUSAN MCROY
CHAPTER 1.
Figure 1.1 Examples of structured and unstructured content from the Steam
webpage for the game Dota 2
2 SUSAN MCROY
experts, with limited amounts of evidence to validate the
approach. Modern NLP often follows the following general pat-
tern: Identify a real-world concern, obtain samples of relevant
data, pre-process and mine data to suggest patterns and research
questions, and then answer research questions or solve technical
problems by finding functions that capture complex correlations
among different aspects of the data and validating these models
with statistical measures. For example, Figure 1.2 shows an
example conceptualization of an important problem: discovering
whether patients have experienced side effects while taking a
medication that may not have been discovered during clinical
testing when the drug was approved (and may not even have
been yet reported to their medical provider).
Conceptualization Example
Our pet duck keeps biting “fits the bill” as an idiom (i.e., “suitable for the
purpose”)
everyone, so I bought a cheap
muzzle for it. Nothing flashy, “fits the bill” as a compositional expression
but it fits the bill. (“the right size for the beak of a duck”)
4 SUSAN MCROY
among multiple possible derivations for a sequence, extracting
semantic information from sequences, and determining whether
accepting the truth of one expression of natural language
requires accepting the truth of some other.
6 SUSAN MCROY
ing promote social justice by allowing people to document and
share their experiences with the police.
Syntax What did Bill buy *What did Bill buy potatoes
18 SUSAN MCROY
learning and the models they create include both statistical clas-
sifiers and neural networks. It is usually necessary to experiment
with a variety of alternative learning strategies to see what works
best, including reusing models from existing systems, or coding
a moderate amount of data by hand and then using it to train
a model to code a larger set. Classification-based approaches
often treat the structures produced by NLP as input features,
but classifiers can also be used to perform many of the tasks in
the NLP pipeline. Open source data science workbench tools,
such as Weka, allow one to compare different strategies easily, by
providing most of the algorithms and configuration alternatives
that one might want, without installing any additional software.
Workbenches allow one to select data, algorithms, and parameter
settings using simple menus and check boxes, and provide func-
tions for analyzing or visualizing the results.
20 SUSAN MCROY
Figure 1.7 Software libraries and tool kits for NLP
Name of
What it is
Project
PyTorch-based toolkit for most medium and high level NLP tasks;
AllenNLP
includes Elmo word vectors
Natural
Python software library for most symbolic and statistical NLP tasks
language
with access to many corpora for many languages (widely used in
Tool Kit
education)
(NLTK)
General Python toolkit for ML, but also has functions for text
Scikit-learn
analytics (e.g., tokenization, classification)
Industrial strength libraries for NLP, loadable via Conda or pip, but
spaCy
has only a dependency parser
Python libraries for most common NLP tasks; built from NLTK but
Textblob
somewhat easier to use, by Steven Loria
Python software for most NLP task, trained on Twitter data; older
Tweet NLP versions for Java 6 also available (now maintained by CMU
researchers)
http://wit.istc.cnr.it/
FRED demos of machine reading
stlab-tools/fred/demo/
https://deepai.org/
Google Parsey McParseface demo of dependency
machine-learning-model/
parsing
parseymcparseface
1.6 SUMMARY
22 SUSAN MCROY
into one of the richest resources of human experience. It is hard
to even imagine an area of social or commercial interest that does
not result in artifacts containing natural language. Although nat-
ural languages and the purposes for which we use it are diverse,
there are essential abstractions that are essential to all of them:
words, sentences, syntax, semantics, and pragmatics that create
an opportunity for creating generalizable methods for analyzing
language. While the first NLP methods relied on small sets of
examples and a high level of human expertise, there are now
large sets of digitized data that have been made available for pub-
lic use that makes methods based on large data sets and the lat-
est advances in artificial intelligence feasible. Additionally, today
there are many highly accurate, freely available software tools for
analyzing natural language.
Notes
24 SUSAN MCROY
CHAPTER 2.
The data structures most common to NLP are strings, lists, vec-
tors, trees, and graphs. All of these are types of sequences,
which are ordered collections of elements. Unprocessed data is
usually input as string data which are processed into lists or
vectors, representing individual words, before subsequent pro-
cessing in the NLP pipeline. This processed data is usually not
just a list or vector of strings, but sequences of complex objects
that keep track of various attributes associated with each token,
such as part of speech. Later stages may add additional annota-
26 SUSAN MCROY
tions, such as marking the beginnings and endings of important
sequences within a sentence, such as the names of entities. This
section will overview the most commonly used data types along
with examples of their application in NLP.
2.1.1 Strings
2.1.2 Lists
28 SUSAN MCROY
Figure 2.1 Examples of patterns to use with the spaCy matcher function
Example of
Pattern matching
phrase
2.1.3 Vectors
Vectors hold elements of the same size, such as numbers, and are
of fixed size. They are one-dimensional, which means elements
can be accessed using a single integer index. Similar represen-
tations of higher dimension are given special names; a matrix is
a two-dimensional, rectangular structure arranged in rows and
columns. The term tensor is used to describe generalizations
of matrices to N-dimensional space.7 Important operations for
vectors include accessing the value of an element of the vector,
PRINCIPLES OF NATURAL LANGUAGE PROCESSING 29
finding the length of a vector, calculating the similarity between
two vectors, and taking the average of multiple vectors.
For much of data science, including modern NLP, vectors are
used as a representation format for a variety of tasks. For exam-
ple, vectors can represent subsets of words or labels from a fixed
vocabulary. To do this, each element corresponds to a single
word in the vocabulary and the value is either one or zero to
indicate whether or not the word is included in the set. Vectors
where only a single element can be zero8 are often used as an
output format (to indicate the selected class label or type of word
in context– e.g. to discriminate between a noun versus verb
sense.) Vector elements can also capture a probability distribu-
tion over a set, where the value of each element corresponds to
the probability of that element, a value between 0 and 1, and the
sum across all such values is 1. Calculating such a probability dis-
tribution is often done using a “softmax” function, which is a
mapping from a vector of real numbers, onto a probability distri-
bution with the same number of elements, where the probabili-
ties are proportional to the exponentials of the input values. Two
applications of vectors are of special note here: the use of vec-
tors to represent documents for information retrieval (which has
been termed the “Vector Space Model”) and the use of vectors for
word embeddings which are a representation of the meaning of
words, which can also be generalized to represent longer phrases
and sentences.
32 SUSAN MCROY
tors using local statistics; it can either use context (e.g., a fixed
number of words on either side) to predict a target word (a
method known as continuous bag of words), or use a word to
predict a target context, which is called a skip-gram. Two varia-
tions of Word2vec are available: one that works better on high
frequency words and one that works better on low frequency
words. For high frequency words, an approach called “negative
sampling” is used to maximize the probability of a word and con-
text being in the corpus data if it is, and maximize the probabil-
ity of a word and context not being in the corpus data, if it is
not. Finding these optimal values can take a very long time. For
low frequency words, which require higher dimensionality vec-
tors, a more efficient approach called hierarchical softmax can
be selected. It uses a binary tree representation that reduces the
computational complexity to O(log2 |V|) instead of O(|V|), where
|V| is the size of the vocabulary.
GloVe vectors14 15 use global statistics to predict the proba-
bility of word j appearing in the context of word i with a least-
squares objective. The general idea is to first count for all pairs
of words their co-occurrence, then find values such that for each
pair of word vectors, their dot product equals the logarithm of
the words’ probability of co-occurrence.
Elmo (Embeddings from Language Models) vectors16 contain
values that have been learned from a neural network with a par-
ticular architecture known as a bidirectional Long Short Term
Memory (biLSTM or biLM) network. These networks have
multiple internal layers, some of which provide feedback to each
other. Bidirectionality refers to training on both the original sen-
tences and its reverse to captures certain syntactic dependencies
on the semantics of a word. In addition, unlike the other types
of vectors discussed here, the representation of a word using an
Elmo vector is a function of the entire sentence in which it occurs,
rather than just a small number of nearby words.
For a particular application, the method of training is likely to
not be as important as the data that was used to create the vec-
Figure 2.3 Example phrase structure parse tree (Image from Stanford
CoreNLP.run)
Figure 2.4 Example dependency structure parse tree (Image from Stanford
Corenlp.run)
Graphs are more general that trees, because they allow nodes
to have multiple incoming edges. While they are not needed
to represent sentence structure, they are helpful in describing
how language is processed. Graphs form the basis of the pro-
cessing architectures for both search based parsing and analysis
using neural networks. In a search, the nodes of the graph cor-
respond to a machine state and possible alternative next states.
In a neural network, the nodes of a graph correspond to oper-
ations or functions, which can have any number of inputs and
outputs. The edges of a neural network, which represent the data
that flows from the output of one node to the input of another
are tensors, as they may be scalars, vectors, matrices, or higher-
dimensionality structures. In both search graphs and neural net-
work, the nodes and edges represent models of process rather
than data.
This concludes our introduction to the data structures of NLP.
We will next consider the two primary processing paradigms:
search and classification.
36 SUSAN MCROY
2.2 PROCESSING PARADIGMS FOR NATURAL
LANGUAGE PROCESSING
38 SUSAN MCROY
is now commonly used for generating text, such as for creating
captions or brief summaries.
Hill-climbing search and its variants use a function to rank the
children of a current node to find a node that is better than the
current one, and then transitions to that state, without keeping
any representation of the overall search path. Gradient descent
is a variant of hill climbing that searches for the child with min-
imum value of its ranking function. For machine learning, this
type of search is used to adjust parameters to find the minimum
amount of error or loss in comparison to the goal by making
changes proportional to the negative gradient, which is the mul-
tivariate generalization of a derivative (akin to the slope of a
line). Hill-climbing and gradient searching do not backtrack, and
hence do not require any memory to track previously visited
states.
Figure: 2.6 IOB encoding for the sentence “Ruth Bader Ginsberg went to New
York”
The same approach can be used for the task of marking the noun
phrases within a sentence. Figure 2.7 shows an example of an
IOB encoding for bracketing noun phrases21.
Figure 2.7 IOB encoding for the sentence “The bull chased the big red ball around
the yard.”
Figure 2.9 Examples of linearly and non-linearly separable data (images from
Wikipedia)
44 SUSAN MCROY
In general, employing machine learning approaches is an itera-
tive process of training and testing. Training means running a
learning algorithm on a data set where the correct class is pro-
vided to the algorithm so it can use the information to find val-
ues for internal parameters. Testing means applying the trained
classifier to a subset of the data that was not used for training,
but where the correct class is known. The performance of the
classifier is then assessed. The most common measures to use
are precision, recall, and F1. Precision is the proportion of cor-
rectly classified items of a given category among the total num-
ber of items it classified as the category. Recall is the proportion
of correctly classified items among the total number of items that
should have been classified as the category. F1 is their harmonic
mean (which is calculated as two times the product of precision
and recall divided by their sum).
Training and test sets can be created through a manual or
automated selection process. Manual selection, while less com-
mon, is sometimes done to assure consistency across training
and testing over time. Manual selection risks biasing the results
however. Instead, experimenters will use N-fold cross-valida-
tion, which is an iterative process where the data set is first par-
titioned into N equal subparts (the “folds”) and then training is
repeated N times, each time with a different one of the subparts
held out as a test set. Afterward, the performance measures are
averaged over all test sets.
The success of training classifiers (of all types) depends pri-
marily on the data set available to train the model. (There can
also be differences due to the training algorithm, so several are
usually tried and compared.) The data used for training should
always be as similar as possible to the target test data and have
enough positive and negative examples for each category, to
minimize the impact of small differences in placement of bound-
aries. It is also important to choose an appropriate internal rep-
resentation of the data as features. A simple approach might
consider each of the individual words – but this can be both too
46 SUSAN MCROY
Figure 2.10 Example Deep Network architecture for NLP (Image credit: Ruder,
2018)
48 SUSAN MCROY
Newer architectures for encoders, developed by Lee-Thorp et al
(2021), replace multiple layers of self-attention with a single layer
of linear transformations that “mixes” different input tokens, to
create faster processing models; the group report that a “stan-
dard, unparameterized Fourier Transform achieves 92% of the
accuracy of BERT on the GLUE benchmark, but pre-trains and
runs up to seven times faster on GPUs and twice as fast on
TPUs”29. Other versions mix a single self-attention layer with
Fourier transforms to get better accuracy, at a somewhat less
performance benefit. Exploring such tradeoff is likely going to
remain an active area of research for awhile.
Both encoder and decoders are trained on large collections of
text, such as Gigaword30, which includes text from several inter-
national news services. Transformers for dialog must be trained
on data collected from interactions between people, which can
be gathered either by scraping portals where people interact
(e.g., for language learners to practice conversation skills) or by
creating tasks for pairs of crowd-workers. These datasets have
then been annotated by researchers at universities or large com-
panies, such as Google. Facebook research has assembled one of
the most comprehensive collections of openly available datasets
and software tools, which they make available through its ParlAI
project31.
For specific sequence to sequence classification tasks, the pre-
trained models are fine-tuned or updated with additional data
from the target domain, such as pairs of questions and answers32.
The pairs of sequences can be given as two separate inputs, {S1,
S2} or, more commonly, they are given as one concatenated
input, by adding special tokens to indicate the start, separation,
and end of each part of the pair, e.g., [Start]-[S1]-[Separator]-[S2]-
[End]. These models have been shown to work fairly well for
question answering, sentiment analysis, textual entailment and
parsing33. One limitation is that these models are big and slow in
production – and thus cannot yet be used for real-time systems
– however they could be used to create training data for simpler
2.4 SUMMARY
This chapter considered the most used data types and problem-
solving strategies for natural language processing. The data
types include strings, lists, vectors, trees, and graphs. Most of
these are meant to capture sequences (e.g., of letters or words)
and the hierarchical structures that emerge because of grammar.
Feature structures or objects are needed to associate various
attributes with tokens or types (as a way to keep the number
of unique types of a manageable number). The processing par-
adigms of natural language processing include search, classifi-
cation, and more generally, machine learning, where the
development of a language model (including classifiers) using
machine learning represents a complex combination of manual
and automated search to find an optimal model for performing a
50 SUSAN MCROY
given task. Many steps for these processing paradigms have been
implemented in the form of software libraries and workbench
style tools, however no tool exists that can predict the optimal
approach or identify the most relevant data or the internal rep-
resentation to use. For these tasks, understanding of language,
including levels of abstraction, benchmark tasks, and important
end-to-end applications, will be discussed in the remainder of
this book.
Notes
58 SUSAN MCROY
3.2 WORD TYPES AND FEATURES
Figure 3.1 Example of nouns with a plural formed with -s or -es suffix
frog frogs
idea ideas
fly flies
fox foxes
class classes
child children
sheep sheep
goose geese
knife knives
60 SUSAN MCROY
Figure 3.3 Typical placement of nouns
Verbs are usually tensed (past, present, future). They include both
verbs where the tensed forms are regular (see Figure 3.4) or
irregular (see Figure 3.5). Also, in some contexts, verbs can
appear untensed, such as after an auxiliary or after the word “to”.
Verbs are also marked for number (singular or plural), and for
person. First person is “I”; second person is “you”; and third
person is “he”, “she”, or “it”. The third-person singular form is
marked with “-s”; the non-3rd person singular present looks the
same as the root form. Verbs also have participle forms for past
(eg., “broken” or “thought”) and present (e.g., “thinking”).
Some verbs require a particle which is similar to a preposition
except that it forms an essential part of the meaning of the verb
that can be moved either before or after another argument, as in
“she took off her hat” or “she took her hat off”.
Verbs that can be main verbs are an open class. Verbs that are
modals or auxiliary verbs (also called helping verbs) are a closed
class. They are used along with a main verb to express abil-
ity (“can”, “could”), possibility (“may”, “might”), necessity (“shall”,
62 SUSAN MCROY
“should”, “ought”), certainty (“do”, “did”), future (“will”, “would”),
past (“has”, “had”, “have”, “was”, “were”). NLP systems treat modals
and auxiliaries as a separate part of speech. They are also all
irregular in the forms that they take for different combinations
of features, such as past, plural, etc. For example, the modal “can”
uses the form “can” for any value for number and “could” for any
value for “past”.
3.2.4 Prepositions
Prepositions, such as “with”, “of”, “for”, and “from” are words that
relate two nouns or a noun and a verb. Prepositions require
a noun phrase argument (to form a prepositional phrase). It is
estimated that there about 150 different prepositions (including
94 one-word prepositions and 56 complex prepositions, such as
“out of”)14. Prepositions are generally considered a closed class,
but the possibility of complex combinations suggests that algo-
rithms might be better off allowing for out of vocabulary exam-
ples.
3.2.6 Conjunctions
These have been among the country’s leading imports, particularly last year when there
were shortages that led many traders to buy heavily and pay dearly. [wsj 1469]
Coordinating
You can walk but you cannot run. conjunction, labelled as
conjunction
Coordinating
You can read at home or in the library. conjunction, labelled as
conjunction
Coordinating
My daughter and her friends like to climb. conjunction, labelled as
conjunction
Subordinating
My cat purrs when you pet her. conjunction, labelled as
adverb
Subordinating
If you study hard, you will do well. conjunction, labelled as
preposition
Correlative conjunction,
You can either walk or take the bus. labelled as adverb,
conjunction
Correlative conjunction,
The car not only is quiet but also handles well. labelled adverb, adverb,
conjunction, adverb
3.2.8 Wh-Words
Wh-words that begin with the letters “wh-” like “who”, “what”,
“when”, “where”, “which”, “whose”, and “why” and their close
cousins “how”, “how much”, “how many”, etc. They are used for
posing questions and are thus sometimes called interrogatives.
Unlike the word types mentioned so far, they can be determiners,
adverbs, or pronouns (both regular and possessive), and so it is
typical to see them marked as a special subtype of each. Identi-
fying phrases that include wh-words is important, because they
66 SUSAN MCROY
usually occur near the front in written text and fill an argument
role that has been left empty in its normal position, as in “Which
book did you like best?” In informal speech one might say “You
left your book where?” or “You said what?”, but the unusual syn-
tax also suggests a problem (like mishearing, shock, or criticism).
The semantics of the wh-expression specify what sort of answer
the speaker is expecting (e.g., a person, a description, a time, a
place, etc) and thus are essential to question-answering systems.
conjunction, subordinating or
IN that, of, on, before, unless
preposition
SYM symbol %, #
68 SUSAN MCROY
UH interjection oh, oops, gosh
Noun phrases are the most used type of phrase but also have
the most variation. Noun phrases can be simply a pronoun or
proper noun, or include a determiner, some premodifiers (such
as adjectives), the head noun, and some postmodifiers at the end.
70 SUSAN MCROY
3.4.2 Verb Phrases
72 SUSAN MCROY
Some examples of different types of sentences are shown in Fig-
ure 3.10.
INTJ Interjection with several words Hello in there, I did not see you.
3.5 SUBCATEGORIZATION
Example Subcategorization
She said she wanted to be president. “said” requires a complete S (any form)
3.7 SUMMARY
At one time, all these aspects required coding from scratch using
manually created dictionaries and grammar rules. Today, large
annotated data sets exist that allow one to either extract most
vocabulary items and grammar rules or to train statistical lan-
guage models for automated processing. However, there are
types of text that are not well covered by such corpora, including
social media and text within proprietary data warehouses (e.g.,
user and repair manuals for devices). Some of this data is impor-
tant as it includes the notes of medical providers and records cre-
ated by marketing and service departments of enterprises. In this
chapter, we have discussed the basis for computational methods
of language analysis. In the next chapters, we will consider com-
putational descriptions of language syntax and processing mod-
els for identifying syntactic structure and meaning.
Notes
78 SUSAN MCROY
LDC97S62. Web Download. Philadelphia: Linguistic Data Consor-
tium.
7. Beis, A., Mott, J., Warner, C., and Kulick, S. (2012). English Web
Treebank LDC2012T13. Web Download. Philadelphia: Linguistic
Data Consortium, 2012. URL: https://catalog.ldc.upenn.edu/
LDC2012T13
8. Silveira, N., Dozat, T., De Marneffe, M.C., Bowman, S.R., Connor,
M., Bauer, J., and Manning, C.D. (2014). A Gold Standard Depen-
dency Corpus for English. Proceedings of the Ninth International Con-
ference on Language Resources and Evaluation (LREC'14), pp.
2897-2904.
9. Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S.,
Ramshaw, L., Xue, N., Taylor, A., Kaufman, J., Franchini, M., El-
Bachouti, M., Belvin, R., and Houston, A. (2013). OntoNotes Release
5.0 LDC2013T19. Linguistic Data Consortium, Philadelphia.
10. spacy.io (2020). spaCy English Available Pretrained Statistical Mod-
els for English. URL: https://spacy.io/models/en
11. Li, S. (2018). Named Entity Recognition with NLTK and spaCy
URL: https://towardsdatascience.com/named-entity-recognition-
with-nltk-and-spacy-8c4a7d88e7da
12. ANC.org (2015) Open American National Corpus URL:
http://www.anc.org/
13. Napoles, C., Gormley, M. R., & Van Durme, B. (2012, June). Anno-
tated Gigaword. In Proceedings of the Joint Workshop on Auto-
matic Knowledge Base Construction and Web-scale Knowledge
Extraction (AKBC-WEKEX) (pp. 95-100).
14. Essberger, J. (2012). English Preposition List. Ebook Online:
http://www.englishclub.com/download/PDF/EnglishClub-English-
Prepositions-List.pdf
15. Bies, A., Ferguson, M., Katz, K., and MacIntyre, R. (1995). Bracketing
Guidelines For Treebank II Style Penn Treebank Project. URL
https://web.archive.org/web/20191212003907/http://lan-
guagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf
16. Warner, C., Bies, A., Brisson, and C. Mott, J. (2004). Addendum to the
Penn Treebank II Style Bracketing Guidelines: BioMedical Treebank
Annotation, University of Pennsylvania, Linguistic Data Consortium.
17. Huddleston, R. D. and Pullum, G. K. (2002)The Cambridge Grammar
of the English Language. Cambridge, UK: Cambridge University
Press.
18. Grammars for NLP use a variety of categories for these words
including wh-determiner (which, that, who, whom), wh-possessive
80 SUSAN MCROY
CHAPTER 4.
82 SUSAN MCROY
The remainder of this chapter will discuss pipeline processes
1 to 3 from Figure 4.1 (tokenization, tagging, and parsing).
4.1 TOKENIZATION
Figure 4.2 The tokenization of “I sold my book for $80.00.”. (Image by Stanford
CoreNLP.run (2020)
Figure 4.3 Code for performing tokenization and part-of-speech tagging in NLTK
and spaCy
Tagging in NLTK
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent s= preprocess('My pet cat.') print(s[0])
Tagging in spaCy
84 SUSAN MCROY
example code for invoking the tagger within these two libraries,
after sentences are tokenized.
The earliest successful taggers were rule-based or combined
a rule-based approach with simple statistical modelling. Statis-
tical taggers were first introduced by Marshall in 19835, but
were not very accurate until larger and better datasets, such
as the Penn Treebank, became available, so that more advanced
modelling was feasible. The most successful taggers combine a
wide range of information including the possible syntactic cat-
egories of nearby words, the overall probability associated with
words being used a particular way (e.g., “eats” can be a noun or a
verb, but it is more typically a verb), the probability of suffixes
and prefixes being associated with different parts of speech (e.g.,
words that end in “-ly” are usually adverbs), and the capitaliza-
tion of a word (e.g., words that begin with a capital letter are usu-
ally proper nouns). Rules for tagging can capture these known
linguistic generalizations.
The simplest statistical models just use the most frequent cat-
egory for each word. Sequence-based modelling (a type of
sequence classification) involves finding the best sequence of tags
for an entire sentence (rather than just a single word). These
approaches can combine information about the most common
category of a single word with those of nearby words, which is
the approach taken by Hidden Markov Models. More advanced
approaches, such as Maximum Entropy Markov Models (used
by the Stanford CoreNLP tagger) and Averaged Perceptrons (a
neural approach used in NLTK), consider the frequency of sub-
sequences, as well as the other linguistic features, but adjust their
impact by measuring the association between each feature and
tag in a training corpus. Thus, sequence classification requires
a data set of sentences where each word has been correctly
tagged. Also, the corpus must be large enough so that the algo-
rithm can find enough examples within the training corpus for
the estimates to be meaningful. A third factor in the success of
taggers trained from data is the similarity between the genre
86 SUSAN MCROY
Figure 4.4 Example of a tagger created using regular expressions in NLTK
88 SUSAN MCROY
the errors were reviewed, it was revealed that the hand-writ-
ten rules tended to solve all but the “hardest” cases of ambigu-
ity, while the statistical model made about 80% of its errors on
“easy” cases, but sometimes did better on cases that people think
would be “hard”, but were correctly tagged in its training data.
(Statistical tagging only got better than rule-based tagging when
enough high-quality annotated data was available, e.g., after the
competion of the Penn Treebank.)
(C(ti-1,ti)÷C(ti-1)) * (C(ti,ti+1)÷C(ti+1)) *
d) P(ti| wi, ti-1, ti+1), probability of a tag for (C(ti,wi)÷C(ti))
a word – that is, take the product of the three
estimates
92 SUSAN MCROY
the estimated probability of examples that occur once10. One
can also build a more complex, and potentially more accurate
model, using Conditional Random Fields (CRF). CRFs do not
require that one assume independence and also provide a cer-
tainty value for different possible sequences. Maximum Entropy
Markov Models (MEMM), such as the Stanford CoreNLP tagger,
and CRF models do not assume independence so they allow one
to use ad hoc features, including suffixes and capitalization, but
as a result, they are much slower, and rarely used for relatively
simple tasks, such as part-of-speech tagging.
Figure 4.7 Example of Twitter text and tags from Gui et al 2017
Untagged @DORSEY33 lol aw i thought u was talkin bout another time . nd i dnt
sequence see u either !
Tags as
labelled USR UH UH PRP VBD PRP VBD VBG IN DT NN . CC PRP VBP VB
in Gui et PRP RB
al 2017.
Figure 4.8 Example neural network architecture for part of speech tagging (Image
from Meftah and Semmar (2018)
94 SUSAN MCROY
An emerging approach that has been proposed for dealing with
novel domains is to combine the input with embeddings trained
from a domain that is large and more conventional to enrich
information from the smaller target domain, an approach known
as transfer learning, which has been applied to Twitter text15.
The tradeoff for these newer models is in the time needed for
training. Thus a reasonable approach might start with a pre-
trained tagger and then address errors with some domain-spe-
cific correction rules, if necessary. (One can also provide
synthetic training data, meant to teach the model how to handle
the erroneous cases.) We will now move to the next stage of syn-
tactic processing which is to identify the syntactic structure over
entire sequences of words.
4.3 GRAMMARS
Figure 4.9 Lexical entry for “given” in the Slot Grammar lexicon of McCord (1990)
Figure 4.10 Parse tree for the sentence “The dog ate the beef.” (parse and image
from Stanford CoreNLP.run 2020)
98 SUSAN MCROY
Figure 4.11 Example CFG learned from parse of “The dog barked”. (parse and
image from Stanford CoreNL
S →NP VP
NP→DT NN
VP →VBD
DT→the
NN→dog
VBD→barked
NP → DT NN .4 NN → money 0.4
For the example in Figure 4.13, we must also consider the phrase
types within a pattern. So in Figure 4.13 there are 700 examples
of “S →NP VP“ (including 400 that appear nested within the
third type of pattern). There are 100 examples of “S → VB NP”
and 200 examples of “S → S CC S“, resulting in a probability of
assignment for the PCFG sentence rule of S → NP VP [0.7] | VB
NP [0.1] | S CC S [0.2].
Simple PCFGs provide a way to resolve local syntactic ambi-
guity, but they are not always very accurate, because many ambi-
guities depend on relationships between specific words, as in
“They ate udon with forks.” versus “They ate udon with chicken.”
Often accuracy can be improved by conditioning the probabili-
ties by the word that is the head of the constituent (such as the
main verb in a verb phrase or the main noun in a noun phrase).
Figure 4.13 gives an example rule by Collins19 that might be used
for finding the head of a noun phrase.
Figure 4.15 Example of a dependency tree for “Bills on ports and immigration were
submitted by Senator Brownback.” (Image: Stanford CoreNLP)
Figure 4.16 Example of the locations of some individual words and spans of words
0 1 2 3 4 5
Spans of
The cat ate a mouse
length 1
Spans of
The cat a mouse
length 2
Spans of
ate a mouse
length 3
Spans of
The cat ate a mouse
length 5
Sentence:
0 The 1 horse 2 raced 3 past 4 the 5 barn 6 fell. 7
VB -> horse
The edges that are created to keep track of the state of a parse
before a rule is completely matched are called “active edges”. As
the sentence is processed from left to right, for each partially
matched grammar rule, the algorithm will create an active edge
to store what has been matched, and what is still needed. A dot
(*) is used to designate the dividing point between what has been
matched and what is still needed. (If an edge is created with
no parts matched, then the dot will be immediately after the
arrow, as in A → * B1 B2.) As more parts are matched, new edges
are added, with the dot moved over to the right. When the dot
reaches the far right end of a rule, then the edge is “complete”
(also called “inactive”) and it can be used to extend other active
rules. Figure 4.18 shows another small grammar and some edges
that would be created, midway through the parse.
Sentence:
0 The 1 cat 2 sat 3
NP -> DT NP
Bottom-up rule: When a new category is seen (or nonterminal rule for that cate-
gory is completed completed), then for any rule where that category is leftmost
on the RHS, create a new edge with the dot just to the left of the category.
If ( ([i:j], X) or [i:j] X -> U V *) and A -> X Y, then add ([i:i], A -> * X Y)
Note: Y can be empty.
Figure: 4.20 Pseudocode for a basic CKY algorithm (an exhaustive enumeration
over all possible edges)
[0:0] DT → * the
[0:1] DT → the *
[0:0] NP → * DT NN
Predict
[0:1] NP → DT * NN
barked and apply
[1:1] NN → * dog
the fundamental
[1:2] NN → dog *
rule to complete
[0:2] NP → DT NN *
VBD and apply
[0:0] S → * NP VBD
the fundamental
[0:2] S → NP * VBD
rule again to
[2:2] VBD→* barked
complete the S.
[2:3] VBD→ barked
*
[0:3] S → NP VBD *
1 $ T and T$ Shift NA
7 $S $ Accept
Figure 4.24 shows the last few steps of the same parse. Note,
if one were to draw a tree at the same time, every shift action
would create a sibling whereas each LA/RA adds a new edge
between a head and a dependent. In Figure 4.24, the “Reduce”
actions at the end of the parse remove items from the stack with-
out creating any new dependencies. The parse ends when the
queue (column 3) is empty.26.
Figure 4.26 The Stanza NLP processing pipeline (image from Qi et al (2020))
Figure 4.27 Example of a chunk grammar with recursion and its usage in NLTK
grammar = r"""
NP: {<DT|JJ|NN.*>+}
RegEx PP: {<IN><NP>}
grammar
VP: {<VB.*><RP|NP|PP|CLAUSE>+$}
CLAUSE: {<NP><VP>}"""
4.7 SUMMARY
Notes
The most
recent Stanford parser may address this, however.
27. Collins, M., Ramshaw, L., Hajič, J. and Tillmann, C., (1999). A Statis-
tical Parser for Czech. In Proceedings of the 37th Annual Meeting of
the Association for Computational Linguistics (pp. 505-512).
28. Xia, F. and Palmer, M. (2001). Converting Dependency Structures to
Phrase Structures. In Proceedings of the First International Conference
on Human language Technology Research (pp. 1-5). Association for
Computational Linguistics.
29. Lee, Y. S., and Wang, Z. (2016). Language Independent Dependency
to Constituent Tree conversion. In Proceedings of COLING 2016,
PRINCIPLES OF NATURAL LANGUAGE PROCESSING 123
the 26th International Conference on Computational Linguistics:
Technical Papers (pp. 421-428).
30. Adwait Ratnaparkhi. 1997. A Linear Observed Time Statistical
Parser Based on Maximum Entropy Models. In Proceedings of Empiri-
cal Methods in Natural Language Processing (EMNLP). Also in
ArXiv.org URL: https://arxiv.org/abs/cmp-lg/9706014
31. See Linzen, T. and Baroni, M. (2020). Syntactic Structure from
Deep Learning. arXiv preprint arXiv:2004.10827.
URL:https://arxiv.org/pdf/2004.10827
32. Gaddy, D., Stern, M. and Klein, D., 2018. What’s Going on in Neural
Constituency Parsers. An Analysis. In the Proceedings of the 16th
Annual Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. Also online
as: https://arxiv.org/pdf/1804.07853.pdf
33. Kitaev, N. and Klein, D. (2018). Constituency Parsing with a Self-
Attentive Encoder. arXiv preprint arXiv:1805.01052.
34. Qi, P., Zhang, Y., Zhang, Y., Bolton, J. and Manning, C. (2020).
Stanza: A Python Natural Language Processing Toolkit for Many
Human Languages. In the Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics (ACL) System Demonstrations.
Also see the STANZA Github website: https://stan-
fordnlp.github.io/stanza/
35. Dozat, T., Qi, P. and Manning, C.D. (2017). Stanford’s Graph-Based
Neural Dependency Parser at the CoNLL 2017 Shared Task. In Pro-
ceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies (pp. 20-30).
36. Qi, P., Dozat, T., Zhang, Y. and Manning, C.D. (2019). Universal
Dependency Parsing from Scratch. ArXiv preprint
arXiv:1901.10457.
37. Straka, M. and Straková, J. (2017). Tokenizing, POS Tagging, Lem-
matizing and Parsing UD 2.0 with UDpipe. In Proceedings of the
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Univer-
sal Dependencies (pp. 88-99).
38. UDPipe is available from both a public Github and from CRAN. The
CRAN site provides more extensive user-level documentation.
Github URL: https://github.com/ufal/udpipe CRAN URL:
https://cran.r-project.org/package=udpipe
39. Universaldepencies.org CONLL-U Format URL: https://univer-
saldependencies.org/format.html
40. Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing
with Python: Analyzing Text with the Natural Language Toolkit.
For sentences that are not specific to any domain, the most com-
mon approach to semantics is to focus on the verbs and how they
128 SUSAN MCROY
are used to describe events, with some attention to the use of
quantifiers (such as “a few”, “many” or “all”) to specify the entities
that participate in those events. These models follow from work
in linguistics (e.g. case grammars and theta roles) and philosophy
(e.g., Montague Semantics5 and Generalized Quantifiers6). Four
types of information are identified to represent the meaning of
individual sentences.
First, it is useful to know what entities are being described.
These correspond to individuals or sets of individuals in the
real world, that are specified using (possibly complex) quanti-
fiers. Entities can be identified by their names (such as a sequence
of proper nouns), by some complex description (such as a noun
phrase that includes a head noun, a determiner, and various types
of restrictive modifiers including possessive phrases, adjectives,
nouns, prepositional phrases, and relative clauses), or by a pro-
noun.
Second, it is useful to know what types of events or states
are being mentioned and their semantic roles, which is deter-
mined by our understanding of verbs and their senses, including
their required arguments and typical modifiers. For example, the
sentence “The duck ate a bug.” describes an eating event that
involved a duck as eater and a bug as the thing that was eaten.
The most complete source of this information is the Unified
Verb Index.
Third, semantic analysis might also consider what type of
propositional attitude a sentence expresses, such as a statement,
question, or request. The type of behavior can be determined by
whether there are “wh” words in the sentence or some other spe-
cial syntax (such as a sentence that begins with either an auxil-
iary or untensed main verb). These three types of information are
represented together, as expressions in a logic or some variant.
Fourth, word sense discrimination determines what words
senses are intended for tokens of a sentence. Discriminating
among the possible senses of a word involves selecting a label
from a given set (that is, a classification task). Alternatively, one
Figure 5.1 Fragment of the Foundational Model of Anatomy ontology for defining
a tendon (example and image from Noy et al (2004)
Figure 5.2 Examples of semantic roles that commonly appear in the subject
position
Experiencer (+sentient) of
[The children] tasted the soup.
perception
Figure 5.3 Examples of semantic roles that commonly appear in the object
position.
Theme (-affected) The boy rolled [the ball] down the hill.
Figure 5.5 Examples of semantic roles that express a time or span of time (as a
modifier).
Python: Customers.retrieveRows(last_name=”Smith”)
Figure 5.8 gives a set of CFG grammar rules with semantic fea-
tures for creating SQL statements for queries like “What cities
are located in Canada?” or “In what country is Houston?” where
the target representations would be “SELECT City FROM
city_table WHERE Country = ‘Canada’ ” and “SELECT Coun-
try FROM city_table WHERE City = ‘Houston’ ”, respectively.
Figure 5.9 Dependency structures for “What cities are in Canada?” vs. “What
cities are located in Canada?”
Figure 5.10 Examples of Cypher query language equivalents for natural language
• If α is a formula then so is ¬α
• If α and β are formulas then so is α ˅ β and α ˄ β
PRINCIPLES OF NATURAL LANGUAGE PROCESSING 141
• If x is a variable and α is a formula then so is Ǝx.α and
∀x.α
There are two special cases. If the sentence within the scope of
a lambda variable includes the same variable as one in its argu-
ment, then the variables in the argument should be renamed to
eliminate the clash. The other special case is when the expression
within the scope of a lambda involves what is known as “inten-
sionality”. Sentences that talk about things that might not be true
in the world right now, such as statements about the past, state-
ments that include a modal operator (e.g., “it is possible”), state-
ments that include counterfactuals, and statements of “belief”,
such as “Rex believes that the cat is hungry.” all require special
care to separate what is true in the world versus some context
of an alternate time, of the mental state of some agent, etc. Since
the logics for these are quite complex and the circumstances for
needing them rare, here we will consider only sentences that
do not involve intensionality. In fact, the complexity of repre-
senting intensional contexts in logic is one of the reasons that
researchers cite for using graph-based representations (which we
consider later), as graphs can be partitioned to define different
contexts explicitly. Figure 5.12 shows some example mappings
Natural language
Logical type
types
Conjoined VP : VP
[[VP]] and [[VP]] where [[ X]] gives the semantics of X
CC VP
Sentence
NP NPr1 VP where λx Qy Ry and P(x, y)(NP), e.g. “Susan owns a grey cat.”
NPr1 λx Ǝy [grey(y) and cat (y) and owns(x, y)](Susan)
is a previously Ǝy [grey(y) and cat (y) and owns(Susan, y)
raised NP
Sentence
Q1yRy and Q2z Rz and λx P(x, y, z)(NP) e.g. “Sy gave a girl a
NP NPr1 NPr2VP
book.”
where NPri
λx Ǝy[girl(y) and Ǝz book (z) and gave(x, y, z)](Sy)
are previously
Ǝy [girl(y) and Ǝz book (z) and gave(Sy, y,z)]
raised NP
Conjoined sentence:
[[S]] and [[S]] where [[ X]] gives the semantics of X
S CC S
Pure FOL, even with lambda expressions, does not fully capture
the meaning of referring expressions such as pronouns, proper
names, and definite descriptions (such as “the cat”), when we
consider what would be needed by a database or a knowledge
base (KB). A backend representation structure must be able to
link referential expressions, such as “the cat” or “a cat”, to some
entity in the KB, either existing (“the”) or newly asserted “a”. Also,
it is not always sufficient to use a constant for named entities,
e.g., [[Susan]] = susan. Using a constant like this assumes unique-
PRINCIPLES OF NATURAL LANGUAGE PROCESSING 147
ness (e.g., that there is only one person named Susan), when in
reality there are millions of people who share that name. At an
abstract level, what we need is a representation of entities, dis-
tinct from the expressions (strings) used to name them and we
need functions to map between the expressions at the logical
level, with those in the underlying KB. We can express this at
the logical level by defining function symbols that invoke these
functions at the implementation level, such as Named_entity(
“Susan”) or Pronoun( “it”). Then to implement these at the back-
end we would need to define them with some new ad hoc func-
tion or database assertion or query. In Python we might build
a dictionary with dynamically created identifiers as keys and
asserted relations as values, e.g., relations[cat1] = [grey(cat1),
cat(cat1), eat(cat1, mouse1)], when “a grey cat ate a mouse” is men-
tioned. Or we might use a database language, such as Cypher,
with expressions like CREATE (susan:Person {name: ‘Susan’}) or
MATCH (:Person {name: ‘Susan’}), respectively, when Susan is
mentioned. Including KB entities within the semantics would
also require making a change to representations involving an
existential quantifier, where we remove the quantifier and sub-
stitute for the variable a new constant corresponding to the
unnamed individual for which the predicate is true. In logic
this notion is accomplished via a “skolem function”; for an NL
semantics one might use an ad hoc function to create new sym-
bols on the fly22. And, with possibly multiple symbols corre-
sponding to the same real-world entity, a logic or KB would
also need an equality operator, e.g., susan1 = susan2 or
owl.sameAs(susan1,susan2). Similar issues (and solutions) would
arise for references to entities that get their meaning from the
context, such as indexicals (“I”, “you”, “here”, or “there”) or ref-
erences that depend on time and location (“the president”, “the
teacher”). We will discuss these issues further in Chapter 7, when
we consider how multiple sentences taken together form a
coherent unit, known as a discourse.
Syntax Example
Figure 5.15 Examples of a complex concept in description logic and some example
sentences
“a company with at least 7 directors, whose managers are all women with PhDs,
and whose minimum salary is $100/hr”
[AND Company
[EXISTS 7 :Director]
[ALL :Manager [AND Woman [FILLS :Degree PhD] [FILLS :MinSalary ‘$100/
hour’]]]
“A dog is among other things a mammal that is a pet and a carnivorous animal
whose voice call includes barking”
(Dog [AND Mammal Pet CarnivorousAnimal [FILLS :VoiceCall barking]])
“A FatherOfDaughters is a male with at least one child and all of whose children
are female”
(FatherOfDaughters ≡ [AND Male [EXISTS 1 :Child] [ALL :Child Female]] )
Frame deter-
Determiner, “the”, “a” (the ?x), (a ?x)
miner
Frame state-
Noun phrase, “the cat” (the ?x (cat ?x)), cat01
ment, Instance
Frame deter-
Clause-end punctuation, “.”, “?” (a ?x), (question ?x)
miner
Figure 5.17 AMRL Graph for “Find restaurants near the Sharks game” (Image
from Perera et al 2018)
Figure 5.18 Conceptual Graph illustration and CGIF notation for one scoping of
“Tom believes Mary wants to marry a sailor.”
5.4 SUMMARY
Notes
At several different layers, it’s Bell Industries Inc. increased its quarterly to 10
a fascinating tale. cents from 7 cents a share.
Similarity
Sentence 1 Sentence 2
class
6.5 SUMMARY
Notes
• To extract information,
• To find documents or information within larger collec-
tions,
• To convey distributed, structured information, such as
found in a database, in a more understandable form, and
• To translate from one form or language into another.
182 PRINCIPLES OF NATURAL LANGUAGE
Computational models of dialog are also used to manage com-
plex devices or to elicit social behaviors from people (e.g., as a
diagnostic, monitoring, or treatment tool for depression1). The
participants in a dialog can all be people, in which case the role of
NLP might be to extract knowledge from their interaction, or to
provide mediating services. Or, the parties might be a heteroge-
neous group of people and an automated system, such as a chat-
bot. Because dialog involves multiple parties, it brings additional
complexity to manage the flow of control among participants
and also to assure that participants’ understandings of the dia-
log are similar. Applications of dialog include interactive voice
response systems (IVR), question answering systems, chatbots,
and information retrieval systems. Applications of discourse that
do not require interaction include text summarization systems
and machine translation systems.
When we think of the meaning of discourse, we might think
about the “story” that the discourse is trying to convey. Under-
standing a story requires a deeper level of understanding than
found in the benchmark tasks we discussed in an earlier chapter.
Story understanding was among the tasks that concerned many
AI and NLP researchers in the 1970’s, before access to large elec-
tronic corpora became widely available2. Some of the key ideas
of understanding a story are similar to what might be captured
in a deep, logic-based semantics, as we discussed in Chapter 5,
including wanting to know what people or things are involved
in the story. (In a logic, these would be referents associated with
logical terms.) One might also want to know the various sorts of
properties and relations that hold among characters, objects, and
events, for example, where things are located, when events hap-
pened, what caused the events to happen, and why the charac-
ters did what they did. In a 2020 article in the MIT Technology
Review, these elements of story understanding were noted as
being totally missed by current research in natural language pro-
cessing3. One reason that story understanding remains unsolved
is that these tasks were found to be much more difficult than
The reason to consider discourse and dialog rather than just the
sentences that comprise them is that sometimes information is
presented or requested over multiple sentences, and we want to
recognize various phrases or relations among them that identify
the who, what, when, where and why of the event. For discourse,
we might extract information found in articles in newspapers or
magazines or the chapters of books and store it in tables that
are more easily searchable. For dialog, we might want to extract
information from interactions to achieve tasks, such as book-
ing travel or making a restaurant reservation, or tutoring a stu-
dent5. Or, we might want to be able to identify direct or implied
requests within purely social conversations that unfold without
specified goals or roles among the participants. Many of these
tasks have been the focus of recent dialog state tracking and dia-
log system technology challenges6, 7
Figure 7.2 Some filled roles for the terrorism template from MUC-3
PERPETRATOR: ID OF ORG(S) –
PERPETRATOR: CONFIDENCE –
“FAST-FOOD RESTAURANT” /
PHYSICAL TARGET: ID(S) “PRESTO INSTALLATIONS” /
“RESTAURANT”
COMMERCIAL: “FAST-FOOD
PHYSICAL TARGET: TYPE(S) RESTAURANT” / “PRESTO
INSTALLATIONS” / “RESTAURANT”
INSTRUMENT: TYPES(S) –
Coach: What goal could you make that would allow you to do more walking?
Patient: Maybe walk (S activity) more in the evening after work (S time).
Coach: Ok sounds good. [How many days after work (S time) would you like to
walk (S activity)?] (M days number intent)
Coach: [And which days would be best?] (M days name intent)
Patient: 2 days (M days number). Thursday (M days name), maybe Tuesday (M
days name update)
Coach: [Think about how much walking (S activity) you like to do for example 2
block (M quantity distance other)]
(M quantity intent)
Patient: At least around the block (M quantity distance) to start.
Coach: [On a scale of 1 − 10 with 10 being very sure. How sure are you that you
will accomplish your goal?](A intent)
Patient: 5 (A score)
Grounding type Example of a turn (A) and a reply (B) with grounding
Continued attention
Relevant next
A: How are you? B: I am fine thank you.
contribution
Possible intended
Type Example
responses
Directive Let the cat out. Hearer will let the cat out.
When the type of action matches the surface form, we say that
the expression is direct or literal, otherwise it is indirect or non-
literal. For example, the expression “Can you pass the salt?” is a
direct yes-no question, but an indirect request (to pass the salt),
which can be further clarified by adding other request features,
such as “please“45. The acceptable use of indirect language is spe-
cific to particular language groups. For example, in some cul-
tures (or family subcultures) it is considered impolite to request
or order something directly (e.g., “Close the window.”), so speak-
ers will put their request in another form that is related, and rely
on the listener to infer the request (e.g., “Can you close the win-
dow?” or “I’m feeling cold with that window open.”). These forms
are not universal, and thus others may find such indirect requests
impolite or confusing46.
Resources for using DiAML are limited. There is a tool for anno-
tating multimodal dialog in video data, called “ANVIL”55, 56, 57.
The creators of ANVIL provide small samples of annotated data
to help illustrate the use of the tool.
One older, but still useful dataset, is a version of the Switch-
board Corpus58 that has been annotated with general types of
dialog actions, including yes-no questions, statements, expres-
sions of appreciation, etc, comprising 42 distinct types overall59.
The corpus contains 1,155 five-minute telephone conversations
between two participants, where callers discuss one of a fixed set
of pre-defined topics, including child care, recycling, and news
media. Overall, about 440 different speakers participated, pro-
ducing 221,616 utterances (or 122,646 utterances, if consecutive
utterances by the same person are combined). The corpus is now
openly distributed by researchers at the University of Colorado
at Boulder60.
7.4 SUMMARY
Notes
Question Response
Now suppose that one wishes to create a tool to help the moder-
ator of an online forum answer questions from cancer survivors.
People can post their stories to a forum, including requests for
information or advice, and forum members can also respond to
each other. The moderator may chime in with the answer to
a question if it has gone unanswered for too long or the best
answer is missing. Sometimes the questions are answerable with-
out more information; sometimes the best thing to do is to refer
them to a healthcare provider. The challenges include: identify-
ing requests for information,identifying whether the responses
have adequately answered the question, and if necessary, finding
and providing either a direct answer (if it is brief) or reference
to a longer document that would contain the answer. One recent
approach tested the idea of training a classifier to identify sen-
tences that express an information need, extracting keywords
from those sentences to form a query to a search engine to
extract passages from the provider’s existing educational materi-
als for patient. They study found that very few questions would
be answerable from the educational documents, because they
contained concepts outside the scope of the materials1. (For
216 SUSAN MCROY
example, the patient might have already read the materials and
found they did not address their concerns, e.g., how to cope with
hair loss, or were too generic, and that is what motivated them to
seek help from peers.)
8.3 Architecture of a text retrieval system (Image: Jurafsky and Martin 2008)
Query processing takes the input from the user (which we call a
query) and performs standardization and generalization steps to
improve the likelihood of a finding a match between the inferred
intent and the documents that are retrieved. Terms in a query
may be removed if they are too general (based on a hand-created
“stoplist”) or terms may be added, such as synonyms from a
thesaurus (“query expansion”). Sometimes unrelated terms are
added on the basis that they were frequent in the top-ranked
matched documents of the original query (“pseudo-relevance
feedback”), which is one method of biasing the search towards
one sense of an ambiguous term. Terms may also be reordered
for efficiency. For example, when the results of queries with mul-
tiple terms conjoined by “and” are merged via an intersection
operation, the smallest sets of results are merged first, decreasing
the number of steps.
Search, also called matching, uses the processed query to find
(partially) matching documents within the index and assigns
Estimated
[(# Judge 1 said Y) + (# Judge 2 said Y) ] ÷ [2* (# of samples)]
Probability_Y
Estimated
[(# Judge 1 said N) + (# Judge 2 said N) ] ÷ [2* (# of samples)]
Probabillty_N
[ObservedAgreement−ExpectedAgreement] ÷
Kappa
(1−ExpectedAgreement )
<SUCCESSION-1>
New York Times Co. named Russell ORGANIZATION : New York Times Co.
T. Lewis, 45, president and general POST : president
manager of its flagship New York
Times newspaper, responsible for all WHO_IS_IN : Russell T. Lewis
business-side activities. WHO_IS_OUT :
</SUCCESSION>
raging: Extreme weather is battering the Western United States , with [ARG1: fires]
AllenNLP [V: raging] [ARGM-LOC: along the Pacific Coast] and snow falling in Colorado
falling: Extreme weather is battering the Western United States , with [ARG1: fires
raging along the Pacific Coast and] [ARG1: snow] [V: falling] [ARGM-LOC: in
Colorado]
Figure 8.7 Example of an argument including claim and premise from Moens
(2018)
8.6 SUMMARY
Notes
While there are plenty of budget-busting robot vacuums ready to do your bidding,
finding one like the Roborock S4 Max, which combines performance and
affordability, is rare. It gets the job down smartly and efficiently– without cleaning
out your wallet. …
In our Roborock S4 Max review, we found a vacuum that works well and has useful,
modern features. With fast mapping, single room cleaning, and automatic carpet
detection, the $429 S4 Max strikes the right balance of performance, features, and
cost. All of that has earned a spot at the top of our best robot vacuums list.
236 SUSAN MCROY
At the same time, if you had searched online, you would have
found customer reviews as shown in Figure 9.2. In the positive
review, the features mentioned were: mopping, camera based
object avoidance, (quality of) cleaning, WiFi setup, laser naviga-
tion, battery life, (degree of) quiet, (speed of) mopping, (accuracy
of) map, virtual walls, no-go function. In the negative review, the
features mentioned were: (accuracy of) map, (quality of) clean-
ing (expressed as “There was a lot of debris left after two cycles
on max mode”), (quality of) suction, (accuracy of) mapping,
expressed as “It is currently in my master bathroom running into
the cabinet although it was set to clean the kitchen”; and (quality
of) object avoidance (expressed as “running into walls” and “stuck
under the dishwasher”).
Let me start by saying this is my first robot vacuum. I read a million reviews and did my
research. I am less than impressed! The first day it didn’t map my house correctly so I
mapped it again no biggie. It worked great on my tile floor the first couple of days but
not so much on my large area rug. There was a lot of debris left after two cycles on max
mode. The suction is awful on rugs we are about a week in and it’s the same with my
Negative tile. I would like to mention I have cleaned the dust bun and untangled hair from the
review rollers after every cycle. Now it doesn’t even clean the rooms I set for it to clean. It is
currently in my master bathroom running into the cabinet although it was set to clean
(1 star) the kitchen. I also had issues with it going in circles running into walls. It literally runs
into EVERYTHING it is constantly stuck under the dishwasher. Even after I set it as a
no go zone! I tried to contact support but they haven’t responded. Highly disappointed
as I have heard good things about roborock. I would look elsewhere save your money!
If you don’t need mopping, get the S4 Max vs the S5 Max. If you don’t need camera
based object avoidance, and most people don’t (or don’t want vacuum cameras in your
house) get the S4 Max vs the S6 Max.
Great cleaning, easy WiFi setup, laser navigation, ~150 minutes of battery life,
surprisingly quiet especially on the lowest power setting, and you can now know the
Positive precise square footage of every room you have!
review
Better cleanup performance than the Roomba s9 and the same as the S6 Max. I like
(5 star) that it doesn’t include a mopping function as I didn’t need this, saving extra costs.
The Roborock S4 Max uses LiDAR navigation enabling super fast mapping. Same
capability as the Roborock S5 Max. I was amazed how quickly you can see on the app
the map being generated; it created an accurate map on its first run. I found the virtual
walls and no-go function to fit my needs perfectly.
English French
The cat is standing on the mat. Le chat est debout sur le tapis.
On the mat, the cat is standing. Sur le tapis, le chat est debout.
The cat is sleeping under the bed. Le chat dort sous le lit.
Under the bed, the cat is sleeping. Sous le lit, le chat dort.
9.5 SUMMARY
Notes
Make more
top-down
predictions to
[0:0] S → * NP VBD create active
edges for each
[0:0] S → * NP VP
of the
[0:0] NP → * DT NN
nonterminal
categories just
to the right of
the dot.
[0:0] S → * NP VBD
[0:0] S → * NP VP Predict dog;
[0:0] NP → * DT NN apply
[0:0] DT → * the fundamental
[0:1] DT → the * rule and then
[0:1] NP → DT * NN make a top
[1:1] NN → * dog down
[1:2] NN → dog * prediction for a
[0:2] NP → DT NN * VP (using the
[0:2] S → NP * VBD second S rule.)
[0:2] S → NP * VP
[2:2] VP → * VBZ NN
[0:0] S → * NP VBD
[0:0] S → * NP VP Predict meat
[0:0] NP → * DT NN apply the
[0:0] DT → * the fundamental
[0:1] DT → the * rule for meat
[0:1] NP → DT * NN as a noun (NN)
[1:1] NN → * dog and again to
[1:2] NN → dog * extend the
[0:2] NP → DT NN * active VP edge,
[0:2] S → NP * VBD to get VP ->
[0:2] S → NP * VP VBZ NN *.
[2:2] VP → * VBZ NN Finally, use the
[2:2] VBZ → * likes fundamental
[2:3] VBZ → likes * rule to extend
[2:3] VP → VBZ * NN the active S
[3:3] NN → * meat edge, which is
[3:4] NN → meat* now complete.
[2:4] VP → VBZ NN *
[0:4] S → NP VP *
Notes