Sequence Labeling For Parts of Speech and Named Entities: To Each Word A Warbling Note A Midsummer Night's Dream, V.I
Sequence Labeling For Parts of Speech and Named Entities: To Each Word A Warbling Note A Midsummer Night's Dream, V.I
Sequence Labeling For Parts of Speech and Named Entities: To Each Word A Warbling Note A Midsummer Night's Dream, V.I
All
rights reserved. Draft of January 7, 2023.
CHAPTER
Dionysius Thrax of Alexandria (c. 100 B . C .), or perhaps someone else (it was a long
time ago), wrote a grammatical sketch of Greek (a “technē”) that summarized the
linguistic knowledge of his day. This work is the source of an astonishing proportion
of modern linguistic vocabulary, including the words syntax, diphthong, clitic, and
parts of speech analogy. Also included are a description of eight parts of speech: noun, verb,
pronoun, preposition, adverb, conjunction, participle, and article. Although earlier
scholars (including Aristotle as well as the Stoics) had their own lists of parts of
speech, it was Thrax’s set of eight that became the basis for descriptions of European
languages for the next 2000 years. (All the way to the Schoolhouse Rock educational
television shows of our childhood, which had songs about 8 parts of speech, like the
late great Bob Dorough’s Conjunction Junction.) The durability of parts of speech
through two millennia speaks to their centrality in models of human language.
Proper names are another important and anciently studied linguistic category.
While parts of speech are generally assigned to individual words or morphemes, a
proper name is often an entire multiword phrase, like the name “Marie Curie”, the
location “New York City”, or the organization “Stanford University”. We’ll use the
named entity term named entity for, roughly speaking, anything that can be referred to with a
proper name: a person, a location, an organization, although as we’ll see the term is
commonly extended to include things that aren’t entities per se.
POS Parts of speech (also known as POS) and named entities are useful clues to
sentence structure and meaning. Knowing whether a word is a noun or a verb tells us
about likely neighboring words (nouns in English are preceded by determiners and
adjectives, verbs by nouns) and syntactic structure (verbs have dependency links to
nouns), making part-of-speech tagging a key aspect of parsing. Knowing if a named
entity like Washington is a name of a person, a place, or a university is important to
many natural language processing tasks like question answering, stance detection,
or information extraction.
In this chapter we’ll introduce the task of part-of-speech tagging, taking a se-
quence of words and assigning each word a part of speech like NOUN or VERB, and
the task of named entity recognition (NER), assigning words or phrases tags like
PERSON , LOCATION , or ORGANIZATION .
Such tasks in which we assign, to each word xi in an input word sequence, a
label yi , so that the output sequence Y has the same length as the input sequence X
sequence
labeling are called sequence labeling tasks. We’ll introduce classic sequence labeling algo-
rithms, one generative— the Hidden Markov Model (HMM)—and one discriminative—
the Conditional Random Field (CRF). In following chapters we’ll introduce modern
sequence labelers based on RNNs and Transformers.
2 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
DescriptionTag Example
ADJ
Adjective: noun modifiers describing properties red, young, awesome
Open Class
ADV
Adverb: verb modifiers of time, place, manner very, slowly, home, yesterday
NOUN
words for persons, places, things, etc. algorithm, cat, mango, beauty
VERB
words for actions and processes draw, provide, go
PROPN
Proper noun: name of a person, organization, place, etc.. Regina, IBM, Colorado
INTJ
Interjection: exclamation, greeting, yes/no response, etc. oh, um, yes, hello
ADP
Adposition (Preposition/Postposition): marks a noun’s in, on, by, under
spacial, temporal, or other relation
Closed Class Words
AUX Auxiliary: helping verb marking tense, aspect, mood, etc., can, may, should, are
CCONJ Coordinating Conjunction: joins two phrases/clauses and, or, but
DET Determiner: marks noun phrase properties a, an, the, this
NUM Numeral one, two, first, second
PART Particle: a function word that must be associated with an- ’s, not, (infinitive) to
other word
PRON Pronoun: a shorthand for referring to an entity or event she, who, I, others
SCONJ Subordinating Conjunction: joins a main clause with a that, which
subordinate clause such as a sentential complement
PUNCT Punctuation ,̇ , ()
Other
closed class Parts of speech fall into two broad categories: closed class and open class.
open class Closed classes are those with relatively fixed membership, such as prepositions—
new prepositions are rarely coined. By contrast, nouns and verbs are open classes—
new nouns and verbs like iPhone or to fax are continually being created or borrowed.
function word Closed class words are generally function words like of, it, and, or you, which tend
to be very short, occur frequently, and often have structuring uses in grammar.
Four major open classes occur in the languages of the world: nouns (including
proper nouns), verbs, adjectives, and adverbs, as well as the smaller open class of
interjections. English has all five, although not every language does.
noun Nouns are words for people, places, or things, but include others as well. Com-
common noun mon nouns include concrete terms like cat and mango, abstractions like algorithm
and beauty, and verb-like terms like pacing as in His pacing to and fro became quite
annoying. Nouns in English can occur with determiners (a goat, this bandwidth)
take possessives (IBM’s annual revenue), and may occur in the plural (goats, abaci).
count noun Many languages, including English, divide common nouns into count nouns and
mass noun mass nouns. Count nouns can occur in the singular and plural (goat/goats, rela-
tionship/relationships) and can be counted (one goat, two goats). Mass nouns are
used when something is conceptualized as a homogeneous group. So snow, salt, and
proper noun communism are not counted (i.e., *two snows or *two communisms). Proper nouns,
like Regina, Colorado, and IBM, are names of specific persons or entities.
8.1 • (M OSTLY ) E NGLISH W ORD C LASSES 3
verb Verbs refer to actions and processes, including main verbs like draw, provide,
and go. English verbs have inflections (non-third-person-singular (eat), third-person-
singular (eats), progressive (eating), past participle (eaten)). While many scholars
believe that all human languages have the categories of noun and verb, others have
argued that some languages, such as Riau Indonesian and Tongan, don’t even make
this distinction (Broschart 1997; Evans 2000; Gil 2000) .
adjective Adjectives often describe properties or qualities of nouns, like color (white,
black), age (old, young), and value (good, bad), but there are languages without
adjectives. In Korean, for example, the words corresponding to English adjectives
act as a subclass of verbs, so what is in English an adjective “beautiful” acts in
Korean like a verb meaning “to be beautiful”.
adverb Adverbs are a hodge-podge. All the italicized words in this example are adverbs:
Actually, I ran home extremely quickly yesterday
Adverbs generally modify something (often verbs, hence the name “adverb”, but
locative also other adverbs and entire verb phrases). Directional adverbs or locative ad-
degree verbs (home, here, downhill) specify the direction or location of some action; degree
adverbs (extremely, very, somewhat) specify the extent of some action, process, or
manner property; manner adverbs (slowly, slinkily, delicately) describe the manner of some
temporal action or process; and temporal adverbs describe the time that some action or event
took place (yesterday, Monday).
interjection Interjections (oh, hey, alas, uh, um) are a smaller open class that also includes
greetings (hello, goodbye) and question responses (yes, no, uh-huh).
preposition English adpositions occur before nouns, hence are called prepositions. They can
indicate spatial or temporal relations, whether literal (on it, before then, by the house)
or metaphorical (on time, with gusto, beside herself), and relations like marking the
agent in Hamlet was written by Shakespeare.
particle A particle resembles a preposition or an adverb and is used in combination with
a verb. Particles often have extended meanings that aren’t quite the same as the
prepositions they resemble, as in the particle over in she turned the paper over. A
phrasal verb verb and a particle acting as a single unit is called a phrasal verb. The meaning
of phrasal verbs is often non-compositional—not predictable from the individual
meanings of the verb and the particle. Thus, turn down means ‘reject’, rule out
‘eliminate’, and go on ‘continue’.
determiner Determiners like this and that (this chapter, that page) can mark the start of an
article English noun phrase. Articles like a, an, and the, are a type of determiner that mark
discourse properties of the noun and are quite frequent; the is the most common
word in written English, with a and an right behind.
conjunction Conjunctions join two phrases, clauses, or sentences. Coordinating conjunc-
tions like and, or, and but join two elements of equal status. Subordinating conjunc-
tions are used when one of the elements has some embedded status. For example,
the subordinating conjunction that in “I thought that you might like some milk” links
the main clause I thought with the subordinate clause you might like some milk. This
clause is called subordinate because this entire clause is the “content” of the main
verb thought. Subordinating conjunctions like that which link a verb to its argument
complementizer in this way are also called complementizers.
pronoun Pronouns act as a shorthand for referring to an entity or event. Personal pro-
nouns refer to persons or entities (you, she, I, it, me, etc.). Possessive pronouns are
forms of personal pronouns that indicate either actual possession or more often just
an abstract relation between the person and some object (my, your, his, her, its, one’s,
wh our, their). Wh-pronouns (what, who, whom, whoever) are used in certain question
4 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
Below we show some examples with each word tagged according to both the
UD and Penn tagsets. Notice that the Penn tagset distinguishes tense and participles
on verbs, and has a special tag for the existential there construction in English. Note
that since New England Journal of Medicine is a proper noun, both tagsets mark its
component nouns as NNP, including journal and medicine, which might otherwise
be labeled as common nouns (NOUN/NN).
(8.1) There/PRO/EX are/VERB/VBP 70/NUM/CD children/NOUN/NNS
there/ADV/RB ./PUNC/.
(8.2) Preliminary/ADJ/JJ findings/NOUN/NNS were/AUX/VBD reported/VERB/VBN
in/ADP/IN today/NOUN/NN ’s/PART/POS New/PROPN/NNP
England/PROPN/NNP Journal/PROPN/NNP of/ADP/IN Medicine/PROPN/NNP
y1 y2 y3 y4 y5
Figure 8.3 The task of part-of-speech tagging: mapping from input words x1 , x2 , ..., xn to
output POS tags y1 , y2 , ..., yn .
ambiguity thought that your flight was earlier). The goal of POS-tagging is to resolve these
resolution
ambiguities, choosing the proper tag for the context.
accuracy The accuracy of part-of-speech tagging algorithms (the percentage of test set
tags that match human gold labels) is extremely high. One study found accuracies
over 97% across 15 languages from the Universal Dependency (UD) treebank (Wu
and Dredze, 2019). Accuracies on various English treebanks are also 97% (no matter
the algorithm; HMMs, CRFs, BERT perform similarly). This 97% number is also
about the human performance on this task, at least for English (Manning, 2011).
We’ll introduce algorithms for the task in the next few sections, but first let’s
explore the task. Exactly how hard is it? Fig. 8.4 shows that most word types
(85-86%) are unambiguous (Janet is always NNP, hesitantly is always RB). But the
ambiguous words, though accounting for only 14-15% of the vocabulary, are very
common, and 55-67% of word tokens in running text are ambiguous. Particularly
ambiguous common words include that, back, down, put and set; here are some
examples of the 6 different parts of speech for the word back:
earnings growth took a back/JJ seat
a small building in the back/NN
a clear majority of senators back/VBP the bill
Dave began to back/VB toward the door
enable the country to buy back/RP debt
I was twenty-one back/RB then
Nonetheless, many words are easy to disambiguate, because their different tags
aren’t equally likely. For example, a can be a determiner or the letter a, but the
determiner sense is much more likely.
This idea suggests a useful baseline: given an ambiguous word, choose the tag
which is most frequent in the training corpus. This is a key concept:
Most Frequent Class Baseline: Always compare a classifier against a baseline at
least as good as the most frequent class baseline (assigning each token to the class
it occurred in most often in the training set).
6 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
Named entity tagging is a useful first step in lots of natural language processing
tasks. In sentiment analysis we might want to know a consumer’s sentiment toward a
particular entity. Entities are a useful first stage in question answering, or for linking
text to information in structured knowledge sources like Wikipedia. And named
entity tagging is also central to tasks involving building semantic representations,
like extracting events and the relationship between participants.
Unlike part-of-speech tagging, where there is no segmentation problem since
each word gets one tag, the task of named entity recognition is to find and label
spans of text, and is difficult partly because of the ambiguity of segmentation; we
1 In English, on the WSJ corpus, tested on sections 22-24.
8.3 • NAMED E NTITIES AND NAMED E NTITY TAGGING 7
need to decide what’s an entity and what isn’t, and where the boundaries are. Indeed,
most words in a text will not be named entities. Another difficulty is caused by type
ambiguity. The mention JFK can refer to a person, the airport in New York, or any
number of schools, bridges, and streets around the United States. Some examples of
this kind of cross-type confusion are given in Figure 8.6.
[PER Washington] was born into slavery on the farm of James Burroughs.
[ORG Washington] went up 2 games to 1 in the four-game series.
Blair arrived in [LOC Washington] for what may well be his last state visit.
In June, [GPE Washington] passed a primary seatbelt law.
Figure 8.6 Examples of type ambiguities in the use of the name Washington.
We’ve also shown two variant tagging schemes: IO tagging, which loses some
information by eliminating the B tag, and BIOES tagging, which adds an end tag
E for the end of a span, and a span tag S for a span consisting of only one word.
A sequence labeler (HMM, CRF, RNN, Transformer, etc.) is trained to label each
token in a text with tags that indicate the presence (or absence) of particular kinds
of named entities.
8 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
.8
are .2
.1 COLD2 .1 .4 .5
.1 .5
.1
.3 uniformly charming
HOT1 WARM3 .5
.6 .3 .6 .1 .2
.6
(a) (b)
Figure 8.8 A Markov chain for weather (a) and one for words (b), showing states and
transitions. A start distribution π is required; setting π = [0.1, 0.7, 0.2] for (a) would mean a
probability 0.7 of starting in state 2 (cold), probability 0.1 of starting in state 1 (hot), etc.
C(ti−1 ,ti )
P(ti |ti−1 ) = (8.8)
C(ti−1 )
In the WSJ corpus, for example, MD occurs 13124 times of which it is followed
by VB 10471, for an MLE estimate of
C(MD,V B) 10471
P(V B|MD) = = = .80 (8.9)
C(MD) 13124
Let’s walk through an example, seeing how these probabilities are estimated and
used in a sample tagging task, before we return to the algorithm for decoding.
In HMM tagging, the probabilities are estimated by counting on a tagged training
corpus. For this example we’ll use the tagged WSJ corpus.
The B emission probabilities, P(wi |ti ), represent the probability, given a tag (say
MD), that it will be associated with a given word (say will). The MLE of the emis-
sion probability is
C(ti , wi )
P(wi |ti ) = (8.10)
C(ti )
Of the 13124 occurrences of MD in the WSJ corpus, it is associated with will 4046
times:
C(MD, will) 4046
P(will|MD) = = = .31 (8.11)
C(MD) 13124
We saw this kind of Bayesian modeling in Chapter 4; recall that this likelihood
term is not asking “which is the most likely tag for the word will?” That would be
the posterior P(MD|will). Instead, P(will|MD) answers the slightly counterintuitive
question “If we were going to generate a MD, how likely is it that this modal would
be will?”
The A transition probabilities, and B observation likelihoods of the HMM are
illustrated in Fig. 8.9 for three states in an HMM part-of-speech tagger; the full
tagger would have one state for each tag.
B2 a22
P("aardvark" | MD)
...
P(“will” | MD)
...
P("the" | MD)
...
MD2 B3
P(“back” | MD)
... a12 a32 P("aardvark" | NN)
P("zebra" | MD) ...
a11 a21 a33 P(“will” | NN)
a23 ...
P("the" | NN)
B1 a13 ...
P(“back” | NN)
P("aardvark" | VB)
...
VB1 a31
NN3 ...
P("zebra" | NN)
P(“will” | VB)
...
P("the" | VB)
...
P(“back” | VB)
...
P("zebra" | VB)
Figure 8.9 An illustration of the two parts of an HMM representation: the A transition
probabilities used to compute the prior probability, and the B observation likelihoods that are
associated with each state, one likelihood for each possible observation word.
For part-of-speech tagging, the goal of HMM decoding is to choose the tag
sequence t1 . . .tn that is most probable given the observation sequence of n words
w1 . . . wn :
tˆ1:n = argmax P(t1 . . .tn |w1 . . . wn ) (8.12)
t1 ... tn
The way we’ll do this in the HMM is to use Bayes’ rule to instead compute:
P(w1 . . . wn |t1 . . .tn )P(t1 . . .tn )
tˆ1:n = argmax (8.13)
t1 ... tn P(w1 . . . wn )
Furthermore, we simplify Eq. 8.13 by dropping the denominator P(wn1 ):
tˆ1:n = argmax P(w1 . . . wn |t1 . . .tn )P(t1 . . .tn ) (8.14)
t1 ... tn
HMM taggers make two further simplifying assumptions. The first is that the
probability of a word appearing depends only on its own tag and is independent of
neighboring words and tags:
n
Y
P(w1 . . . wn |t1 . . .tn ) ≈ P(wi |ti ) (8.15)
i=1
The second assumption, the bigram assumption, is that the probability of a tag is
dependent only on the previous tag, rather than the entire tag sequence;
n
Y
P(t1 . . .tn ) ≈ P(ti |ti−1 ) (8.16)
i=1
Plugging the simplifying assumptions from Eq. 8.15 and Eq. 8.16 into Eq. 8.14
results in the following equation for the most probable tag sequence from a bigram
tagger:
emission transition
n z }| { z }| {
Y
tˆ1:n = argmax P(t1 . . .tn |w1 . . . wn ) ≈ argmax P(wi |ti ) P(ti |ti−1 ) (8.17)
t1 ... tn t1 ... tn
i=1
The two parts of Eq. 8.17 correspond neatly to the B emission probability and A
transition probability that we just defined above!
12 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
Figure 8.10 Viterbi algorithm for finding the optimal sequence of tags. Given an observation sequence and
an HMM λ = (A, B), the algorithm returns the state path through the HMM that assigns maximum likelihood
to the observation sequence.
The Viterbi algorithm first sets up a probability matrix or lattice, with one col-
umn for each observation ot and one row for each state in the state graph. Each col-
umn thus has a cell for each state qi in the single combined automaton. Figure 8.11
shows an intuition of this lattice for the sentence Janet will back the bill.
Each cell of the lattice, vt ( j), represents the probability that the HMM is in state
j after seeing the first t observations and passing through the most probable state
sequence q1 , ..., qt−1 , given the HMM λ . The value of each cell vt ( j) is computed
by recursively taking the most probable path that could lead us to this cell. Formally,
each cell expresses the probability
We represent the most probable path by taking the maximum over all possible
previous state sequences max . Like other dynamic programming algorithms,
q1 ,...,qt−1
Viterbi fills each cell recursively. Given that we had already computed the probabil-
ity of being in every state at time t − 1, we compute the Viterbi probability by taking
the most probable of the extensions of the paths that lead to the current cell. For a
given state q j at time t, the value vt ( j) is computed as
N
vt ( j) = max vt−1 (i) ai j b j (ot ) (8.19)
i=1
The three factors that are multiplied in Eq. 8.19 for extending the previous paths to
compute the Viterbi probability at time t are
8.4 • HMM PART- OF -S PEECH TAGGING 13
DT DT DT DT DT
RB RB RB RB RB
NN NN NN NN NN
JJ JJ JJ JJ JJ
VB VB VB VB VB
MD MD MD MD MD
vt−1 (i) the previous Viterbi path probability from the previous time step
ai j the transition probability from previous state qi to current state q j
b j (ot ) the state observation likelihood of the observation symbol ot given
the current state j
NNP MD VB JJ NN RB DT
<s > 0.2767 0.0006 0.0031 0.0453 0.0449 0.0510 0.2026
NNP 0.3777 0.0110 0.0009 0.0084 0.0584 0.0090 0.0025
MD 0.0008 0.0002 0.7968 0.0005 0.0008 0.1698 0.0041
VB 0.0322 0.0005 0.0050 0.0837 0.0615 0.0514 0.2231
JJ 0.0366 0.0004 0.0001 0.0733 0.4509 0.0036 0.0036
NN 0.0096 0.0176 0.0014 0.0086 0.1216 0.0177 0.0068
RB 0.0068 0.0102 0.1011 0.1012 0.0120 0.0728 0.0479
DT 0.1147 0.0021 0.0002 0.2157 0.4744 0.0102 0.0017
Figure 8.12 The A transition probabilities P(ti |ti−1 ) computed from the WSJ corpus with-
out smoothing. Rows are labeled with the conditioning event; thus P(V B|MD) is 0.7968.
Let the HMM be defined by the two tables in Fig. 8.12 and Fig. 8.13. Figure 8.12
lists the ai j probabilities for transitioning between the hidden states (part-of-speech
tags). Figure 8.13 expresses the bi (ot ) probabilities, the observation likelihoods of
words given tags. This table is (slightly simplified) from counts in the WSJ corpus.
So the word Janet only appears as an NNP, back has 4 possible parts of speech, and
14 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
the word the can appear as a determiner or as an NNP (in titles like “Somewhere
Over the Rainbow” all words are tagged as NNP).
v1(7) v2(7)
q7 DT
art)
D
q3 VB B|st
|J
=. =0 = 2.5e-13
* P
(MD
= 0 |VB)
v2(2) =
tart) v1(2)=
q2 MD D|s
P(M 0006 .0006 x 0 = * P(MD|M max * .308 =
= . D) 2.772e-8
0 =0
8 1 =)
.9 9*.0 NP
v1(1) =
00 D|N
v2(1)
.0 P(M
= .000009
*
= .28
backtrace
start start start start
start
π backtrace
Janet will
t back the bill
o1 o2 o3 o4 o5
Figure 8.14 The first few entries in the individual state columns for the Viterbi algorithm. Each cell keeps
the probability of the best path so far and a pointer to the previous cell along that path. We have only filled out
columns 1 and 2; to avoid clutter most cells with value 0 are left empty. The rest is left as an exercise for the
reader. After the cells are filled in, backtracing from the end state, we should be able to reconstruct the correct
state sequence NNP MD VB DT NN.
Figure 8.14 shows a fleshed-out version of the sketch we saw in Fig. 8.11, the
Viterbi lattice for computing the best hidden state sequence for the observation se-
quence Janet will back the bill.
There are N = 5 state columns. We begin in column 1 (for the word Janet) by
setting the Viterbi value in each cell to the product of the π transition probability
(the start probability for that state i, which we get from the <s > entry of Fig. 8.12),
8.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 15
and the observation likelihood of the word Janet given the tag for that cell. Most of
the cells in the column are zero since the word Janet cannot be any of those tags.
The reader should find this in Fig. 8.14.
Next, each cell in the will column gets updated. For each state, we compute the
value viterbi[s,t] by taking the maximum over the extensions of all the paths from
the previous column that lead to the current cell according to Eq. 8.19. We have
shown the values for the MD, VB, and NN cells. Each cell gets the max of the 7
values from the previous column, multiplied by the appropriate transition probabil-
ity; as it happens in this case, most of them are zero from the previous column. The
remaining value is multiplied by the relevant observation probability, and the (triv-
ial) max is taken. In this case the final value, 2.772e-8, comes from the NNP state at
the previous column. The reader should fill in the rest of the lattice in Fig. 8.14 and
backtrace to see whether or not the Viterbi algorithm returns the gold state sequence
NNP MD VB DT NN.
In a CRF, by contrast, we compute the posterior p(Y |X) directly, training the CRF
16 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
However, the CRF does not compute a probability for each tag at each time step. In-
stead, at each time step the CRF computes log-linear functions over a set of relevant
features, and these local features are aggregated and normalized to produce a global
probability for the whole sequence.
Let’s introduce the CRF more formally, again using X and Y as the input and
output sequences. A CRF is a log-linear model that assigns a probability to an en-
tire output (tag) sequence Y , out of all possible sequences Y , given the entire input
(word) sequence X. We can think of a CRF as like a giant version of what multi-
nomial logistic regression does for a single token. Recall that the feature function
f in regular multinomial logistic regression can be viewed as a function of a tuple:
a token x and a label y (page ??). In a CRF, the function F maps an entire input
sequence X and an entire output sequence Y to a feature vector. Let’s assume we
have K features, with a weight wk for each feature Fk :
K
!
X
exp wk Fk (X,Y )
k=1
p(Y |X) = K
! (8.23)
X X
0
exp wk Fk (X,Y )
Y 0 ∈Y k=1
It’s common to also describe the same equation by pulling out the denominator into
a function Z(X):
K
!
1 X
p(Y |X) = exp wk Fk (X,Y ) (8.24)
Z(X)
k=1
K
!
X X
0
Z(X) = exp wk Fk (X,Y ) (8.25)
Y 0 ∈Y k=1
We’ll call these K functions Fk (X,Y ) global features, since each one is a property
of the entire input sequence X and output sequence Y . We compute them by decom-
posing into a sum of local features for each position i in Y :
n
X
Fk (X,Y ) = fk (yi−1 , yi , X, i) (8.26)
i=1
Each of these local features fk in a linear-chain CRF is allowed to make use of the
current output token yi , the previous output token yi−1 , the entire input string X (or
any subpart of it), and the current position i. This constraint to only depend on
the current and previous output tokens yi and yi−1 are what characterizes a linear
linear chain chain CRF. As we will see, this limitation makes it possible to use versions of the
CRF
efficient Viterbi and Forward-Backwards algorithms from the HMM. A general CRF,
by contrast, allows a feature to make use of any output token, and are thus necessary
for tasks in which the decision depend on distant output tokens, like yi−4 . General
CRFs require more complex inference, and are less commonly used for language
processing.
8.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 17
These templates automatically populate the set of features from every instance in
the training and test set. Thus for our example Janet/NNP will/MD back/VB the/DT
bill/NN, when xi is the word back, the following features would be generated and
have the value 1 (we’ve assigned them arbitrary feature numbers):
f3743 : yi = VB and xi = back
f156 : yi = VB and yi−1 = MD
f99732 : yi = VB and xi−1 = will and xi+2 = bill
It’s also important to have features that help with unknown words. One of the
word shape most important is word shape features, which represent the abstract letter pattern
of the word by mapping lower-case letters to ‘x’, upper-case to ‘X’, numbers to
’d’, and retaining punctuation. Thus for example I.M.F. would map to X.X.X. and
DC10-30 would map to XXdd-dd. A second class of shorter word shape features is
also used. In these features consecutive character types are removed, so words in all
caps map to X, words with initial-caps map to Xx, DC10-30 would be mapped to
Xd-d but I.M.F would still map to X.X.X. Prefix and suffix features are also useful.
In summary, here are some sample feature templates that help with unknown words:
For example the word well-dressed might generate the following non-zero val-
ued feature values:
2 Because in HMMs all computation is based on the two probabilities P(tag|tag) and P(word|tag), if
we want to include some source of knowledge into the tagging process, we must find a way to encode
the knowledge into one of these two probabilities. Each time we add a feature we have to do a lot of
complicated conditioning which gets harder and harder as we have more and more such features.
18 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
prefix(xi ) = w
prefix(xi ) = we
suffix(xi ) = ed
suffix(xi ) = d
word-shape(xi ) = xxxx-xxxxxxx
short-word-shape(xi ) = x-x
The known-word templates are computed for every word seen in the training
set; the unknown word features can also be computed for all words in training, or
only on training words whose frequency is below some threshold. The result of the
known-word templates and word-signature features is a very large set of features.
Generally a feature cutoff is used in which features are thrown out if they have count
< 5 in the training set.
Remember that in a CRF we don’t learn weights for each of these local features
fk . Instead, we first sum the values of each local feature (for example feature f3743 )
over the entire sentence, to create each global feature (for example F3743 ). It is those
global features that will then be multiplied by weight w3743 . Thus for training and
inference there is always a fixed set of K features with K weights, even though the
length of each sentence is different.
gazetteer One feature that is especially useful for locations is a gazetteer, a list of place
names, often providing millions of entries for locations with detailed geographical
and political information.3 This can be implemented as a binary feature indicating a
phrase appears in the list. Other related resources like name-lists, for example from
the United States Census Bureau4 , can be used, as can other entity dictionaries like
lists of corporations or products, although they may not be as helpful as a gazetteer
(Mikheev et al., 1999).
The sample named entity token L’Occitane would generate the following non-
zero valued feature values (assuming that L’Occitane is neither in the gazetteer nor
the census).
3 www.geonames.org
4 www.census.gov
8.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 19
We can ignore the exp function and the denominator Z(X), as we do above, because
exp doesn’t change the argmax, and the denominator Z(X) is constant for a given
observation sequence X.
How should we decode to find this optimal tag sequence ŷ? Just as with HMMs,
we’ll turn to the Viterbi algorithm, which works because, like the HMM, the linear-
chain CRF depends at each timestep on only one previous output token yi−1 .
Concretely, this involves filling an N ×T array with the appropriate values, main-
taining backpointers as we proceed. As with HMM Viterbi, when the table is filled,
we simply follow pointers back from the maximum value in the final column to
retrieve the desired set of labels.
20 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
The requisite changes from HMM Viterbi have to do only with how we fill each
cell. Recall from Eq. 8.19 that the recursive step of the Viterbi equation computes
the Viterbi value of time t for state j as
N
vt ( j) = max vt−1 (i) ai j b j (ot ); 1 ≤ j ≤ N, 1 < t ≤ T (8.31)
i=1
The CRF requires only a slight change to this latter formula, replacing the a and b
prior and likelihood probabilities with the CRF features:
K
N X
vt ( j) = max vt−1 (i) + wk fk (yt−1 , yt , X,t) 1 ≤ j ≤ N, 1 < t ≤ T (8.33)
i=1
k=1
presented are supervised, having labeled data is essential for training and testing. A
wide variety of datasets exist for part-of-speech tagging and/or NER. The Universal
Dependencies (UD) dataset (de Marneffe et al., 2021) has POS tagged corpora in
over a hundred languages, as do the Penn Treebanks in English, Chinese, and Arabic.
OntoNotes has corpora labeled for named entities in English, Chinese, and Arabic
(Hovy et al., 2006). Named entity tagged corpora are also available in particular
domains, such as for biomedical (Bada et al., 2012) and literary text (Bamman et al.,
2019).
guages need to label words with case and gender information. Tagsets for morpho-
logically rich languages are therefore sequences of morphological tags rather than a
single primitive tag. Here’s a Turkish example, in which the word izin has three pos-
sible morphological/part-of-speech tags and meanings (Hakkani-Tür et al., 2002):
1. Yerdeki izin temizlenmesi gerek. iz + Noun+A3sg+Pnon+Gen
The trace on the floor should be cleaned.
8.8 Summary
This chapter introduced parts of speech and named entities, and the tasks of part-
of-speech tagging and named entity recognition:
• Languages generally have a small set of closed class words that are highly
frequent, ambiguous, and act as function words, and open-class words like
nouns, verbs, adjectives. Various part-of-speech tagsets exist, of between 40
and 200 tags.
• Part-of-speech tagging is the process of assigning a part-of-speech label to
each of a sequence of words.
• Named entities are words for proper nouns referring mainly to people, places,
and organizations, but extended to many other types that aren’t strictly entities
or even proper nouns.
• Two common approaches to sequence modeling are a generative approach,
HMM tagging, and a discriminative approach, CRF tagging. We will see a
neural approach in following chapters.
• The probabilities in HMM taggers are estimated by maximum likelihood es-
timation on tag-labeled training corpora. The Viterbi algorithm is used for
decoding, finding the most likely tag sequence
• Conditional Random Fields or CRF taggers train a log-linear model that can
choose the best tag sequence given an observation sequence, based on features
that condition on the output tag, the prior output tag, the entire input sequence,
and the current timestep. They use the Viterbi algorithm for inference, to
choose the best sequence of tags, and a version of the Forward-Backward
algorithm (see Appendix A) for training,
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 23
Exercises
8.1 Find one tagging error in each of the following sentences that are tagged with
the Penn Treebank tagset:
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN
2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS
3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP
4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
8.2 Use the Penn Treebank tagset to tag each word in the following sentences
from Damon Runyon’s short stories. You may ignore punctuation. Some of
these are quite difficult; do your best.
1. It is a nice night.
2. This crap game is over a garage in Fifty-second Street. . .
3. . . . Nobody ever takes the newspapers she sells . . .
4. He is a tall, skinny guy with a long, sad, mean-looking kisser, and a
mournful voice.
E XERCISES 25
Abney, S. P., R. E. Schapire, and Y. Singer. 1999. Boosting Greene, B. B. and G. M. Rubin. 1971. Automatic grammati-
applied to tagging and PP attachment. EMNLP/VLC. cal tagging of English. Department of Linguistics, Brown
Bada, M., M. Eckert, D. Evans, K. Garcia, K. Shipley, D. Sit- University, Providence, Rhode Island.
nikov, W. A. Baumgartner, K. B. Cohen, K. Verspoor, Hajič, J. 2000. Morphological tagging: Data vs. dictionaries.
J. A. Blake, and L. E. Hunter. 2012. Concept annotation In NAACL.
in the craft corpus. BMC bioinformatics, 13(1):161. Hakkani-Tür, D., K. Oflazer, and G. Tür. 2002. Sta-
Bahl, L. R. and R. L. Mercer. 1976. Part of speech as- tistical morphological disambiguation for agglutinative
signment by a statistical decision algorithm. Proceedings languages. Journal of Computers and Humanities,
IEEE International Symposium on Information Theory. 36(4):381–410.
Bamman, D., S. Popat, and S. Shen. 2019. An annotated Harris, Z. S. 1962. String Analysis of Sentence Structure.
dataset of literary entities. NAACL HLT. Mouton, The Hague.
Bikel, D. M., S. Miller, R. Schwartz, and R. Weischedel. Householder, F. W. 1995. Dionysius Thrax, the technai, and
1997. Nymble: A high-performance learning name- Sextus Empiricus. In E. F. K. Koerner and R. E. Asher,
finder. ANLP. editors, Concise History of the Language Sciences, pages
99–103. Elsevier Science.
Brants, T. 2000. TnT: A statistical part-of-speech tagger.
Hovy, E. H., M. P. Marcus, M. Palmer, L. A. Ramshaw,
ANLP.
and R. Weischedel. 2006. OntoNotes: The 90% solution.
Broschart, J. 1997. Why Tongan does it differently. Linguis- HLT-NAACL.
tic Typology, 1:123–165.
Huang, Z., W. Xu, and K. Yu. 2015. Bidirectional LSTM-
Charniak, E., C. Hendrickson, N. Jacobson, and CRF models for sequence tagging. arXiv preprint
M. Perkowitz. 1993. Equations for part-of-speech tag- arXiv:1508.01991.
ging. AAAI. Joshi, A. K. and P. Hopely. 1999. A parser from antiquity. In
Chiticariu, L., M. Danilevsky, Y. Li, F. Reiss, and H. Zhu. A. Kornai, editor, Extended Finite State Models of Lan-
2018. SystemT: Declarative text understanding for enter- guage, pages 6–15. Cambridge University Press.
prise. NAACL HLT, volume 3. Karlsson, F., A. Voutilainen, J. Heikkilä, and A. Anttila,
Chiticariu, L., Y. Li, and F. R. Reiss. 2013. Rule-Based In- editors. 1995. Constraint Grammar: A Language-
formation Extraction is Dead! Long Live Rule-Based In- Independent System for Parsing Unrestricted Text. Mou-
formation Extraction Systems! EMNLP. ton de Gruyter.
Christodoulopoulos, C., S. Goldwater, and M. Steedman. Karttunen, L. 1999. Comments on Joshi. In A. Kornai, edi-
2010. Two decades of unsupervised POS induction: How tor, Extended Finite State Models of Language, pages 16–
far have we come? EMNLP. 18. Cambridge University Press.
Church, K. W. 1988. A stochastic parts program and noun Klein, S. and R. F. Simmons. 1963. A computational ap-
phrase parser for unrestricted text. ANLP. proach to grammatical coding of English words. Journal
of the ACM, 10(3):334–347.
Church, K. W. 1989. A stochastic parts program and noun
phrase parser for unrestricted text. ICASSP. Kupiec, J. 1992. Robust part-of-speech tagging using a hid-
den Markov model. Computer Speech and Language,
Clark, S., J. R. Curran, and M. Osborne. 2003. Bootstrapping 6:225–242.
POS-taggers using unlabelled data. CoNLL.
Lafferty, J. D., A. McCallum, and F. C. N. Pereira. 2001.
Collobert, R., J. Weston, L. Bottou, M. Karlen, Conditional random fields: Probabilistic models for seg-
K. Kavukcuoglu, and P. Kuksa. 2011. Natural language menting and labeling sequence data. ICML.
processing (almost) from scratch. JMLR, 12:2493–2537.
Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami,
DeRose, S. J. 1988. Grammatical category disambiguation and C. Dyer. 2016. Neural architectures for named entity
by statistical optimization. Computational Linguistics, recognition. NAACL HLT.
14:31–39.
Lee, H., M. Surdeanu, and D. Jurafsky. 2017. A scaffolding
Evans, N. 2000. Word classes in the world’s languages. approach to coreference resolution integrating statistical
In G. Booij, C. Lehmann, and J. Mugdan, editors, Mor- and rule-based models. Natural Language Engineering,
phology: A Handbook on Inflection and Word Formation, 23(5):733–762.
pages 708–732. Mouton. Ma, X. and E. H. Hovy. 2016. End-to-end sequence labeling
Francis, W. N. and H. Kučera. 1982. Frequency Analysis of via bi-directional LSTM-CNNs-CRF. ACL.
English Usage. Houghton Mifflin, Boston. Manning, C. D. 2011. Part-of-speech tagging from 97% to
Garside, R. 1987. The CLAWS word-tagging system. In 100%: Is it time for some linguistics? CICLing 2011.
R. Garside, G. Leech, and G. Sampson, editors, The Com- Marcus, M. P., B. Santorini, and M. A. Marcinkiewicz. 1993.
putational Analysis of English, pages 30–41. Longman. Building a large annotated corpus of English: The Penn
Garside, R., G. Leech, and A. McEnery. 1997. Corpus An- treebank. Computational Linguistics, 19(2):313–330.
notation. Longman. de Marneffe, M.-C., C. D. Manning, J. Nivre, and D. Zeman.
Gil, D. 2000. Syntactic categories, cross-linguistic variation 2021. Universal Dependencies. Computational Linguis-
and universal grammar. In P. M. Vogel and B. Comrie, ed- tics, 47(2):255–308.
itors, Approaches to the Typology of Word Classes, pages Marshall, I. 1983. Choice of grammatical word-class with-
173–216. Mouton. out global syntactic analysis: Tagging words in the LOB
corpus. Computers and the Humanities, 17:139–150.
Exercises 27