Regular Expressions, Text Normalization, Edit Distance
Regular Expressions, Text Normalization, Edit Distance
Copyright
c 2019. All
rights reserved. Draft of October 2, 2019.
CHAPTER
Some languages, like Japanese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
lemmatization Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.
The regular expression /[1234567890]/ specified any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.
The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.
How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
4 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
we want, something that is very important in regular expressions. For example,
consider the language of certain sheep, which consists of strings that look like the
following:
baa!
baaa!
baaaa!
baaaaa!
...
This language consists of strings with a b, followed by at least two a’s, followed
by an exclamation point. The set of operators that allows us to say things like “some
Kleene * number of as” are based on the asterisk or *, commonly called the Kleene * (gen-
erally pronounced “cleany star”). The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So /a*/ means “any
string of zero or more as”. This will match a or aaaaaa, but it will also match Off
Minor since the string Off Minor has zero a’s. So the regular expression for matching
one or more a is /aa*/, meaning one a followed by zero or more as. More complex
patterns can also be repeated. So /[ab]*/ means “zero or more a’s or b’s” (not
“zero or more right square braces”). This will match strings like aaaa or ababab or
bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/,
the regular expression for a single digit. An integer (a string of digits) is thus
/[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice,
so there is a shorter way to specify “at least one” of some character. This is the
Kleene + Kleene +, which means “one or more occurrences of the immediately preceding
character or regular expression”. Thus, the expression /[0-9]+/ is the normal way
to specify “a sequence of digits”. There are thus two ways to specify the sheep
language: /baaa*!/ or /baa+!/.
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return), as shown in Fig. 2.6.
The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
Anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
2.1 • R EGULAR E XPRESSIONS 5
start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)
There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antag-
onistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 2.8 Regular expression operators for counting.
Finally, certain special characters are referred to by special notation based on the
Newline backslash (\) (see Fig. 2.9). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
s/([0-9]+)/<\1>/
The parenthesis and number operators can also specify that a certain string or
expression must occur twice in the text. For example, suppose we are looking for
the pattern “the Xer they were, the Xer they will be”, where we want to constrain
the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as
follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match the bigger they were, the bigger they will be but not the
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/
will match the faster they ran, the faster we ran but not the faster they ran, the faster
we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register. Occasionally we might want to use parenthe-
ses for grouping, but don’t want to capture the resulting pattern in a register. In that
non-capturing
group case we use a non-capturing group, which is specified by putting the commands
?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some cats but not some cats like some a few.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:
Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 26.
2.2 Words
Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at one particular corpus (plural corpora), a computer-readable
corpora collection of text or speech. For example the Brown corpus is a million-word col-
lection of samples from 500 written English texts from different genres (newspa-
per, fiction, non-fiction, academic, etc.), assembled at Brown University in 1963–64
(Kučera and Francis, 1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15
if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as
words depends on the task. Punctuation is critical for finding boundaries of things
(commas, periods, colons) and for identifying some aspects of meaning (question
marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if
they were separate words.
The Switchboard corpus of American English telephone conversations between
strangers was collected in the early 1990s; it contains 2430 conversations averaging
6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey
et al., 1992). Such corpora of spoken language don’t have punctuation but do intro-
duce other complications with regard to defining words. Let’s look at one utterance
utterance from Switchboard; an utterance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
disfluency This utterance has two kinds of disfluencies. The broken-off word main- is
fragment called a fragment. Words like uh and um are called fillers or filled pauses. Should
filled pause we consider these to be words? Again, it depends on the application. If we are
building a speech transcription system, we might want to eventually strip out the
disfluencies.
2.2 • W ORDS 11
Fig. 2.10 shows the rough numbers of types and tokens computed from some
popular English corpora. The larger the corpora we look at, the more word types
we find, and in fact this relationship between the number of types |V | and number
Herdan’s Law of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown
in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.
|V | = kN β (2.1)
The value of β depends on the corpus size and the genre, but at least for the
large corpora in Fig. 2.10, β ranges from .67 to .75. Roughly then we can say that
the vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.
2.3 Corpora
Words don’t appear out of nowhere. Any particular piece of text that we study
is produced by one or more specific speakers or writers, in a specific dialect of a
specific language, at a specific time, in a specific place, for a specific function.
Perhaps the most important dimension of variation is the language. NLP algo-
rithms are most useful when they apply across many languages. The world has 7097
languages at the time of this writing, according to the online Ethnologue catalog
(Simons and Fennig, 2018). Most NLP tools tend to be developed for the official
languages of large industrialized nations (Chinese, English, Spanish, Arabic, etc.),
but we don’t want to limit tools to just these few languages. Furthermore, most lan-
guages also have multiple varieties, such as dialects spoken in different regions or
by different social groups. Thus, for example, if we’re processing text in African
AAVE American Vernacular English (AAVE), a dialect spoken by millions of people in the
United States, it’s important to make use of NLP tools that function with that dialect.
Twitter posts written in AAVE make use of constructions like iont (I don’t in Stan-
SAE dard American English (SAE)), or talmbout corresponding to SAE talking about,
both examples that influence word segmentation (Blodgett et al. 2016, Jones 2015).
It’s also quite common for speakers or writers to use multiple languages in a
code switching single communicative act, a phenomenon called code switching. Code switch-
ing is enormously common across the world; here are examples showing Spanish
and (transliterated) Hindi code switching with English (Solorio et al. 2014, Jurgens
et al. 2017):
(2.2) Por primera vez veo a @username actually being hateful! it was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
(2.3) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Another dimension of variation is the genre. The text that our algorithms must
process might come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts. It might come from spoken genres like telephone
conversations, business meetings, police body-worn cameras, medical interviews,
or transcripts of television shows or movies. It might come from work situations
like doctors’ notes, legal text, or parliamentary or congressional proceedings.
Text also reflects the demographic characteristics of the writer (or speaker): their
age, gender, race, socioeconomic class can all influence the linguistic properties of
the text we are processing.
And finally, time matters too. Language changes over time, and for some lan-
guages we have good corpora of texts from different historical periods.
Because language is so situated, when developing computational models for lan-
guage processing, it’s important to consider who produced the language, in what
context, for what purpose, and make sure that the models are fit to the data.
2.4 • T EXT N ORMALIZATION 13
3 Abbot
...
Alternatively, we can collapse all the upper case to lower case:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c
whose output is
14725 a
97 aaron
1 abaissiez
10 abandon
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...
Unix tools of this sort can be very handy in building quick word count statistics
for any corpus.
($45.55) and dates (01/02/06); we don’t want to segment that price into separate to-
kens of “45” and “55”. And there are URLs (http://www.stanford.edu), Twitter
hashtags (#nlproc), or email addresses ([email protected]).
Number expressions introduce other complications as well; while commas nor-
mally appear at word boundaries, commas are used inside numbers in English, every
three digits: 555,500.50. Languages, and hence tokenization requirements, differ
on this; many continental European languages like Spanish, French, and German, by
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity detection, the task of detecting names, dates, and
organizations (Chapter 18).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank kenization standard, used for the parsed corpora (treebanks) released by the Lin-
tokenization
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation (to save space we’re showing visible spaces
‘ ’ between tokens, although newlines is a more common output):
Input: "The San Francisco-based restaurant," they said,
"doesn’t charge $10".
Output: " The San Francisco-based restaurant , " they said ,
" does n’t charge $ 10 " .
In practice, since tokenization needs to be run before any other language pro-
cessing, it needs to be very fast. The standard method for tokenization is therefore
to use deterministic algorithms based on regular expressions compiled into very ef-
ficient finite state automata. For example, Fig. 2.11 shows an example of a basic
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
http://www.nltk.org).
Carefully designed deterministic algorithms can deal with the ambiguities that
arise, such as the fact that the apostrophe needs to be tokenized differently when used
as a genitive marker (as in the book’s cover), a quotative as in ‘The other class’, she
said, or in clitics like they’re.
Word tokenization is more complex in languages like written Chinese, Japanese,
and Thai, which do not use spaces to mark potential word-boundaries.
hanzi In Chinese, for example, words are composed of characters (called hanzi in
Chinese). Each character generally represents a single unit of meaning (called a
morpheme) and is pronounceable as a single syllable. Words are about 2.4 charac-
ters long on average. But deciding what counts as a word in Chinese is complex.
For example, consider the following sentence:
(2.4) 姚明进入总决赛
“Yao Ming reaches the finals”
16 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
As Chen et al. (2017) point out, this could be treated as 3 words (‘Chinese Treebank’
segmentation):
(2.5) 姚明 进入 总决赛
YaoMing reaches finals
or as 5 words (‘Peking University’ segmentation):
(2.6) 姚 明 进入 总 决赛
Yao Ming reaches overall finals
Finally, it is possible in Chinese simply to ignore words altogether and use characters
as the basic elements, treating the sentence as a series of 7 characters:
(2.7) 姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game
In fact, for most Chinese NLP tasks it turns out to work better to take characters
rather than words as input, since characters are at a reasonable semantic level for
most applications, and since most word standards result in a huge vocabulary with
large numbers of very rare words (Li et al., 2019).
However, for Japanese and Thai the character is too small a unit, and so algo-
word
segmentation rithms for word segmentation are required. These can also be useful for Chinese
in the rare situations where word rather than character boundaries are required. The
standard segmentation algorithms for these languages use neural sequence mod-
els trained via supervised machine learning on hand-segmented training sets; we’ll
introduce sequence models in Chapter 8.
2018). BPE and wordpiece both assume that we already have some initial tokeniza-
tion of words (such as by spaces, or from some initial dictionary) and so we never
tried to induce word parts across spaces. By contrast, the SentencePiece model
works from raw text; even whitespace is handled as a normal symbol. Thus it doesn’t
need an initial tokenization or word-list, and can be used in languages like Chinese
or Japanese that don’t have spaces.
CHAPTER
Part-of-Speech Tagging
8 Dionysius Thrax of Alexandria (c. 100 B . C .), or perhaps someone else (it was a long
time ago), wrote a grammatical sketch of Greek (a “technē”) that summarized the
linguistic knowledge of his day. This work is the source of an astonishing proportion
of modern linguistic vocabulary, including words like syntax, diphthong, clitic, and
parts of speech analogy. Also included are a description of eight parts of speech: noun, verb,
pronoun, preposition, adverb, conjunction, participle, and article. Although earlier
scholars (including Aristotle as well as the Stoics) had their own lists of parts of
speech, it was Thrax’s set of eight that became the basis for practically all subsequent
part-of-speech descriptions of most European languages for the next 2000 years.
Schoolhouse Rock was a series of popular animated educational television clips
from the 1970s. Its Grammar Rock sequence included songs about exactly 8 parts
of speech, including the late great Bob Dorough’s Conjunction Junction:
Conjunction Junction, what’s your function?
Hooking up words and phrases and clauses...
Although the list of 8 was slightly modified from Thrax’s original, the astonishing
durability of the parts of speech through two millennia is an indicator of both the
importance and the transparency of their role in human language.1
POS Parts of speech (also known as POS, word classes, or syntactic categories) are
useful because they reveal a lot about a word and its neighbors. Knowing whether
a word is a noun or a verb tells us about likely neighboring words (nouns are pre-
ceded by determiners and adjectives, verbs by nouns) and syntactic structure (nouns
are generally part of noun phrases), making part-of-speech tagging a key aspect of
parsing (Chapter 13). Parts of speech are useful features for labeling named entities
like people or organizations in information extraction (Chapter 18), or for corefer-
ence resolution (Chapter 22). A word’s part of speech can even play a role in speech
recognition or synthesis, e.g., the word content is pronounced CONtent when it is a
noun and conTENT when it is an adjective.
This chapter introduces parts of speech, and then introduces two algorithms for
part-of-speech tagging, the task of assigning parts of speech to words. One is
generative— Hidden Markov Model (HMM)—and one is discriminative—the Max-
imum Entropy Markov Model (MEMM). Chapter 9 then introduces a third algorithm
based on the recurrent neural network (RNN). All three have roughly equal perfor-
mance but, as we’ll see, have different tradeoffs.
properties and nouns people— parts of speech are traditionally defined instead based
on syntactic and morphological function, grouping words that have similar neighbor-
ing words (their distributional properties) or take similar affixes (their morpholog-
ical properties).
closed class Parts of speech can be divided into two broad supercategories: closed class types
open class and open class types. Closed classes are those with relatively fixed membership,
such as prepositions—new prepositions are rarely coined. By contrast, nouns and
verbs are open classes—new nouns and verbs like iPhone or to fax are continually
being created or borrowed. Any given speaker or corpus may have different open
class words, but all speakers of a language, and sufficiently large corpora, likely
function word share the set of closed class words. Closed class words are generally function words
like of, it, and, or you, which tend to be very short, occur frequently, and often have
structuring uses in grammar.
Four major open classes occur in the languages of the world: nouns, verbs,
adjectives, and adverbs. English has all four, although not every language does.
noun The syntactic class noun includes the words for most people, places, or things, but
others as well. Nouns include concrete terms like ship and chair, abstractions like
bandwidth and relationship, and verb-like terms like pacing as in His pacing to and
fro became quite annoying. What defines a noun in English, then, are things like its
ability to occur with determiners (a goat, its bandwidth, Plato’s Republic), to take
possessives (IBM’s annual revenue), and for most but not all nouns to occur in the
plural form (goats, abaci).
proper noun Open class nouns fall into two classes. Proper nouns, like Regina, Colorado,
and IBM, are names of specific persons or entities. In English, they generally aren’t
preceded by articles (e.g., the book is upstairs, but Regina is upstairs). In written
common noun English, proper nouns are usually capitalized. The other class, common nouns, are
count noun divided in many languages, including English, into count nouns and mass nouns.
mass noun Count nouns allow grammatical enumeration, occurring in both the singular and plu-
ral (goat/goats, relationship/relationships) and they can be counted (one goat, two
goats). Mass nouns are used when something is conceptualized as a homogeneous
group. So words like snow, salt, and communism are not counted (i.e., *two snows
or *two communisms). Mass nouns can also appear without articles where singular
count nouns cannot (Snow is white but not *Goat is white).
verb Verbs refer to actions and processes, including main verbs like draw, provide,
and go. English verbs have inflections (non-third-person-sg (eat), third-person-sg
(eats), progressive (eating), past participle (eaten)). While many researchers believe
that all human languages have the categories of noun and verb, others have argued
that some languages, such as Riau Indonesian and Tongan, don’t even make this
distinction (Broschart 1997; Evans 2000; Gil 2000) .
adjective The third open class English form is adjectives, a class that includes many terms
for properties or qualities. Most languages have adjectives for the concepts of color
(white, black), age (old, young), and value (good, bad), but there are languages
without adjectives. In Korean, for example, the words corresponding to English
adjectives act as a subclass of verbs, so what is in English an adjective “beautiful”
acts in Korean like a verb meaning “to be beautiful”.
adverb The final open class form, adverbs, is rather a hodge-podge in both form and
meaning. In the following all the italicized words are adverbs:
Actually, I ran home extremely quickly yesterday
What coherence the class has semantically may be solely that each of these
words can be viewed as modifying something (often verbs, hence the name “ad-
8.1 • (M OSTLY ) E NGLISH W ORD C LASSES 3
verb”, but also other adverbs and entire verb phrases). Directional adverbs or loca-
locative tive adverbs (home, here, downhill) specify the direction or location of some action;
degree degree adverbs (extremely, very, somewhat) specify the extent of some action, pro-
manner cess, or property; manner adverbs (slowly, slinkily, delicately) describe the manner
temporal of some action or process; and temporal adverbs describe the time that some ac-
tion or event took place (yesterday, Monday). Because of the heterogeneous nature
of this class, some adverbs (e.g., temporal adverbs like Monday) are tagged in some
tagging schemes as nouns.
The closed classes differ more from language to language than do the open
classes. Some of the important closed classes in English include:
prepositions: on, under, over, near, by, at, from, to, with
particles: up, down, on, off, in, out, at, by
determiners: a, an, the
conjunctions: and, but, or, as, if, when
pronouns: she, who, I, others
auxiliary verbs: can, may, should, are
numerals: one, two, three, first, second, third
preposition Prepositions occur before noun phrases. Semantically they often indicate spatial
or temporal relations, whether literal (on it, before then, by the house) or metaphor-
ical (on time, with gusto, beside herself), but often indicate other relations as well,
particle like marking the agent in Hamlet was written by Shakespeare. A particle resembles
a preposition or an adverb and is used in combination with a verb. Particles often
have extended meanings that aren’t quite the same as the prepositions they resemble,
as in the particle over in she turned the paper over.
A verb and a particle that act as a single syntactic and/or semantic unit are
phrasal verb called a phrasal verb. The meaning of phrasal verbs is often problematically non-
compositional—not predictable from the distinct meanings of the verb and the par-
ticle. Thus, turn down means something like ‘reject’, rule out ‘eliminate’, find out
‘discover’, and go on ‘continue’.
A closed class that occurs with nouns, often marking the beginning of a noun
determiner phrase, is the determiner. One small subtype of determiners is the article: English
article has three articles: a, an, and the. Other determiners include this and that (this chap-
ter, that page). A and an mark a noun phrase as indefinite, while the can mark it
as definite; definiteness is a discourse property (Chapter 23). Articles are quite fre-
quent in English; indeed, the is the most frequently occurring word in most corpora
of written English, and a and an are generally right behind.
conjunctions Conjunctions join two phrases, clauses, or sentences. Coordinating conjunc-
tions like and, or, and but join two elements of equal status. Subordinating conjunc-
tions are used when one of the elements has some embedded status. For example,
that in “I thought that you might like some milk” is a subordinating conjunction
that links the main clause I thought with the subordinate clause you might like some
milk. This clause is called subordinate because this entire clause is the “content” of
the main verb thought. Subordinating conjunctions like that which link a verb to its
complementizer argument in this way are also called complementizers.
pronoun Pronouns are forms that often act as a kind of shorthand for referring to some
personal noun phrase or entity or event. Personal pronouns refer to persons or entities (you,
possessive she, I, it, me, etc.). Possessive pronouns are forms of personal pronouns that in-
dicate either actual possession or more often just an abstract relation between the
wh person and some object (my, your, his, her, its, one’s, our, their). Wh-pronouns
(what, who, whom, whoever) are used in certain question forms, or may also act as
4 C HAPTER 8 • PART- OF -S PEECH TAGGING