Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Dan Jurafsky - Download the ebook now and own the full detailed content
Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Dan Jurafsky - Download the ebook now and own the full detailed content
com
OR CLICK HERE
DOWLOAD EBOOK
https://ebookmeta.com/product/speech-and-language-processing-3rd-
edition-daniel-jurafsky-james-h-martin/
ebookmeta.com
https://ebookmeta.com/product/natural-language-processing-with-
pytorch-2019th-edition-delip-rao/
ebookmeta.com
https://ebookmeta.com/product/sword-of-the-demon-hunter-kijin-
gentosho-light-novel-vol-3-1st-edition-motoo-nakanishi/
ebookmeta.com
Good Girl Bad Girl 1st Edition Mia Archer.
https://ebookmeta.com/product/good-girl-bad-girl-1st-edition-mia-
archer/
ebookmeta.com
https://ebookmeta.com/product/contracts-for-infrastructure-projects-
an-international-guide-to-application-1st-edition-philip-c-loots/
ebookmeta.com
https://ebookmeta.com/product/the-mexican-petroleum-
industry-1938-1950-j-richard-powell/
ebookmeta.com
https://ebookmeta.com/product/incantations-over-water-1st-edition-
sharanya-manivannan/
ebookmeta.com
https://ebookmeta.com/product/smoke-mirrors-nite-fire-3-1st-edition-c-
l-schneider/
ebookmeta.com
Boardwalk Kings Boardwalk Mafia 1 1st Edition Jillian
Frost
https://ebookmeta.com/product/boardwalk-kings-boardwalk-mafia-1-1st-
edition-jillian-frost/
ebookmeta.com
Speech and Language Processing
An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
Daniel Jurafsky
Stanford University
James H. Martin
University of Colorado at Boulder
2
Contents
1 Introduction 1
5 Logistic Regression 76
5.1 Classification: the sigmoid . . . . . . . . . . . . . . . . . . . . . 77
5.2 Learning in Logistic Regression . . . . . . . . . . . . . . . . . . . 81
5.3 The cross-entropy loss function . . . . . . . . . . . . . . . . . . . 82
5.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . 90
5.7 Interpreting models . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Advanced: Deriving the Gradient Equation . . . . . . . . . . . . . 93
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3
4 C ONTENTS
25 Phonetics 526
25.1 Speech Sounds and Phonetic Transcription . . . . . . . . . . . . . 526
8 C ONTENTS
Bibliography 575
Subject Index 607
CHAPTER
1 Introduction
La dernière chose qu’on trouve en faisant un ouvrage est de savoir celle qu’il faut
mettre la première.
[The last thing you figure out in writing a book is what to put first.]
Pascal
1
2 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
CHAPTER
Some languages, like Japanese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
lemmatization Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.
The regular expression /[1234567890]/ specifies any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.
The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.
How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
2.1 • R EGULAR E XPRESSIONS 5
We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
we want, something that is very important in regular expressions. For example,
consider the language of certain sheep, which consists of strings that look like the
following:
baa!
baaa!
baaaa!
baaaaa!
...
This language consists of strings with a b, followed by at least two a’s, followed
by an exclamation point. The set of operators that allows us to say things like “some
Kleene * number of as” are based on the asterisk or *, commonly called the Kleene * (gen-
erally pronounced “cleany star”). The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So /a*/ means “any
string of zero or more as”. This will match a or aaaaaa, but it will also match Off
Minor since the string Off Minor has zero a’s. So the regular expression for matching
one or more a is /aa*/, meaning one a followed by zero or more as. More complex
patterns can also be repeated. So /[ab]*/ means “zero or more a’s or b’s” (not
“zero or more right square braces”). This will match strings like aaaa or ababab or
bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/,
the regular expression for a single digit. An integer (a string of digits) is thus
/[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice,
so there is a shorter way to specify “at least one” of some character. This is the
Kleene + Kleene +, which means “one or more occurrences of the immediately preceding
character or regular expression”. Thus, the expression /[0-9]+/ is the normal way
to specify “a sequence of digits”. There are thus two ways to specify the sheep
language: /baaa*!/ or /baa+!/.
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return), as shown in Fig. 2.6.
The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
6 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)
RE Match
ˆ start of line
\$ end of line
\b word boundary
\B non-word boundary
Figure 2.7 Anchors in regular expressions.
There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).
/(Column [0-9]+ *)*/ to match the word Column, followed by a number and
optional spaces, the whole pattern repeated zero or more times.
This idea that one operator may take precedence over another, requiring us to
sometimes use parentheses to specify what we mean, is formalized by the operator
operator
precedence precedence hierarchy for regular expressions. The following table gives the order
of RE operator precedence, from highest precedence to lowest precedence.
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
Thus, because counters have a higher precedence than sequences,
/the*/ matches theeeee but not thethe. Because sequences have a higher prece-
dence than disjunction, /the|any/ matches the or any but not thany or theny.
Patterns can be ambiguous in another way. Consider the expression /[a-z]*/
when matching against the text once upon a time. Since /[a-z]*/ matches zero or
more letters, this expression could match nothing, or just the first letter o, on, onc,
or once. In these cases regular expressions always match the largest string they can;
greedy we say that patterns are greedy, expanding to cover as much of a string as they can.
non-greedy There are, however, ways to enforce non-greedy matching, using another mean-
*? ing of the ? qualifier. The operator *? is a Kleene star that matches as little text as
+? possible. The operator +? is a Kleene plus that matches as little text as possible.
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/
The process we just went through was based on fixing two kinds of errors: false
false positives positives, strings that we incorrectly matched like other or there, and false nega-
false negatives tives, strings that we incorrectly missed, like The. Addressing these two kinds of
errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antag-
onistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
We’ll come back to precision and recall with more precise definitions in Chapter 4.
RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 2.9 Regular expression operators for counting.
Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Fig. 2.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
2.1 • R EGULAR E XPRESSIONS 9
s/colour/color/
It is often useful to be able to refer to a particular subpart of the string matching
the first pattern. For example, suppose we wanted to put angle brackets around all
integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d
like a way to refer to the integer we’ve found so that we can easily add the brackets.
To do this, we put parentheses ( and ) around the first pattern and use the number
operator \1 in the second pattern to refer back. Here’s how it looks:
s/([0-9]+)/<\1>/
The parenthesis and number operators can also specify that a certain string or
expression must occur twice in the text. For example, suppose we are looking for
the pattern “the Xer they were, the Xer they will be”, where we want to constrain
the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as
follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match the bigger they were, the bigger they will be but not the
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/
will match the faster they ran, the faster we ran but not the faster they ran, the faster
we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register. Occasionally we might want to use parenthe-
ses for grouping, but don’t want to capture the resulting pattern in a register. In that
non-capturing
group case we use a non-capturing group, which is specified by putting the commands
?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some cats but not some cats like some a few.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:
first uppercased. The first substitutions then change all instances of MY to YOUR,
and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces
other patterns in the input. Here are some examples:
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 24.
2.2 Words
Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at one particular corpus (plural corpora), a computer-readable
corpora collection of text or speech. For example the Brown corpus is a million-word col-
lection of samples from 500 written English texts from different genres (newspa-
per, fiction, non-fiction, academic, etc.), assembled at Brown University in 1963–64
(Kučera and Francis, 1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15
if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as
words depends on the task. Punctuation is critical for finding boundaries of things
(commas, periods, colons) and for identifying some aspects of meaning (question
marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if
they were separate words.
The Switchboard corpus of American English telephone conversations between
strangers was collected in the early 1990s; it contains 2430 conversations averaging
6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey
et al., 1992). Such corpora of spoken language don’t have punctuation but do intro-
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
duce other complications with regard to defining words. Let’s look at one utterance
utterance from Switchboard; an utterance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
disfluency This utterance has two kinds of disfluencies. The broken-off word main- is
fragment called a fragment. Words like uh and um are called fillers or filled pauses. Should
filled pause we consider these to be words? Again, it depends on the application. If we are
building a speech transcription system, we might want to eventually strip out the
disfluencies.
But we also sometimes keep disfluencies around. Disfluencies like uh or um
are actually helpful in speech recognition in predicting the upcoming word, because
they may signal that the speaker is restarting the clause or idea, and so for speech
recognition they are treated as regular words. Because people use different disflu-
encies they can also be a cue to speaker identification. In fact Clark and Fox Tree
(2002) showed that uh and um have different meanings. What do you think they are?
Are capitalized tokens like They and uncapitalized tokens like they the same
word? These are lumped together in some tasks (speech recognition), while for part-
of-speech or named-entity tagging, capitalization is a useful feature and is retained.
How about inflected forms like cats versus cat? These two words have the same
lemma lemma cat but are different wordforms. A lemma is a set of lexical forms having
the same stem, the same major part-of-speech, and the same word sense. The word-
wordform form is the full inflected or derived form of the word. For morphologically complex
languages like Arabic, we often need to deal with lemmatization. For many tasks in
English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to
word type distinguish two ways of talking about words. Types are the number of distinct words
in a corpus; if the set of words in the vocabulary is V , the number of types is the
word token vocabulary size |V |. Tokens are the total number N of running words. If we ignore
punctuation, the following Brown sentence has 16 tokens and 14 types:
They picnicked by the pool, then lay back on the grass and looked at the stars.
When we speak about the number of words in the language, we are generally
referring to word types.
Fig. 2.11 shows the rough numbers of types and tokens computed from some
popular English corpora. The larger the corpora we look at, the more word types
we find, and in fact this relationship between the number of types |V | and number
Herdan’s Law of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown
in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.
|V | = kN β (2.1)
2.3 • C ORPORA 13
The value of β depends on the corpus size and the genre, but at least for the large
corpora in Fig. 2.11, β ranges from .67 to .75. Roughly then we can say that the
vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.
2.3 Corpora
Words don’t appear out of nowhere. Any particular piece of text that we study
is produced by one or more specific speakers or writers, in a specific dialect of a
specific language, at a specific time, in a specific place, for a specific function.
Perhaps the most important dimension of variation is the language. NLP algo-
rithms are most useful when they apply across many languages. The world has 7097
languages at the time of this writing, according to the online Ethnologue catalog
(Simons and Fennig, 2018). It is important to test algorithms on more than one lan-
guage, and particularly on languages with different properties; by contrast there is
an unfortunate current tendency for NLP algorithms to be developed or tested just
on English (Bender, 2019). Even when algorithms are developed beyond English,
they tend to be developed for the official languages of large industrialized nations
(Chinese, Spanish, Japanese, German etc.), but we don’t want to limit tools to just
these few languages. Furthermore, most languages also have multiple varieties, of-
ten spoken in different regions or by different social groups. Thus, for example, if
AAL we’re processing text that uses features of African American Language (AAL) —
the name for the many variations of language used by millions of people in African
American communities (King 2020) — we must use NLP tools that function with
features of those varieties. Twitter posts might use features often used by speakers of
African American Language, such as constructions like iont (I don’t in Mainstream
MAE American English (MAE)), or talmbout corresponding to MAE talking about, both
examples that influence word segmentation (Blodgett et al. 2016, Jones 2015).
It’s also quite common for speakers or writers to use multiple languages in a
code switching single communicative act, a phenomenon called code switching. Code switch-
ing is enormously common across the world; here are examples showing Spanish
and (transliterated) Hindi code switching with English (Solorio et al. 2014, Jurgens
et al. 2017):
(2.2) Por primera vez veo a @username actually being hateful! it was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
(2.3) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Another dimension of variation is the genre. The text that our algorithms must
process might come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts. It might come from spoken genres like telephone
conversations, business meetings, police body-worn cameras, medical interviews,
or transcripts of television shows or movies. It might come from work situations
14 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...
Unix tools of this sort can be very handy in building quick word count statistics
for any corpus.
Carefully designed deterministic algorithms can deal with the ambiguities that
arise, such as the fact that the apostrophe needs to be tokenized differently when used
as a genitive marker (as in the book’s cover), a quotative as in ‘The other class’, she
said, or in clitics like they’re.
Word tokenization is more complex in languages like written Chinese, Japanese,
and Thai, which do not use spaces to mark potential word-boundaries. In Chinese,
hanzi for example, words are composed of characters (called hanzi in Chinese). Each
character generally represents a single unit of meaning (called a morpheme) and is
pronounceable as a single syllable. Words are about 2.4 characters long on average.
But deciding what counts as a word in Chinese is complex. For example, consider
the following sentence:
(2.4) 姚明进入总决赛
“Yao Ming reaches the finals”
As Chen et al. (2017) point out, this could be treated as 3 words (‘Chinese Treebank’
segmentation):
(2.5) 姚明 进入 总决赛
YaoMing reaches finals
or as 5 words (‘Peking University’ segmentation):
(2.6) 姚 明 进入 总 决赛
Yao Ming reaches overall finals
Finally, it is possible in Chinese simply to ignore words altogether and use characters
as the basic elements, treating the sentence as a series of 7 characters:
18 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
(2.7) 姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game
In fact, for most Chinese NLP tasks it turns out to work better to take characters
rather than words as input, since characters are at a reasonable semantic level for
most applications, and since most word standards, by contrast, result in a huge vo-
cabulary with large numbers of very rare words (Li et al., 2019).
However, for Japanese and Thai the character is too small a unit, and so algo-
word
segmentation rithms for word segmentation are required. These can also be useful for Chinese
in the rare situations where word rather than character boundaries are required. The
standard segmentation algorithms for these languages use neural sequence mod-
els trained via supervised machine learning on hand-segmented training sets; we’ll
introduce sequence models in Chapter 8 and Chapter 9.
The algorithm is usually run inside words (not merging across word boundaries),
so the input corpus is first white-space-separated to give a set of strings, each corre-
sponding to the characters of a word, plus a special end-of-word symbol , and its
counts. Let’s see its operation on the following tiny input corpus of 18 word tokens
with counts for each word (the word low appears 5 times, the word newer 6 times,
and so on), which would have a starting vocabulary of 11 letters:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
The BPE algorithm first count all pairs of adjacent symbols: the most frequent
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
3) for a total of 9 occurrences1 . We then merge these symbols, treating er as one
symbol, and count again:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Next n e (total count of 8) get merged to ne:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
2 l o w e s t
6 ne w er
3 w i d er
2 ne w
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a test
sentence. The token parser just runs on the test data the merges we have learned
1 Note that there can be ties; we could have instead chosen to merge r first, since that also has a
frequency of 9.
20 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Figure adapted from Bostrom and Durrett (2020).
from the training data, greedily, in the order we learned them. (Thus the frequencies
in the test data don’t play a role, just the frequencies in the training data). So first
we segment each test sentence word into characters. Then we apply the first rule:
replace every instance of e r in the test corpus with r, and then the second rule:
replace every instance of er in the test corpus with er , and so on. By the end,
if the test corpus contained the word n e w e r , it would be tokenized as a full
word. But a new (unknown) word like l o w e r would be merged into the two
tokens low er .
Of course in real algorithms BPE is run with many thousands of merges on a very
large input corpus. The result is that most words will be represented as full symbols,
and only the very rare words (and unknown words) will have to be represented by
their parts.
be; the words dinner and dinners both have the lemma dinner. Lemmatizing each of
these forms to the same lemma will let us find all mentions of words in Russian like
Moscow. The lemmatized form of a sentence like He is reading detective stories
would thus be He be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings of various kinds. So, for example, the word fox consists of one morpheme
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or parses a Spanish word like amaren (‘if in the future
they would love’) into the morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.
INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s
Figure 2.14 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.
We can also assign a particular cost or weight to each of these operations. The
Levenshtein distance between two sequences is the simplest weighting factor in
which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume
that the substitution of a letter for itself, for example, t for t, has zero cost. The Lev-
enshtein distance between intention and execution is 5. Levenshtein also proposed
an alternative version of his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.
i n t e n t i o n
n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.15 Finding the edit distance viewed as a search problem
The space of all possible edits is enormous, so we can’t search naively. However,
lots of distinct edit paths will end up in the same state (string), so rather than recom-
puting all those paths, we could just remember the shortest path to a state each time
dynamic
programming we saw it. We can do this by using dynamic programming. Dynamic programming
is the name for a class of algorithms, first introduced by Bellman (1957), that apply
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the
CKY algorithm for parsing (Chapter 13).
The intuition of a dynamic programming problem is that a large problem can
be solved by properly combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.16.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
24 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.16 Path from intention to execution.
then we could use it instead, resulting in a shorter overall path, and the optimal
minimum edit
sequence wouldn’t be optimal, thus leading to a contradiction.
distance The minimum edit distance algorithm algorithm was named by Wagner and
algorithm
Fischer (1974) but independently discovered by many people (see the Historical
Notes section of Chapter 8).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
D[i, j] as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D[n, m].
We’ll use dynamic programming to compute D[n, m] bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D[i, j] for small i, j we then compute larger
D[i, j] based on previously computed smaller values. The value of D[i, j] is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:
D[i − 1, j] + del-cost(source[i])
D[i, j] = min D[i, j − 1] + ins-cost(target[ j])
D[i − 1, j − 1] + sub-cost(source[i], target[ j])
If we assume the version of Levenshtein distance in which the insertions and dele-
tions each have a cost of 1 (ins-cost(·) = del-cost(·) = 1), and substitutions have a
cost of 2 (except substitution of identical letters have zero cost), the computation for
D[i, j] becomes:
D[i − 1, j] + 1
D[i, j − 1] + 1
D[i, j] = min (2.8)
2; if source[i] 6= target[ j]
D[i − 1, j − 1] +
0; if source[i] = target[ j]
The algorithm is summarized in Fig. 2.17; Fig. 2.18 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.8.
Knowing the minimum edit distance is useful for algorithms like finding poten-
tial spelling error corrections. But the edit distance algorithm is important in another
way; with a small change, it can also provide the minimum cost alignment between
two strings. Aligning two strings is useful throughout speech and language process-
ing. In speech recognition, minimum edit distance alignment is used to compute
the word error rate (Chapter 26). Alignment plays a role in machine translation, in
2.5 • M INIMUM E DIT D ISTANCE 25
n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix distance[n+1,m+1]
# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])
# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]
Figure 2.17 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).
Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.18 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.17, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.
which sentences in a parallel corpus (a corpus with a text in two languages) need to
be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.19
shows this path with the boldfaced cell. Each boldfaced cell represents an alignment
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.19 also shows the intuition of how to compute this alignment path. The
26 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.19. Some cells have mul-
tiple backpointers because the minimum extension could have come from multiple
backtrace previous cells. In the second step, we perform a backtrace. In a backtrace, we start
from the last cell (at the final row and column), and follow the pointers back through
the dynamic programming matrix. Each complete path between the final cell and the
initial cell is a minimum distance alignment. Exercise 2.7 asks you to modify the
minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment.
# e x e c u t i o n
# 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7 ←8 ← 9
i ↑1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n ↑2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t ↑3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e ↑4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n ↑5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t ↑6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i ↑7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o ↑8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n ↑9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.19 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).
While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.17 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
each other on the keyboard. The Viterbi algorithm is a probabilistic extension of
minimum edit distance. Instead of computing the “minimum edit distance” between
two strings, Viterbi computes the “maximum probability alignment” of one string
with another. We’ll discuss this more in Chapter 8.
2.6 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduced the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors
(ˆ, $) and precedence operators ((,)).
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 27
“...The 1950s were not good years for mathematical research. [the]
Secretary of Defense ...had a pathological fear and hatred of the word,
research... I decided therefore to use the word, “programming”. I
wanted to get across the idea that this was dynamic, this was multi-
stage... I thought, let’s ... take a word that has an absolutely precise
meaning, namely dynamic... it’s impossible to use the word, dynamic,
in a pejorative sense. Try thinking of some combination that will pos-
sibly give it a pejorative meaning. It’s impossible. Thus, I thought
dynamic programming was a good name. It was something not even a
Congressman could object to.”
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 11. You might want to choose a different domain than a Rogerian psy-
chologist, although keep in mind that you would need a domain in which your
program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
CHAPTER
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone
is going to say? What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:
Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words in
noisy, ambiguous input, like speech recognition. For a speech recognizer to realize
that you said I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish. For writing
tools like spelling correction or grammatical error correction, we need to find and
correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved.
The phrase There are will be much more probable than Their are, and has improved
than has improve, allowing us to help users by detecting and correcting these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
30 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “its water is so transparent that” and we
want to know the probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a
very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the. This would be answering the
question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:
With a large enough corpus, such as the web, we can compute these counts and
estimate the probability from Eq. 3.2. You should pause now, go to the web, and
compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in
many cases, it turns out that even the web isn’t big enough to give us good estimates
in most cases. This is because language is creative; new sentences are created all the
time, and we won’t always be able to count entire sentences. Even simple extensions
of the example sentence may have counts of zero on the web (such as “Walden
Pond’s water is so transparent that the”; well, used to have counts of zero).
3.1 • N-G RAMS 31
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equa-
tion 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |wn−1 1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string, because language is creative and any particular
context might have never occurred before!
The intuition of the n-gram model is that instead of computing the probability of
a word given its entire history, we can approximate the history by just the last few
words.
bigram The bigram model, for example, approximates the probability of a word given
all the previous words P(wn |w1:n−1 ) by using only the conditional probability of the
preceding word P(wn |wn−1 ). In other words, instead of computing the probability
P(the|Walden Pond’s water is so transparent that) (3.5)
we approximate it with the probability
P(the|that) (3.6)
When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:
P(wn |w1:n−1 ) ≈ P(wn |wn−1 ) (3.7)
32 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
The assumption that the probability of a word depends only on the previous word is
Markov called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Thus, the general equation for this n-gram approximation to the conditional
probability of the next word in a sequence is
Given the bigram assumption for the probability of an individual word, we can com-
pute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
n
Y
P(w1:n ) ≈ P(wk |wk−1 ) (3.9)
k=1
maximum
How do we estimate these bigram or n-gram probabilities? An intuitive way to
likelihood estimate probabilities is called maximum likelihood estimation or MLE. We get
estimation
the MLE estimate for the parameters of an n-gram model by getting counts from a
normalize corpus, and normalizing the counts so that they lie between 0 and 1.1
For example, to compute a particular bigram probability of a word y given a
previous word x, we’ll compute the count of the bigram C(xy) and normalize by the
sum of all the bigrams that share the same first word x:
C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):
C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol. </s>2
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = .67 P(Sam|<s>) = 3 = .33 P(am|I) = 3 = .67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = .5 P(do|I) = 3 = .33
1 For probabilistic models, normalizing means dividing by some total count so that the resulting proba-
bilities fall legally between 0 and 1.
2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an
end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model
would define an infinite set of probability distributions, with one distribution per sentence length. See
Exercise 3.5.
Other documents randomly have
different content
Marriage and family are thus intimately connected with each
other: it is for the benefit of the young that male and female
continue to live together. Marriage is therefore rooted in family,
rather than family in marriage. There are also many peoples among
whom true conjugal life does not begin before a child is born, and
others who consider that the birth of a child out of wedlock makes it
obligatory for the parents to marry. Among the Eastern
Greenlanders102 and the Fuegians,103 marriage is not regarded as
complete till the woman has become a mother. Among the
Shawanese104 and Abipones,105 the wife very often remains at her
father’s house till she has a child. Among the Khyens, the Ainos of
Yesso, and one of the aboriginal tribes of China, the husband goes
to live with his wife at her father’s house, and never takes her away
till after the birth of a child.106 In Circassia, the bride and
bridegroom are kept apart until the first child is born;107 and among
the Bedouins of Mount Sinai, a wife never enters her husband’s tent
until she becomes far advanced in pregnancy.108 Among the Baele,
the wife remains with her parents until she becomes a mother, and if
this does not happen, she stays there for ever, the husband getting
back what he has paid for her.109 In Siam, a wife does not receive
her marriage portion before having given birth to a child;110 while
among the Atkha Aleuts, according to Erman, a husband does not
pay the purchase sum before he has become a father.111 Again, the
Badagas in Southern India have two marriage ceremonies, the
second of which does not take place till there is some indication that
the pair are to have a family; and if there is no appearance of this,
the couple not uncommonly separate.112 Dr. Bérenger-Féraud states
that, among the Wolofs in Senegambia, “ce n’est que lorsque les
signes de la grossesse sont irrécusables chez la fiancée, quelquefois
même ce n’est qu’après la naissance d’un ou plusieurs enfants, que
la cérémonie du mariage proprement dit s’accomplit.”113 And the
Igorrotes of Luzon consider no engagement binding until the woman
has become pregnant.114
On the other hand, Emin Pasha tells us that, among the Mádi in
Central Africa, “should a girl become pregnant, the youth who has
been her companion is bound to marry her, and to pay to her father
the customary price of a bride.”115 Burton reports a similar custom
as prevailing among peoples dwelling to the south of the equator.116
Among many of the wild tribes of Borneo, there is almost
unrestrained intercourse between the youth of both sexes; but, if
pregnancy ensue, marriage is regarded as necessary.117 The same,
as I am informed by Dr. A. Bunker, is the case with some Karen
tribes in Burma. In Tahiti, according to Cook, the father might kill his
natural child, but if he suffered it to live, the parties were considered
to be in the married state.118 Among the Tipperahs of the
Chittagong Hills,119 as well as the peasants of the Ukraine,120 a
seducer is bound to marry the girl, should she become pregnant.
Again, Mr.Powers informs us that, among the Californian Wintun, if a
wife is abandoned when she has a young child, she is justified by
her friends in destroying it on the ground that it has no supporter.121
And among the Creeks, a young woman that becomes pregnant by a
man whom she had expected to marry, and is disappointed, is
allowed the same privilege.122
It might, however, be supposed that, in man, the prolonged union
of the sexes is due to another cause besides the offspring’s want of
parental care, i.e., to the fact that the sexual instinct is not restricted
to any particular season, but endures throughout the whole year.
“That which distinguishes man from the beast,” Beaumarchais says,
“is drinking without being thirsty, and making love at all seasons.”
But in the next chapter, I shall endeavour to show that this is
probably not quite correct, so far as our earliest human or semi-
human ancestors are concerned.
CHAPTER II
A HUMAN PAIRING SEASON IN PRIMITIVE
TIMES