0% found this document useful (0 votes)
72 views65 pages

Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Dan Jurafsky - Download the ebook now and own the full detailed content

The document provides information about various eBooks available for download, including 'Speech and Language Processing' by Dan Jurafsky and James H. Martin, along with other titles related to natural language processing and machine learning. It includes links to access these eBooks in different formats such as PDF, EPUB, and MOBI. Additionally, it outlines the contents of the third edition of 'Speech and Language Processing', detailing chapters on topics like language models, sentiment classification, and machine translation.

Uploaded by

ladyfaspread
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views65 pages

Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Dan Jurafsky - Download the ebook now and own the full detailed content

The document provides information about various eBooks available for download, including 'Speech and Language Processing' by Dan Jurafsky and James H. Martin, along with other titles related to natural language processing and machine learning. It includes links to access these eBooks in different formats such as PDF, EPUB, and MOBI. Additionally, it outlines the contents of the third edition of 'Speech and Language Processing', detailing chapters on topics like language models, sentiment classification, and machine translation.

Uploaded by

ladyfaspread
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Read Anytime Anywhere Easy Ebook Downloads at ebookmeta.

com

Speech and Language Processing An Introduction to


Natural Language Processing Computational
Linguistics and Speech Recognition 3rd Edition Dan
Jurafsky
https://ebookmeta.com/product/speech-and-language-
processing-an-introduction-to-natural-language-processing-
computational-linguistics-and-speech-recognition-3rd-
edition-dan-jurafsky/

OR CLICK HERE

DOWLOAD EBOOK

Visit and Get More Ebook Downloads Instantly at https://ebookmeta.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Speech and Language Processing 3rd Edition Daniel Jurafsky


James H Martin

https://ebookmeta.com/product/speech-and-language-processing-3rd-
edition-daniel-jurafsky-james-h-martin/

ebookmeta.com

Deep Learning Approach for Natural Language Processing,


Speech, and Computer Vision: Techniques and Use Cases 1st
Edition L. Ashok Kumar
https://ebookmeta.com/product/deep-learning-approach-for-natural-
language-processing-speech-and-computer-vision-techniques-and-use-
cases-1st-edition-l-ashok-kumar/
ebookmeta.com

Natural Language Processing with PyTorch 2019th Edition


Delip Rao

https://ebookmeta.com/product/natural-language-processing-with-
pytorch-2019th-edition-delip-rao/

ebookmeta.com

Sword of the Demon Hunter Kijin Gentosho Light Novel Vol 3


1st Edition Motoo Nakanishi

https://ebookmeta.com/product/sword-of-the-demon-hunter-kijin-
gentosho-light-novel-vol-3-1st-edition-motoo-nakanishi/

ebookmeta.com
Good Girl Bad Girl 1st Edition Mia Archer.

https://ebookmeta.com/product/good-girl-bad-girl-1st-edition-mia-
archer/

ebookmeta.com

Contracts for Infrastructure Projects : An International


Guide to Application 1st Edition Philip C. Loots

https://ebookmeta.com/product/contracts-for-infrastructure-projects-
an-international-guide-to-application-1st-edition-philip-c-loots/

ebookmeta.com

The Mexican Petroleum Industry 1938 1950 J. Richard Powell

https://ebookmeta.com/product/the-mexican-petroleum-
industry-1938-1950-j-richard-powell/

ebookmeta.com

Incantations Over Water 1st Edition Sharanya Manivannan

https://ebookmeta.com/product/incantations-over-water-1st-edition-
sharanya-manivannan/

ebookmeta.com

Smoke Mirrors Nite Fire 3 1st Edition C L Schneider

https://ebookmeta.com/product/smoke-mirrors-nite-fire-3-1st-edition-c-
l-schneider/

ebookmeta.com
Boardwalk Kings Boardwalk Mafia 1 1st Edition Jillian
Frost

https://ebookmeta.com/product/boardwalk-kings-boardwalk-mafia-1-1st-
edition-jillian-frost/

ebookmeta.com
Speech and Language Processing
An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition

Third Edition draft

Daniel Jurafsky
Stanford University

James H. Martin
University of Colorado at Boulder

Copyright ©2020. All rights reserved.

Draft of December 30, 2020. Comments and typos welcome!


Summary of Contents
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Regular Expressions, Text Normalization, Edit Distance . . . . . . . . . 2
3 N-gram Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Naive Bayes and Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . 55
5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Vector Semantics and Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Neural Networks and Neural Language Models . . . . . . . . . . . . . . . . . 127
8 Sequence Labeling for Parts of Speech and Named Entities . . . . . . 148
9 Deep Learning Architectures for Sequence Processing . . . . . . . . . . . 173
10 Contextual Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11 Machine Translation and Encoder-Decoder Models . . . . . . . . . . . . . 203
12 Constituency Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
13 Constituency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15 Logical Representations of Sentence Meaning . . . . . . . . . . . . . . . . . . . 305
16 Computational Semantics and Semantic Parsing . . . . . . . . . . . . . . . . 331
17 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
18 Word Senses and WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
19 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
20 Lexicons for Sentiment, Affect, and Connotation . . . . . . . . . . . . . . . . 393
21 Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
22 Discourse Coherence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
23 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
24 Chatbots & Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
25 Phonetics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
26 Automatic Speech Recognition and Text-to-Speech . . . . . . . . . . . . . . 548
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607

2
Contents
1 Introduction 1

2 Regular Expressions, Text Normalization, Edit Distance 2


2.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 27
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 N-gram Language Models 29


3.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Evaluating Language Models . . . . . . . . . . . . . . . . . . . . 35
3.3 Generalization and Zeros . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Kneser-Ney Smoothing . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Huge Language Models and Stupid Backoff . . . . . . . . . . . . 47
3.7 Advanced: Perplexity’s Relation to Entropy . . . . . . . . . . . . 49
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 52
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Naive Bayes and Sentiment Classification 55


4.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Training the Naive Bayes Classifier . . . . . . . . . . . . . . . . . 59
4.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Optimizing for Sentiment Analysis . . . . . . . . . . . . . . . . . 61
4.5 Naive Bayes for other text classification tasks . . . . . . . . . . . 63
4.6 Naive Bayes as a Language Model . . . . . . . . . . . . . . . . . 64
4.7 Evaluation: Precision, Recall, F-measure . . . . . . . . . . . . . . 65
4.8 Test sets and Cross-validation . . . . . . . . . . . . . . . . . . . . 67
4.9 Statistical Significance Testing . . . . . . . . . . . . . . . . . . . 69
4.10 Avoiding Harms in Classification . . . . . . . . . . . . . . . . . . 72
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 73
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Logistic Regression 76
5.1 Classification: the sigmoid . . . . . . . . . . . . . . . . . . . . . 77
5.2 Learning in Logistic Regression . . . . . . . . . . . . . . . . . . . 81
5.3 The cross-entropy loss function . . . . . . . . . . . . . . . . . . . 82
5.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . 90
5.7 Interpreting models . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Advanced: Deriving the Gradient Equation . . . . . . . . . . . . . 93
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3
4 C ONTENTS

Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 94


Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Vector Semantics and Embeddings 96


6.1 Lexical Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Vector Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Words and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Cosine for measuring similarity . . . . . . . . . . . . . . . . . . . 105
6.5 TF-IDF: Weighing terms in the vector . . . . . . . . . . . . . . . 106
6.6 Pointwise Mutual Information (PMI) . . . . . . . . . . . . . . . . 109
6.7 Applications of the tf-idf or PPMI vector models . . . . . . . . . . 111
6.8 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.9 Visualizing Embeddings . . . . . . . . . . . . . . . . . . . . . . . 118
6.10 Semantic properties of embeddings . . . . . . . . . . . . . . . . . 118
6.11 Bias and Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 120
6.12 Evaluating Vector Models . . . . . . . . . . . . . . . . . . . . . . 122
6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 123
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Neural Networks and Neural Language Models 127


7.1 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2 The XOR problem . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . 133
7.4 Training Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . 137
7.5 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 147

8 Sequence Labeling for Parts of Speech and Named Entities 148


8.1 (Mostly) English Word Classes . . . . . . . . . . . . . . . . . . . 149
8.2 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . 151
8.3 Named Entities and Named Entity Tagging . . . . . . . . . . . . . 153
8.4 HMM Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . 155
8.5 Conditional Random Fields (CRFs) . . . . . . . . . . . . . . . . . 162
8.6 Evaluation of Named Entity Recognition . . . . . . . . . . . . . . 167
8.7 Further Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 170
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9 Deep Learning Architectures for Sequence Processing 173


9.1 Language Models Revisited . . . . . . . . . . . . . . . . . . . . . 174
9.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 176
9.3 Managing Context in RNNs: LSTMs and GRUs . . . . . . . . . . 186
9.4 Self-Attention Networks: Transformers . . . . . . . . . . . . . . . 190
9.5 Potential Harms from Language Models . . . . . . . . . . . . . . 198
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 200

10 Contextual Embeddings 202


C ONTENTS 5

11 Machine Translation and Encoder-Decoder Models 203


11.1 Language Divergences and Typology . . . . . . . . . . . . . . . . 205
11.2 The Encoder-Decoder Model . . . . . . . . . . . . . . . . . . . . 208
11.3 Encoder-Decoder with RNNs . . . . . . . . . . . . . . . . . . . . 209
11.4 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.5 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.6 Encoder-Decoder with Transformers . . . . . . . . . . . . . . . . 217
11.7 Some practical details on building MT systems . . . . . . . . . . . 218
11.8 MT Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.9 Bias and Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . 226
11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 228
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

12 Constituency Grammars 231


12.1 Constituency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.2 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . . 232
12.3 Some Grammar Rules for English . . . . . . . . . . . . . . . . . . 237
12.4 Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.5 Grammar Equivalence and Normal Form . . . . . . . . . . . . . . 249
12.6 Lexicalized Grammars . . . . . . . . . . . . . . . . . . . . . . . . 250
12.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 256
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

13 Constituency Parsing 259


13.1 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
13.2 CKY Parsing: A Dynamic Programming Approach . . . . . . . . 261
13.3 Span-Based Neural Constituency Parsing . . . . . . . . . . . . . . 267
13.4 Evaluating Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.5 Partial Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.6 CCG Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 278
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

14 Dependency Parsing 280


14.1 Dependency Relations . . . . . . . . . . . . . . . . . . . . . . . . 281
14.2 Dependency Formalisms . . . . . . . . . . . . . . . . . . . . . . . 283
14.3 Dependency Treebanks . . . . . . . . . . . . . . . . . . . . . . . 284
14.4 Transition-Based Dependency Parsing . . . . . . . . . . . . . . . 285
14.5 Graph-Based Dependency Parsing . . . . . . . . . . . . . . . . . 296
14.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 302
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

15 Logical Representations of Sentence Meaning 305


15.1 Computational Desiderata for Representations . . . . . . . . . . . 306
15.2 Model-Theoretic Semantics . . . . . . . . . . . . . . . . . . . . . 308
15.3 First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 311
15.4 Event and State Representations . . . . . . . . . . . . . . . . . . . 318
6 C ONTENTS

15.5 Description Logics . . . . . . . . . . . . . . . . . . . . . . . . . . 323


15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 329
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

16 Computational Semantics and Semantic Parsing 331

17 Information Extraction 332


17.1 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.2 Relation Extraction Algorithms . . . . . . . . . . . . . . . . . . . 336
17.3 Extracting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
17.4 Extracting Events and their Times . . . . . . . . . . . . . . . . . . 348
17.5 Template Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
17.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 353
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

18 Word Senses and WordNet 355


18.1 Word Senses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
18.2 Relations Between Senses . . . . . . . . . . . . . . . . . . . . . . 358
18.3 WordNet: A Database of Lexical Relations . . . . . . . . . . . . . 360
18.4 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 363
18.5 Alternate WSD algorithms and Tasks . . . . . . . . . . . . . . . . 366
18.6 Using Thesauruses to Improve Embeddings . . . . . . . . . . . . 369
18.7 Word Sense Induction . . . . . . . . . . . . . . . . . . . . . . . . 369
18.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 371
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

19 Semantic Role Labeling 373


19.1 Semantic Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
19.2 Diathesis Alternations . . . . . . . . . . . . . . . . . . . . . . . . 375
19.3 Semantic Roles: Problems with Thematic Roles . . . . . . . . . . 376
19.4 The Proposition Bank . . . . . . . . . . . . . . . . . . . . . . . . 377
19.5 FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
19.6 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . 380
19.7 Selectional Restrictions . . . . . . . . . . . . . . . . . . . . . . . 384
19.8 Primitive Decomposition of Predicates . . . . . . . . . . . . . . . 389
19.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 390
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

20 Lexicons for Sentiment, Affect, and Connotation 393


20.1 Defining Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . 394
20.2 Available Sentiment and Affect Lexicons . . . . . . . . . . . . . . 395
20.3 Creating Affect Lexicons by Human Labeling . . . . . . . . . . . 398
20.4 Semi-supervised Induction of Affect Lexicons . . . . . . . . . . . 399
20.5 Supervised Learning of Word Sentiment . . . . . . . . . . . . . . 402
20.6 Using Lexicons for Sentiment Recognition . . . . . . . . . . . . . 406
20.7 Other tasks: Personality . . . . . . . . . . . . . . . . . . . . . . . 407
20.8 Affect Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 408
20.9 Lexicon-based methods for Entity-Centric Affect . . . . . . . . . . 410
C ONTENTS 7

20.10 Connotation Frames . . . . . . . . . . . . . . . . . . . . . . . . . 411


20.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 413

21 Coreference Resolution 415


21.1 Coreference Phenomena: Linguistic Background . . . . . . . . . . 418
21.2 Coreference Tasks and Datasets . . . . . . . . . . . . . . . . . . . 423
21.3 Mention Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 424
21.4 Architectures for Coreference Algorithms . . . . . . . . . . . . . 427
21.5 Classifiers using hand-built features . . . . . . . . . . . . . . . . . 429
21.6 A neural mention-ranking algorithm . . . . . . . . . . . . . . . . 430
21.7 Evaluation of Coreference Resolution . . . . . . . . . . . . . . . . 434
21.8 Winograd Schema problems . . . . . . . . . . . . . . . . . . . . . 435
21.9 Gender Bias in Coreference . . . . . . . . . . . . . . . . . . . . . 436
21.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 438
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

22 Discourse Coherence 442


22.1 Coherence Relations . . . . . . . . . . . . . . . . . . . . . . . . . 444
22.2 Discourse Structure Parsing . . . . . . . . . . . . . . . . . . . . . 447
22.3 Centering and Entity-Based Coherence . . . . . . . . . . . . . . . 451
22.4 Representation learning models for local coherence . . . . . . . . 456
22.5 Global Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . 458
22.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 461
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

23 Question Answering 464


23.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 465
23.2 IR-based Factoid Question Answering . . . . . . . . . . . . . . . 473
23.3 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
23.4 Knowledge-based Question Answering . . . . . . . . . . . . . . . 482
23.5 Using Language Models to do QA . . . . . . . . . . . . . . . . . 484
23.6 Classic QA Models . . . . . . . . . . . . . . . . . . . . . . . . . 485
23.7 Evaluation of Factoid Answers . . . . . . . . . . . . . . . . . . . 488
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 489
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

24 Chatbots & Dialogue Systems 492


24.1 Properties of Human Conversation . . . . . . . . . . . . . . . . . 493
24.2 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
24.3 GUS: Simple Frame-based Dialogue Systems . . . . . . . . . . . 504
24.4 The Dialogue-State Architecture . . . . . . . . . . . . . . . . . . 508
24.5 Evaluating Dialogue Systems . . . . . . . . . . . . . . . . . . . . 517
24.6 Dialogue System Design . . . . . . . . . . . . . . . . . . . . . . . 520
24.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 523
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

25 Phonetics 526
25.1 Speech Sounds and Phonetic Transcription . . . . . . . . . . . . . 526
8 C ONTENTS

25.2 Articulatory Phonetics . . . . . . . . . . . . . . . . . . . . . . . . 527


25.3 Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
25.4 Acoustic Phonetics and Signals . . . . . . . . . . . . . . . . . . . 534
25.5 Phonetic Resources . . . . . . . . . . . . . . . . . . . . . . . . . 544
25.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 546
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

26 Automatic Speech Recognition and Text-to-Speech 548


26.1 The Automatic Speech Recognition Task . . . . . . . . . . . . . . 549
26.2 Feature Extraction for ASR: Log Mel Spectrum . . . . . . . . . . 551
26.3 Speech Recognition Architecture . . . . . . . . . . . . . . . . . . 555
26.4 CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
26.5 ASR Evaluation: Word Error Rate . . . . . . . . . . . . . . . . . 562
26.6 TTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
26.7 Other Speech Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 569
26.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 570
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

Bibliography 575
Subject Index 607
CHAPTER

1 Introduction

La dernière chose qu’on trouve en faisant un ouvrage est de savoir celle qu’il faut
mettre la première.
[The last thing you figure out in writing a book is what to put first.]
Pascal

1
2 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

CHAPTER

2 Regular Expressions, Text


Normalization, Edit Distance
User: I am unhappy.
ELIZA: DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
User: I need some help, that much seems certain.
ELIZA: WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
User: Perhaps I could learn to get along with my mother.
ELIZA: TELL ME MORE ABOUT YOUR FAMILY
User: My mother takes care of me.
ELIZA: WHO ELSE IN YOU FAMILY TAKES CARE OF YOU
User: My father.
ELIZA: YOUR FATHER
User: You are like my father in some ways.
Weizenbaum (1966)
ELIZA The dialogue above is from ELIZA, an early natural language processing system
that could carry on a limited conversation with a user by imitating the responses of
a Rogerian psychotherapist (Weizenbaum, 1966). ELIZA is a surprisingly simple
program that uses pattern matching to recognize phrases like “I need X” and translate
them into suitable outputs like “What would it mean to you if you got X?”. This
simple technique succeeds in this domain because ELIZA doesn’t actually need to
know anything to mimic a Rogerian psychotherapist. As Weizenbaum notes, this is
one of the few dialogue genres where listeners can act as if they know nothing of the
world. Eliza’s mimicry of human conversation was remarkably successful: many
people who interacted with ELIZA came to believe that it really understood them
and their problems, many continued to believe in ELIZA’s abilities even after the
program’s operation was explained to them (Weizenbaum, 1976), and even today
chatbots such chatbots are a fun diversion.
Of course modern conversational agents are much more than a diversion; they
can answer questions, book flights, or find restaurants, functions for which they rely
on a much more sophisticated understanding of the user’s intent, as we will see in
Chapter 24. Nonetheless, the simple pattern-based methods that powered ELIZA
and other chatbots play a crucial role in natural language processing.
We’ll begin with the most important tool for describing text patterns: the regular
expression. Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “I need X” in Eliza above, to defining
strings like $199 or $24.99 for extracting tables of prices from a document.
text We’ll then turn to a set of tasks collectively called text normalization, in which
normalization
regular expressions play an important part. Normalizing text means converting it
to a more convenient, standard form. For example, most of what we are going to
do with language relies on first separating out or tokenizing words from running
tokenization text, the task of tokenization. English words are often separated from each other
by whitespace, but whitespace is not always sufficient. New York and rock ’n’ roll
are sometimes treated as large words despite the fact that they contain spaces, while
sometimes we’ll need to separate I’m into the two words I and am. For processing
tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.
2.1 • R EGULAR E XPRESSIONS 3

Some languages, like Japanese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
lemmatization Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.

2.1 Regular Expressions


One of the unsung successes in standardization in computer science has been the
regular
expression regular expression (RE), a language for specifying text search strings. This prac-
tical language is used in every computer language, word processor, and text pro-
cessing tools like the Unix tools grep or Emacs. Formally, a regular expression is
an algebraic notation for characterizing a set of strings. They are particularly use-
corpus ful for searching in texts, when we have a pattern to search for and a corpus of
texts to search through. A regular expression search function will search through the
corpus, returning all texts that match the pattern. The corpus can be a single docu-
ment or a collection. For example, the Unix command-line tool grep takes a regular
expression and returns every line of the input document that matches the expression.
A search can be designed to return every match on a line, if there are more than
one, or just the first match. In the following examples we generally underline the
exact part of the pattern that matches the regular expression and show only the first
match. We’ll show regular expressions delimited by slashes but note that slashes are
not part of the regular expressions.
Regular expressions come in many variants. We’ll be describing extended regu-
lar expressions; different regular expression parsers may only recognize subsets of
these, or treat some expressions slightly differently. Using an online regular expres-
sion tester is a handy way to test out your expressions and explore these variations.

2.1.1 Basic Regular Expression Patterns


The simplest kind of regular expression is a sequence of simple characters. To search
for woodchuck, we type /woodchuck/. The expression /Buttercup/ matches any
string containing the substring Buttercup; grep with that expression would return the
line I’m called little Buttercup. The search string can consist of a single character
(like /!/) or a sequence of characters (like /urgl/).
Regular expressions are case sensitive; lower case /s/ is distinct from upper
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
4 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

RE Example Patterns Matched


/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/!/ “You’ve left the burglar behind again!” said Nori
Figure 2.1 Some simple regex searches.

problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.

RE Match Example Patterns


/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”
/[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”
/[1234567890]/ any digit “plenty of 7 to 5”
Figure 2.2 The use of the brackets [] to specify a disjunction of characters.

The regular expression /[1234567890]/ specifies any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.

RE Match Example Patterns Matched


/[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”
/[a-z]/ a lower case letter “my beans were impatient to be hoed!”
/[0-9]/ a single digit “Chapter 1: Down the Rabbit Hole”
Figure 2.3 The use of the brackets [] plus the dash - to specify a range.

The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.

RE Match (single characters) Example Patterns Matched


/[ˆA-Z]/ not an upper case letter “Oyfn pripetchik”
/[ˆSs]/ neither ‘S’ nor ‘s’ “I have no exquisite reason for’t”
/[ˆ.]/ not a period “our resident Djinn”
/[eˆ]/ either ‘e’ or ‘ˆ’ “look up ˆ now”
/aˆb/ the pattern ‘aˆb’ “look up aˆ b now”
Figure 2.4 The caret ˆ for negation or just to mean ˆ. See below re: the backslash for escaping the period.

How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
2.1 • R EGULAR E XPRESSIONS 5

RE Match Example Patterns Matched


/woodchucks?/ woodchuck or woodchucks “woodchuck”
/colou?r/ color or colour “color”
Figure 2.5 The question mark ? marks optionality of the previous expression.

We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
we want, something that is very important in regular expressions. For example,
consider the language of certain sheep, which consists of strings that look like the
following:
baa!
baaa!
baaaa!
baaaaa!
...
This language consists of strings with a b, followed by at least two a’s, followed
by an exclamation point. The set of operators that allows us to say things like “some
Kleene * number of as” are based on the asterisk or *, commonly called the Kleene * (gen-
erally pronounced “cleany star”). The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So /a*/ means “any
string of zero or more as”. This will match a or aaaaaa, but it will also match Off
Minor since the string Off Minor has zero a’s. So the regular expression for matching
one or more a is /aa*/, meaning one a followed by zero or more as. More complex
patterns can also be repeated. So /[ab]*/ means “zero or more a’s or b’s” (not
“zero or more right square braces”). This will match strings like aaaa or ababab or
bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/,
the regular expression for a single digit. An integer (a string of digits) is thus
/[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice,
so there is a shorter way to specify “at least one” of some character. This is the
Kleene + Kleene +, which means “one or more occurrences of the immediately preceding
character or regular expression”. Thus, the expression /[0-9]+/ is the normal way
to specify “a sequence of digits”. There are thus two ways to specify the sheep
language: /baaa*!/ or /baa+!/.
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return), as shown in Fig. 2.6.

RE Match Example Matches


/beg.n/ any character between beg and n begin, beg’n, begun
Figure 2.6 The use of the period . to specify any character.

The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
6 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)

RE Match
ˆ start of line
\$ end of line
\b word boundary
\B non-word boundary
Figure 2.7 Anchors in regular expressions.

There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).

2.1.2 Disjunction, Grouping, and Precedence


Suppose we need to search for texts about pets; perhaps we are particularly interested
in cats and dogs. In such a case, we might want to search for either the string cat or
the string dog. Since we can’t use the square brackets to search for “cat or dog” (why
disjunction can’t we say /[catdog]/?), we need a new operator, the disjunction operator, also
called the pipe symbol |. The pattern /cat|dog/ matches either the string cat or
the string dog.
Sometimes we need to use this disjunction operator in the midst of a larger se-
quence. For example, suppose I want to search for information about pet fish for
my cousin David. How can I specify both guppy and guppies? We cannot simply
say /guppy|ies/, because that would match only the strings guppy and ies. This
precedence is because sequences like guppy take precedence over the disjunction operator |.
To make the disjunction operator apply only to a specific pattern, we need to use the
parenthesis operators ( and ). Enclosing a pattern in parentheses makes it act like
a single character for the purposes of neighboring operators like the pipe | and the
Kleene*. So the pattern /gupp(y|ies)/ would specify that we meant the disjunc-
tion only to apply to the suffixes y and ies.
The parenthesis operator ( is also useful when we are using counters like the
Kleene*. Unlike the | operator, the Kleene* operator applies by default only to
a single character, not to a whole sequence. Suppose we want to match repeated
instances of a string. Perhaps we have a line that has column labels of the form
Column 1 Column 2 Column 3. The expression /Column [0-9]+ */ will not
match any number of columns; instead, it will match a single column followed by
any number of spaces! The star here applies only to the space that precedes it,
not to the whole sequence. With the parentheses, we could write the expression
2.1 • R EGULAR E XPRESSIONS 7

/(Column [0-9]+ *)*/ to match the word Column, followed by a number and
optional spaces, the whole pattern repeated zero or more times.
This idea that one operator may take precedence over another, requiring us to
sometimes use parentheses to specify what we mean, is formalized by the operator
operator
precedence precedence hierarchy for regular expressions. The following table gives the order
of RE operator precedence, from highest precedence to lowest precedence.
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
Thus, because counters have a higher precedence than sequences,
/the*/ matches theeeee but not thethe. Because sequences have a higher prece-
dence than disjunction, /the|any/ matches the or any but not thany or theny.
Patterns can be ambiguous in another way. Consider the expression /[a-z]*/
when matching against the text once upon a time. Since /[a-z]*/ matches zero or
more letters, this expression could match nothing, or just the first letter o, on, onc,
or once. In these cases regular expressions always match the largest string they can;
greedy we say that patterns are greedy, expanding to cover as much of a string as they can.
non-greedy There are, however, ways to enforce non-greedy matching, using another mean-
*? ing of the ? qualifier. The operator *? is a Kleene star that matches as little text as
+? possible. The operator +? is a Kleene plus that matches as little text as possible.

2.1.3 A Simple Example


Suppose we wanted to write a RE to find cases of the English article the. A simple
(but incorrect) pattern might be:
/the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology). So we need to specify that we want instances with a word bound-
ary on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of /\b/. We might want this since
/\b/ won’t treat underscores and numbers as word boundaries; but we might want
to find the in some context where it might also have underlines or numbers nearby
(the or the25). We need to specify that we want instances in which there are no
alphabetic letters on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [ˆa-zA-Z], which
we used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specify-
ing that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:
8 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

The process we just went through was based on fixing two kinds of errors: false
false positives positives, strings that we incorrectly matched like other or there, and false nega-
false negatives tives, strings that we incorrectly missed, like The. Addressing these two kinds of
errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antag-
onistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
We’ll come back to precision and recall with more precise definitions in Chapter 4.

2.1.4 More Operators


Figure 2.8 shows some aliases for common ranges, which can be used mainly to
save typing. Besides the Kleene * and Kleene + we can also use explicit numbers as
counters, by enclosing them in curly brackets. The regular expression /{3}/ means
“exactly 3 occurrences of the previous character or expression”. So /a\.{24}z/
will match a followed by 24 dots followed by z (but not a followed by 23 or 25 dots
followed by a z).

RE Expansion Match First Matches


\d [0-9] any digit Party of 5
\D [ˆ0-9] any non-digit Blue moon
\w [a-zA-Z0-9_] any alphanumeric/underscore Daiyu
\W [ˆ\w] a non-alphanumeric !!!!
\s [ \r\t\n\f] whitespace (space, tab)
\S [ˆ\s] Non-whitespace in Concord
Figure 2.8 Aliases for common sets of characters.

A range of numbers can also be specified. So /{n,m}/ specifies from n to m


occurrences of the previous char or expression, and /{n,}/ means at least n occur-
rences of the previous expression. REs for counting are summarized in Fig. 2.9.

RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 2.9 Regular expression operators for counting.

Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Fig. 2.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
2.1 • R EGULAR E XPRESSIONS 9

RE Match First Patterns Matched


\* an asterisk “*” “K*A*P*L*A*N”
\. a period “.” “Dr. Livingston, I presume”
\? a question mark “Why don’t they come and lend a hand?”
\n a newline
\t a tab
Figure 2.10 Some characters that need to be backslashed.

2.1.5 A More Complex Example


Let’s try out a more significant example of the power of REs. Suppose we want to
build an application to help a user buy a computer on the Web. The user might want
“any machine with at least 6 GHz and 500 GB of disk space for less than $1000”.
To do this kind of retrieval, we first need to be able to look for expressions like 6
GHz or 500 GB or Mac or $999.99. In the rest of this section we’ll work out some
simple regular expressions for this task.
First, let’s complete our regular expression for prices. Here’s a regular expres-
sion for a dollar sign followed by a string of digits:
/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)
Now we just need to deal with fractions of dollars. We’ll add a decimal point
and two digits afterwards:
/$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents
optional and to make sure we’re at a word boundary:
/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/
One last catch! This pattern allows prices like $199999.99 which would be far
too expensive! We need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/
How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the of / */ to mean “zero or
more spaces” since there might always be extra spaces lying around:
/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
Modifying this regular expression so that it only matches more than 500 GB is
left as an exercise for the reader.

2.1.6 Substitution, Capture Groups, and ELIZA


substitution An important use of regular expressions is in substitutions. For example, the substi-
tution operator s/regexp1/pattern/ used in Python and in Unix commands like
vim or sed allows a string characterized by a regular expression to be replaced by
another string:
10 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

s/colour/color/
It is often useful to be able to refer to a particular subpart of the string matching
the first pattern. For example, suppose we wanted to put angle brackets around all
integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d
like a way to refer to the integer we’ve found so that we can easily add the brackets.
To do this, we put parentheses ( and ) around the first pattern and use the number
operator \1 in the second pattern to refer back. Here’s how it looks:
s/([0-9]+)/<\1>/
The parenthesis and number operators can also specify that a certain string or
expression must occur twice in the text. For example, suppose we are looking for
the pattern “the Xer they were, the Xer they will be”, where we want to constrain
the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as
follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match the bigger they were, the bigger they will be but not the
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/
will match the faster they ran, the faster we ran but not the faster they ran, the faster
we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register. Occasionally we might want to use parenthe-
ses for grouping, but don’t want to capture the resulting pattern in a register. In that
non-capturing
group case we use a non-capturing group, which is specified by putting the commands
?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some cats but not some cats like some a few.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:

User1 : Men are all alike.


ELIZA1 : IN WHAT WAY
User2 : They’re always bugging us about something or other.
ELIZA2 : CAN YOU THINK OF A SPECIFIC EXAMPLE
User3 : Well, my boyfriend made me come here.
ELIZA3 : YOUR BOYFRIEND MADE YOU COME HERE
User4 : He says I’m depressed much of the time.
ELIZA4 : I AM SORRY TO HEAR YOU ARE DEPRESSED

ELIZA works by having a series or cascade of regular expression substitutions


each of which matches and changes some part of the input lines. Input lines are
2.2 • W ORDS 11

first uppercased. The first substitutions then change all instances of MY to YOUR,
and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces
other patterns in the input. Here are some examples:
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 24.

2.1.7 Lookahead Assertions


Finally, there will be times when we need to predict the future: look ahead in the
text to see if some pattern matches, but not advance the match cursor, so that we can
then deal with the pattern if it occurs.
lookahead These lookahead assertions make use of the (? syntax that we saw in the previ-
ous section for non-capture groups. The operator (?= pattern) is true if pattern
zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator
(?! pattern) only returns true if a pattern does not match, but again is zero-width
and doesn’t advance the cursor. Negative lookahead is commonly used when we
are parsing some complex pattern but want to rule out a special case. For example
suppose we want to match, at the beginning of a line, any single word that doesn’t
start with “Volcano”. We can use negative lookahead to do this:
/ˆ(?!Volcano)[A-Za-z]+/

2.2 Words
Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at one particular corpus (plural corpora), a computer-readable
corpora collection of text or speech. For example the Brown corpus is a million-word col-
lection of samples from 500 written English texts from different genres (newspa-
per, fiction, non-fiction, academic, etc.), assembled at Brown University in 1963–64
(Kučera and Francis, 1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15
if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as
words depends on the task. Punctuation is critical for finding boundaries of things
(commas, periods, colons) and for identifying some aspects of meaning (question
marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if
they were separate words.
The Switchboard corpus of American English telephone conversations between
strangers was collected in the early 1990s; it contains 2430 conversations averaging
6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey
et al., 1992). Such corpora of spoken language don’t have punctuation but do intro-
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

duce other complications with regard to defining words. Let’s look at one utterance
utterance from Switchboard; an utterance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
disfluency This utterance has two kinds of disfluencies. The broken-off word main- is
fragment called a fragment. Words like uh and um are called fillers or filled pauses. Should
filled pause we consider these to be words? Again, it depends on the application. If we are
building a speech transcription system, we might want to eventually strip out the
disfluencies.
But we also sometimes keep disfluencies around. Disfluencies like uh or um
are actually helpful in speech recognition in predicting the upcoming word, because
they may signal that the speaker is restarting the clause or idea, and so for speech
recognition they are treated as regular words. Because people use different disflu-
encies they can also be a cue to speaker identification. In fact Clark and Fox Tree
(2002) showed that uh and um have different meanings. What do you think they are?
Are capitalized tokens like They and uncapitalized tokens like they the same
word? These are lumped together in some tasks (speech recognition), while for part-
of-speech or named-entity tagging, capitalization is a useful feature and is retained.
How about inflected forms like cats versus cat? These two words have the same
lemma lemma cat but are different wordforms. A lemma is a set of lexical forms having
the same stem, the same major part-of-speech, and the same word sense. The word-
wordform form is the full inflected or derived form of the word. For morphologically complex
languages like Arabic, we often need to deal with lemmatization. For many tasks in
English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to
word type distinguish two ways of talking about words. Types are the number of distinct words
in a corpus; if the set of words in the vocabulary is V , the number of types is the
word token vocabulary size |V |. Tokens are the total number N of running words. If we ignore
punctuation, the following Brown sentence has 16 tokens and 14 types:
They picnicked by the pool, then lay back on the grass and looked at the stars.
When we speak about the number of words in the language, we are generally
referring to word types.

Corpus Tokens = N Types = |V |


Shakespeare 884 thousand 31 thousand
Brown corpus 1 million 38 thousand
Switchboard telephone conversations 2.4 million 20 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13 million
Figure 2.11 Rough numbers of types and tokens for some English language corpora. The
largest, the Google N-grams corpus, contains 13 million types, but this count only includes
types appearing 40 or more times, so the true number would be much larger.

Fig. 2.11 shows the rough numbers of types and tokens computed from some
popular English corpora. The larger the corpora we look at, the more word types
we find, and in fact this relationship between the number of types |V | and number
Herdan’s Law of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown
in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.

|V | = kN β (2.1)
2.3 • C ORPORA 13

The value of β depends on the corpus size and the genre, but at least for the large
corpora in Fig. 2.11, β ranges from .67 to .75. Roughly then we can say that the
vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.

2.3 Corpora
Words don’t appear out of nowhere. Any particular piece of text that we study
is produced by one or more specific speakers or writers, in a specific dialect of a
specific language, at a specific time, in a specific place, for a specific function.
Perhaps the most important dimension of variation is the language. NLP algo-
rithms are most useful when they apply across many languages. The world has 7097
languages at the time of this writing, according to the online Ethnologue catalog
(Simons and Fennig, 2018). It is important to test algorithms on more than one lan-
guage, and particularly on languages with different properties; by contrast there is
an unfortunate current tendency for NLP algorithms to be developed or tested just
on English (Bender, 2019). Even when algorithms are developed beyond English,
they tend to be developed for the official languages of large industrialized nations
(Chinese, Spanish, Japanese, German etc.), but we don’t want to limit tools to just
these few languages. Furthermore, most languages also have multiple varieties, of-
ten spoken in different regions or by different social groups. Thus, for example, if
AAL we’re processing text that uses features of African American Language (AAL) —
the name for the many variations of language used by millions of people in African
American communities (King 2020) — we must use NLP tools that function with
features of those varieties. Twitter posts might use features often used by speakers of
African American Language, such as constructions like iont (I don’t in Mainstream
MAE American English (MAE)), or talmbout corresponding to MAE talking about, both
examples that influence word segmentation (Blodgett et al. 2016, Jones 2015).
It’s also quite common for speakers or writers to use multiple languages in a
code switching single communicative act, a phenomenon called code switching. Code switch-
ing is enormously common across the world; here are examples showing Spanish
and (transliterated) Hindi code switching with English (Solorio et al. 2014, Jurgens
et al. 2017):
(2.2) Por primera vez veo a @username actually being hateful! it was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
(2.3) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Another dimension of variation is the genre. The text that our algorithms must
process might come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts. It might come from spoken genres like telephone
conversations, business meetings, police body-worn cameras, medical interviews,
or transcripts of television shows or movies. It might come from work situations
14 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

like doctors’ notes, legal text, or parliamentary or congressional proceedings.


Text also reflects the demographic characteristics of the writer (or speaker): their
age, gender, race, socioeconomic class can all influence the linguistic properties of
the text we are processing.
And finally, time matters too. Language changes over time, and for some lan-
guages we have good corpora of texts from different historical periods.
Because language is so situated, when developing computational models for lan-
guage processing from a corpus, it’s important to consider who produced the lan-
guage, in what context, for what purpose. How can a user of a dataset know all these
datasheet details? The best way is for the corpus creator to build a datasheet (Gebru et al.,
2020) or data statement (Bender and Friedman, 2018) for each corpus. A datasheet
specifies properties of a dataset like:
Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: When and in what situation was the text written/spoken? For example,
was there a task? Was the language originally spoken conversation, edited
text, social media communication, monologue vs. dialogue?
Language variety: What language (including dialect/region) was the corpus in?
Speaker demographics: What was, e.g., age or gender of the authors of the text?
Collection process: How big is the data? If it is a subsample how was it sampled?
Was the data collected with consent? How was the data pre-processed, and
what metadata is available?
Annotation process: What are the annotations, what are the demographics of the
annotators, how were they trained, how was the data annotated?
Distribution: Are there copyright or other intellectual property restrictions?

2.4 Text Normalization


Before almost any natural language processing of a text, the text has to be normal-
ized. At least three tasks are commonly applied as part of any normalization process:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
In the next sections we walk through each of these tasks.

2.4.1 Unix Tools for Crude Tokenization and Normalization


Let’s begin with an easy, if somewhat naive version of word tokenization and nor-
malization (and frequency computation) that can be accomplished for English solely
in a single UNIX command-line, inspired by Church (1994). We’ll make use of some
Unix commands: tr, used to systematically change particular characters in the in-
put; sort, which sorts input lines in alphabetical order; and uniq, which collapses
and counts adjacent identical lines.
For example let’s begin with the ‘complete words’ of Shakespeare in one textfile,
sh.txt. We can use tr to tokenize the words by changing every sequence of non-
alphabetic characters to a newline (’A-Za-z’ means alphabetic, the -c option com-
plements to non-alphabet, and the -s option squeezes all sequences into a single
character):
2.4 • T EXT N ORMALIZATION 15

tr -sc ’A-Za-z’ ’\n’ < sh.txt


The output of this command will be:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Now that there is one word per line, we can sort the lines, and pass them to uniq
-c which will collapse and count them:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c
with the following output:
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
...
Alternatively, we can collapse all the upper case to lower case:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c
whose output is
14725 a
97 aaron
1 abaissiez
10 abandon
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
16 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...

Unix tools of this sort can be very handy in building quick word count statistics
for any corpus.

2.4.2 Word Tokenization


The simple UNIX tools above were fine for getting rough word statistics but more
tokenization sophisticated algorithms are generally necessary for tokenization, the task of seg-
menting running text into words.
While the Unix command sequence just removed all the numbers and punctu-
ation, for most NLP applications we’ll need to keep these in our tokenization. We
often want to break off punctuation as a separate token; commas are a useful piece of
information for parsers, periods help indicate sentence boundaries. But we’ll often
want to keep the punctuation that occurs word internally, in examples like m.p.h.,
Ph.D., AT&T, and cap’n. Special characters and numbers will need to be kept in
prices ($45.55) and dates (01/02/06); we don’t want to segment that price into sep-
arate tokens of “45” and “55”. And there are URLs (http://www.stanford.edu),
Twitter hashtags (#nlproc), or email addresses ([email protected]).
Number expressions introduce other complications as well; while commas nor-
mally appear at word boundaries, commas are used inside numbers in English, every
three digits: 555,500.50. Languages, and hence tokenization requirements, differ
on this; many continental European languages like Spanish, French, and German, by
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity recognition, the task of detecting names, dates,
and organizations (Chapter 8).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank kenization standard, used for the parsed corpora (treebanks) released by the Lin-
tokenization
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation (to save space we’re showing visible spaces
‘ ’ between tokens, although newlines is a more common output):
2.4 • T EXT N ORMALIZATION 17

Input: "The San Francisco-based restaurant," they said,


"doesn’t charge $10".
Output: " The San Francisco-based restaurant , " they said ,
" does n’t charge $ 10 " .
In practice, since tokenization needs to be run before any other language pro-
cessing, it needs to be very fast. The standard method for tokenization is therefore
to use deterministic algorithms based on regular expressions compiled into very ef-
ficient finite state automata. For example, Fig. 2.12 shows an example of a basic
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
http://www.nltk.org).

>>> text = ’That U.S.A. poster-print costs $12.40...’


>>> pattern = r’’’(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"’?():-_‘] # these are separate tokens; includes ], [
... ’’’
>>> nltk.regexp_tokenize(text, pattern)
[’That’, ’U.S.A.’, ’poster-print’, ’costs’, ’$12.40’, ’...’]
Figure 2.12 A Python trace of regular expression tokenization in the NLTK Python-based
natural language processing toolkit (Bird et al., 2009), commented for readability; the (?x)
verbose flag tells Python to strip comments and whitespace. Figure from Chapter 3 of Bird
et al. (2009).

Carefully designed deterministic algorithms can deal with the ambiguities that
arise, such as the fact that the apostrophe needs to be tokenized differently when used
as a genitive marker (as in the book’s cover), a quotative as in ‘The other class’, she
said, or in clitics like they’re.
Word tokenization is more complex in languages like written Chinese, Japanese,
and Thai, which do not use spaces to mark potential word-boundaries. In Chinese,
hanzi for example, words are composed of characters (called hanzi in Chinese). Each
character generally represents a single unit of meaning (called a morpheme) and is
pronounceable as a single syllable. Words are about 2.4 characters long on average.
But deciding what counts as a word in Chinese is complex. For example, consider
the following sentence:
(2.4) 姚明进入总决赛
“Yao Ming reaches the finals”
As Chen et al. (2017) point out, this could be treated as 3 words (‘Chinese Treebank’
segmentation):
(2.5) 姚明 进入 总决赛
YaoMing reaches finals
or as 5 words (‘Peking University’ segmentation):
(2.6) 姚 明 进入 总 决赛
Yao Ming reaches overall finals
Finally, it is possible in Chinese simply to ignore words altogether and use characters
as the basic elements, treating the sentence as a series of 7 characters:
18 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

(2.7) 姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game

In fact, for most Chinese NLP tasks it turns out to work better to take characters
rather than words as input, since characters are at a reasonable semantic level for
most applications, and since most word standards, by contrast, result in a huge vo-
cabulary with large numbers of very rare words (Li et al., 2019).
However, for Japanese and Thai the character is too small a unit, and so algo-
word
segmentation rithms for word segmentation are required. These can also be useful for Chinese
in the rare situations where word rather than character boundaries are required. The
standard segmentation algorithms for these languages use neural sequence mod-
els trained via supervised machine learning on hand-segmented training sets; we’ll
introduce sequence models in Chapter 8 and Chapter 9.

2.4.3 Byte-Pair Encoding for Tokenization


There is a third option to tokenizing text. Instead of defining tokens as words
(whether delimited by spaces or more complex algorithms), or as characters (as in
Chinese), we can use our data to automatically tell us what the tokens should be.
This is especially useful in dealing with unknown words, an important problem in
language processing. As we will see in the next chapter, NLP algorithms often learn
some facts about language from one corpus (a training corpus) and then use these
facts to make decisions about a separate test corpus and its language. Thus if our
training corpus contains, say the words low, new, newer, but not lower, then if the
word lower appears in our test corpus, our system will not know what to do with it.
To deal with this unknown word problem, modern tokenizers often automati-
subwords cally induce sets of tokens that include tokens smaller than words, called subwords.
Subwords can be arbitrary substrings, or they can be meaning-bearing units like the
morphemes -est or -er. (A morpheme is the smallest meaning-bearing unit of a lan-
guage; for example the word unlikeliest has the morphemes un-, likely, and -est.)
In modern tokenization schemes, most tokens are words, but some tokens are fre-
quently occurring morphemes or other subwords like -er. Every unseen words like
lower can thus be represented by some sequence of known subword units, such as
low and er, or even as a sequence of individual letters if necessary.
Most tokenization schemes have two parts: a token learner, and a token seg-
menter. The token learner takes a raw training corpus (sometimes roughly pre-
separated into words, for example by whitespace) and induces a vocabulary, a set
of tokens. The token segmenter takes a raw test sentence and segments it into the
tokens in the vocabulary. Three algorithms are widely used: byte-pair encoding
(Sennrich et al., 2016), unigram language modeling (Kudo, 2018), and WordPiece
(Schuster and Nakajima, 2012); there is also a SentencePiece library that includes
implementations of the first two of the three (Kudo and Richardson, 2018).
In this section we introduce the simplest of the three, the byte-pair encoding or
BPE BPE algorithm (Sennrich et al., 2016); see Fig. 2.13. The BPE token learner begins
with a vocabulary that is just the set of all individual characters. It then examines the
training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’,
‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent
’A’ ’B’ in the corpus with the new ‘AB’. It continues to count and merge, creating
new longer and longer character strings, until k merges have been done creating k
novel tokens; k is thus is a parameter of the algorithm. The resulting vocabulary
consists of the original set of characters plus k new symbols.
2.4 • T EXT N ORMALIZATION 19

The algorithm is usually run inside words (not merging across word boundaries),
so the input corpus is first white-space-separated to give a set of strings, each corre-
sponding to the characters of a word, plus a special end-of-word symbol , and its
counts. Let’s see its operation on the following tiny input corpus of 18 word tokens
with counts for each word (the word low appears 5 times, the word newer 6 times,
and so on), which would have a starting vocabulary of 11 letters:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
The BPE algorithm first count all pairs of adjacent symbols: the most frequent
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
3) for a total of 9 occurrences1 . We then merge these symbols, treating er as one
symbol, and count again:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Next n e (total count of 8) get merged to ne:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
2 l o w e s t
6 ne w er
3 w i d er
2 ne w
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a test
sentence. The token parser just runs on the test data the merges we have learned
1 Note that there can be ties; we could have instead chosen to merge r first, since that also has a
frequency of 9.
20 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

function B YTE - PAIR ENCODING(strings C, number of merges k) returns vocab V

V ← all unique characters in C # initial set of tokens is characters


for i = 1 to k do # merge tokens til k times
tL , tR ← Most frequent pair of adjacent tokens in C
tNEW ← tL + tR # make new token by concatenating
V ← V + tNEW # update the vocabulary
Replace each occurrence of tL , tR in C with tNEW # and update the corpus
return V

Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Figure adapted from Bostrom and Durrett (2020).

from the training data, greedily, in the order we learned them. (Thus the frequencies
in the test data don’t play a role, just the frequencies in the training data). So first
we segment each test sentence word into characters. Then we apply the first rule:
replace every instance of e r in the test corpus with r, and then the second rule:
replace every instance of er in the test corpus with er , and so on. By the end,
if the test corpus contained the word n e w e r , it would be tokenized as a full
word. But a new (unknown) word like l o w e r would be merged into the two
tokens low er .
Of course in real algorithms BPE is run with many thousands of merges on a very
large input corpus. The result is that most words will be represented as full symbols,
and only the very rare words (and unknown words) will have to be represented by
their parts.

2.4.4 Word Normalization, Lemmatization and Stemming


normalization Word normalization is the task of putting words/tokens in a standard format, choos-
ing a single normal form for words with multiple forms like USA and US or uh-huh
and uhhuh. This standardization may be valuable, despite the spelling information
that is lost in the normalization process. For information retrieval or information
extraction about the US, we might want to see information from documents whether
they mention the US or the USA.
case folding Case folding is another kind of normalization. Mapping everything to lower
case means that Woodchuck and woodchuck are represented identically, which is
very helpful for generalization in many tasks, such as information retrieval or speech
recognition. For sentiment analysis and other text classification tasks, information
extraction, and machine translation, by contrast, case can be quite helpful and case
folding is generally not done. This is because maintaining the difference between,
for example, US the country and us the pronoun can outweigh the advantage in
generalization that case folding would have provided for other words.
For many natural language processing situations we also want two morpholog-
ically different forms of a word to behave similarly. For example in web search,
someone may type the string woodchucks but a useful system might want to also
return pages that mention woodchuck with no s. This is especially common in mor-
phologically complex languages like Russian, where for example the word Moscow
has different endings in the phrases Moscow, of Moscow, to Moscow, and so on.
Lemmatization is the task of determining that two words have the same root,
despite their surface differences. The words am, are, and is have the shared lemma
2.4 • T EXT N ORMALIZATION 21

be; the words dinner and dinners both have the lemma dinner. Lemmatizing each of
these forms to the same lemma will let us find all mentions of words in Russian like
Moscow. The lemmatized form of a sentence like He is reading detective stories
would thus be He be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings of various kinds. So, for example, the word fox consists of one morpheme
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or parses a Spanish word like amaren (‘if in the future
they would love’) into the morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.

The Porter Stemmer


Lemmatization algorithms can be complex. For this reason we sometimes make use
of a simpler but cruder method, which mainly consists of chopping off word-final
stemming affixes. This naive version of morphological analysis is called stemming. One of
Porter stemmer the most widely used stemming algorithms is the Porter (1980). The Porter stemmer
applied to the following paragraph:
This was not the map we found in Billy Bones’s chest, but
an accurate copy, complete in all things-names and heights
and soundings-with the single exception of the red crosses
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
accur copi complet in all thing name and height and sound
with the singl except of the red cross and the written note
cascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass; here is a sampling of
the rules:
ATIONAL → ATE (e.g., relational → relate)
ING →  if stem contains vowel (e.g., motoring → motor)
SSES → SS (e.g., grasses → grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
ent variants of the same lemma. Nonetheless, they do tend to commit errors of both
over- and under-generalizing, as shown in the table below (Krovetz, 1993):

Errors of Commission Errors of Omission


organization organ European Europe
doing doe analysis analyzes
numerical numerous noise noisy
policy police sparse sparsity
22 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

2.4.5 Sentence Segmentation


sentence
segmentation Sentence segmentation is another important step in text processing. The most use-
ful cues for segmenting a text into sentences are punctuation, like periods, question
marks, and exclamation points. Question marks and exclamation points are rela-
tively unambiguous markers of sentence boundaries. Periods, on the other hand, are
more ambiguous. The period character “.” is ambiguous between a sentence bound-
ary marker and a marker of abbreviations like Mr. or Inc. The previous sentence that
you just read showed an even more complex case of this ambiguity, in which the final
period of Inc. marked both an abbreviation and the sentence boundary marker. For
this reason, sentence tokenization and word tokenization may be addressed jointly.
In general, sentence tokenization methods work by first deciding (based on rules
or machine learning) whether a period is part of the word or is a sentence-boundary
marker. An abbreviation dictionary can help determine whether the period is part
of a commonly used abbreviation; the dictionaries can be hand-built or machine-
learned (Kiss and Strunk, 2006), as can the final sentence splitter. In the Stan-
ford CoreNLP toolkit (Manning et al., 2014), for example sentence splitting is
rule-based, a deterministic consequence of tokenization; a sentence ends when a
sentence-ending punctuation (., !, or ?) is not already grouped with other charac-
ters into a token (such as for an abbreviation or number), optionally followed by
additional final quotes or brackets.

2.5 Minimum Edit Distance


Much of natural language processing is concerned with measuring how similar two
strings are. For example in spelling correction, the user typed some erroneous
string—let’s say graffe–and we want to know what the user meant. The user prob-
ably intended a word that is similar to graffe. Among candidate similar words,
the word giraffe, which differs by only one letter from graffe, seems intuitively
to be more similar than, say grail or graf, which differ in more letters. Another
example comes from coreference, the task of deciding whether two strings such as
the following refer to the same entity:
Stanford President Marc Tessier-Lavigne
Stanford University President Marc Tessier-Lavigne
Again, the fact that these two strings are very similar (differing by only one word)
seems like useful evidence for deciding that they might be coreferent.
Edit distance gives us a way to quantify both of these intuitions about string sim-
minimum edit ilarity. More formally, the minimum edit distance between two strings is defined
distance
as the minimum number of editing operations (operations like insertion, deletion,
substitution) needed to transform one string into another.
The gap between intention and execution, for example, is 5 (delete an i, substi-
tute e for n, substitute x for t, insert c, substitute u for n). It’s much easier to see
alignment this by looking at the most important visualization for string distances, an alignment
between the two strings, shown in Fig. 2.14. Given two sequences, an alignment is
a correspondence between substrings of the two sequences. Thus, we say I aligns
with the empty string, N with E, and so on. Beneath the aligned strings is another
representation; a series of symbols expressing an operation list for converting the
top string into the bottom string: d for deletion, s for substitution, i for insertion.
2.5 • M INIMUM E DIT D ISTANCE 23

INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s

Figure 2.14 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.

We can also assign a particular cost or weight to each of these operations. The
Levenshtein distance between two sequences is the simplest weighting factor in
which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume
that the substitution of a letter for itself, for example, t for t, has zero cost. The Lev-
enshtein distance between intention and execution is 5. Levenshtein also proposed
an alternative version of his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.

2.5.1 The Minimum Edit Distance Algorithm


How do we find the minimum edit distance? We can think of this as a search task, in
which we are searching for the shortest path—a sequence of edits—from one string
to another.

i n t e n t i o n

del ins subst

n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.15 Finding the edit distance viewed as a search problem

The space of all possible edits is enormous, so we can’t search naively. However,
lots of distinct edit paths will end up in the same state (string), so rather than recom-
puting all those paths, we could just remember the shortest path to a state each time
dynamic
programming we saw it. We can do this by using dynamic programming. Dynamic programming
is the name for a class of algorithms, first introduced by Bellman (1957), that apply
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the
CKY algorithm for parsing (Chapter 13).
The intuition of a dynamic programming problem is that a large problem can
be solved by properly combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.16.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
24 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.16 Path from intention to execution.

then we could use it instead, resulting in a shorter overall path, and the optimal
minimum edit
sequence wouldn’t be optimal, thus leading to a contradiction.
distance The minimum edit distance algorithm algorithm was named by Wagner and
algorithm
Fischer (1974) but independently discovered by many people (see the Historical
Notes section of Chapter 8).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
D[i, j] as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D[n, m].
We’ll use dynamic programming to compute D[n, m] bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D[i, j] for small i, j we then compute larger
D[i, j] based on previously computed smaller values. The value of D[i, j] is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:

 D[i − 1, j] + del-cost(source[i])
D[i, j] = min D[i, j − 1] + ins-cost(target[ j])

D[i − 1, j − 1] + sub-cost(source[i], target[ j])

If we assume the version of Levenshtein distance in which the insertions and dele-
tions each have a cost of 1 (ins-cost(·) = del-cost(·) = 1), and substitutions have a
cost of 2 (except substitution of identical letters have zero cost), the computation for
D[i, j] becomes:


 D[i − 1, j] + 1

D[i, j − 1] + 1 
D[i, j] = min (2.8)

 2; if source[i] 6= target[ j]
 D[i − 1, j − 1] +
0; if source[i] = target[ j]

The algorithm is summarized in Fig. 2.17; Fig. 2.18 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.8.
Knowing the minimum edit distance is useful for algorithms like finding poten-
tial spelling error corrections. But the edit distance algorithm is important in another
way; with a small change, it can also provide the minimum cost alignment between
two strings. Aligning two strings is useful throughout speech and language process-
ing. In speech recognition, minimum edit distance alignment is used to compute
the word error rate (Chapter 26). Alignment plays a role in machine translation, in
2.5 • M INIMUM E DIT D ISTANCE 25

function M IN -E DIT-D ISTANCE(source, target) returns min-distance

n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix distance[n+1,m+1]

# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])

# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]

Figure 2.17 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).

Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.18 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.17, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.

which sentences in a parallel corpus (a corpus with a text in two languages) need to
be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.19
shows this path with the boldfaced cell. Each boldfaced cell represents an alignment
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.19 also shows the intuition of how to compute this alignment path. The
26 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.19. Some cells have mul-
tiple backpointers because the minimum extension could have come from multiple
backtrace previous cells. In the second step, we perform a backtrace. In a backtrace, we start
from the last cell (at the final row and column), and follow the pointers back through
the dynamic programming matrix. Each complete path between the final cell and the
initial cell is a minimum distance alignment. Exercise 2.7 asks you to modify the
minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment.

# e x e c u t i o n
# 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7 ←8 ← 9
i ↑1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n ↑2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t ↑3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e ↑4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n ↑5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t ↑6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i ↑7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o ↑8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n ↑9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.19 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).

While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.17 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
each other on the keyboard. The Viterbi algorithm is a probabilistic extension of
minimum edit distance. Instead of computing the “minimum edit distance” between
two strings, Viterbi computes the “maximum probability alignment” of one string
with another. We’ll discuss this more in Chapter 8.

2.6 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduced the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors
(ˆ, $) and precedence operators ((,)).
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 27

• Word tokenization and normalization are generally done by cascades of


simple regular expression substitutions or finite automata.
• The Porter algorithm is a simple and efficient way to do stemming, stripping
off affixes. It does not have high accuracy but may be useful for some tasks.
• The minimum edit distance between two strings is the minimum number of
operations it takes to edit one into the other. Minimum edit distance can be
computed by dynamic programming, which also results in an alignment of
the two strings.

Bibliographical and Historical Notes


Kleene (1951, 1956) first defined regular expressions and the finite automaton, based
on the McCulloch-Pitts neuron. Ken Thompson was one of the first to build regular
expressions compilers into editors for text searching (Thompson, 1968). His edi-
tor ed included a command “g/regular expression/p”, or Global Regular Expression
Print, which later became the Unix grep utility.
Text normalization algorithms have been applied since the beginning of the
field. One of the earliest widely used stemmers was Lovins (1968). Stemming
was also applied early to the digital humanities, by Packard (1973), who built an
affix-stripping morphological parser for Ancient Greek. Currently a wide vari-
ety of code for tokenization and normalization is available, such as the Stanford
Tokenizer (http://nlp.stanford.edu/software/tokenizer.shtml) or spe-
cialized tokenizers for Twitter (O’Connor et al., 2010), or for sentiment (http:
//sentiment.christopherpotts.net/tokenizing.html). See Palmer (2012)
for a survey of text preprocessing. NLTK is an essential tool that offers both useful
Python libraries (http://www.nltk.org) and textbook descriptions (Bird et al.,
2009) of many algorithms including text normalization and corpus interfaces.
For more on Herdan’s law and Heaps’ Law, see Herdan (1960, p. 28), Heaps
(1978), Egghe (2007) and Baayen (2001); Yasseri et al. (2012) discuss the relation-
ship with other measures of linguistic complexity. For more on edit distance, see the
excellent Gusfield (1997). Our example measuring the edit distance from ‘intention’
to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail-
able packages to compute edit distance, including Unix diff and the NIST sclite
program (NIST, 2005).
In his autobiography Bellman (1984) explains how he originally came up with
the term dynamic programming:

“...The 1950s were not good years for mathematical research. [the]
Secretary of Defense ...had a pathological fear and hatred of the word,
research... I decided therefore to use the word, “programming”. I
wanted to get across the idea that this was dynamic, this was multi-
stage... I thought, let’s ... take a word that has an absolutely precise
meaning, namely dynamic... it’s impossible to use the word, dynamic,
in a pejorative sense. Try thinking of some combination that will pos-
sibly give it a pejorative meaning. It’s impossible. Thus, I thought
dynamic programming was a good name. It was something not even a
Congressman could object to.”
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 11. You might want to choose a different domain than a Rogerian psy-
chologist, although keep in mind that you would need a domain in which your
program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
CHAPTER

3 N-gram Language Models

“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model

Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone
is going to say? What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:

on guys all I of notice sidewalk three a sudden standing the

Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words in
noisy, ambiguous input, like speech recognition. For a speech recognizer to realize
that you said I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish. For writing
tools like spelling correction or grammatical error correction, we need to find and
correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved.
The phrase There are will be much more probable than Their are, and has improved
than has improve, allowing us to help users by detecting and correcting these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
30 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

A probabilistic model of word sequences could suggest that briefed reporters on


is a more probable English phrase than briefed to reporters (which has an awkward
to after briefed) or introduced reporters to (which uses a verb that is less fluent
English in this context), allowing us to correctly select the boldfaced sentence above.
Probabilities are also important for augmentative and alternative communi-
AAC cation systems (Trnka et al. 2007, Kane et al. 2017). People often use such AAC
devices if they are physically unable to speak or sign but can instead use eye gaze or
other specific movements to select words from a menu to be spoken by the system.
Word prediction can be used to suggest likely words for the menu.
Models that assign probabilities to sequences of words are called language mod-
language model els or LMs. In this chapter we introduce the simplest model that assigns probabil-
LM ities to sentences and sequences of words, the n-gram. An n-gram is a sequence
n-gram of n words: a 2-gram (which we’ll call bigram) is a two-word sequence of words
like “please turn”, “turn your”, or ”your homework”, and a 3-gram (a trigram) is
a three-word sequence of words like “please turn your”, or “turn your homework”.
We’ll see how to use n-gram models to estimate the probability of the last word of
an n-gram given the previous words, and also to assign probabilities to entire se-
quences. In a bit of terminological ambiguity, we usually drop the word “model”,
and use the term n-gram (and bigram, etc.) to mean either the word sequence itself
or the predictive model that assigns it a probability. In later chapters we’ll introduce
more sophisticated language models like the RNN LMs of Chapter 9.

3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “its water is so transparent that” and we
want to know the probability that the next word is the:

P(the|its water is so transparent that). (3.1)

One way to estimate this probability is from relative frequency counts: take a
very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the. This would be answering the
question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:

P(the|its water is so transparent that) =


C(its water is so transparent that the)
(3.2)
C(its water is so transparent that)

With a large enough corpus, such as the web, we can compute these counts and
estimate the probability from Eq. 3.2. You should pause now, go to the web, and
compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in
many cases, it turns out that even the web isn’t big enough to give us good estimates
in most cases. This is because language is creative; new sentences are created all the
time, and we won’t always be able to count entire sentences. Even simple extensions
of the example sentence may have counts of zero on the web (such as “Walden
Pond’s water is so transparent that the”; well, used to have counts of zero).
3.1 • N-G RAMS 31

Similarly, if we wanted to know the joint probability of an entire sequence of


words like its water is so transparent, we could do it by asking “out of all possible
sequences of five words, how many of them are its water is so transparent?” We
would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
For this reason, we’ll need to introduce more clever ways of estimating the prob-
ability of a word w given a history h, or the probability of an entire word sequence
W . Let’s start with a little formalizing of notation. To represent the probability
of a particular random variable Xi taking on the value “the”, or P(Xi = “the”), we
will use the simplification P(the). We’ll represent a sequence of N words either as
w1 . . . wn or w1:n (so the expression w1:n−1 means the string w1 , w2 , ..., wn−1 ). For the
joint probability of each word in a sequence having a particular value P(X = w1 ,Y =
w2 , Z = w3 , ...,W = wn ) we’ll use P(w1 , w2 , ..., wn ).
Now how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )?
One thing we can do is decompose this probability using the chain rule of proba-
bility:
P(X1 ...Xn ) = P(X1 )P(X2 |X1 )P(X3 |X1:2 ) . . . P(Xn |X1:n−1 )
Yn
= P(Xk |X1:k−1 ) (3.3)
k=1

Applying the chain rule to words, we get


P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1:2 ) . . . P(wn |w1:n−1 )
Yn
= P(wk |w1:k−1 ) (3.4)
k=1

The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equa-
tion 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |wn−1 1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string, because language is creative and any particular
context might have never occurred before!
The intuition of the n-gram model is that instead of computing the probability of
a word given its entire history, we can approximate the history by just the last few
words.
bigram The bigram model, for example, approximates the probability of a word given
all the previous words P(wn |w1:n−1 ) by using only the conditional probability of the
preceding word P(wn |wn−1 ). In other words, instead of computing the probability
P(the|Walden Pond’s water is so transparent that) (3.5)
we approximate it with the probability
P(the|that) (3.6)
When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:
P(wn |w1:n−1 ) ≈ P(wn |wn−1 ) (3.7)
32 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

The assumption that the probability of a word depends only on the previous word is
Markov called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Thus, the general equation for this n-gram approximation to the conditional
probability of the next word in a sequence is

P(wn |w1:n−1 ) ≈ P(wn |wn−N+1:n−1 ) (3.8)

Given the bigram assumption for the probability of an individual word, we can com-
pute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
n
Y
P(w1:n ) ≈ P(wk |wk−1 ) (3.9)
k=1

maximum
How do we estimate these bigram or n-gram probabilities? An intuitive way to
likelihood estimate probabilities is called maximum likelihood estimation or MLE. We get
estimation
the MLE estimate for the parameters of an n-gram model by getting counts from a
normalize corpus, and normalizing the counts so that they lie between 0 and 1.1
For example, to compute a particular bigram probability of a word y given a
previous word x, we’ll compute the count of the bigram C(xy) and normalize by the
sum of all the bigrams that share the same first word x:

C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):

C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol. </s>2
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = .67 P(Sam|<s>) = 3 = .33 P(am|I) = 3 = .67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = .5 P(do|I) = 3 = .33
1 For probabilistic models, normalizing means dividing by some total count so that the resulting proba-
bilities fall legally between 0 and 1.
2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an
end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model
would define an infinite set of probability distributions, with one distribution per sentence length. See
Exercise 3.5.
Other documents randomly have
different content
Marriage and family are thus intimately connected with each
other: it is for the benefit of the young that male and female
continue to live together. Marriage is therefore rooted in family,
rather than family in marriage. There are also many peoples among
whom true conjugal life does not begin before a child is born, and
others who consider that the birth of a child out of wedlock makes it
obligatory for the parents to marry. Among the Eastern
Greenlanders102 and the Fuegians,103 marriage is not regarded as
complete till the woman has become a mother. Among the
Shawanese104 and Abipones,105 the wife very often remains at her
father’s house till she has a child. Among the Khyens, the Ainos of
Yesso, and one of the aboriginal tribes of China, the husband goes
to live with his wife at her father’s house, and never takes her away
till after the birth of a child.106 In Circassia, the bride and
bridegroom are kept apart until the first child is born;107 and among
the Bedouins of Mount Sinai, a wife never enters her husband’s tent
until she becomes far advanced in pregnancy.108 Among the Baele,
the wife remains with her parents until she becomes a mother, and if
this does not happen, she stays there for ever, the husband getting
back what he has paid for her.109 In Siam, a wife does not receive
her marriage portion before having given birth to a child;110 while
among the Atkha Aleuts, according to Erman, a husband does not
pay the purchase sum before he has become a father.111 Again, the
Badagas in Southern India have two marriage ceremonies, the
second of which does not take place till there is some indication that
the pair are to have a family; and if there is no appearance of this,
the couple not uncommonly separate.112 Dr. Bérenger-Féraud states
that, among the Wolofs in Senegambia, “ce n’est que lorsque les
signes de la grossesse sont irrécusables chez la fiancée, quelquefois
même ce n’est qu’après la naissance d’un ou plusieurs enfants, que
la cérémonie du mariage proprement dit s’accomplit.”113 And the
Igorrotes of Luzon consider no engagement binding until the woman
has become pregnant.114
On the other hand, Emin Pasha tells us that, among the Mádi in
Central Africa, “should a girl become pregnant, the youth who has
been her companion is bound to marry her, and to pay to her father
the customary price of a bride.”115 Burton reports a similar custom
as prevailing among peoples dwelling to the south of the equator.116
Among many of the wild tribes of Borneo, there is almost
unrestrained intercourse between the youth of both sexes; but, if
pregnancy ensue, marriage is regarded as necessary.117 The same,
as I am informed by Dr. A. Bunker, is the case with some Karen
tribes in Burma. In Tahiti, according to Cook, the father might kill his
natural child, but if he suffered it to live, the parties were considered
to be in the married state.118 Among the Tipperahs of the
Chittagong Hills,119 as well as the peasants of the Ukraine,120 a
seducer is bound to marry the girl, should she become pregnant.
Again, Mr.Powers informs us that, among the Californian Wintun, if a
wife is abandoned when she has a young child, she is justified by
her friends in destroying it on the ground that it has no supporter.121
And among the Creeks, a young woman that becomes pregnant by a
man whom she had expected to marry, and is disappointed, is
allowed the same privilege.122
It might, however, be supposed that, in man, the prolonged union
of the sexes is due to another cause besides the offspring’s want of
parental care, i.e., to the fact that the sexual instinct is not restricted
to any particular season, but endures throughout the whole year.
“That which distinguishes man from the beast,” Beaumarchais says,
“is drinking without being thirsty, and making love at all seasons.”
But in the next chapter, I shall endeavour to show that this is
probably not quite correct, so far as our earliest human or semi-
human ancestors are concerned.
CHAPTER II
A HUMAN PAIRING SEASON IN PRIMITIVE
TIMES

Professor Leuckart assumes that the periodicity in the sexual life


of animals depends upon economical conditions, the reproductive
matter being a surplus of the individual economy. Hence he says
that the rut occurs at the time when the proportion between receipts
and expenditure is most favourable.123
Though this hypothesis is accepted by several eminent
physiologists, facts do not support the assumption that the power of
reproduction is correlated with abundance of food and bodily vigour.
There are some writers who even believe that the reverse is the
case.124
At any rate, it is not correct to say, with Dr. Gruenhagen, that “the
general wedding-feast is spring, when awakening nature opens, to
most animals, new and ample sources of living.”125 This is certainly
true of Reptiles and Birds, but not of Mammals; every month or
season of the year is the pairing season of one or another
mammalian species.126 But notwithstanding this apparent
irregularity, the pairing time of every species is bound by an unfailing
law; it sets in earlier or later, according as the period of gestation
lasts longer or shorter, so that the young may be born at the time
when they are most likely to survive. Thus, most Mammals bring
forth their young early in spring, or, in tropical countries, at the
beginning of the rainy season; the period then commences when life
is more easily sustained, when prey is most abundant, when there is
enough water and vegetable food, and when the climate becomes
warmer. In the highlands, animals pair later than those living in
lower regions,127 whilst those of the polar and temperate zones
generally pair later than those of the tropics. As regards the species
living in different latitudes the pairing time comes earlier or later,
according to the differences in climate.128
Far from depending upon any general physiological law, the rut is
thus adapted to the requirements of each species separately. Here
again we have an example of the powerful effects of natural
selection, often showing themselves very obviously. The dormouse
(Muscardinus avellanarius), for instance, that feeds upon hazel-nuts,
pairs in July, and brings forth its young in August, when nuts begin
to ripen. Then the young grow very quickly, so that they are able to
bear the autumn and winter cold.129

There are, however, a few wild species, as some whales,130 the


elephant,131 many Rodents,132 and several of the lower
monkeys,133 that seem to have no definite pairing season. As to
them it is, perhaps, sufficient to quote Dr. Brehm’s statement with
reference to the elephant, “The richness of their woods is so great,
that they really never suffer want.”134 But the man-like apes do not
belong to this class. According to Mr. Winwood Reade, the male
Gorillas fight at the rutting season for their females;135 Dr. Mohnike,
as also other authorities, mentions the occurrence of a rut-time with
the Orang-utan.136 And we find that both of these species breed
early in the season when fruits begin to be plentiful,—that is, their
pairing time depends on the same law as that which prevails in the
rest of the animal kingdom.
Sir Richard Burton says, “The Gorilla breeds about December, a
cool and dry month; according to my bushmen, the period of
gestation is between five and six months.”137 I have referred this
important statement to Mr. Alfred R. Wallace, who writes as follows:
“From the maps of rain distribution in Africa in Stanford’s
‘Compendium,’ the driest months in the Gorilla country seem to be
January and February, and these would probably be the months of
greatest fruit supply.” As regards the Orang-utan, Mr. Wallace adds,
“I found the young sucking Orang-utan in May; that was about the
second or third month of the dry season, in which fruits began to be
plentiful.”
Considering, then, that the periodicity of the sexual life rests on
the kind of food on which the species lives, together with other
circumstances connected with anatomical and physiological
peculiarities, and considering, further, the close biological
resemblance between man and the man-like apes, we are almost
compelled to assume that the pairing time of our earliest human or
half-human ancestors was restricted to a certain season of the year,
as was also the case with their nearest relations among the lower
animals. This presumption derives further probability from there
being, even now, some rude peoples who are actually stated to have
an annual pairing time, and other peoples whose sexual instinct
undergoes most decidedly a periodical increase at a certain time of
the year.
According to Mr. Johnston, the wild Indians of California,
belonging to the lowest races on earth, “have their rutting seasons
as regularly as have the deer, the elk, the antelope, or any other
animals.”138 And Mr. Powers confirms the correctness of this
statement, at least with regard to some of these Indians, saying that
spring “is a literal Saint Valentine’s Day with them, as with the
natural beasts and birds of the forest.”139
As regards the Goddanes in Luzon, Mr. Foreman tells us that “it is
the custom of the young men about to marry, to vie with each other
in presenting to the sires of their future bride all the scalps they are
able to take from their enemies, as proof of their manliness and
courage. This practice prevails at the season of the year when the
tree—popularly called by the Spaniards ‘the fire-tree’—is in
bloom.”140
Speaking of the Watch-an-dies in the western part of Australia, Mr.
Oldfield remarks, “Like the beasts of the field, the savage has but
one time for copulation in the year.141 About the middle of spring ...
the Watch-an-dies begin to think of holding their grand semi-
religious festival of Caa-ro, preparatory to the performance of the
important duty of procreation.”142 A similar feast, according to Mr.
Bonwick, was celebrated by the Tasmanians at the same time of the
year.143
The Hos, an Indian hill tribe, have, as we are informed by Colonel
Dalton, every year a great feast in January, “when the granaries are
full of grain, and the people, to use their own expression, full of
devilry. They have a strange notion that at this period, men and
women are so over-charged with vicious propensities, that it is
absolutely necessary for the safety of the person to let off steam by
allowing for a time full vent to the passions. The festival, therefore,
becomes a saturnalia, during which servants forget their duty to
their masters, children their reverence for parents, men their respect
for women, and women all notions of modesty, delicacy, and
gentleness.” Men and women become almost like animals in the
indulgence of their amorous propensities, and the utmost liberty is
given to the girls.144
The same writer adds that “it would appear that most Hill Tribes
have found it necessary to promote marriage by stimulating
intercourse between the sexes at particular seasons of the year.”145
Among the Santals, “the marriages mostly take place once a year, in
January; for six days all the candidates for matrimony live in
promiscuous concubinage, after which the whole party are supposed
to have paired off as man and wife.”146 The Punjas in Jeypore,
according to Dr. Shortt, have a festival in the first month of the new
year, where men and women assemble. The lower order or castes
observe this festival, which is kept up for a month, by both sexes
mixing promiscuously, and taking partners as their choice directs.147
A similar feast, comprising a continuous course of debauchery and
licentiousness, is held once a year, by the Kotars, a tribe inhabiting
the Neilgherries;148 according to Mr. Bancroft, by the Keres in New
Mexico;149 according to Dr. Fritsch, by the Hottentots;150 according
to the Rev. H. Rowley, by the Kafirs;151 and, as I am informed by Mr.
A. J. Swann, by some tribes near Nyassa. Writers of the sixteenth
century speak of the existence of certain early festivals in Russia, at
which great license prevailed. According to Pamphill, these annual
gatherings took place, as a rule, at the end of June, the day before
the festival of St. John the Baptist, which, in pagan times, was that
of a divinity known by the name of Jarilo, corresponding to the
Priapus of the Greeks.152 At Rome, a festival in honour of Venus
took place in the month of April;153 and Mannhardt mentions some
curious popular customs in Germany, England, Esthonia and other
European countries, which seem to indicate an increase of the
sexual instinct in spring or at the beginning of summer.154
By questions addressed to persons living among various savage
peoples, I have inquired whether among these peoples, marriages
are principally contracted at a certain time of the year, and whether
more children are born in one month or season than in another. In
answer, Mr. Radfield writes from Lifu, near New Caledonia, that
marriages there formerly took place at various times, when suitable,
but “November used to be the time at which engagements were
made.” As the seasons in this island are the reverse of those in
England, this month includes the end of spring and the beginning of
summer. The Rev. H. T. Cousins informs me that, among the Kafirs
inhabiting what is known as Cis-Natalian Kafirland, “there are more
children born in one month or season than in another, viz. August
and September, which are the spring months in South Africa;” and
he ascribes this surplus of births to feasts, comprising debauchery
and unrestricted intercourse between the unmarried people of both
sexes. Again, Dr. A. Sims writes from Stanley Pool that, among the
Bateke, more children are born in September and October, that is, in
the seasons of the early rains, than at other times; and the Rev. Ch.
E. Ingham, writing from Banza Manteka, states that he believes the
same to be the case among the Bakongo. But the Rev. T. Bridges
informs me that, among the Yahgans in the southern part of Tierra
del Fuego, so far as he knows, one month is the same as another
with regard to the number of births. I venture, however, to think
that this result might be somewhat modified by a minute inquiry,
embracing a sufficient number of cases. For statistics prove that
even in civilized countries, there is a regular periodical fluctuation in
the birth-rate.
In the eighteenth century Wargentin showed that, in Sweden
more children were born in one month than in another.155 The same
has since been found to be the case in other European countries.
According to Wappäus, the number of births in Sardinia, Belgium,
Holland, and Sweden is subject to a regular increase twice a year,
the maximum of the first increase occurring in February or March,
that of the second in September and October.156 M. Sormani
observed that, in the south of Italy, there is an increase only once in
the year, but more to the north twice, in spring and in autumn.157
Dr. Mayr and Dr. Beukemann found in Germany two annual maxima
—in February or March, and in September;—158and Dr. Haycraft
states that, in the eight largest towns of Scotland, more children are
born in legitimate wedlock in April than in any other month.159 As a
rule, according to M. Sormani, the first annual augmentation of
births has its maximum, in Sweden, in March; in France and Holland,
between February and March; in Belgium, Spain, Austria and Italy, in
February; in Greece, in January; so that it comes earlier in southern
Europe than farther to the north.160 Again, the second annual
increase is found more considerable the more to the north we go. In
South Germany it is smaller than the first one, but in North Germany
generally larger;161 and in Sweden, it is decidedly larger.162
As to non-European countries, Wappäus observed that in
Massachusetts, the birth-rate likewise underwent an increase twice a
year, the maxima falling in March and September; and that in Chili
many more children were born in September and October—i.e., at
the beginning of spring—than in any other month.163 Finally, Mr.
S. A. Hill, of Allahabad, has proved, by statistical data, that, among
the Hindus of that province, the birth-rates exhibit a most distinct
annual variation, the minimum falling in June and the maximum in
September and October.164
This unequal distribution of births over the different months of the
year is ascribed to various causes by statisticians. It is, however,
generally admitted that the maximum in February and March (in
Chili, September) is, at least to a great extent, due to the sexual
instinct being strongest in May and June (in Chili, December).165
This is the more likely to be the case as it is especially illegitimate
births that are then comparatively numerous. And it appears
extremely probable that, in Africa also, the higher birth-rates in the
seasons of the early rains owe their origin to the same cause.
Thus, comparing the facts stated, we find, among various races of
men, the sexual instinct increasing at the end of spring, or, rather, at
the beginning of summer. Some peoples of India seem to form an
exception to this rule, lascivious festivals, in the case of several of
them, taking place in the month of January, and the maximum of
births, among the Hindus of Allahabad, falling at the end of the hot
season, or in early autumn. But in India also there are traces of
strengthened passions in spring. M. Rousselet gives the following
description of the indecent Holi festival, as it is celebrated among the
Hindus of Oudeypour. “The festival of Holi marks the arrival of
spring, and is held in honour of the goddess Holica, or Vasanti, who
personifies that season in the Hindu Pantheon. The carnival lasts
several days, during which time the most licentious debauchery and
disorder reign throughout every class of society. It is the regular
saturnalia of India. Persons of the greatest respectability, without
regard to rank or age, are not ashamed to take part in the orgies
which mark this season of the year.... Women and children crowd
round the hideous idols of the feast of Holica, and deck them with
flowers; and immorality reigns supreme in the streets of the
capital.”166 Among the Aryans who inhabited the plains of the North,
the spring, or “vasanta,” corresponding to the months of March and
April, was the season of love and pleasure, celebrated in song by the
poets, and the time for marriages and religious feasts.167 And
among the Rajputs of Mewar, according to Lieutenant-Colonel Tod,
the last days of spring are dedicated to Camdéva, the god of love:
“the scorching winds of the hot season are already beginning to
blow, when Flora droops her head, and the ‘god of love turns
anchorite.’”168
We must not, however, infer that this enhancement of the
procreative power is to be attributed directly to “the different
positions of the sun with respect to the earth,”169 or to the
temperature of a certain season. The phenomenon does not
immediately spring from this cause in the case of any other animal
species. Neither can it be due to abundance of food. In the northern
parts of Europe many more conceptions take place in the months of
May and June, when the conditions of life are often rather hard,
than in September, October, and November, when the supplies of
food are comparatively plentiful. In the north-western provinces of
Germany, as well as in Sweden, the latter months are characterized
by a minimum of conceptions.170 Among the Kaffirs, more children
are conceived in November and December than in any other month,
although, according to the Rev. H. T. Cousins, food is most abundant
among them from March to September. And among the Bateke, the
maximum of conceptions falls in December and January, although
food is, as I am informed by Dr. Sims, most plentiful in the dry
season, that is, from May to the end of August.
On the other hand, the periodical increase of conceptions cannot
be explained by the opposite hypothesis, entertained by some
physiologists, that the power of reproduction is increased by want
and distress. Among the Western Australians and Californians,171 for
instance, the season of love is accompanied by a surplus of food,
and in the land of the Bakongo, among whom Mr. Ingham believes
most conceptions to take place in December and January, food is,
according to him, most abundant precisely in these months and in
February.
It seems, therefore, a reasonable presumption that the increase of
the sexual instinct at the end of spring or in the beginning of
summer, is a survival of an ancient pairing season, depending upon
the same law that rules in the rest of the animal kingdom. Since
spring is rather a time of want than a time of abundance for a
frugivorous species, it is impossible to believe that our early
ancestors, as long as they fed upon fruits, gave birth to their young
at the beginning of that period. From the statements of Sir Richard
Burton and Mr. A. R. Wallace, already quoted,172 we know that the
man-like apes breed early in the season when fruits begin to be
plentiful. But when man began to feed on herbs, roots, and animal
food, the conditions were changed. Spring is the season of the re-
awakening of life, when there are plenty of vegetables and prey.
Hence those children whose infancy fell in this period survived more
frequently than those born at any other. Considering that the parents
of at least a few of them must have had an innate tendency to the
increase of the power of reproduction at the beginning of summer,
and considering, further, that this tendency must have been
transmitted to some of the offspring, like many other characteristics
which occur periodically at certain seasons,173 we can readily
understand that gradually, through the influence of natural selection,
a race would emerge whose pairing time would be exclusively or
predominantly restricted to the season most favourable to its
subsistence. To judge from the period when most children are born
among existing peoples, the pairing season of our prehistoric
ancestors occurred, indeed, somewhat earlier in the year than is the
case with the majority of mammalian species. But we must
remember that the infancy of man is unusually long; and, with
regard to the time most favourable to the subsistence of children,
we must take into consideration not only the first days of their
existence, but the first period of their infancy in general. Besides
food and warmth, several other factors affect the welfare of the
offspring, and it is often difficult to find out all of them. We do not
know the particular circumstances that make the badger breed at
the end of February or the beginning of March,174 and the reindeer
of the Norwegian mountains as early as April;175 but there can be
no doubt that these breeding seasons are adapted to the
requirements of the respective species.
The cause of the winter maximum of conceptions, especially
considerable among the peoples of Northern Europe, is generally
sought in social influences, as the quiet ensuing on the harvest time,
the better food, and the amusements of Christmas.176 But the
people certainly recover before December from the labours of the
field, and Christmas amusements, as Wargentin remarks, take place
at the end of that month and far into January, without any particular
influence upon the number of births in October being observable.177
It has, further, been proved that the unequal distribution of
marriages over the different months exercises hardly any influence
upon the distribution of births.178 Again, among the Hindus the
December and January maximum of conceptions seems from the
lascivious festivities of several Indian peoples to be due to an
increase of the sexual instinct. According to Mr. Hill, this increase
depends upon healthy conditions with an abundant food supply. But,
as I have already said, it is not proved that a strengthened power of
reproduction and abundance of food are connected with one
another.
I am far from venturing to express any definite opinion as to the
cause of these particular phenomena, but it is not impossible that
they also are effects of natural selection, although of a comparatively
recent date. Considering that the September maximum of births (or
December maximum of conceptions) in Europe becomes larger the
farther north we go; that the agricultural peoples of Northern Europe
have plenty of food in autumn and during the first part of winter, but
often suffer a certain degree of want in spring; and, finally, that the
winter cold does not affect the health of infants, the woods giving
sufficient material for fuel,—it has occurred to me that children born
in September may have a better chance of surviving than others.
Indeed, Dr. Beukemann states that the number of still-born births is
largest in winter or at the beginning of spring, and that “the children
born in autumn possess the greatest vitality and resisting power
against the dangers of earliest infancy.”179 This would perhaps be an
adequate explanation either of an increase of the sexual instinct or
of greater disposition to impregnation in December. It is not
impossible either, that the increase of the power of reproduction
among the Hindus in December and January, which causes an
increase of births in September and October—i.e., the end of the hot
season and the beginning of winter—owes its origin to the fact that
during the winter the granaries get filled and some of the conditions
of life become more healthy. But it should be remarked that
September itself, according to Mr. Hill, is a very unhealthy month.180
Now it can be explained, I believe for the first time, how it
happens that man, unlike the lower animals, is not limited to a
particular period of the year in which to court the female.181 The
Darwinian theory of natural selection can, as it seems to me,
account for the periodicity of the sexual instinct in such a rude race
as the Western Australians, among whom the mortality of children is
so enormous that the greater number of them do not survive even
the first month after birth,182 and who inhabit a land pre-eminently
unproductive of animals and vegetables fitted to sustain human life,
a land where, “during the summer seasons, the black man riots in
comparative abundance, but during the rest of the year ... the
struggle for existence becomes very severe.”183 The more progress
man makes in arts and inventions; the more he acquires the power
of resisting injurious external influences; the more he rids himself of
the necessity of freezing when it is cold, and starving when nature is
less lavish with food; in short, the more independent he becomes of
the changes of the seasons—the greater is the probability that
children born at one time of the year will survive as well, or almost
as well, as those born at any other. Variations as regards the pairing
time, always likely to occur occasionally, will do so the more
frequently on account of changed conditions of life, which directly or
indirectly cause variability of every kind;184 and these variations will
be preserved and transmitted to following generations. Thus we can
understand how a race has arisen, endowed with the ability to
procreate children in any season. We can also understand how, even
in such a rude race as the Yahgans in Tierra del Fuego, the
seasonable distribution of births seems to be pretty equal, as there
is, according to the Rev. T. Bridges, “such a variety of food in the
various seasons that there is strictly no period of hardship, save such
as is caused by accidents of weather.” We can explain, too, why the
periodical fluctuation in the number of births, though comparatively
inconsiderable in every civilized society, is greater in countries
predominantly agricultural, such as Chili, than in countries
predominantly industrial, as Saxony;185 why it is greater in rural
districts than in towns;186 and why it was greater in Sweden in the
middle of the last century than it is now.187 For the more man has
abandoned natural life out of doors, the more luxury has increased
and his habits have got refined, the greater is the variability to which
his sexual life has become subject, and the smaller has been the
influence exerted upon it by the changes of the seasons.
Man has thus gone through the same transition as certain
domestic animals. The he-goat188 and the ass in southern
countries,189 for instance, rut throughout the whole year. The
domestic pig pairs generally twice a year, while its wild ancestors
had but one rutting season.190 Dr. Hermann Müller has even
observed a canary that laid eggs in autumn and winter.191 Natural
selection cannot, of course, account for such alterations: they fall
under the law of variation. It is the limited pairing season that is a
product of this powerful process, which acts with full force only
under conditions free from civilization and domestication.
If the hypothesis set forth in this chapter holds good, it must be
admitted that the continued excitement of the sexual instinct could
not have played a part in the origin of human marriage—provided
that this institution did exist among primitive men. Whether this was
the case I shall examine in the following chapters.
CHAPTER III
THE ANTIQUITY OF HUMAN MARRIAGE

If it be admitted that marriage, as a necessary requirement for the


existence of certain species, is connected with some peculiarities in
their organism, and, more particularly among the highest monkeys,
with the paucity of their progeny and their long period of infancy,—it
must at the same time be admitted that, among primitive men, from
the same causes as among these animals, the sexes in all probability
kept together till after the birth of the offspring. Later on, when the
human race passed beyond its frugivorous stage and spread over
the earth, living chiefly on animal food, the assistance of an adult
male became still more necessary for the subsistence of the children.
Everywhere the chase devolves on the man, it being a rare
exception among savage peoples for a woman to engage in it.192
Under such conditions a family consisting of mother and young only,
would probably, as a rule, have succumbed.
It has, however, been suggested that, in olden times, the natural
guardian of the children was not the father, but the maternal
uncle.193 This inference has been drawn chiefly from the common
practice of a nephew succeeding his mother’s brother in rank and
property. But sometimes the relation between the two is still more
intimate. “La famille Malaise proprement dite—le Sa-Mandei,—” says
a Dutch writer, as quoted by Professor Giraud-Teulon, “consiste dans
la mère et ses enfants: le père n’en fait point partie. Les liens de
parenté qui unissent ce dernier à ses frères et sœurs sont plus étrois
que ceux qui le rattachent à sa femme et à ses propres enfants. Il
continue même après son mariage à vivre dans sa famille
maternelle; c’est là qu’est son véritable domicile, et non pas dans la
maison de sa femme: il ne cesse pas de cultiver le champ de sa
propre famille, à travailler pour elle, et n’aide sa femme
qu’accidentellement. Le chef de la famille est ordinairement le frère
aîné du côté maternel (le mamak ou avunculus). De par ses droits et
ses devoirs, c’est lui le vrai père des enfants de sa sœur.”194 As
regards the mountaineers of Georgia, especially the Pshaves, M.
Kovalevsky states that, among them, “le frère de la mère prend la
place du père dans toutes les circonstances où il s’agit de venger le
sang répandu, surtout au cas de meurtre commis sur la personne de
son neveu.”195 Among the Goajiro Indians,196 the Negroes of
Bondo,197 the Barea, and the Bazes,198 it is the mother’s brother
who has the right of selling a girl to her suitor. Touching the Kois,
the Rev. John Cain says, “The maternal uncle of any Koi girl has the
right to bestow her hand on any one of his sons, or any other
suitable candidate who meets with his approval. The father and the
mother of the girl have no acknowledged voice in the matter. A
similar custom prevails amongst some of the Komâti (Vaiśya)
caste.”199 Among the Savaras in India, the bridegroom has to give a
bullock not only to the girl’s father, but to the maternal uncle;200
whilst among the Creeks, the proxy of the suitor asked for the
consent of the uncles, aunts, and brothers of the young woman,
“the father having no voice or authority in the business.”201
But such cases are rare. Besides, most of them imply only that the
children in a certain way belong to the uncle, not that the father is
released from the obligation of supporting them. Even where
succession runs through females only, the father is nearly always
certainly the head of the family. Thus, for instance, in Melanesia,
where the clan of the children is determined by that of the mother,
“the mother is,” to quote Dr. Codrington, “in no way the head of the
family. The house of the family is the father’s, the garden is his, the
rule and government are his.”202 Nor is there any reason to believe
that it was generally otherwise in former times. A man could not of
course be the guardian of his sister’s children, if he did not live in
close connection with them. But except in such a decidedly
anomalous case as that of the Malays, just referred to, this could
scarcely happen unless marriages were contracted between persons
living closely together. Nowadays, however, such marriages are
usually avoided, and I shall endeavour later on to show that they
were probably also avoided by our remote ancestors.
It might, further, be objected that the children were equally well or
better provided for, if not the fathers only, but all the males of the
tribe indiscriminately were their guardians. The supporters of the
hypothesis of promiscuity, and even other sociologists, as for
instance Herr Kautsky,203 believe that this really was the case
among primitive men. According to them, the tribe or horde is the
primary social unit of the human race, and the family only a
secondary unit, developed in later times. Indeed, this assumption
has been treated by many writers, not as a more or less probable
hypothesis, but as a demonstrated truth. Yet the idea that a man’s
children belong to the tribe, has no foundation in fact. Everywhere
we find the tribes or clans composed of several families, the
members of each family being more closely connected with one
another than with the rest of the tribe. The family, consisting of
parents, children, and often also their next descendants, is a
universal institution among existing peoples.204 And it seems
extremely probable that, among our earliest human ancestors, the
family formed, if not the society itself, at least the nucleus of it. As
this is a question of great importance, I must deal with it at some
length.
Mr. Darwin remarks, “Judging from the analogy of the majority of
the Quadrumana, it is probable that the early ape-like progenitors of
man were likewise social.”205 But it may be doubted whether Mr.
Darwin would have drawn this inference, had he taken into
consideration the remarkable fact that none of the monkeys most
nearly allied to man can be called social animals.
The solitary life of the Orang-utan has already been noted. As
regards Gorillas, Dr. Savage states that there is only one adult male
attached to each group;206 and Mr. Reade says expressly that they

You might also like