0% found this document useful (0 votes)
678 views287 pages

Speech and Language Processing: Third Edition Draft

Uploaded by

Hoàng Vũ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
678 views287 pages

Speech and Language Processing: Third Edition Draft

Uploaded by

Hoàng Vũ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 287

Speech and Language Processing

An Introduction to Natural Language Processing,


Computational Linguistics, and Speech Recognition

Third Edition draft

Daniel Jurafsky
Stanford University

James H. Martin
University of Colorado at Boulder

Copyright ©2020. All rights reserved.

Draft of December 30, 2020. Comments and typos welcome!


Summary of Contents
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Regular Expressions, Text Normalization, Edit Distance . . . . . . . . . 2
3 N-gram Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Naive Bayes and Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . 55
5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Vector Semantics and Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Neural Networks and Neural Language Models . . . . . . . . . . . . . . . . . 127
8 Sequence Labeling for Parts of Speech and Named Entities . . . . . . 148
9 Deep Learning Architectures for Sequence Processing . . . . . . . . . . . 173
10 Contextual Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11 Machine Translation and Encoder-Decoder Models . . . . . . . . . . . . . 203
12 Constituency Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
13 Constituency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15 Logical Representations of Sentence Meaning . . . . . . . . . . . . . . . . . . . 305
16 Computational Semantics and Semantic Parsing . . . . . . . . . . . . . . . . 331
17 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
18 Word Senses and WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
19 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
20 Lexicons for Sentiment, Affect, and Connotation . . . . . . . . . . . . . . . . 393
21 Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
22 Discourse Coherence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
23 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
24 Chatbots & Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
25 Phonetics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
26 Automatic Speech Recognition and Text-to-Speech . . . . . . . . . . . . . . 548
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607

2
Contents
1 Introduction 1

2 Regular Expressions, Text Normalization, Edit Distance 2


2.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 27
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 N-gram Language Models 29


3.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Evaluating Language Models . . . . . . . . . . . . . . . . . . . . 35
3.3 Generalization and Zeros . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Kneser-Ney Smoothing . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Huge Language Models and Stupid Backoff . . . . . . . . . . . . 47
3.7 Advanced: Perplexity’s Relation to Entropy . . . . . . . . . . . . 49
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 52
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Naive Bayes and Sentiment Classification 55


4.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Training the Naive Bayes Classifier . . . . . . . . . . . . . . . . . 59
4.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Optimizing for Sentiment Analysis . . . . . . . . . . . . . . . . . 61
4.5 Naive Bayes for other text classification tasks . . . . . . . . . . . 63
4.6 Naive Bayes as a Language Model . . . . . . . . . . . . . . . . . 64
4.7 Evaluation: Precision, Recall, F-measure . . . . . . . . . . . . . . 65
4.8 Test sets and Cross-validation . . . . . . . . . . . . . . . . . . . . 67
4.9 Statistical Significance Testing . . . . . . . . . . . . . . . . . . . 69
4.10 Avoiding Harms in Classification . . . . . . . . . . . . . . . . . . 72
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 73
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Logistic Regression 76
5.1 Classification: the sigmoid . . . . . . . . . . . . . . . . . . . . . 77
5.2 Learning in Logistic Regression . . . . . . . . . . . . . . . . . . . 81
5.3 The cross-entropy loss function . . . . . . . . . . . . . . . . . . . 82
5.4 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . 90
5.7 Interpreting models . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.8 Advanced: Deriving the Gradient Equation . . . . . . . . . . . . . 93
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3
4 C ONTENTS

Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 94


Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Vector Semantics and Embeddings 96


6.1 Lexical Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Vector Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Words and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Cosine for measuring similarity . . . . . . . . . . . . . . . . . . . 105
6.5 TF-IDF: Weighing terms in the vector . . . . . . . . . . . . . . . 106
6.6 Pointwise Mutual Information (PMI) . . . . . . . . . . . . . . . . 109
6.7 Applications of the tf-idf or PPMI vector models . . . . . . . . . . 111
6.8 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.9 Visualizing Embeddings . . . . . . . . . . . . . . . . . . . . . . . 118
6.10 Semantic properties of embeddings . . . . . . . . . . . . . . . . . 118
6.11 Bias and Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 120
6.12 Evaluating Vector Models . . . . . . . . . . . . . . . . . . . . . . 122
6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 123
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Neural Networks and Neural Language Models 127


7.1 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2 The XOR problem . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.3 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . 133
7.4 Training Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . 137
7.5 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 147

8 Sequence Labeling for Parts of Speech and Named Entities 148


8.1 (Mostly) English Word Classes . . . . . . . . . . . . . . . . . . . 149
8.2 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . 151
8.3 Named Entities and Named Entity Tagging . . . . . . . . . . . . . 153
8.4 HMM Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . 155
8.5 Conditional Random Fields (CRFs) . . . . . . . . . . . . . . . . . 162
8.6 Evaluation of Named Entity Recognition . . . . . . . . . . . . . . 167
8.7 Further Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 170
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9 Deep Learning Architectures for Sequence Processing 173


9.1 Language Models Revisited . . . . . . . . . . . . . . . . . . . . . 174
9.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 176
9.3 Managing Context in RNNs: LSTMs and GRUs . . . . . . . . . . 186
9.4 Self-Attention Networks: Transformers . . . . . . . . . . . . . . . 190
9.5 Potential Harms from Language Models . . . . . . . . . . . . . . 198
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 200

10 Contextual Embeddings 202


C ONTENTS 5

11 Machine Translation and Encoder-Decoder Models 203


11.1 Language Divergences and Typology . . . . . . . . . . . . . . . . 205
11.2 The Encoder-Decoder Model . . . . . . . . . . . . . . . . . . . . 208
11.3 Encoder-Decoder with RNNs . . . . . . . . . . . . . . . . . . . . 209
11.4 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.5 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
11.6 Encoder-Decoder with Transformers . . . . . . . . . . . . . . . . 217
11.7 Some practical details on building MT systems . . . . . . . . . . . 218
11.8 MT Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.9 Bias and Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . 226
11.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 228
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

12 Constituency Grammars 231


12.1 Constituency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.2 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . . 232
12.3 Some Grammar Rules for English . . . . . . . . . . . . . . . . . . 237
12.4 Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.5 Grammar Equivalence and Normal Form . . . . . . . . . . . . . . 249
12.6 Lexicalized Grammars . . . . . . . . . . . . . . . . . . . . . . . . 250
12.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 256
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

13 Constituency Parsing 259


13.1 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
13.2 CKY Parsing: A Dynamic Programming Approach . . . . . . . . 261
13.3 Span-Based Neural Constituency Parsing . . . . . . . . . . . . . . 267
13.4 Evaluating Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.5 Partial Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.6 CCG Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 278
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

14 Dependency Parsing 280


14.1 Dependency Relations . . . . . . . . . . . . . . . . . . . . . . . . 281
14.2 Dependency Formalisms . . . . . . . . . . . . . . . . . . . . . . . 283
14.3 Dependency Treebanks . . . . . . . . . . . . . . . . . . . . . . . 284
14.4 Transition-Based Dependency Parsing . . . . . . . . . . . . . . . 285
14.5 Graph-Based Dependency Parsing . . . . . . . . . . . . . . . . . 296
14.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 302
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

15 Logical Representations of Sentence Meaning 305


15.1 Computational Desiderata for Representations . . . . . . . . . . . 306
15.2 Model-Theoretic Semantics . . . . . . . . . . . . . . . . . . . . . 308
15.3 First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 311
15.4 Event and State Representations . . . . . . . . . . . . . . . . . . . 318
6 C ONTENTS

15.5 Description Logics . . . . . . . . . . . . . . . . . . . . . . . . . . 323


15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 329
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

16 Computational Semantics and Semantic Parsing 331

17 Information Extraction 332


17.1 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.2 Relation Extraction Algorithms . . . . . . . . . . . . . . . . . . . 336
17.3 Extracting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
17.4 Extracting Events and their Times . . . . . . . . . . . . . . . . . . 348
17.5 Template Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
17.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 353
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

18 Word Senses and WordNet 355


18.1 Word Senses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
18.2 Relations Between Senses . . . . . . . . . . . . . . . . . . . . . . 358
18.3 WordNet: A Database of Lexical Relations . . . . . . . . . . . . . 360
18.4 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 363
18.5 Alternate WSD algorithms and Tasks . . . . . . . . . . . . . . . . 366
18.6 Using Thesauruses to Improve Embeddings . . . . . . . . . . . . 369
18.7 Word Sense Induction . . . . . . . . . . . . . . . . . . . . . . . . 369
18.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 371
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

19 Semantic Role Labeling 373


19.1 Semantic Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
19.2 Diathesis Alternations . . . . . . . . . . . . . . . . . . . . . . . . 375
19.3 Semantic Roles: Problems with Thematic Roles . . . . . . . . . . 376
19.4 The Proposition Bank . . . . . . . . . . . . . . . . . . . . . . . . 377
19.5 FrameNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
19.6 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . 380
19.7 Selectional Restrictions . . . . . . . . . . . . . . . . . . . . . . . 384
19.8 Primitive Decomposition of Predicates . . . . . . . . . . . . . . . 389
19.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 390
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

20 Lexicons for Sentiment, Affect, and Connotation 393


20.1 Defining Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . 394
20.2 Available Sentiment and Affect Lexicons . . . . . . . . . . . . . . 395
20.3 Creating Affect Lexicons by Human Labeling . . . . . . . . . . . 398
20.4 Semi-supervised Induction of Affect Lexicons . . . . . . . . . . . 399
20.5 Supervised Learning of Word Sentiment . . . . . . . . . . . . . . 402
20.6 Using Lexicons for Sentiment Recognition . . . . . . . . . . . . . 406
20.7 Other tasks: Personality . . . . . . . . . . . . . . . . . . . . . . . 407
20.8 Affect Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 408
20.9 Lexicon-based methods for Entity-Centric Affect . . . . . . . . . . 410
C ONTENTS 7

20.10 Connotation Frames . . . . . . . . . . . . . . . . . . . . . . . . . 411


20.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 413

21 Coreference Resolution 415


21.1 Coreference Phenomena: Linguistic Background . . . . . . . . . . 418
21.2 Coreference Tasks and Datasets . . . . . . . . . . . . . . . . . . . 423
21.3 Mention Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 424
21.4 Architectures for Coreference Algorithms . . . . . . . . . . . . . 427
21.5 Classifiers using hand-built features . . . . . . . . . . . . . . . . . 429
21.6 A neural mention-ranking algorithm . . . . . . . . . . . . . . . . 430
21.7 Evaluation of Coreference Resolution . . . . . . . . . . . . . . . . 434
21.8 Winograd Schema problems . . . . . . . . . . . . . . . . . . . . . 435
21.9 Gender Bias in Coreference . . . . . . . . . . . . . . . . . . . . . 436
21.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 438
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

22 Discourse Coherence 442


22.1 Coherence Relations . . . . . . . . . . . . . . . . . . . . . . . . . 444
22.2 Discourse Structure Parsing . . . . . . . . . . . . . . . . . . . . . 447
22.3 Centering and Entity-Based Coherence . . . . . . . . . . . . . . . 451
22.4 Representation learning models for local coherence . . . . . . . . 456
22.5 Global Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . 458
22.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 461
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463

23 Question Answering 464


23.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 465
23.2 IR-based Factoid Question Answering . . . . . . . . . . . . . . . 473
23.3 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
23.4 Knowledge-based Question Answering . . . . . . . . . . . . . . . 482
23.5 Using Language Models to do QA . . . . . . . . . . . . . . . . . 484
23.6 Classic QA Models . . . . . . . . . . . . . . . . . . . . . . . . . 485
23.7 Evaluation of Factoid Answers . . . . . . . . . . . . . . . . . . . 488
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 489
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

24 Chatbots & Dialogue Systems 492


24.1 Properties of Human Conversation . . . . . . . . . . . . . . . . . 493
24.2 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
24.3 GUS: Simple Frame-based Dialogue Systems . . . . . . . . . . . 504
24.4 The Dialogue-State Architecture . . . . . . . . . . . . . . . . . . 508
24.5 Evaluating Dialogue Systems . . . . . . . . . . . . . . . . . . . . 517
24.6 Dialogue System Design . . . . . . . . . . . . . . . . . . . . . . . 520
24.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 523
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

25 Phonetics 526
25.1 Speech Sounds and Phonetic Transcription . . . . . . . . . . . . . 526
8 C ONTENTS

25.2 Articulatory Phonetics . . . . . . . . . . . . . . . . . . . . . . . . 527


25.3 Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
25.4 Acoustic Phonetics and Signals . . . . . . . . . . . . . . . . . . . 534
25.5 Phonetic Resources . . . . . . . . . . . . . . . . . . . . . . . . . 544
25.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 546
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

26 Automatic Speech Recognition and Text-to-Speech 548


26.1 The Automatic Speech Recognition Task . . . . . . . . . . . . . . 549
26.2 Feature Extraction for ASR: Log Mel Spectrum . . . . . . . . . . 551
26.3 Speech Recognition Architecture . . . . . . . . . . . . . . . . . . 555
26.4 CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
26.5 ASR Evaluation: Word Error Rate . . . . . . . . . . . . . . . . . 562
26.6 TTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
26.7 Other Speech Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 569
26.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 570
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

Bibliography 575
Subject Index 607
CHAPTER

1 Introduction

La dernière chose qu’on trouve en faisant un ouvrage est de savoir celle qu’il faut
mettre la première.
[The last thing you figure out in writing a book is what to put first.]
Pascal

1
2 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

CHAPTER

2 Regular Expressions, Text


Normalization, Edit Distance
User: I am unhappy.
ELIZA: DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
User: I need some help, that much seems certain.
ELIZA: WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
User: Perhaps I could learn to get along with my mother.
ELIZA: TELL ME MORE ABOUT YOUR FAMILY
User: My mother takes care of me.
ELIZA: WHO ELSE IN YOU FAMILY TAKES CARE OF YOU
User: My father.
ELIZA: YOUR FATHER
User: You are like my father in some ways.
Weizenbaum (1966)
ELIZA The dialogue above is from ELIZA, an early natural language processing system
that could carry on a limited conversation with a user by imitating the responses of
a Rogerian psychotherapist (Weizenbaum, 1966). ELIZA is a surprisingly simple
program that uses pattern matching to recognize phrases like “I need X” and translate
them into suitable outputs like “What would it mean to you if you got X?”. This
simple technique succeeds in this domain because ELIZA doesn’t actually need to
know anything to mimic a Rogerian psychotherapist. As Weizenbaum notes, this is
one of the few dialogue genres where listeners can act as if they know nothing of the
world. Eliza’s mimicry of human conversation was remarkably successful: many
people who interacted with ELIZA came to believe that it really understood them
and their problems, many continued to believe in ELIZA’s abilities even after the
program’s operation was explained to them (Weizenbaum, 1976), and even today
chatbots such chatbots are a fun diversion.
Of course modern conversational agents are much more than a diversion; they
can answer questions, book flights, or find restaurants, functions for which they rely
on a much more sophisticated understanding of the user’s intent, as we will see in
Chapter 24. Nonetheless, the simple pattern-based methods that powered ELIZA
and other chatbots play a crucial role in natural language processing.
We’ll begin with the most important tool for describing text patterns: the regular
expression. Regular expressions can be used to specify strings we might want to
extract from a document, from transforming “I need X” in Eliza above, to defining
strings like $199 or $24.99 for extracting tables of prices from a document.
text We’ll then turn to a set of tasks collectively called text normalization, in which
normalization
regular expressions play an important part. Normalizing text means converting it
to a more convenient, standard form. For example, most of what we are going to
do with language relies on first separating out or tokenizing words from running
tokenization text, the task of tokenization. English words are often separated from each other
by whitespace, but whitespace is not always sufficient. New York and rock ’n’ roll
are sometimes treated as large words despite the fact that they contain spaces, while
sometimes we’ll need to separate I’m into the two words I and am. For processing
tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.
2.1 • R EGULAR E XPRESSIONS 3

Some languages, like Japanese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
lemmatization Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.

2.1 Regular Expressions


One of the unsung successes in standardization in computer science has been the
regular
expression regular expression (RE), a language for specifying text search strings. This prac-
tical language is used in every computer language, word processor, and text pro-
cessing tools like the Unix tools grep or Emacs. Formally, a regular expression is
an algebraic notation for characterizing a set of strings. They are particularly use-
corpus ful for searching in texts, when we have a pattern to search for and a corpus of
texts to search through. A regular expression search function will search through the
corpus, returning all texts that match the pattern. The corpus can be a single docu-
ment or a collection. For example, the Unix command-line tool grep takes a regular
expression and returns every line of the input document that matches the expression.
A search can be designed to return every match on a line, if there are more than
one, or just the first match. In the following examples we generally underline the
exact part of the pattern that matches the regular expression and show only the first
match. We’ll show regular expressions delimited by slashes but note that slashes are
not part of the regular expressions.
Regular expressions come in many variants. We’ll be describing extended regu-
lar expressions; different regular expression parsers may only recognize subsets of
these, or treat some expressions slightly differently. Using an online regular expres-
sion tester is a handy way to test out your expressions and explore these variations.

2.1.1 Basic Regular Expression Patterns


The simplest kind of regular expression is a sequence of simple characters. To search
for woodchuck, we type /woodchuck/. The expression /Buttercup/ matches any
string containing the substring Buttercup; grep with that expression would return the
line I’m called little Buttercup. The search string can consist of a single character
(like /!/) or a sequence of characters (like /urgl/).
Regular expressions are case sensitive; lower case /s/ is distinct from upper
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
4 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

RE Example Patterns Matched


/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/!/ “You’ve left the burglar behind again!” said Nori
Figure 2.1 Some simple regex searches.

problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.

RE Match Example Patterns


/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”
/[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”
/[1234567890]/ any digit “plenty of 7 to 5”
Figure 2.2 The use of the brackets [] to specify a disjunction of characters.

The regular expression /[1234567890]/ specifies any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.

RE Match Example Patterns Matched


/[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”
/[a-z]/ a lower case letter “my beans were impatient to be hoed!”
/[0-9]/ a single digit “Chapter 1: Down the Rabbit Hole”
Figure 2.3 The use of the brackets [] plus the dash - to specify a range.

The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.

RE Match (single characters) Example Patterns Matched


/[ˆA-Z]/ not an upper case letter “Oyfn pripetchik”
/[ˆSs]/ neither ‘S’ nor ‘s’ “I have no exquisite reason for’t”
/[ˆ.]/ not a period “our resident Djinn”
/[eˆ]/ either ‘e’ or ‘ˆ’ “look up ˆ now”
/aˆb/ the pattern ‘aˆb’ “look up aˆ b now”
Figure 2.4 The caret ˆ for negation or just to mean ˆ. See below re: the backslash for escaping the period.

How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
2.1 • R EGULAR E XPRESSIONS 5

RE Match Example Patterns Matched


/woodchucks?/ woodchuck or woodchucks “woodchuck”
/colou?r/ color or colour “color”
Figure 2.5 The question mark ? marks optionality of the previous expression.

We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
we want, something that is very important in regular expressions. For example,
consider the language of certain sheep, which consists of strings that look like the
following:
baa!
baaa!
baaaa!
baaaaa!
...
This language consists of strings with a b, followed by at least two a’s, followed
by an exclamation point. The set of operators that allows us to say things like “some
Kleene * number of as” are based on the asterisk or *, commonly called the Kleene * (gen-
erally pronounced “cleany star”). The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So /a*/ means “any
string of zero or more as”. This will match a or aaaaaa, but it will also match Off
Minor since the string Off Minor has zero a’s. So the regular expression for matching
one or more a is /aa*/, meaning one a followed by zero or more as. More complex
patterns can also be repeated. So /[ab]*/ means “zero or more a’s or b’s” (not
“zero or more right square braces”). This will match strings like aaaa or ababab or
bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/,
the regular expression for a single digit. An integer (a string of digits) is thus
/[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice,
so there is a shorter way to specify “at least one” of some character. This is the
Kleene + Kleene +, which means “one or more occurrences of the immediately preceding
character or regular expression”. Thus, the expression /[0-9]+/ is the normal way
to specify “a sequence of digits”. There are thus two ways to specify the sheep
language: /baaa*!/ or /baa+!/.
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return), as shown in Fig. 2.6.

RE Match Example Matches


/beg.n/ any character between beg and n begin, beg’n, begun
Figure 2.6 The use of the period . to specify any character.

The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
anchors Anchors are special characters that anchor regular expressions to particular places
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
6 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)

RE Match
ˆ start of line
\$ end of line
\b word boundary
\B non-word boundary
Figure 2.7 Anchors in regular expressions.

There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).

2.1.2 Disjunction, Grouping, and Precedence


Suppose we need to search for texts about pets; perhaps we are particularly interested
in cats and dogs. In such a case, we might want to search for either the string cat or
the string dog. Since we can’t use the square brackets to search for “cat or dog” (why
disjunction can’t we say /[catdog]/?), we need a new operator, the disjunction operator, also
called the pipe symbol |. The pattern /cat|dog/ matches either the string cat or
the string dog.
Sometimes we need to use this disjunction operator in the midst of a larger se-
quence. For example, suppose I want to search for information about pet fish for
my cousin David. How can I specify both guppy and guppies? We cannot simply
say /guppy|ies/, because that would match only the strings guppy and ies. This
precedence is because sequences like guppy take precedence over the disjunction operator |.
To make the disjunction operator apply only to a specific pattern, we need to use the
parenthesis operators ( and ). Enclosing a pattern in parentheses makes it act like
a single character for the purposes of neighboring operators like the pipe | and the
Kleene*. So the pattern /gupp(y|ies)/ would specify that we meant the disjunc-
tion only to apply to the suffixes y and ies.
The parenthesis operator ( is also useful when we are using counters like the
Kleene*. Unlike the | operator, the Kleene* operator applies by default only to
a single character, not to a whole sequence. Suppose we want to match repeated
instances of a string. Perhaps we have a line that has column labels of the form
Column 1 Column 2 Column 3. The expression /Column [0-9]+ */ will not
match any number of columns; instead, it will match a single column followed by
any number of spaces! The star here applies only to the space that precedes it,
not to the whole sequence. With the parentheses, we could write the expression
2.1 • R EGULAR E XPRESSIONS 7

/(Column [0-9]+ *)*/ to match the word Column, followed by a number and
optional spaces, the whole pattern repeated zero or more times.
This idea that one operator may take precedence over another, requiring us to
sometimes use parentheses to specify what we mean, is formalized by the operator
operator
precedence precedence hierarchy for regular expressions. The following table gives the order
of RE operator precedence, from highest precedence to lowest precedence.
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
Thus, because counters have a higher precedence than sequences,
/the*/ matches theeeee but not thethe. Because sequences have a higher prece-
dence than disjunction, /the|any/ matches the or any but not thany or theny.
Patterns can be ambiguous in another way. Consider the expression /[a-z]*/
when matching against the text once upon a time. Since /[a-z]*/ matches zero or
more letters, this expression could match nothing, or just the first letter o, on, onc,
or once. In these cases regular expressions always match the largest string they can;
greedy we say that patterns are greedy, expanding to cover as much of a string as they can.
non-greedy There are, however, ways to enforce non-greedy matching, using another mean-
*? ing of the ? qualifier. The operator *? is a Kleene star that matches as little text as
+? possible. The operator +? is a Kleene plus that matches as little text as possible.

2.1.3 A Simple Example


Suppose we wanted to write a RE to find cases of the English article the. A simple
(but incorrect) pattern might be:
/the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology). So we need to specify that we want instances with a word bound-
ary on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of /\b/. We might want this since
/\b/ won’t treat underscores and numbers as word boundaries; but we might want
to find the in some context where it might also have underlines or numbers nearby
(the or the25). We need to specify that we want instances in which there are no
alphabetic letters on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [ˆa-zA-Z], which
we used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specify-
ing that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:
8 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

The process we just went through was based on fixing two kinds of errors: false
false positives positives, strings that we incorrectly matched like other or there, and false nega-
false negatives tives, strings that we incorrectly missed, like The. Addressing these two kinds of
errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antag-
onistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
We’ll come back to precision and recall with more precise definitions in Chapter 4.

2.1.4 More Operators


Figure 2.8 shows some aliases for common ranges, which can be used mainly to
save typing. Besides the Kleene * and Kleene + we can also use explicit numbers as
counters, by enclosing them in curly brackets. The regular expression /{3}/ means
“exactly 3 occurrences of the previous character or expression”. So /a\.{24}z/
will match a followed by 24 dots followed by z (but not a followed by 23 or 25 dots
followed by a z).

RE Expansion Match First Matches


\d [0-9] any digit Party of 5
\D [ˆ0-9] any non-digit Blue moon
\w [a-zA-Z0-9_] any alphanumeric/underscore Daiyu
\W [ˆ\w] a non-alphanumeric !!!!
\s [ \r\t\n\f] whitespace (space, tab)
\S [ˆ\s] Non-whitespace in Concord
Figure 2.8 Aliases for common sets of characters.

A range of numbers can also be specified. So /{n,m}/ specifies from n to m


occurrences of the previous char or expression, and /{n,}/ means at least n occur-
rences of the previous expression. REs for counting are summarized in Fig. 2.9.

RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 2.9 Regular expression operators for counting.

Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Fig. 2.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
2.1 • R EGULAR E XPRESSIONS 9

RE Match First Patterns Matched


\* an asterisk “*” “K*A*P*L*A*N”
\. a period “.” “Dr. Livingston, I presume”
\? a question mark “Why don’t they come and lend a hand?”
\n a newline
\t a tab
Figure 2.10 Some characters that need to be backslashed.

2.1.5 A More Complex Example


Let’s try out a more significant example of the power of REs. Suppose we want to
build an application to help a user buy a computer on the Web. The user might want
“any machine with at least 6 GHz and 500 GB of disk space for less than $1000”.
To do this kind of retrieval, we first need to be able to look for expressions like 6
GHz or 500 GB or Mac or $999.99. In the rest of this section we’ll work out some
simple regular expressions for this task.
First, let’s complete our regular expression for prices. Here’s a regular expres-
sion for a dollar sign followed by a string of digits:
/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)
Now we just need to deal with fractions of dollars. We’ll add a decimal point
and two digits afterwards:
/$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents
optional and to make sure we’re at a word boundary:
/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/
One last catch! This pattern allows prices like $199999.99 which would be far
too expensive! We need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/
How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the of / */ to mean “zero or
more spaces” since there might always be extra spaces lying around:
/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
Modifying this regular expression so that it only matches more than 500 GB is
left as an exercise for the reader.

2.1.6 Substitution, Capture Groups, and ELIZA


substitution An important use of regular expressions is in substitutions. For example, the substi-
tution operator s/regexp1/pattern/ used in Python and in Unix commands like
vim or sed allows a string characterized by a regular expression to be replaced by
another string:
10 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

s/colour/color/
It is often useful to be able to refer to a particular subpart of the string matching
the first pattern. For example, suppose we wanted to put angle brackets around all
integers in a text, for example, changing the 35 boxes to the <35> boxes. We’d
like a way to refer to the integer we’ve found so that we can easily add the brackets.
To do this, we put parentheses ( and ) around the first pattern and use the number
operator \1 in the second pattern to refer back. Here’s how it looks:
s/([0-9]+)/<\1>/
The parenthesis and number operators can also specify that a certain string or
expression must occur twice in the text. For example, suppose we are looking for
the pattern “the Xer they were, the Xer they will be”, where we want to constrain
the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as
follows:
/the (.*)er they were, the \1er they will be/
Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match the bigger they were, the bigger they will be but not the
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
/the (.*)er they (.*), the \1er we \2/
will match the faster they ran, the faster we ran but not the faster they ran, the faster
we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used to
group terms for specifying the order in which operators should apply, and they are
used to capture something in a register. Occasionally we might want to use parenthe-
ses for grouping, but don’t want to capture the resulting pattern in a register. In that
non-capturing
group case we use a non-capturing group, which is specified by putting the commands
?: after the open paren, in the form (?: pattern ).
/(?:some|a few) (people|cats) like some \1/
will match some cats like some cats but not some cats like some a few.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:

User1 : Men are all alike.


ELIZA1 : IN WHAT WAY
User2 : They’re always bugging us about something or other.
ELIZA2 : CAN YOU THINK OF A SPECIFIC EXAMPLE
User3 : Well, my boyfriend made me come here.
ELIZA3 : YOUR BOYFRIEND MADE YOU COME HERE
User4 : He says I’m depressed much of the time.
ELIZA4 : I AM SORRY TO HEAR YOU ARE DEPRESSED

ELIZA works by having a series or cascade of regular expression substitutions


each of which matches and changes some part of the input lines. Input lines are
2.2 • W ORDS 11

first uppercased. The first substitutions then change all instances of MY to YOUR,
and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces
other patterns in the input. Here are some examples:
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 24.

2.1.7 Lookahead Assertions


Finally, there will be times when we need to predict the future: look ahead in the
text to see if some pattern matches, but not advance the match cursor, so that we can
then deal with the pattern if it occurs.
lookahead These lookahead assertions make use of the (? syntax that we saw in the previ-
ous section for non-capture groups. The operator (?= pattern) is true if pattern
zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator
(?! pattern) only returns true if a pattern does not match, but again is zero-width
and doesn’t advance the cursor. Negative lookahead is commonly used when we
are parsing some complex pattern but want to rule out a special case. For example
suppose we want to match, at the beginning of a line, any single word that doesn’t
start with “Volcano”. We can use negative lookahead to do this:
/ˆ(?!Volcano)[A-Za-z]+/

2.2 Words
Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at one particular corpus (plural corpora), a computer-readable
corpora collection of text or speech. For example the Brown corpus is a million-word col-
lection of samples from 500 written English texts from different genres (newspa-
per, fiction, non-fiction, academic, etc.), assembled at Brown University in 1963–64
(Kučera and Francis, 1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
This sentence has 13 words if we don’t count punctuation marks as words, 15
if we count punctuation. Whether we treat period (“.”), comma (“,”), and so on as
words depends on the task. Punctuation is critical for finding boundaries of things
(commas, periods, colons) and for identifying some aspects of meaning (question
marks, exclamation marks, quotation marks). For some tasks, like part-of-speech
tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if
they were separate words.
The Switchboard corpus of American English telephone conversations between
strangers was collected in the early 1990s; it contains 2430 conversations averaging
6 minutes each, totaling 240 hours of speech and about 3 million words (Godfrey
et al., 1992). Such corpora of spoken language don’t have punctuation but do intro-
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

duce other complications with regard to defining words. Let’s look at one utterance
utterance from Switchboard; an utterance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
disfluency This utterance has two kinds of disfluencies. The broken-off word main- is
fragment called a fragment. Words like uh and um are called fillers or filled pauses. Should
filled pause we consider these to be words? Again, it depends on the application. If we are
building a speech transcription system, we might want to eventually strip out the
disfluencies.
But we also sometimes keep disfluencies around. Disfluencies like uh or um
are actually helpful in speech recognition in predicting the upcoming word, because
they may signal that the speaker is restarting the clause or idea, and so for speech
recognition they are treated as regular words. Because people use different disflu-
encies they can also be a cue to speaker identification. In fact Clark and Fox Tree
(2002) showed that uh and um have different meanings. What do you think they are?
Are capitalized tokens like They and uncapitalized tokens like they the same
word? These are lumped together in some tasks (speech recognition), while for part-
of-speech or named-entity tagging, capitalization is a useful feature and is retained.
How about inflected forms like cats versus cat? These two words have the same
lemma lemma cat but are different wordforms. A lemma is a set of lexical forms having
the same stem, the same major part-of-speech, and the same word sense. The word-
wordform form is the full inflected or derived form of the word. For morphologically complex
languages like Arabic, we often need to deal with lemmatization. For many tasks in
English, however, wordforms are sufficient.
How many words are there in English? To answer this question we need to
word type distinguish two ways of talking about words. Types are the number of distinct words
in a corpus; if the set of words in the vocabulary is V , the number of types is the
word token vocabulary size |V |. Tokens are the total number N of running words. If we ignore
punctuation, the following Brown sentence has 16 tokens and 14 types:
They picnicked by the pool, then lay back on the grass and looked at the stars.
When we speak about the number of words in the language, we are generally
referring to word types.

Corpus Tokens = N Types = |V |


Shakespeare 884 thousand 31 thousand
Brown corpus 1 million 38 thousand
Switchboard telephone conversations 2.4 million 20 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13 million
Figure 2.11 Rough numbers of types and tokens for some English language corpora. The
largest, the Google N-grams corpus, contains 13 million types, but this count only includes
types appearing 40 or more times, so the true number would be much larger.

Fig. 2.11 shows the rough numbers of types and tokens computed from some
popular English corpora. The larger the corpora we look at, the more word types
we find, and in fact this relationship between the number of types |V | and number
Herdan’s Law of tokens N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978)
Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown
in Eq. 2.1, where k and β are positive constants, and 0 < β < 1.

|V | = kN β (2.1)
2.3 • C ORPORA 13

The value of β depends on the corpus size and the genre, but at least for the large
corpora in Fig. 2.11, β ranges from .67 to .75. Roughly then we can say that the
vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.

2.3 Corpora
Words don’t appear out of nowhere. Any particular piece of text that we study
is produced by one or more specific speakers or writers, in a specific dialect of a
specific language, at a specific time, in a specific place, for a specific function.
Perhaps the most important dimension of variation is the language. NLP algo-
rithms are most useful when they apply across many languages. The world has 7097
languages at the time of this writing, according to the online Ethnologue catalog
(Simons and Fennig, 2018). It is important to test algorithms on more than one lan-
guage, and particularly on languages with different properties; by contrast there is
an unfortunate current tendency for NLP algorithms to be developed or tested just
on English (Bender, 2019). Even when algorithms are developed beyond English,
they tend to be developed for the official languages of large industrialized nations
(Chinese, Spanish, Japanese, German etc.), but we don’t want to limit tools to just
these few languages. Furthermore, most languages also have multiple varieties, of-
ten spoken in different regions or by different social groups. Thus, for example, if
AAL we’re processing text that uses features of African American Language (AAL) —
the name for the many variations of language used by millions of people in African
American communities (King 2020) — we must use NLP tools that function with
features of those varieties. Twitter posts might use features often used by speakers of
African American Language, such as constructions like iont (I don’t in Mainstream
MAE American English (MAE)), or talmbout corresponding to MAE talking about, both
examples that influence word segmentation (Blodgett et al. 2016, Jones 2015).
It’s also quite common for speakers or writers to use multiple languages in a
code switching single communicative act, a phenomenon called code switching. Code switch-
ing is enormously common across the world; here are examples showing Spanish
and (transliterated) Hindi code switching with English (Solorio et al. 2014, Jurgens
et al. 2017):
(2.2) Por primera vez veo a @username actually being hateful! it was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
(2.3) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Another dimension of variation is the genre. The text that our algorithms must
process might come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts. It might come from spoken genres like telephone
conversations, business meetings, police body-worn cameras, medical interviews,
or transcripts of television shows or movies. It might come from work situations
14 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

like doctors’ notes, legal text, or parliamentary or congressional proceedings.


Text also reflects the demographic characteristics of the writer (or speaker): their
age, gender, race, socioeconomic class can all influence the linguistic properties of
the text we are processing.
And finally, time matters too. Language changes over time, and for some lan-
guages we have good corpora of texts from different historical periods.
Because language is so situated, when developing computational models for lan-
guage processing from a corpus, it’s important to consider who produced the lan-
guage, in what context, for what purpose. How can a user of a dataset know all these
datasheet details? The best way is for the corpus creator to build a datasheet (Gebru et al.,
2020) or data statement (Bender and Friedman, 2018) for each corpus. A datasheet
specifies properties of a dataset like:
Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: When and in what situation was the text written/spoken? For example,
was there a task? Was the language originally spoken conversation, edited
text, social media communication, monologue vs. dialogue?
Language variety: What language (including dialect/region) was the corpus in?
Speaker demographics: What was, e.g., age or gender of the authors of the text?
Collection process: How big is the data? If it is a subsample how was it sampled?
Was the data collected with consent? How was the data pre-processed, and
what metadata is available?
Annotation process: What are the annotations, what are the demographics of the
annotators, how were they trained, how was the data annotated?
Distribution: Are there copyright or other intellectual property restrictions?

2.4 Text Normalization


Before almost any natural language processing of a text, the text has to be normal-
ized. At least three tasks are commonly applied as part of any normalization process:
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
In the next sections we walk through each of these tasks.

2.4.1 Unix Tools for Crude Tokenization and Normalization


Let’s begin with an easy, if somewhat naive version of word tokenization and nor-
malization (and frequency computation) that can be accomplished for English solely
in a single UNIX command-line, inspired by Church (1994). We’ll make use of some
Unix commands: tr, used to systematically change particular characters in the in-
put; sort, which sorts input lines in alphabetical order; and uniq, which collapses
and counts adjacent identical lines.
For example let’s begin with the ‘complete words’ of Shakespeare in one textfile,
sh.txt. We can use tr to tokenize the words by changing every sequence of non-
alphabetic characters to a newline (’A-Za-z’ means alphabetic, the -c option com-
plements to non-alphabet, and the -s option squeezes all sequences into a single
character):
2.4 • T EXT N ORMALIZATION 15

tr -sc ’A-Za-z’ ’\n’ < sh.txt


The output of this command will be:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Now that there is one word per line, we can sort the lines, and pass them to uniq
-c which will collapse and count them:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c
with the following output:
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
...
Alternatively, we can collapse all the upper case to lower case:
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c
whose output is
14725 a
97 aaron
1 abaissiez
10 abandon
2 abandoned
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
16 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...

Unix tools of this sort can be very handy in building quick word count statistics
for any corpus.

2.4.2 Word Tokenization


The simple UNIX tools above were fine for getting rough word statistics but more
tokenization sophisticated algorithms are generally necessary for tokenization, the task of seg-
menting running text into words.
While the Unix command sequence just removed all the numbers and punctu-
ation, for most NLP applications we’ll need to keep these in our tokenization. We
often want to break off punctuation as a separate token; commas are a useful piece of
information for parsers, periods help indicate sentence boundaries. But we’ll often
want to keep the punctuation that occurs word internally, in examples like m.p.h.,
Ph.D., AT&T, and cap’n. Special characters and numbers will need to be kept in
prices ($45.55) and dates (01/02/06); we don’t want to segment that price into sep-
arate tokens of “45” and “55”. And there are URLs (http://www.stanford.edu),
Twitter hashtags (#nlproc), or email addresses ([email protected]).
Number expressions introduce other complications as well; while commas nor-
mally appear at word boundaries, commas are used inside numbers in English, every
three digits: 555,500.50. Languages, and hence tokenization requirements, differ
on this; many continental European languages like Spanish, French, and German, by
contrast, use a comma to mark the decimal point, and spaces (or sometimes periods)
where English puts commas, for example, 555 500,50.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity recognition, the task of detecting names, dates,
and organizations (Chapter 8).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank kenization standard, used for the parsed corpora (treebanks) released by the Lin-
tokenization
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation (to save space we’re showing visible spaces
‘ ’ between tokens, although newlines is a more common output):
2.4 • T EXT N ORMALIZATION 17

Input: "The San Francisco-based restaurant," they said,


"doesn’t charge $10".
Output: " The San Francisco-based restaurant , " they said ,
" does n’t charge $ 10 " .
In practice, since tokenization needs to be run before any other language pro-
cessing, it needs to be very fast. The standard method for tokenization is therefore
to use deterministic algorithms based on regular expressions compiled into very ef-
ficient finite state automata. For example, Fig. 2.12 shows an example of a basic
regular expression that can be used to tokenize with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
http://www.nltk.org).

>>> text = ’That U.S.A. poster-print costs $12.40...’


>>> pattern = r’’’(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"’?():-_‘] # these are separate tokens; includes ], [
... ’’’
>>> nltk.regexp_tokenize(text, pattern)
[’That’, ’U.S.A.’, ’poster-print’, ’costs’, ’$12.40’, ’...’]
Figure 2.12 A Python trace of regular expression tokenization in the NLTK Python-based
natural language processing toolkit (Bird et al., 2009), commented for readability; the (?x)
verbose flag tells Python to strip comments and whitespace. Figure from Chapter 3 of Bird
et al. (2009).

Carefully designed deterministic algorithms can deal with the ambiguities that
arise, such as the fact that the apostrophe needs to be tokenized differently when used
as a genitive marker (as in the book’s cover), a quotative as in ‘The other class’, she
said, or in clitics like they’re.
Word tokenization is more complex in languages like written Chinese, Japanese,
and Thai, which do not use spaces to mark potential word-boundaries. In Chinese,
hanzi for example, words are composed of characters (called hanzi in Chinese). Each
character generally represents a single unit of meaning (called a morpheme) and is
pronounceable as a single syllable. Words are about 2.4 characters long on average.
But deciding what counts as a word in Chinese is complex. For example, consider
the following sentence:
(2.4) 姚明进入总决赛
“Yao Ming reaches the finals”
As Chen et al. (2017) point out, this could be treated as 3 words (‘Chinese Treebank’
segmentation):
(2.5) 姚明 进入 总决赛
YaoMing reaches finals
or as 5 words (‘Peking University’ segmentation):
(2.6) 姚 明 进入 总 决赛
Yao Ming reaches overall finals
Finally, it is possible in Chinese simply to ignore words altogether and use characters
as the basic elements, treating the sentence as a series of 7 characters:
18 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

(2.7) 姚 明 进 入 总 决 赛
Yao Ming enter enter overall decision game

In fact, for most Chinese NLP tasks it turns out to work better to take characters
rather than words as input, since characters are at a reasonable semantic level for
most applications, and since most word standards, by contrast, result in a huge vo-
cabulary with large numbers of very rare words (Li et al., 2019).
However, for Japanese and Thai the character is too small a unit, and so algo-
word
segmentation rithms for word segmentation are required. These can also be useful for Chinese
in the rare situations where word rather than character boundaries are required. The
standard segmentation algorithms for these languages use neural sequence mod-
els trained via supervised machine learning on hand-segmented training sets; we’ll
introduce sequence models in Chapter 8 and Chapter 9.

2.4.3 Byte-Pair Encoding for Tokenization


There is a third option to tokenizing text. Instead of defining tokens as words
(whether delimited by spaces or more complex algorithms), or as characters (as in
Chinese), we can use our data to automatically tell us what the tokens should be.
This is especially useful in dealing with unknown words, an important problem in
language processing. As we will see in the next chapter, NLP algorithms often learn
some facts about language from one corpus (a training corpus) and then use these
facts to make decisions about a separate test corpus and its language. Thus if our
training corpus contains, say the words low, new, newer, but not lower, then if the
word lower appears in our test corpus, our system will not know what to do with it.
To deal with this unknown word problem, modern tokenizers often automati-
subwords cally induce sets of tokens that include tokens smaller than words, called subwords.
Subwords can be arbitrary substrings, or they can be meaning-bearing units like the
morphemes -est or -er. (A morpheme is the smallest meaning-bearing unit of a lan-
guage; for example the word unlikeliest has the morphemes un-, likely, and -est.)
In modern tokenization schemes, most tokens are words, but some tokens are fre-
quently occurring morphemes or other subwords like -er. Every unseen words like
lower can thus be represented by some sequence of known subword units, such as
low and er, or even as a sequence of individual letters if necessary.
Most tokenization schemes have two parts: a token learner, and a token seg-
menter. The token learner takes a raw training corpus (sometimes roughly pre-
separated into words, for example by whitespace) and induces a vocabulary, a set
of tokens. The token segmenter takes a raw test sentence and segments it into the
tokens in the vocabulary. Three algorithms are widely used: byte-pair encoding
(Sennrich et al., 2016), unigram language modeling (Kudo, 2018), and WordPiece
(Schuster and Nakajima, 2012); there is also a SentencePiece library that includes
implementations of the first two of the three (Kudo and Richardson, 2018).
In this section we introduce the simplest of the three, the byte-pair encoding or
BPE BPE algorithm (Sennrich et al., 2016); see Fig. 2.13. The BPE token learner begins
with a vocabulary that is just the set of all individual characters. It then examines the
training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’,
‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent
’A’ ’B’ in the corpus with the new ‘AB’. It continues to count and merge, creating
new longer and longer character strings, until k merges have been done creating k
novel tokens; k is thus is a parameter of the algorithm. The resulting vocabulary
consists of the original set of characters plus k new symbols.
2.4 • T EXT N ORMALIZATION 19

The algorithm is usually run inside words (not merging across word boundaries),
so the input corpus is first white-space-separated to give a set of strings, each corre-
sponding to the characters of a word, plus a special end-of-word symbol , and its
counts. Let’s see its operation on the following tiny input corpus of 18 word tokens
with counts for each word (the word low appears 5 times, the word newer 6 times,
and so on), which would have a starting vocabulary of 11 letters:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
The BPE algorithm first count all pairs of adjacent symbols: the most frequent
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
3) for a total of 9 occurrences1 . We then merge these symbols, treating er as one
symbol, and count again:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Next n e (total count of 8) get merged to ne:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er , ne
2 l o w e s t
6 ne w er
3 w i d er
2 ne w
If we continue, the next merges are:
Merge Current Vocabulary
(ne, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new
(l, o) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo
(lo, w) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low
(new, er ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer
(low, ) , d, e, i, l, n, o, r, s, t, w, er, er , ne, new, lo, low, newer , low
Once we’ve learned our vocabulary, the token parser is used to tokenize a test
sentence. The token parser just runs on the test data the merges we have learned
1 Note that there can be ties; we could have instead chosen to merge r first, since that also has a
frequency of 9.
20 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

function B YTE - PAIR ENCODING(strings C, number of merges k) returns vocab V

V ← all unique characters in C # initial set of tokens is characters


for i = 1 to k do # merge tokens til k times
tL , tR ← Most frequent pair of adjacent tokens in C
tNEW ← tL + tR # make new token by concatenating
V ← V + tNEW # update the vocabulary
Replace each occurrence of tL , tR in C with tNEW # and update the corpus
return V

Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Figure adapted from Bostrom and Durrett (2020).

from the training data, greedily, in the order we learned them. (Thus the frequencies
in the test data don’t play a role, just the frequencies in the training data). So first
we segment each test sentence word into characters. Then we apply the first rule:
replace every instance of e r in the test corpus with r, and then the second rule:
replace every instance of er in the test corpus with er , and so on. By the end,
if the test corpus contained the word n e w e r , it would be tokenized as a full
word. But a new (unknown) word like l o w e r would be merged into the two
tokens low er .
Of course in real algorithms BPE is run with many thousands of merges on a very
large input corpus. The result is that most words will be represented as full symbols,
and only the very rare words (and unknown words) will have to be represented by
their parts.

2.4.4 Word Normalization, Lemmatization and Stemming


normalization Word normalization is the task of putting words/tokens in a standard format, choos-
ing a single normal form for words with multiple forms like USA and US or uh-huh
and uhhuh. This standardization may be valuable, despite the spelling information
that is lost in the normalization process. For information retrieval or information
extraction about the US, we might want to see information from documents whether
they mention the US or the USA.
case folding Case folding is another kind of normalization. Mapping everything to lower
case means that Woodchuck and woodchuck are represented identically, which is
very helpful for generalization in many tasks, such as information retrieval or speech
recognition. For sentiment analysis and other text classification tasks, information
extraction, and machine translation, by contrast, case can be quite helpful and case
folding is generally not done. This is because maintaining the difference between,
for example, US the country and us the pronoun can outweigh the advantage in
generalization that case folding would have provided for other words.
For many natural language processing situations we also want two morpholog-
ically different forms of a word to behave similarly. For example in web search,
someone may type the string woodchucks but a useful system might want to also
return pages that mention woodchuck with no s. This is especially common in mor-
phologically complex languages like Russian, where for example the word Moscow
has different endings in the phrases Moscow, of Moscow, to Moscow, and so on.
Lemmatization is the task of determining that two words have the same root,
despite their surface differences. The words am, are, and is have the shared lemma
2.4 • T EXT N ORMALIZATION 21

be; the words dinner and dinners both have the lemma dinner. Lemmatizing each of
these forms to the same lemma will let us find all mentions of words in Russian like
Moscow. The lemmatized form of a sentence like He is reading detective stories
would thus be He be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning— and affixes—adding “additional”
meanings of various kinds. So, for example, the word fox consists of one morpheme
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or parses a Spanish word like amaren (‘if in the future
they would love’) into the morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.

The Porter Stemmer


Lemmatization algorithms can be complex. For this reason we sometimes make use
of a simpler but cruder method, which mainly consists of chopping off word-final
stemming affixes. This naive version of morphological analysis is called stemming. One of
Porter stemmer the most widely used stemming algorithms is the Porter (1980). The Porter stemmer
applied to the following paragraph:
This was not the map we found in Billy Bones’s chest, but
an accurate copy, complete in all things-names and heights
and soundings-with the single exception of the red crosses
and the written notes.
produces the following stemmed output:
Thi wa not the map we found in Billi Bone s chest but an
accur copi complet in all thing name and height and sound
with the singl except of the red cross and the written note
cascade The algorithm is based on series of rewrite rules run in series, as a cascade, in
which the output of each pass is fed as input to the next pass; here is a sampling of
the rules:
ATIONAL → ATE (e.g., relational → relate)
ING →  if stem contains vowel (e.g., motoring → motor)
SSES → SS (e.g., grasses → grass)
Detailed rule lists for the Porter stemmer, as well as code (in Java, Python, etc.)
can be found on Martin Porter’s homepage; see also the original paper (Porter, 1980).
Simple stemmers can be useful in cases where we need to collapse across differ-
ent variants of the same lemma. Nonetheless, they do tend to commit errors of both
over- and under-generalizing, as shown in the table below (Krovetz, 1993):

Errors of Commission Errors of Omission


organization organ European Europe
doing doe analysis analyzes
numerical numerous noise noisy
policy police sparse sparsity
22 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

2.4.5 Sentence Segmentation


sentence
segmentation Sentence segmentation is another important step in text processing. The most use-
ful cues for segmenting a text into sentences are punctuation, like periods, question
marks, and exclamation points. Question marks and exclamation points are rela-
tively unambiguous markers of sentence boundaries. Periods, on the other hand, are
more ambiguous. The period character “.” is ambiguous between a sentence bound-
ary marker and a marker of abbreviations like Mr. or Inc. The previous sentence that
you just read showed an even more complex case of this ambiguity, in which the final
period of Inc. marked both an abbreviation and the sentence boundary marker. For
this reason, sentence tokenization and word tokenization may be addressed jointly.
In general, sentence tokenization methods work by first deciding (based on rules
or machine learning) whether a period is part of the word or is a sentence-boundary
marker. An abbreviation dictionary can help determine whether the period is part
of a commonly used abbreviation; the dictionaries can be hand-built or machine-
learned (Kiss and Strunk, 2006), as can the final sentence splitter. In the Stan-
ford CoreNLP toolkit (Manning et al., 2014), for example sentence splitting is
rule-based, a deterministic consequence of tokenization; a sentence ends when a
sentence-ending punctuation (., !, or ?) is not already grouped with other charac-
ters into a token (such as for an abbreviation or number), optionally followed by
additional final quotes or brackets.

2.5 Minimum Edit Distance


Much of natural language processing is concerned with measuring how similar two
strings are. For example in spelling correction, the user typed some erroneous
string—let’s say graffe–and we want to know what the user meant. The user prob-
ably intended a word that is similar to graffe. Among candidate similar words,
the word giraffe, which differs by only one letter from graffe, seems intuitively
to be more similar than, say grail or graf, which differ in more letters. Another
example comes from coreference, the task of deciding whether two strings such as
the following refer to the same entity:
Stanford President Marc Tessier-Lavigne
Stanford University President Marc Tessier-Lavigne
Again, the fact that these two strings are very similar (differing by only one word)
seems like useful evidence for deciding that they might be coreferent.
Edit distance gives us a way to quantify both of these intuitions about string sim-
minimum edit ilarity. More formally, the minimum edit distance between two strings is defined
distance
as the minimum number of editing operations (operations like insertion, deletion,
substitution) needed to transform one string into another.
The gap between intention and execution, for example, is 5 (delete an i, substi-
tute e for n, substitute x for t, insert c, substitute u for n). It’s much easier to see
alignment this by looking at the most important visualization for string distances, an alignment
between the two strings, shown in Fig. 2.14. Given two sequences, an alignment is
a correspondence between substrings of the two sequences. Thus, we say I aligns
with the empty string, N with E, and so on. Beneath the aligned strings is another
representation; a series of symbols expressing an operation list for converting the
top string into the bottom string: d for deletion, s for substitution, i for insertion.
2.5 • M INIMUM E DIT D ISTANCE 23

INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s

Figure 2.14 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.

We can also assign a particular cost or weight to each of these operations. The
Levenshtein distance between two sequences is the simplest weighting factor in
which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume
that the substitution of a letter for itself, for example, t for t, has zero cost. The Lev-
enshtein distance between intention and execution is 5. Levenshtein also proposed
an alternative version of his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.

2.5.1 The Minimum Edit Distance Algorithm


How do we find the minimum edit distance? We can think of this as a search task, in
which we are searching for the shortest path—a sequence of edits—from one string
to another.

i n t e n t i o n

del ins subst

n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.15 Finding the edit distance viewed as a search problem

The space of all possible edits is enormous, so we can’t search naively. However,
lots of distinct edit paths will end up in the same state (string), so rather than recom-
puting all those paths, we could just remember the shortest path to a state each time
dynamic
programming we saw it. We can do this by using dynamic programming. Dynamic programming
is the name for a class of algorithms, first introduced by Bellman (1957), that apply
a table-driven method to solve problems by combining solutions to sub-problems.
Some of the most commonly used algorithms in natural language processing make
use of dynamic programming, such as the Viterbi algorithm (Chapter 8) and the
CKY algorithm for parsing (Chapter 13).
The intuition of a dynamic programming problem is that a large problem can
be solved by properly combining the solutions to various sub-problems. Consider
the shortest path of transformed words that represents the minimum edit distance
between the strings intention and execution shown in Fig. 2.16.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
24 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.16 Path from intention to execution.

then we could use it instead, resulting in a shorter overall path, and the optimal
minimum edit
sequence wouldn’t be optimal, thus leading to a contradiction.
distance The minimum edit distance algorithm algorithm was named by Wagner and
algorithm
Fischer (1974) but independently discovered by many people (see the Historical
Notes section of Chapter 8).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
D[i, j] as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D[n, m].
We’ll use dynamic programming to compute D[n, m] bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D[i, j] for small i, j we then compute larger
D[i, j] based on previously computed smaller values. The value of D[i, j] is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:

 D[i − 1, j] + del-cost(source[i])
D[i, j] = min D[i, j − 1] + ins-cost(target[ j])

D[i − 1, j − 1] + sub-cost(source[i], target[ j])

If we assume the version of Levenshtein distance in which the insertions and dele-
tions each have a cost of 1 (ins-cost(·) = del-cost(·) = 1), and substitutions have a
cost of 2 (except substitution of identical letters have zero cost), the computation for
D[i, j] becomes:


 D[i − 1, j] + 1

D[i, j − 1] + 1 
D[i, j] = min (2.8)

 2; if source[i] 6= target[ j]
 D[i − 1, j − 1] +
0; if source[i] = target[ j]

The algorithm is summarized in Fig. 2.17; Fig. 2.18 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.8.
Knowing the minimum edit distance is useful for algorithms like finding poten-
tial spelling error corrections. But the edit distance algorithm is important in another
way; with a small change, it can also provide the minimum cost alignment between
two strings. Aligning two strings is useful throughout speech and language process-
ing. In speech recognition, minimum edit distance alignment is used to compute
the word error rate (Chapter 26). Alignment plays a role in machine translation, in
2.5 • M INIMUM E DIT D ISTANCE 25

function M IN -E DIT-D ISTANCE(source, target) returns min-distance

n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix distance[n+1,m+1]

# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])

# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]

Figure 2.17 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).

Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.18 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.17, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.

which sentences in a parallel corpus (a corpus with a text in two languages) need to
be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.19
shows this path with the boldfaced cell. Each boldfaced cell represents an alignment
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.19 also shows the intuition of how to compute this alignment path. The
26 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.19. Some cells have mul-
tiple backpointers because the minimum extension could have come from multiple
backtrace previous cells. In the second step, we perform a backtrace. In a backtrace, we start
from the last cell (at the final row and column), and follow the pointers back through
the dynamic programming matrix. Each complete path between the final cell and the
initial cell is a minimum distance alignment. Exercise 2.7 asks you to modify the
minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment.

# e x e c u t i o n
# 0 ← 1 ← 2 ← 3 ← 4 ← 5 ← 6 ← 7 ←8 ← 9
i ↑1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n ↑2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t ↑3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e ↑4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n ↑5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t ↑6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i ↑7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o ↑8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n ↑9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.19 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).

While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.17 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
each other on the keyboard. The Viterbi algorithm is a probabilistic extension of
minimum edit distance. Instead of computing the “minimum edit distance” between
two strings, Viterbi computes the “maximum probability alignment” of one string
with another. We’ll discuss this more in Chapter 8.

2.6 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduced the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors
(ˆ, $) and precedence operators ((,)).
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 27

• Word tokenization and normalization are generally done by cascades of


simple regular expression substitutions or finite automata.
• The Porter algorithm is a simple and efficient way to do stemming, stripping
off affixes. It does not have high accuracy but may be useful for some tasks.
• The minimum edit distance between two strings is the minimum number of
operations it takes to edit one into the other. Minimum edit distance can be
computed by dynamic programming, which also results in an alignment of
the two strings.

Bibliographical and Historical Notes


Kleene (1951, 1956) first defined regular expressions and the finite automaton, based
on the McCulloch-Pitts neuron. Ken Thompson was one of the first to build regular
expressions compilers into editors for text searching (Thompson, 1968). His edi-
tor ed included a command “g/regular expression/p”, or Global Regular Expression
Print, which later became the Unix grep utility.
Text normalization algorithms have been applied since the beginning of the
field. One of the earliest widely used stemmers was Lovins (1968). Stemming
was also applied early to the digital humanities, by Packard (1973), who built an
affix-stripping morphological parser for Ancient Greek. Currently a wide vari-
ety of code for tokenization and normalization is available, such as the Stanford
Tokenizer (http://nlp.stanford.edu/software/tokenizer.shtml) or spe-
cialized tokenizers for Twitter (O’Connor et al., 2010), or for sentiment (http:
//sentiment.christopherpotts.net/tokenizing.html). See Palmer (2012)
for a survey of text preprocessing. NLTK is an essential tool that offers both useful
Python libraries (http://www.nltk.org) and textbook descriptions (Bird et al.,
2009) of many algorithms including text normalization and corpus interfaces.
For more on Herdan’s law and Heaps’ Law, see Herdan (1960, p. 28), Heaps
(1978), Egghe (2007) and Baayen (2001); Yasseri et al. (2012) discuss the relation-
ship with other measures of linguistic complexity. For more on edit distance, see the
excellent Gusfield (1997). Our example measuring the edit distance from ‘intention’
to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail-
able packages to compute edit distance, including Unix diff and the NIST sclite
program (NIST, 2005).
In his autobiography Bellman (1984) explains how he originally came up with
the term dynamic programming:

“...The 1950s were not good years for mathematical research. [the]
Secretary of Defense ...had a pathological fear and hatred of the word,
research... I decided therefore to use the word, “programming”. I
wanted to get across the idea that this was dynamic, this was multi-
stage... I thought, let’s ... take a word that has an absolutely precise
meaning, namely dynamic... it’s impossible to use the word, dynamic,
in a pejorative sense. Try thinking of some combination that will pos-
sibly give it a pejorative meaning. It’s impossible. Thus, I thought
dynamic programming was a good name. It was something not even a
Congressman could object to.”
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE

Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 11. You might want to choose a different domain than a Rogerian psy-
chologist, although keep in mind that you would need a domain in which your
program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
CHAPTER

3 N-gram Language Models

“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model

Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone
is going to say? What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:

on guys all I of notice sidewalk three a sudden standing the

Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words in
noisy, ambiguous input, like speech recognition. For a speech recognizer to realize
that you said I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish. For writing
tools like spelling correction or grammatical error correction, we need to find and
correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved.
The phrase There are will be much more probable than Their are, and has improved
than has improve, allowing us to help users by detecting and correcting these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
30 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

A probabilistic model of word sequences could suggest that briefed reporters on


is a more probable English phrase than briefed to reporters (which has an awkward
to after briefed) or introduced reporters to (which uses a verb that is less fluent
English in this context), allowing us to correctly select the boldfaced sentence above.
Probabilities are also important for augmentative and alternative communi-
AAC cation systems (Trnka et al. 2007, Kane et al. 2017). People often use such AAC
devices if they are physically unable to speak or sign but can instead use eye gaze or
other specific movements to select words from a menu to be spoken by the system.
Word prediction can be used to suggest likely words for the menu.
Models that assign probabilities to sequences of words are called language mod-
language model els or LMs. In this chapter we introduce the simplest model that assigns probabil-
LM ities to sentences and sequences of words, the n-gram. An n-gram is a sequence
n-gram of n words: a 2-gram (which we’ll call bigram) is a two-word sequence of words
like “please turn”, “turn your”, or ”your homework”, and a 3-gram (a trigram) is
a three-word sequence of words like “please turn your”, or “turn your homework”.
We’ll see how to use n-gram models to estimate the probability of the last word of
an n-gram given the previous words, and also to assign probabilities to entire se-
quences. In a bit of terminological ambiguity, we usually drop the word “model”,
and use the term n-gram (and bigram, etc.) to mean either the word sequence itself
or the predictive model that assigns it a probability. In later chapters we’ll introduce
more sophisticated language models like the RNN LMs of Chapter 9.

3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “its water is so transparent that” and we
want to know the probability that the next word is the:

P(the|its water is so transparent that). (3.1)

One way to estimate this probability is from relative frequency counts: take a
very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the. This would be answering the
question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:

P(the|its water is so transparent that) =


C(its water is so transparent that the)
(3.2)
C(its water is so transparent that)

With a large enough corpus, such as the web, we can compute these counts and
estimate the probability from Eq. 3.2. You should pause now, go to the web, and
compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in
many cases, it turns out that even the web isn’t big enough to give us good estimates
in most cases. This is because language is creative; new sentences are created all the
time, and we won’t always be able to count entire sentences. Even simple extensions
of the example sentence may have counts of zero on the web (such as “Walden
Pond’s water is so transparent that the”; well, used to have counts of zero).
3.1 • N-G RAMS 31

Similarly, if we wanted to know the joint probability of an entire sequence of


words like its water is so transparent, we could do it by asking “out of all possible
sequences of five words, how many of them are its water is so transparent?” We
would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
For this reason, we’ll need to introduce more clever ways of estimating the prob-
ability of a word w given a history h, or the probability of an entire word sequence
W . Let’s start with a little formalizing of notation. To represent the probability
of a particular random variable Xi taking on the value “the”, or P(Xi = “the”), we
will use the simplification P(the). We’ll represent a sequence of N words either as
w1 . . . wn or w1:n (so the expression w1:n−1 means the string w1 , w2 , ..., wn−1 ). For the
joint probability of each word in a sequence having a particular value P(X = w1 ,Y =
w2 , Z = w3 , ...,W = wn ) we’ll use P(w1 , w2 , ..., wn ).
Now how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )?
One thing we can do is decompose this probability using the chain rule of proba-
bility:
P(X1 ...Xn ) = P(X1 )P(X2 |X1 )P(X3 |X1:2 ) . . . P(Xn |X1:n−1 )
Yn
= P(Xk |X1:k−1 ) (3.3)
k=1

Applying the chain rule to words, we get


P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1:2 ) . . . P(wn |w1:n−1 )
Yn
= P(wk |w1:k−1 ) (3.4)
k=1

The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equa-
tion 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |wn−1 1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string, because language is creative and any particular
context might have never occurred before!
The intuition of the n-gram model is that instead of computing the probability of
a word given its entire history, we can approximate the history by just the last few
words.
bigram The bigram model, for example, approximates the probability of a word given
all the previous words P(wn |w1:n−1 ) by using only the conditional probability of the
preceding word P(wn |wn−1 ). In other words, instead of computing the probability
P(the|Walden Pond’s water is so transparent that) (3.5)
we approximate it with the probability
P(the|that) (3.6)
When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:
P(wn |w1:n−1 ) ≈ P(wn |wn−1 ) (3.7)
32 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

The assumption that the probability of a word depends only on the previous word is
Markov called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Thus, the general equation for this n-gram approximation to the conditional
probability of the next word in a sequence is

P(wn |w1:n−1 ) ≈ P(wn |wn−N+1:n−1 ) (3.8)

Given the bigram assumption for the probability of an individual word, we can com-
pute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
n
Y
P(w1:n ) ≈ P(wk |wk−1 ) (3.9)
k=1

maximum
How do we estimate these bigram or n-gram probabilities? An intuitive way to
likelihood estimate probabilities is called maximum likelihood estimation or MLE. We get
estimation
the MLE estimate for the parameters of an n-gram model by getting counts from a
normalize corpus, and normalizing the counts so that they lie between 0 and 1.1
For example, to compute a particular bigram probability of a word y given a
previous word x, we’ll compute the count of the bigram C(xy) and normalize by the
sum of all the bigrams that share the same first word x:

C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):

C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol. </s>2
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = .67 P(Sam|<s>) = 3 = .33 P(am|I) = 3 = .67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = .5 P(do|I) = 3 = .33
1 For probabilistic models, normalizing means dividing by some total count so that the resulting proba-
bilities fall legally between 0 and 1.
2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an
end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model
would define an infinite set of probability distributions, with one distribution per sentence length. See
Exercise 3.5.
3.1 • N-G RAMS 33

For the general case of MLE n-gram parameter estimation:

C(wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) = (3.12)
C(wn−N+1:n−1 )

Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the
observed frequency of a particular sequence by the observed frequency of a prefix.
relative
frequency This ratio is called a relative frequency. We said above that this use of relative
frequencies as a way to estimate probabilities is an example of maximum likelihood
estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood
of the training set T given the model M (i.e., P(T |M)). For example, suppose the
word Chinese occurs 400 times in a corpus of a million words like the Brown corpus.
What is the probability that a random word selected from some other text of, say,
400
a million words will be the word Chinese? The MLE of its probability is 1000000
or .0004. Now .0004 is not the best possible estimate of the probability of Chinese
occurring in all situations; it might turn out that in some other corpus or context
Chinese is a very unlikely word. But it is the probability that makes it most likely
that Chinese will occur 400 times in a million-word corpus. We present ways to
modify the MLE estimates slightly to get better probability estimates in Section 3.4.
Let’s move on to some examples from a slightly larger corpus than our 14-word
example above. We’ll use data from the now-defunct Berkeley Restaurant Project,
a dialogue system from the last century that answered questions about a database
of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some text-
normalized sample user queries (a sample of 9332 sentences is on the website):
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day

Figure 3.1 shows the bigram counts from a piece of a bigram grammar from the
Berkeley Restaurant Project. Note that the majority of the values are zero. In fact,
we have chosen the sample words to cohere with each other; a matrix selected from
a random set of seven words would be even more sparse.

i want to eat chinese food lunch spend


i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0
Figure 3.1 Bigram counts for eight of the words (out of V = 1446) in the Berkeley Restau-
rant Project corpus of 9332 sentences. Zero counts are in gray.

Figure 3.2 shows the bigram probabilities after normalization (dividing each cell
in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of
unigram probabilities):
34 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

i want to eat chinese food lunch spend


2533 927 2417 746 158 1093 341 278

i want to eat chinese food lunch spend


i 0.002 0.33 0 0.0036 0 0 0 0.00079
want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011
to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087
eat 0 0 0.0027 0 0.021 0.0027 0.056 0
chinese 0.0063 0 0 0 0 0.52 0.0063 0
food 0.014 0 0.014 0 0.00092 0.0037 0 0
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0.0036 0 0.0036 0 0 0 0 0
Figure 3.2 Bigram probabilities for eight words in the Berkeley Restaurant Project corpus
of 9332 sentences. Zero probabilities are in gray.

Here are a few other useful probabilities:


P(i|<s>) = 0.25 P(english|want) = 0.0011
P(food|english) = 0.5 P(</s>|food) = 0.68
Now we can compute the probability of sentences like I want English food or
I want Chinese food by simply multiplying the appropriate bigram probabilities to-
gether, as follows:

P(<s> i want english food </s>)


= P(i|<s>)P(want|i)P(english|want)
P(food|english)P(</s>|food)
= .25 × .33 × .0011 × 0.5 × 0.68
= .000031

We leave it as Exercise 3.2 to compute the probability of i want chinese food.


What kinds of linguistic phenomena are captured in these bigram statistics?
Some of the bigram probabilities above encode some facts that we think of as strictly
syntactic in nature, like the fact that what comes after eat is usually a noun or an
adjective, or that what comes after to is usually a verb. Others might be a fact about
the personal assistant task, like the high probability of sentences beginning with
the words I. And some might even be cultural rather than linguistic, like the higher
probability that people are looking for Chinese versus English food.
Some practical issues: Although for pedagogical purposes we have only described
trigram bigram models, in practice it’s more common to use trigram models, which con-
4-gram dition on the previous two words rather than the previous word, or 4-gram or even
5-gram 5-gram models, when there is sufficient training data. Note that for these larger n-
grams, we’ll need to assume extra contexts to the left and right of the sentence end.
For example, to compute trigram probabilities at the very beginning of the sentence,
we use two pseudo-words for the first trigram (i.e., P(I|<s><s>).
We always represent and compute language model probabilities in log format
log
probabilities as log probabilities. Since probabilities are (by definition) less than or equal to
1, the more probabilities we multiply together, the smaller the product becomes.
Multiplying enough n-grams together would result in numerical underflow. By using
log probabilities instead of raw probabilities, we get numbers that are not as small.
3.2 • E VALUATING L ANGUAGE M ODELS 35

Adding in log space is equivalent to multiplying in linear space, so we combine log


probabilities by adding them. The result of doing all computation and storage in log
space is that we only need to convert back into probabilities if we need to report
them at the end; then we can just take the exp of the logprob:

p1 × p2 × p3 × p4 = exp(log p1 + log p2 + log p3 + log p4 ) (3.13)

3.2 Evaluating Language Models


The best way to evaluate the performance of a language model is to embed it in
an application and measure how much the application improves. Such end-to-end
extrinsic evaluation is called extrinsic evaluation. Extrinsic evaluation is the only way to
evaluation
know if a particular improvement in a component is really going to help the task
at hand. Thus, for speech recognition, we can compare the performance of two
language models by running the speech recognizer twice, once with each language
model, and seeing which gives the more accurate transcription.
Unfortunately, running big NLP systems end-to-end is often very expensive. In-
stead, it would be nice to have a metric that can be used to quickly evaluate potential
intrinsic improvements in a language model. An intrinsic evaluation metric is one that mea-
evaluation
sures the quality of a model independent of any application.
For an intrinsic evaluation of a language model we need a test set. As with many
of the statistical models in our field, the probabilities of an n-gram model come from
training set the corpus it is trained on, the training set or training corpus. We can then measure
the quality of an n-gram model by its performance on some unseen data called the
test set test set or test corpus. We will also sometimes call test sets and other datasets that
held out are not in our training sets held out corpora because we hold them out from the
training data.
So if we are given a corpus of text and want to compare two different n-gram
models, we divide the data into training and test sets, train the parameters of both
models on the training set, and then compare how well the two trained models fit the
test set.
But what does it mean to “fit the test set”? The answer is simple: whichever
model assigns a higher probability to the test set—meaning it more accurately
predicts the test set—is a better model. Given two probabilistic models, the better
model is the one that has a tighter fit to the test data or that better predicts the details
of the test data, and hence will assign a higher probability to the test data.
Since our evaluation metric is based on test set probability, it’s important not to
let the test sentences into the training set. Suppose we are trying to compute the
probability of a particular “test” sentence. If our test sentence is part of the training
corpus, we will mistakenly assign it an artificially high probability when it occurs
in the test set. We call this situation training on the test set. Training on the test
set introduces a bias that makes the probabilities all look too high, and causes huge
inaccuracies in perplexity, the probability-based metric we introduce below.
Sometimes we use a particular test set so often that we implicitly tune to its
characteristics. We then need a fresh test set that is truly unseen. In such cases, we
development call the initial test set the development test set or, devset. How do we divide our
test
data into training, development, and test sets? We want our test set to be as large
as possible, since a small test set may be accidentally unrepresentative, but we also
want as much training data as possible. At the minimum, we would want to pick
36 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

the smallest test set that gives us enough statistical power to measure a statistically
significant difference between two potential models. In practice, we often just divide
our data into 80% training, 10% development, and 10% test. Given a large corpus
that we want to divide into training and test, test data can either be taken from some
continuous sequence of text inside the corpus, or we can remove smaller “stripes”
of text from randomly selected parts of our corpus and combine them into a test set.

3.2.1 Perplexity
In practice we don’t use raw probability as our metric for evaluating language mod-
perplexity els, but a variant called perplexity. The perplexity (sometimes called PP for short)
of a language model on a test set is the inverse probability of the test set, normalized
by the number of words. For a test set W = w1 w2 . . . wN ,:

1
PP(W ) = P(w1 w2 . . . wN )− N (3.14)
s
1
= N
P(w1 w2 . . . wN )

We can use the chain rule to expand the probability of W :

v
uN
uY 1
PP(W ) = t
N
(3.15)
P(wi |w1 . . . wi−1 )
i=1

Thus, if we are computing the perplexity of W with a bigram language model,


we get:

v
uN
uY 1
PP(W ) = t
N
(3.16)
P(wi |wi−1 )
i=1

Note that because of the inverse in Eq. 3.15, the higher the conditional probabil-
ity of the word sequence, the lower the perplexity. Thus, minimizing perplexity is
equivalent to maximizing the test set probability according to the language model.
What we generally use for word sequence in Eq. 3.15 or Eq. 3.16 is the entire se-
quence of words in some test set. Since this sequence will cross many sentence
boundaries, we need to include the begin- and end-sentence markers <s> and </s>
in the probability computation. We also need to include the end-of-sentence marker
</s> (but not the beginning-of-sentence marker <s>) in the total count of word to-
kens N.
There is another way to think about perplexity: as the weighted average branch-
ing factor of a language. The branching factor of a language is the number of possi-
ble next words that can follow any word. Consider the task of recognizing the digits
in English (zero, one, two,..., nine), given that (both in some training set and in some
1
test set) each of the 10 digits occurs with equal probability P = 10 . The perplexity of
this mini-language is in fact 10. To see that, imagine a test string of digits of length
N, and assume that in the training set all the digits occurred with equal probability.
By Eq. 3.15, the perplexity will be
3.3 • G ENERALIZATION AND Z EROS 37

1
PP(W ) = P(w1 w2 . . . wN )− N
1 N −1
= ( ) N
10
1 −1
=
10
= 10 (3.17)

But suppose that the number zero is really frequent and occurs far more often
than other numbers. Let’s say that 0 occur 91 times in the training set, and each
of the other digits occurred 1 time each. Now we see the following test set: 0 0
0 0 0 3 0 0 0 0. We should expect the perplexity of this test set to be lower since
most of the time the next number will be zero, which is very predictable, i.e. has
a high probability. Thus, although the branching factor is still 10, the perplexity or
weighted branching factor is smaller. We leave this exact calculation as exercise 12.
We see in Section 3.7 that perplexity is also closely related to the information-
theoretic notion of entropy.
Finally, let’s look at an example of how perplexity can be used to compare dif-
ferent n-gram models. We trained unigram, bigram, and trigram grammars on 38
million words (including start-of-sentence tokens) from the Wall Street Journal, us-
ing a 19,979 word vocabulary. We then computed the perplexity of each of these
models on a test set of 1.5 million words with Eq. 3.16. The table below shows the
perplexity of a 1.5 million word WSJ test set according to each of these grammars.
Unigram Bigram Trigram
Perplexity 962 170 109
As we see above, the more information the n-gram gives us about the word
sequence, the lower the perplexity (since as Eq. 3.15 showed, perplexity is related
inversely to the likelihood of the test sequence according to the model).
Note that in computing perplexities, the n-gram model P must be constructed
without any knowledge of the test set or any prior knowledge of the vocabulary of
the test set. Any kind of knowledge of the test set can cause the perplexity to be
artificially low. The perplexity of two language models is only comparable if they
use identical vocabularies.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) im-
provement in the performance of a language processing task like speech recognition
or machine translation. Nonetheless, because perplexity often correlates with such
improvements, it is commonly used as a quick check on an algorithm. But a model’s
improvement in perplexity should always be confirmed by an end-to-end evaluation
of a real task before concluding the evaluation of the model.

3.3 Generalization and Zeros


The n-gram model, like many statistical models, is dependent on the training corpus.
One implication of this is that the probabilities often encode specific facts about a
given training corpus. Another implication is that n-grams do a better and better job
of modeling the training corpus as we increase the value of N.
38 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

We can visualize both of these facts by borrowing the technique of Shannon


(1951) and Miller and Selfridge (1950) of generating random sentences from dif-
ferent n-gram models. It’s simplest to visualize how this works for the unigram
case. Imagine all the words of the English language covering the probability space
between 0 and 1, each word covering an interval proportional to its frequency. We
choose a random value between 0 and 1 and print the word whose interval includes
this chosen value. We continue choosing random numbers and generating words
until we randomly generate the sentence-final token </s>. We can use the same
technique to generate bigrams by first generating a random bigram that starts with
<s> (according to its bigram probability). Let’s say the second word of that bigram
is w. We next chose a random bigram starting with w (again, drawn according to its
bigram probability), and so on.
To give an intuition for the increasing power of higher-order n-grams, Fig. 3.3
shows random sentences generated from unigram, bigram, trigram, and 4-gram
models trained on Shakespeare’s works.

–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1
gram
rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2
gram
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3
gram
’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4
gram
great banquet serv’d in;
–It cannot be but so.
Figure 3.3 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.

The longer the context on which we train the model, the more coherent the sen-
tences. In the unigram sentences, there is no coherent relation between words or any
sentence-final punctuation. The bigram sentences have some local word-to-word
coherence (especially if we consider that punctuation counts as a word). The tri-
gram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a
careful investigation of the 4-gram sentences shows that they look a little too much
like Shakespeare. The words It cannot be but so are directly from King John. This is
because, not to put the knock on Shakespeare, his oeuvre is not very large as corpora
go (N = 884, 647,V = 29, 066), and our n-gram probability matrices are ridiculously
sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the number of pos-
sible 4-grams is V 4 = 7 × 1017 . Thus, once the generator has chosen the first 4-gram
(It cannot be but), there are only five possible continuations (that, I, he, thou, and
so); indeed, for many 4-grams, there is only one continuation.
To get an idea of the dependence of a grammar on its training set, let’s look at an
n-gram grammar trained on a completely different corpus: the Wall Street Journal
(WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so
we might expect some overlap between our n-grams for the two genres. Fig. 3.4
3.3 • G ENERALIZATION AND Z EROS 39

shows sentences generated by unigram, bigram, and trigram grammars trained on


40 million words from WSJ.

1
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2
gram
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3
gram
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 3.4 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
tion as words. Output was then hand-corrected for capitalization to improve readability.

Compare these examples to the pseudo-Shakespeare in Fig. 3.3. While they both
model “English-like sentences”, there is clearly no overlap in generated sentences,
and little overlap even in small phrases. Statistical models are likely to be pretty use-
less as predictors if the training sets and the test sets are as different as Shakespeare
and WSJ.
How should we deal with this problem when we build n-gram models? One step
is to be sure to use a training corpus that has a similar genre to whatever task we are
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
question-answering system, we need a training corpus of questions.
It is equally important to get training data in the appropriate dialect or variety,
especially when processing social media posts or spoken transcripts. For example
some tweets will use features of African American Language (AAL)— the name
for the many variations of language used in African American communities (King,
2020). Such features include words like finna—an auxiliary verb that marks imme-
diate future tense —that don’t occur in other varieties, or spellings like den for then,
in tweets like this one (Blodgett and O’Connor, 2017):
(3.18) Bored af den my phone finna die!!!
while tweets from varieties like Nigerian English have markedly different vocabu-
lary and n-gram patterns from American English (Jurgens et al., 2017):
(3.19) @username R u a wizard or wat gan sef: in d mornin - u tweet, afternoon - u
tweet, nyt gan u dey tweet. beta get ur IT placement wiv twitter
Matching genres and dialects is still not sufficient. Our models may still be
subject to the problem of sparsity. For any n-gram that occurred a sufficient number
of times, we might have a good estimate of its probability. But because any corpus is
limited, some perfectly acceptable English word sequences are bound to be missing
from it. That is, we’ll have many cases of putative “zero probability n-grams” that
should really have some non-zero probability. Consider the words that follow the
bigram denied the in the WSJ Treebank3 corpus, together with their counts:
denied the allegations: 5
denied the speculation: 2
denied the rumors: 1
denied the report: 1
40 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

But suppose our test set has phrases like:


denied the offer
denied the loan
Our model will incorrectly estimate that the P(offer|denied the) is 0!
zeros These zeros— things that don’t ever occur in the training set but do occur in
the test set—are a problem for two reasons. First, their presence means we are
underestimating the probability of all sorts of words that might occur, which will
hurt the performance of any application we want to run on this data.
Second, if the probability of any word in the test set is 0, the entire probability
of the test set is 0. By definition, perplexity is based on the inverse probability of the
test set. Thus if some words have zero probability, we can’t compute perplexity at
all, since we can’t divide by 0!

3.3.1 Unknown Words


The previous section discussed the problem of words whose bigram probability is
zero. But what about words we simply have never seen before?
Sometimes we have a language task in which this can’t happen because we know
closed
vocabulary all the words that can occur. In such a closed vocabulary system the test set can
only contain words from this lexicon, and there will be no unknown words. This is
a reasonable assumption in some domains, such as speech recognition or machine
translation, where we have a pronunciation dictionary or a phrase table that are fixed
in advance, and so the language model can only use the words in that dictionary or
phrase table.
In other cases we have to deal with words we haven’t seen before, which we’ll
OOV call unknown words, or out of vocabulary (OOV) words. The percentage of OOV
open
vocabulary words that appear in the test set is called the OOV rate. An open vocabulary system
is one in which we model these potential unknown words in the test set by adding a
pseudo-word called <UNK>.
There are two common ways to train the probabilities of the unknown word
model <UNK>. The first one is to turn the problem back into a closed vocabulary one
by choosing a fixed vocabulary in advance:
1. Choose a vocabulary (word list) that is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to
the unknown word token <UNK> in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular
word in the training set.
The second alternative, in situations where we don’t have a prior vocabulary in ad-
vance, is to create such a vocabulary implicitly, replacing words in the training data
by <UNK> based on their frequency. For example we can replace by <UNK> all words
that occur fewer than n times in the training set, where n is some small number, or
equivalently select a vocabulary size V in advance (say 50,000) and choose the top
V words by frequency and replace the rest by UNK. In either case we then proceed
to train the language model as before, treating <UNK> like a regular word.
The exact choice of <UNK> model does have an effect on metrics like perplexity.
A language model can achieve low perplexity by choosing a small vocabulary and
assigning the unknown word a high probability. For this reason, perplexities should
only be compared across language models with the same vocabularies (Buck et al.,
2014).
3.4 • S MOOTHING 41

3.4 Smoothing
What do we do with words that are in our vocabulary (they are not unknown words)
but appear in a test set in an unseen context (for example they appear after a word
they never appeared after in training)? To keep a language model from assigning
zero probability to these unseen events, we’ll have to shave off a bit of probability
mass from some more frequent events and give it to the events we’ve never seen.
smoothing This modification is called smoothing or discounting. In this section and the fol-
discounting lowing ones we’ll introduce a variety of ways to do smoothing: Laplace (add-one)
smoothing, add-k smoothing, stupid backoff, and Kneser-Ney smoothing.

3.4.1 Laplace Smoothing


The simplest way to do smoothing is to add one to all the bigram counts, before
we normalize them into probabilities. All the counts that used to be zero will now
have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called
Laplace
smoothing Laplace smoothing. Laplace smoothing does not perform well enough to be used
in modern n-gram models, but it usefully introduces many of the concepts that we
see in other smoothing algorithms, gives a useful baseline, and is also a practical
smoothing algorithm for other tasks like text classification (Chapter 4).
Let’s start with the application of Laplace smoothing to unigram probabilities.
Recall that the unsmoothed maximum likelihood estimate of the unigram probability
of the word wi is its count ci normalized by the total number of word tokens N:

ci
P(wi ) =
N
Laplace smoothing merely adds one to each count (hence its alternate name add-
add-one one smoothing). Since there are V words in the vocabulary and each one was incre-
mented, we also need to adjust the denominator to take into account the extra V
observations. (What happens to our P values if we don’t increase the denominator?)

ci + 1
PLaplace (wi ) = (3.20)
N +V
Instead of changing both the numerator and denominator, it is convenient to
describe how a smoothing algorithm affects the numerator, by defining an adjusted
count c∗ . This adjusted count is easier to compare directly with the MLE counts and
can be turned into a probability like an MLE count by normalizing by N. To define
this count, since we are only changing the numerator in addition to adding 1 we’ll
N
also need to multiply by a normalization factor N+V :
N
c∗i = (ci + 1) (3.21)
N +V
We can now turn c∗i into a probability Pi∗ by normalizing by N.
discounting A related way to view smoothing is as discounting (lowering) some non-zero
counts in order to get the probability mass that will be assigned to the zero counts.
Thus, instead of referring to the discounted counts c∗ , we might describe a smooth-
discount ing algorithm in terms of a relative discount dc , the ratio of the discounted counts to
the original counts:
42 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

c∗
dc =
c
Now that we have the intuition for the unigram case, let’s smooth our Berkeley
Restaurant Project bigrams. Figure 3.5 shows the add-one smoothed counts for the
bigrams in Fig. 3.1.

i want to eat chinese food lunch spend


i 6 828 1 10 1 1 1 3
want 3 1 609 2 7 7 6 2
to 3 1 5 687 3 1 7 212
eat 1 1 3 1 17 3 43 1
chinese 2 1 1 1 1 83 2 1
food 16 1 16 1 2 5 1 1
lunch 3 1 1 1 1 2 1 1
spend 2 1 2 1 1 1 1 1
Figure 3.5 Add-one smoothed bigram counts for eight of the words (out of V = 1446) in
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts are in gray.

Figure 3.6 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2.
Recall that normal bigram probabilities are computed by normalizing each row of
counts by the unigram count:

C(wn−1 wn )
P(wn |wn−1 ) = (3.22)
C(wn−1 )
For add-one smoothed bigram counts, we need to augment the unigram count by
the number of total word types in the vocabulary V :

∗ C(wn−1 wn ) + 1 C(wn−1 wn ) + 1
PLaplace (wn |wn−1 ) = P = (3.23)
w (C(wn−1 w) + 1) C(wn−1 ) +V
Thus, each of the unigram counts given in the previous section will need to be
augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 3.6.

i want to eat chinese food lunch spend


i 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075
want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.00084
to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055
eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046
chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062
food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039
lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056
spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058
Figure 3.6 Add-one smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP
corpus of 9332 sentences. Previously-zero probabilities are in gray.

It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts. These adjusted counts can be
computed by Eq. 3.24. Figure 3.7 shows the reconstructed counts.
[C(wn−1 wn ) + 1] ×C(wn−1 )
c∗ (wn−1 wn ) = (3.24)
C(wn−1 ) +V
3.4 • S MOOTHING 43

i want to eat chinese food lunch spend


i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9
want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78
to 1.9 0.63 3.1 430 1.9 0.63 4.4 133
eat 0.34 0.34 1 0.34 5.8 1 15 0.34
chinese 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098
food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43
lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19
spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16
Figure 3.7 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus
of 9332 sentences. Previously-zero counts are in gray.

Note that add-one smoothing has made a very big change to the counts. C(want to)
changed from 609 to 238! We can see this in probability space as well: P(to|want)
decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at
the discount d (the ratio between new and old counts) shows us how strikingly the
counts for each prefix word have been reduced; the discount for the bigram want to
is .39, while the discount for Chinese food is .10, a factor of 10!
The sharp change in counts and probabilities occurs because too much probabil-
ity mass is moved to all the zeros.

3.4.2 Add-k smoothing


One alternative to add-one smoothing is to move a bit less of the probability mass
from the seen to the unseen events. Instead of adding 1 to each count, we add a frac-
add-k tional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.

∗ C(wn−1 wn ) + k
PAdd-k (wn |wn−1 ) = (3.25)
C(wn−1 ) + kV
Add-k smoothing requires that we have a method for choosing k; this can be
done, for example, by optimizing on a devset. Although add-k is useful for some
tasks (including text classification), it turns out that it still doesn’t work well for
language modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).

3.4.3 Backoff and Interpolation


The discounting we have been discussing so far can help solve the problem of zero
frequency n-grams. But there is an additional source of knowledge we can draw on.
If we are trying to compute P(wn |wn−2 wn−1 ) but we have no examples of a particular
trigram wn−2 wn−1 wn , we can instead estimate its probability by using the bigram
probability P(wn |wn−1 ). Similarly, if we don’t have counts to compute P(wn |wn−1 ),
we can look to the unigram P(wn ).
In other words, sometimes using less context is a good thing, helping to general-
ize more for contexts that the model hasn’t learned much about. There are two ways
backoff to use this n-gram “hierarchy”. In backoff, we use the trigram if the evidence is
sufficient, otherwise we use the bigram, otherwise the unigram. In other words, we
only “back off” to a lower-order n-gram if we have zero evidence for a higher-order
interpolation n-gram. By contrast, in interpolation, we always mix the probability estimates from
all the n-gram estimators, weighing and combining the trigram, bigram, and unigram
counts.
44 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

In simple linear interpolation, we combine different order n-grams by linearly in-


terpolating all the models. Thus, we estimate the trigram probability P(wn |wn−2 wn−1 )
by mixing together the unigram, bigram, and trigram probabilities, each weighted
by a λ :

P̂(wn |wn−2 wn−1 ) = λ1 P(wn |wn−2 wn−1 )


+λ2 P(wn |wn−1 )
+λ3 P(wn ) (3.26)

such that the λ s sum to 1:


X
λi = 1 (3.27)
i

In a slightly more sophisticated version of linear interpolation, each λ weight is


computed by conditioning on the context. This way, if we have particularly accurate
counts for a particular bigram, we assume that the counts of the trigrams based on
this bigram will be more trustworthy, so we can make the λ s for those trigrams
higher and thus give that trigram more weight in the interpolation. Equation 3.28
shows the equation for interpolation with context-conditioned weights:

P̂(wn |wn−2 wn−1 ) = λ1 (wn−2:n−1 )P(wn |wn−2 wn−1 )


+λ2 (wn−2:n−1 )P(wn |wn−1 )
+ λ3 (wn−2:n−1 )P(wn ) (3.28)

How are these λ values set? Both the simple interpolation and conditional inter-
held-out polation λ s are learned from a held-out corpus. A held-out corpus is an additional
training corpus that we use to set hyperparameters like these λ values, by choosing
the λ values that maximize the likelihood of the held-out corpus. That is, we fix
the n-gram probabilities and then search for the λ values that—when plugged into
Eq. 3.26—give us the highest probability of the held-out set. There are various ways
to find this optimal set of λ s. One way is to use the EM algorithm, an iterative
learning algorithm that converges on locally optimal λ s (Jelinek and Mercer, 1980).
In a backoff n-gram model, if the n-gram we need has zero counts, we approxi-
mate it by backing off to the (N-1)-gram. We continue backing off until we reach a
history that has some counts.
In order for a backoff model to give a correct probability distribution, we have
discount to discount the higher-order n-grams to save some probability mass for the lower
order n-grams. Just as with add-one smoothing, if the higher-order n-grams aren’t
discounted and we just used the undiscounted MLE probability, then as soon as we
replaced an n-gram which has zero probability with a lower-order n-gram, we would
be adding probability mass, and the total probability assigned to all possible strings
by the language model would be greater than 1! In addition to this explicit discount
factor, we’ll need a function α to distribute this probability mass to the lower order
n-grams.
Katz backoff This kind of backoff with discounting is also called Katz backoff. In Katz back-
off we rely on a discounted probability P∗ if we’ve seen this n-gram before (i.e., if
we have non-zero counts). Otherwise, we recursively back off to the Katz probabil-
ity for the shorter-history (N-1)-gram. The probability for a backoff n-gram PBO is
3.5 • K NESER -N EY S MOOTHING 45

thus computed as follows:



 P∗ (wn |wn−N+1:n−1 ), if C(wn−N+1:n ) > 0
PBO (wn |wn−N+1:n−1 ) = (3.29)
 α(wn−N+1:n−1 )PBO (wn |wn−N+2:n−1 ), otherwise.

Good-Turing Katz backoff is often combined with a smoothing method called Good-Turing.
The combined Good-Turing backoff algorithm involves quite detailed computation
for estimating the Good-Turing smoothing and the P∗ and α values.

3.5 Kneser-Ney Smoothing


One of the most commonly used and best performing n-gram smoothing methods
Kneser-Ney is the interpolated Kneser-Ney algorithm (Kneser and Ney 1995, Chen and Good-
man 1998).
Kneser-Ney has its roots in a method called absolute discounting. Recall that
discounting of the counts for frequent n-grams is necessary to save some probability
mass for the smoothing algorithm to distribute to the unseen n-grams.
To see this, we can use a clever idea from Church and Gale (1991). Consider
an n-gram that has count 4. We need to discount this count by some amount. But
how much should we discount it? Church and Gale’s clever idea was to look at a
held-out corpus and just see what the count is for all those bigrams that had count
4 in the training set. They computed a bigram grammar from 22 million words of
AP newswire and then checked the counts of each of these bigrams in another 22
million words. On average, a bigram that occurred 4 times in the first 22 million
words occurred 3.23 times in the next 22 million words. Fig. 3.8 from Church and
Gale (1991) shows these counts for bigrams with c from 0 to 9.

Bigram count in Bigram count in


training set heldout set
0 0.0000270
1 0.448
2 1.25
3 2.24
4 3.23
5 4.21
6 5.23
7 6.21
8 7.21
9 8.26
Figure 3.8 For all bigrams in 22 million words of AP newswire of count 0, 1, 2,...,9, the
counts of these bigrams in a held-out corpus also of 22 million words.

Notice in Fig. 3.8 that except for the held-out counts for 0 and 1, all the other
bigram counts in the held-out set could be estimated pretty well by just subtracting
Absolute
discounting 0.75 from the count in the training set! Absolute discounting formalizes this intu-
ition by subtracting a fixed (absolute) discount d from each count. The intuition is
that since we have good estimates already for the very high counts, a small discount
d won’t affect them much. It will mainly modify the smaller counts, for which we
46 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

don’t necessarily trust the estimate anyway, and Fig. 3.8 suggests that in practice this
discount is actually a good one for bigrams with counts 2 through 9. The equation
for interpolated absolute discounting applied to bigrams:

C(wi−1 wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = P + λ (wi−1 )P(wi ) (3.30)
v C(wi−1 v)

The first term is the discounted bigram, and the second term is the unigram with
an interpolation weight λ . We could just set all the d values to .75, or we could keep
a separate discount value of 0.5 for the bigrams with counts of 1.
Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discount-
ing with a more sophisticated way to handle the lower-order unigram distribution.
Consider the job of predicting the next word in this sentence, assuming we are inter-
polating a bigram and a unigram model.
I can’t see without my reading .
The word glasses seems much more likely to follow here than, say, the word
Kong, so we’d like our unigram model to prefer glasses. But in fact it’s Kong that is
more common, since Hong Kong is a very frequent word. A standard unigram model
will assign Kong a higher probability than glasses. We would like to capture the
intuition that although Kong is frequent, it is mainly only frequent in the phrase Hong
Kong, that is, after the word Hong. The word glasses has a much wider distribution.
In other words, instead of P(w), which answers the question “How likely is
w?”, we’d like to create a unigram model that we might call PCONTINUATION , which
answers the question “How likely is w to appear as a novel continuation?”. How can
we estimate this probability of seeing the word w as a novel continuation, in a new
unseen context? The Kneser-Ney intuition is to base our estimate of PCONTINUATION
on the number of different contexts word w has appeared in, that is, the number of
bigram types it completes. Every bigram type was a novel continuation the first time
it was seen. We hypothesize that words that have appeared in more contexts in the
past are more likely to appear in some new context as well. The number of times a
word w appears as a novel continuation can be expressed as:

PCONTINUATION (w) ∝ |{v : C(vw) > 0}| (3.31)


To turn this count into a probability, we normalize by the total number of word
bigram types. In summary:

|{v : C(vw) > 0}|


PCONTINUATION (w) = (3.32)
|{(u0 , w0 ) : C(u0 w0 ) > 0}|
An equivalent formulation based on a different metaphor is to use the number of
word types seen to precede w (Eq. 3.31 repeated):

PCONTINUATION (w) ∝ |{v : C(vw) > 0}| (3.33)

normalized by the number of words preceding all words, as follows:

|{v : C(vw) > 0}|


PCONTINUATION (w) = P (3.34)
w0 |{v : C(vw ) > 0}|
0

A frequent word (Kong) occurring in only one context (Hong) will have a low
continuation probability.
3.6 • H UGE L ANGUAGE M ODELS AND S TUPID BACKOFF 47

Interpolated
Kneser-Ney The final equation for Interpolated Kneser-Ney smoothing for bigrams is then:

max(C(wi−1 wi ) − d, 0)
PKN (wi |wi−1 ) = + λ (wi−1 )PCONTINUATION (wi ) (3.35)
C(wi−1 )

The λ is a normalizing constant that is used to distribute the probability mass


we’ve discounted.:
d
λ (wi−1 ) = P |{w : C(wi−1 w) > 0}| (3.36)
v C(wi−1 v)
d
The first term, P , is the normalized discount. The second term,
v C(w i−1 v)
|{w : C(wi−1 w) > 0}|, is the number of word types that can follow wi−1 or, equiva-
lently, the number of word types that we discounted; in other words, the number of
times we applied the normalized discount.
The general recursive formulation is as follows:

max(cKN (w i−n+1: i ) − d, 0)
PKN (wi |wi−n+1:i−1 ) = P + λ (wi−n+1:i−1 )PKN (wi |wi−n+2:i−1 ) (3.37)
v cKN (wi−n+1:i−1 v)

where the definition of the count cKN depends on whether we are counting the
highest-order n-gram being interpolated (for example trigram if we are interpolating
trigram, bigram, and unigram) or one of the lower-order n-grams (bigram or unigram
if we are interpolating trigram, bigram, and unigram):

count(·) for the highest order
cKN (·) = (3.38)
continuationcount(·) for lower orders
The continuation count is the number of unique single word contexts for ·.
At the termination of the recursion, unigrams are interpolated with the uniform
distribution, where the parameter  is the empty string:

max(cKN (w) − d, 0) 1
PKN (w) = P + λ () (3.39)
c
w0 KN (w0 ) V

If we want to include an unknown word <UNK>, it’s just included as a regular vo-
cabulary entry with count zero, and hence its probability will be a lambda-weighted
uniform distribution λV() .
The best performing version of Kneser-Ney smoothing is called modified Kneser-
modified
Kneser-Ney Ney smoothing, and is due to Chen and Goodman (1998). Rather than use a single
fixed discount d, modified Kneser-Ney uses three different discounts d1 , d2 , and
d3+ for n-grams with counts of 1, 2 and three or more, respectively. See Chen and
Goodman (1998, p. 19) or Heafield et al. (2013) for the details.

3.6 Huge Language Models and Stupid Backoff


By using text from the web or other enormous collections, it is possible to build
extremely large language models. The Web 1 Trillion 5-gram corpus released by
Google includes various large sets of n-grams, including 1-grams through 5-grams
from all the five-word sequences that appear in at least 40 distinct books from
48 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

1,024,908,267,229 words of text from publicly accessible Web pages in English


(Franz and Brants, 2006). Google has also released Google Books Ngrams cor-
pora with n-grams drawn from their book collections, including another 800 billion
tokens of of n-grams from Chinese, English, French, German, Hebrew, Italian, Rus-
sian, and Spanish (Lin et al., 2012a). Smaller but more carefully curated n-gram cor-
pora for English include the million most frequent n-grams drawn from the COCA
(Corpus of Contemporary American English) 1 billion word corpus of American
English (Davies, 2020). COCA is a balanced corpora, meaning that it has roughly
equal numbers of words from different genres: web, newspapers, spoken conversa-
tion transcripts, fiction, and so on, drawn from the period 1990-2019, and has the
context of each n-gram as well as labels for genre and provenance).
Some example 4-grams from the Google Web corpus:
4-gram Count
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
Efficiency considerations are important when building language models that use
such large sets of n-grams. Rather than store each word as a string, it is generally
represented in memory as a 64-bit hash number, with the words themselves stored
on disk. Probabilities are generally quantized using only 4-8 bits (instead of 8-byte
floats), and n-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing n-grams with
counts greater than some threshold (such as the count threshold of 40 used for the
Google n-gram release) or using entropy to prune less-important n-grams (Stolcke,
1998). Another option is to build approximate language models using techniques
Bloom filters like Bloom filters (Talbot and Osborne 2007, Church et al. 2007). Finally, effi-
cient language model toolkits like KenLM (Heafield 2011, Heafield et al. 2013) use
sorted arrays, efficiently combine probabilities and backoffs in a single value, and
use merge sorts to efficiently build the probability tables in a minimal number of
passes through a large corpus.
Although with these toolkits it is possible to build web-scale language models
using full Kneser-Ney smoothing, Brants et al. (2007) show that with very large lan-
guage models a much simpler algorithm may be sufficient. The algorithm is called
stupid backoff stupid backoff. Stupid backoff gives up the idea of trying to make the language
model a true probability distribution. There is no discounting of the higher-order
probabilities. If a higher-order n-gram has a zero count, we simply backoff to a
lower order n-gram, weighed by a fixed (context-independent) weight. This algo-
rithm does not produce a probability distribution, so we’ll follow Brants et al. (2007)
in referring to it as S:


 count(wii−k+1 ) if count(wi
i−1
S(wi |wi−k+1 ) = count(wi−k+1
i−1
) i−k+1 ) > 0 (3.40)
 λ S(w |wi−1 ) otherwise
i i−k+2

The backoff terminates in the unigram, which has probability S(w) = count(w)
N . Brants
et al. (2007) find that a value of 0.4 worked well for λ .
3.7 • A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY 49

3.7 Advanced: Perplexity’s Relation to Entropy


We introduced perplexity in Section 3.2.1 as a way to evaluate n-gram models on
a test set. A better n-gram model is one that assigns a higher probability to the
test data, and perplexity is a normalized version of the probability of the test set.
The perplexity measure actually arises from the information-theoretic concept of
cross-entropy, which explains otherwise mysterious properties of perplexity (why
Entropy the inverse probability, for example?) and its relationship to entropy. Entropy is a
measure of information. Given a random variable X ranging over whatever we are
predicting (words, letters, parts of speech, the set of which we’ll call χ) and with a
particular probability function, call it p(x), the entropy of the random variable X is:
X
H(X) = − p(x) log2 p(x) (3.41)
x∈χ

The log can, in principle, be computed in any base. If we use log base 2, the
resulting value of entropy will be measured in bits.
One intuitive way to think about entropy is as a lower bound on the number of
bits it would take to encode a certain decision or piece of information in the optimal
coding scheme.
Consider an example from the standard information theory textbook Cover and
Thomas (1991). Imagine that we want to place a bet on a horse race but it is too
far to go all the way to Yonkers Racetrack, so we’d like to send a short message to
the bookie to tell him which of the eight horses to bet on. One way to encode this
message is just to use the binary representation of the horse’s number as the code;
thus, horse 1 would be 001, horse 2 010, horse 3 011, and so on, with horse 8 coded
as 000. If we spend the whole day betting and each horse is coded with 3 bits, on
average we would be sending 3 bits per race.
Can we do better? Suppose that the spread is the actual distribution of the bets
placed and that we represent it as the prior probability of each horse as follows:

1 1
Horse 1 2 Horse 5 64
1 1
Horse 2 4 Horse 6 64
1 1
Horse 3 8 Horse 7 64
1 1
Horse 4 16 Horse 8 64

The entropy of the random variable X that ranges over horses gives us a lower
bound on the number of bits and is

i=8
X
H(X) = − p(i) log p(i)
i=1
= − 21 log 12 − 41 log 14 − 18 log 18 − 16
1 log 1 −4( 1 log 1 )
16 64 64

= 2 bits (3.42)

A code that averages 2 bits per race can be built with short encodings for more
probable horses, and longer encodings for less probable horses. For example, we
could encode the most likely horse with the code 0, and the remaining horses as 10,
then 110, 1110, 111100, 111101, 111110, and 111111.
50 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

What if the horses are equally likely? We saw above that if we used an equal-
length binary code for the horse numbers, each horse took 3 bits to code, so the
average was 3. Is the entropy the same? In this case each horse would have a
probability of 18 . The entropy of the choice of horses is then
i=8
X 1 1 1
H(X) = − log = − log = 3 bits (3.43)
8 8 8
i=1

Until now we have been computing the entropy of a single variable. But most of
what we will use entropy for involves sequences. For a grammar, for example, we
will be computing the entropy of some sequence of words W = {w0 , w1 , w2 , . . . , wn }.
One way to do this is to have a variable that ranges over sequences of words. For
example we can compute the entropy of a random variable that ranges over all finite
sequences of words of length n in some language L as follows:
X
H(w1 , w2 , . . . , wn ) = − p(W1n ) log p(W1n ) (3.44)
W1n ∈L

entropy rate We could define the entropy rate (we could also think of this as the per-word
entropy) as the entropy of this sequence divided by the number of words:
1 1 X
H(W1n ) = − p(W1n ) log p(W1n ) (3.45)
n n n
W1 ∈L

But to measure the true entropy of a language, we need to consider sequences of


infinite length. If we think of a language as a stochastic process L that produces a
sequence of words, and allow W to represent the sequence of words w1 , . . . , wn , then
L’s entropy rate H(L) is defined as

1
H(L) = lim H(w1 , w2 , . . . , wn )
n
n→∞
1X
= − lim p(w1 , . . . , wn ) log p(w1 , . . . , wn ) (3.46)
n→∞ n
W ∈L

The Shannon-McMillan-Breiman theorem (Algoet and Cover 1988, Cover and


Thomas 1991) states that if the language is regular in certain ways (to be exact, if it
is both stationary and ergodic),
1
H(L) = lim − log p(w1 w2 . . . wn ) (3.47)
n→∞ n

That is, we can take a single sequence that is long enough instead of summing
over all possible sequences. The intuition of the Shannon-McMillan-Breiman the-
orem is that a long-enough sequence of words will contain in it many other shorter
sequences and that each of these shorter sequences will reoccur in the longer se-
quence according to their probabilities.
Stationary A stochastic process is said to be stationary if the probabilities it assigns to a
sequence are invariant with respect to shifts in the time index. In other words, the
probability distribution for words at time t is the same as the probability distribution
at time t + 1. Markov models, and hence n-grams, are stationary. For example, in
a bigram, Pi is dependent only on Pi−1 . So if we shift our time index by x, Pi+x is
still dependent on Pi+x−1 . But natural language is not stationary, since as we show
3.7 • A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY 51

in Chapter 12, the probability of upcoming words can be dependent on events that
were arbitrarily distant and time dependent. Thus, our statistical models only give
an approximation to the correct distributions and entropies of natural language.
To summarize, by making some incorrect but convenient simplifying assump-
tions, we can compute the entropy of some stochastic process by taking a very long
sample of the output and computing its average log probability.
cross-entropy Now we are ready to introduce cross-entropy. The cross-entropy is useful when
we don’t know the actual probability distribution p that generated some data. It
allows us to use some m, which is a model of p (i.e., an approximation to p). The
cross-entropy of m on p is defined by

1X
H(p, m) = lim − p(w1 , . . . , wn ) log m(w1 , . . . , wn ) (3.48)
n→∞ n
W ∈L

That is, we draw sequences according to the probability distribution p, but sum
the log of their probabilities according to m.
Again, following the Shannon-McMillan-Breiman theorem, for a stationary er-
godic process:

1
H(p, m) = lim − log m(w1 w2 . . . wn ) (3.49)
n→∞ n

This means that, as for entropy, we can estimate the cross-entropy of a model
m on some distribution p by taking a single sequence that is long enough instead of
summing over all possible sequences.
What makes the cross-entropy useful is that the cross-entropy H(p, m) is an up-
per bound on the entropy H(p). For any model m:

H(p) ≤ H(p, m) (3.50)

This means that we can use some simplified model m to help estimate the true en-
tropy of a sequence of symbols drawn according to probability p. The more accurate
m is, the closer the cross-entropy H(p, m) will be to the true entropy H(p). Thus,
the difference between H(p, m) and H(p) is a measure of how accurate a model is.
Between two models m1 and m2 , the more accurate model will be the one with the
lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so
a model cannot err by underestimating the true entropy.)
We are finally ready to see the relation between perplexity and cross-entropy
as we saw it in Eq. 3.49. Cross-entropy is defined in the limit as the length of the
observed word sequence goes to infinity. We will need an approximation to cross-
entropy, relying on a (sufficiently long) sequence of fixed length. This approxima-
tion to the cross-entropy of a model M = P(wi |wi−N+1 ...wi−1 ) on a sequence of
words W is

1
H(W ) = − log P(w1 w2 . . . wN ) (3.51)
N

perplexity The perplexity of a model P on a sequence of words W is now formally defined as


the exp of this cross-entropy:
52 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

Perplexity(W ) = 2H(W )
1
= P(w1 w2 . . . wN )− N
s
1
= N
P(w1 w2 . . . wN )
v
uN
uY 1
= t
N
(3.52)
P(wi |w1 . . . wi−1 )
i=1

3.8 Summary
This chapter introduced language modeling and the n-gram, one of the most widely
used tools in language processing.
• Language models offer a way to assign a probability to a sentence or other
sequence of words, and to predict a word from preceding words.
• n-grams are Markov models that estimate words from a fixed window of pre-
vious words. n-gram probabilities can be estimated by counting in a corpus
and normalizing (the maximum likelihood estimate).
• n-gram language models are evaluated extrinsically in some task, or intrinsi-
cally using perplexity.
• The perplexity of a test set according to a language model is the geometric
mean of the inverse test set probability computed by the model.
• Smoothing algorithms provide a more sophisticated way to estimate the prob-
ability of n-grams. Commonly used smoothing algorithms for n-grams rely on
lower-order n-gram counts through backoff or interpolation.
• Both backoff and interpolation require discounting to create a probability dis-
tribution.
• Kneser-Ney smoothing makes use of the probability of a word being a novel
continuation. The interpolated Kneser-Ney smoothing algorithm mixes a
discounted probability with a lower-order continuation probability.

Bibliographical and Historical Notes


The underlying mathematics of the n-gram was first proposed by Markov (1913),
who used what are now called Markov chains (bigrams and trigrams) to predict
whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a con-
sonant. Markov classified 20,000 letters as V or C and computed the bigram and
trigram probability that a given letter would be a vowel given the previous one or
two letters. Shannon (1948) applied n-grams to compute approximations to English
word sequences. Based on Shannon’s work, Markov models were commonly used in
engineering, linguistic, and psychological work on modeling word sequences by the
1950s. In a series of extremely influential papers starting with Chomsky (1956) and
including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky argued
that “finite-state Markov processes”, while a possibly useful engineering heuristic,
E XERCISES 53

were incapable of being a complete cognitive model of human grammatical knowl-


edge. These arguments led many linguists and computational linguists to ignore
work in statistical modeling for decades.
The resurgence of n-gram models came from Jelinek and colleagues at the IBM
Thomas J. Watson Research Center, who were influenced by Shannon, and Baker
at CMU, who was influenced by the work of Baum and colleagues. Independently
these two labs successfully used n-grams in their speech recognition systems (Baker 1975b,
Jelinek 1976, Baker 1975a, Bahl et al. 1983, Jelinek 1990).
Add-one smoothing derives from Laplace’s 1812 law of succession and was first
applied as an engineering solution to the zero frequency problem by Jeffreys (1948)
based on an earlier Add-K suggestion by Johnson (1932). Problems with the add-
one algorithm are summarized in Gale and Church (1994).
A wide variety of different language modeling and smoothing techniques were
proposed in the 80s and 90s, including Good-Turing discounting—first applied to
the n-gram smoothing at IBM by Katz (Nádas 1984, Church and Gale 1991)—
Witten-Bell discounting (Witten and Bell, 1991), and varieties of class-based n-
class-based
n-gram gram models that used information about word classes.
Starting in the late 1990s, Chen and Goodman performed a number of carefully
controlled experiments comparing different discounting algorithms, cache models,
class-based models, and other language model parameters (Chen and Goodman 1999,
Goodman 2006, inter alia). They showed the advantages of Modified Interpolated
Kneser-Ney, which became the standard baseline for n-gram language modeling,
especially because they showed that caches and class-based models provided only
minor additional improvement. These papers are recommended for any reader with
further interest in n-gram language modeling. SRILM (Stolcke, 2002) and KenLM
(Heafield 2011, Heafield et al. 2013) are publicly available toolkits for building n-
gram language models.
Modern language modeling is more commonly done with neural network lan-
guage models, which solve the major problems with n-grams: the number of param-
eters increases exponentially as the n-gram order increases, and n-grams have no
way to generalize from training to test set. Neural language models instead project
words into a continuous space in which words with similar contexts have simi-
lar representations. We’ll introduce both feedforward language models (Bengio
et al. 2006, Schwenk 2007) in Chapter 7, and recurrent language models (Mikolov,
2012) in Chapter 9.

Exercises
3.1 Write out the equation for trigram probability estimation (modifying Eq. 3.11).
Now write out all the non-zero trigram probabilities for the I am Sam corpus
on page 32.
3.2 Calculate the probability of the sentence i want chinese food. Give two
probabilities, one using Fig. 3.2 and the ‘useful probabilities’ just below it on
page 34, and another using the add-1 smoothed table in Fig. 3.6. Assume the
additional add-1 smoothed probabilities P(i|<s>) = 0.19 and P(</s>|food) =
0.40.
3.3 Which of the two probabilities you computed in the previous exercise is higher,
unsmoothed or smoothed? Explain why.
3.4 We are given the following corpus, modified from the one in the chapter:
54 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS

<s> I am Sam </s>


<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
Using a bigram language model with add-one smoothing, what is P(Sam |
am)? Include <s> and </s> in your counts just like any other token.
3.5 Suppose we didn’t use the end-symbol </s>. Train an unsmoothed bigram
grammar on the following training corpus without using the end-symbol </s>:
<s> a b
<s> b b
<s> b a
<s> a a
Demonstrate that your bigram model does not assign a single probability dis-
tribution across all sentence lengths by showing that the sum of the probability
of the four possible 2 word sentences over the alphabet {a,b} is 1.0, and the
sum of the probability of all possible 3 word sentences over the alphabet {a,b}
is also 1.0.
3.6 Suppose we train a trigram language model with add-one smoothing on a
given corpus. The corpus contains V word types. Express a formula for esti-
mating P(w3|w1,w2), where w3 is a word which follows the bigram (w1,w2),
in terms of various N-gram counts and V. Use the notation c(w1,w2,w3) to
denote the number of times that trigram (w1,w2,w3) occurs in the corpus, and
so on for bigrams and unigrams.
3.7 We are given the following corpus, modified from the one in the chapter:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
If we use linear interpolation smoothing between a maximum-likelihood bi-
gram model and a maximum-likelihood unigram model with λ1 = 12 and λ2 =
1
2 , what is P(Sam|am)? Include <s> and </s> in your counts just like any
other token.
3.8 Write a program to compute unsmoothed unigrams and bigrams.
3.9 Run your n-gram program on two different small corpora of your choice (you
might use email text or newsgroups). Now compare the statistics of the two
corpora. What are the differences in the most common unigrams between the
two? How about interesting differences in bigrams?
3.10 Add an option to your program to generate random sentences.
3.11 Add an option to your program to compute the perplexity of a test set.
3.12 You are given a training set of 100 numbers that consists of 91 zeros and 1
each of the other digits 1-9. Now we see the following test set: 0 0 0 0 0 3 0 0
0 0. What is the unigram perplexity?
CHAPTER

4 Naive Bayes and Sentiment


Classification
Classification lies at the heart of both human and machine intelligence. Deciding
what letter, word, or image has been presented to our senses, recognizing faces
or voices, sorting mail, assigning grades to homeworks; these are all examples of
assigning a category to an input. The potential challenges of this task are highlighted
by the fabulist Jorge Luis Borges (1964), who imagined classifying animals into:
(a) those that belong to the Emperor, (b) embalmed ones, (c) those that
are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray
dogs, (h) those that are included in this classification, (i) those that
tremble as if they were mad, (j) innumerable ones, (k) those drawn with
a very fine camel’s hair brush, (l) others, (m) those that have just broken
a flower vase, (n) those that resemble flies from a distance.
Many language processing tasks involve classification, although luckily our classes
are much easier to define than those of Borges. In this chapter we introduce the naive
text
categorization Bayes algorithm and apply it to text categorization, the task of assigning a label or
category to an entire text or document.
sentiment
analysis We focus on one common text categorization task, sentiment analysis, the ex-
traction of sentiment, the positive or negative orientation that a writer expresses
toward some object. A review of a movie, book, or product on the web expresses the
author’s sentiment toward the product, while an editorial or political text expresses
sentiment toward a candidate or political action. Extracting consumer or public sen-
timent is thus relevant for fields from marketing to politics.
The simplest version of sentiment analysis is a binary classification task, and
the words of the review provide excellent cues. Consider, for example, the follow-
ing phrases extracted from positive and negative reviews of movies and restaurants.
Words like great, richly, awesome, and pathetic, and awful and ridiculously are very
informative cues:
+ ...zany characters and richly applied satire, and some great plot twists
− It was pathetic. The worst part about it was the boxing scenes...
+ ...awesome caramel sauce and sweet toasty almonds. I love this place!
− ...awful pizza and ridiculously overpriced...
spam detection Spam detection is another important commercial application, the binary clas-
sification task of assigning an email to one of the two classes spam or not-spam.
Many lexical and other features can be used to perform this classification. For ex-
ample you might quite reasonably be suspicious of an email containing phrases like
“online pharmaceutical” or “WITHOUT ANY COST” or “Dear Winner”.
Another thing we might want to know about a text is the language it’s written
in. Texts on social media, for example, can be in any number of languages and
language id we’ll need to apply different processing. The task of language id is thus the first
step in most language processing pipelines. Related text classification tasks like au-
authorship thorship attribution— determining a text’s author— are also relevant to the digital
attribution
humanities, social sciences, and forensic linguistics.
56 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

Finally, one of the oldest tasks in text classification is assigning a library sub-
ject category or topic label to a text. Deciding whether a research paper concerns
epidemiology or instead, perhaps, embryology, is an important component of infor-
mation retrieval. Various sets of subject categories exist, such as the MeSH (Medical
Subject Headings) thesaurus. In fact, as we will see, subject category classification
is the task for which the naive Bayes algorithm was invented in 1961.
Classification is essential for tasks below the level of the document as well.
We’ve already seen period disambiguation (deciding if a period is the end of a sen-
tence or part of a word), and word tokenization (deciding if a character should be
a word boundary). Even language modeling can be viewed as classification: each
word can be thought of as a class, and so predicting the next word is classifying the
context-so-far into a class for each next word. A part-of-speech tagger (Chapter 8)
classifies each occurrence of a word in a sentence as, e.g., a noun or a verb.
The goal of classification is to take a single observation, extract some useful
features, and thereby classify the observation into one of a set of discrete classes.
One method for classifying text is to use handwritten rules. There are many areas of
language processing where handwritten rule-based classifiers constitute a state-of-
the-art system, or at least part of it.
Rules can be fragile, however, as situations or data change over time, and for
some tasks humans aren’t necessarily good at coming up with the rules. Most cases
supervised
of classification in language processing are instead done via supervised machine
machine learning, and this will be the subject of the remainder of this chapter. In supervised
learning
learning, we have a data set of input observations, each associated with some correct
output (a ‘supervision signal’). The goal of the algorithm is to learn how to map
from a new observation to a correct output.
Formally, the task of supervised classification is to take an input x and a fixed
set of output classes Y = y1 , y2 , ..., yM and return a predicted class y ∈ Y . For text
classification, we’ll sometimes talk about c (for “class”) instead of y as our output
variable, and d (for “document”) instead of x as our input variable. In the supervised
situation we have a training set of N documents that have each been hand-labeled
with a class: (d1 , c1 ), ...., (dN , cN ). Our goal is to learn a classifier that is capable of
mapping from a new document d to its correct class c ∈ C. A probabilistic classifier
additionally will tell us the probability of the observation being in the class. This
full distribution over the classes can be useful information for downstream decisions;
avoiding making discrete decisions early on can be useful when combining systems.
Many kinds of machine learning algorithms are used to build classifiers. This
chapter introduces naive Bayes; the following one introduces logistic regression.
These exemplify two ways of doing classification. Generative classifiers like naive
Bayes build a model of how a class could generate some input data. Given an ob-
servation, they return the class most likely to have generated the observation. Dis-
criminative classifiers like logistic regression instead learn what features from the
input are most useful to discriminate between the different possible classes. While
discriminative systems are often more accurate and hence more commonly used,
generative classifiers still have a role.

4.1 Naive Bayes Classifiers


naive Bayes In this section we introduce the multinomial naive Bayes classifier, so called be-
classifier
cause it is a Bayesian classifier that makes a simplifying (naive) assumption about
4.1 • NAIVE BAYES C LASSIFIERS 57

how the features interact.


The intuition of the classifier is shown in Fig. 4.1. We represent a text document
bag-of-words as if it were a bag-of-words, that is, an unordered set of words with their position
ignored, keeping only their frequency in the document. In the example in the figure,
instead of representing the word order in all the phrases like “I love this movie” and
“I would recommend it”, we simply note that the word I occurred 5 times in the
entire excerpt, the word it 6 times, the words love, recommend, and movie once, and
so on.

it 6
I 5
I love this movie! It's sweet, the 4
but with satirical humor. The fairy always love it to 3
it whimsical it to and 3
dialogue is great and the I
and seen are seen 2
adventure scenes are fun... friend anyone
It manages to be whimsical happy dialogue yet 1
and romantic while laughing adventure recommend would 1
satirical whimsical 1
at the conventions of the who sweet of movie it
fairy tale genre. I would it I but to romantic I times 1
recommend it to just about several yet sweet 1
anyone. I've seen it several again it the humor satirical 1
the seen would
times, and I'm always happy adventure 1
to scenes I the manages
to see it again whenever I the genre 1
fun I times and fairy 1
have a friend who hasn't and
about while humor 1
seen it yet! whenever have
conventions have 1
with great 1
… …

Figure 4.1 Intuition of the multinomial naive Bayes classifier applied to a movie review. The position of the
words is ignored (the bag of words assumption) and we make use of the frequency of each word.

Naive Bayes is a probabilistic classifier, meaning that for a document d, out of


all classes c ∈ C the classifier returns the class ĉ which has the maximum posterior
ˆ probability given the document. In Eq. 4.1 we use the hat notation ˆ to mean “our
estimate of the correct class”.

ĉ = argmax P(c|d) (4.1)


c∈C

Bayesian This idea of Bayesian inference has been known since the work of Bayes (1763),
inference
and was first applied to text classification by Mosteller and Wallace (1964). The
intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 4.1 into
other probabilities that have some useful properties. Bayes’ rule is presented in
Eq. 4.2; it gives us a way to break down any conditional probability P(x|y) into
three other probabilities:
P(y|x)P(x)
P(x|y) = (4.2)
P(y)
We can then substitute Eq. 4.2 into Eq. 4.1 to get Eq. 4.3:

P(d|c)P(c)
ĉ = argmax P(c|d) = argmax (4.3)
c∈C c∈C P(d)
58 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

We can conveniently simplify Eq. 4.3 by dropping the denominator P(d). This
is possible because we will be computing P(d|c)P(c)
P(d) for each possible class. But P(d)
doesn’t change for each class; we are always asking about the most likely class for
the same document d, which must have the same probability P(d). Thus, we can
choose the class that maximizes this simpler formula:

ĉ = argmax P(c|d) = argmax P(d|c)P(c) (4.4)


c∈C c∈C
We call Naive Bayes a generative model because we can read Eq. 4.4 as stating
a kind of implicit assumption about how a document is generated: first a class is
sampled from P(c), and then the words are generated by sampling from P(d|c). (In
fact we could imagine generating artificial documents, or at least their word counts,
by following this process). We’ll say more about this intuition of generative models
in Chapter 5.
To return to classification: we compute the most probable class ĉ given some
document d by choosing the class which has the highest product of two probabilities:
prior
probability the prior probability of the class P(c) and the likelihood of the document P(d|c):
likelihood
likelihood prior
z }| { z}|{
ĉ = argmax P(d|c) P(c) (4.5)
c∈C
Without loss of generalization, we can represent a document d as a set of features
f1 , f2 , ..., fn :

likelihood prior
z }| { z}|{
ĉ = argmax P( f1 , f2 , ...., fn |c) P(c) (4.6)
c∈C
Unfortunately, Eq. 4.6 is still too hard to compute directly: without some sim-
plifying assumptions, estimating the probability of every possible combination of
features (for example, every possible set of words and positions) would require huge
numbers of parameters and impossibly large training sets. Naive Bayes classifiers
therefore make two simplifying assumptions.
The first is the bag of words assumption discussed intuitively above: we assume
position doesn’t matter, and that the word “love” has the same effect on classification
whether it occurs as the 1st, 20th, or last word in the document. Thus we assume
that the features f1 , f2 , ..., fn only encode word identity and not position.
naive Bayes
assumption The second is commonly called the naive Bayes assumption: this is the condi-
tional independence assumption that the probabilities P( fi |c) are independent given
the class c and hence can be ‘naively’ multiplied as follows:
P( f1 , f2 , ...., fn |c) = P( f1 |c) · P( f2 |c) · ... · P( fn |c) (4.7)

The final equation for the class chosen by a naive Bayes classifier is thus:
Y
cNB = argmax P(c) P( f |c) (4.8)
c∈C f ∈F

To apply the naive Bayes classifier to text, we need to consider word positions, by
simply walking an index through every word position in the document:
positions ← all word positions in test document
Y
cNB = argmax P(c) P(wi |c) (4.9)
c∈C i∈positions
4.2 • T RAINING THE NAIVE BAYES C LASSIFIER 59

Naive Bayes calculations, like calculations for language modeling, are done in log
space, to avoid underflow and increase speed. Thus Eq. 4.9 is generally instead
expressed as
X
cNB = argmax log P(c) + log P(wi |c) (4.10)
c∈C i∈positions

By considering features in log space, Eq. 4.10 computes the predicted class as a lin-
ear function of input features. Classifiers that use a linear combination of the inputs
to make a classification decision —like naive Bayes and also logistic regression—
linear are called linear classifiers.
classifiers

4.2 Training the Naive Bayes Classifier


How can we learn the probabilities P(c) and P( fi |c)? Let’s first consider the maxi-
mum likelihood estimate. We’ll simply use the frequencies in the data. For the class
prior P(c) we ask what percentage of the documents in our training set are in each
class c. Let Nc be the number of documents in our training data with class c and
Ndoc be the total number of documents. Then:

Nc
P̂(c) = (4.11)
Ndoc

To learn the probability P( fi |c), we’ll assume a feature is just the existence of a word
in the document’s bag of words, and so we’ll want P(wi |c), which we compute as
the fraction of times the word wi appears among all words in all documents of topic
c. We first concatenate all documents with category c into one big “category c” text.
Then we use the frequency of wi in this concatenated document to give a maximum
likelihood estimate of the probability:

count(wi , c)
P̂(wi |c) = P (4.12)
w∈V count(w, c)

Here the vocabulary V consists of the union of all the word types in all classes, not
just the words in one class c.
There is a problem, however, with maximum likelihood training. Imagine we
are trying to estimate the likelihood of the word “fantastic” given class positive, but
suppose there are no training documents that both contain the word “fantastic” and
are classified as positive. Perhaps the word “fantastic” happens to occur (sarcasti-
cally?) in the class negative. In such a case the probability for this feature will be
zero:
count(“fantastic”, positive)
P̂(“fantastic”|positive) = P =0 (4.13)
w∈V count(w, positive)

But since naive Bayes naively multiplies all the feature likelihoods together, zero
probabilities in the likelihood term for any class will cause the probability of the
class to be zero, no matter the other evidence!
The simplest solution is the add-one (Laplace) smoothing introduced in Chap-
ter 3. While Laplace smoothing is usually replaced by more sophisticated smoothing
60 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

algorithms in language modeling, it is commonly used in naive Bayes text catego-


rization:

count(wi , c) + 1 count(wi , c) + 1
P̂(wi |c) = P = P  (4.14)
w∈V (count(w, c) + 1) w∈V count(w, c) + |V |

Note once again that it is crucial that the vocabulary V consists of the union of all the
word types in all classes, not just the words in one class c (try to convince yourself
why this must be true; see the exercise at the end of the chapter).
What do we do about words that occur in our test data but are not in our vocab-
ulary at all because they did not occur in any training document in any class? The
unknown word solution for such unknown words is to ignore them—remove them from the test
document and not include any probability for them at all.
Finally, some systems choose to completely ignore another class of words: stop
stop words words, very frequent words like the and a. This can be done by sorting the vocabu-
lary by frequency in the training set, and defining the top 10–100 vocabulary entries
as stop words, or alternatively by using one of the many predefined stop word list
available online. Then every instance of these stop words are simply removed from
both training and test documents as if they had never occurred. In most text classi-
fication applications, however, using a stop word list doesn’t improve performance,
and so it is more common to make use of the entire vocabulary and not use a stop
word list.
Fig. 4.2 shows the final algorithm.

function T RAIN NAIVE BAYES(D, C) returns log P(c) and log P(w|c)

for each class c ∈ C # Calculate P(c) terms


Ndoc = number of documents in D
Nc = number of documents from D in class c
Nc
logprior[c] ← log
Ndoc
V ← vocabulary of D
bigdoc[c] ← append(d) for d ∈ D with class c
for each word w in V # Calculate P(w|c) terms
count(w,c) ← # of occurrences of w in bigdoc[c]
count(w, c) + 1
loglikelihood[w,c] ← log P
w in V (count (w , c) + 1)
0
0
return logprior, loglikelihood, V

function T EST NAIVE BAYES(testdoc, logprior, loglikelihood, C, V) returns best c

for each class c ∈ C


sum[c] ← logprior[c]
for each position i in testdoc
word ← testdoc[i]
if word ∈ V
sum[c] ← sum[c]+ loglikelihood[word,c]
return argmaxc sum[c]

Figure 4.2 The naive Bayes algorithm, using add-1 smoothing. To use add-α smoothing
instead, change the +1 to +α for loglikelihood counts in training.
4.3 • W ORKED EXAMPLE 61

4.3 Worked example


Let’s walk through an example of training and testing naive Bayes with add-one
smoothing. We’ll use a sentiment analysis domain with the two classes positive
(+) and negative (-), and take the following miniature training and test documents
simplified from actual movie reviews.
Cat Documents
Training - just plain boring
- entirely predictable and lacks energy
- no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test ? predictable with no fun
Nc
The prior P(c) for the two classes is computed via Eq. 4.11 as Ndoc :

3 2
P(−) = P(+) =
5 5
The word with doesn’t occur in the training set, so we drop it completely (as
mentioned above, we don’t use unknown word models for naive Bayes). The like-
lihoods from the training set for the remaining three words “predictable”, “no”, and
“fun”, are as follows, from Eq. 4.14 (computing the probabilities for the remainder
of the words in the training set is left as an exercise for the reader):
1+1 0+1
P(“predictable”|−) = P(“predictable”|+) =
14 + 20 9 + 20
1+1 0+1
P(“no”|−) = P(“no”|+) =
14 + 20 9 + 20
0+1 1+1
P(“fun”|−) = P(“fun”|+) =
14 + 20 9 + 20
For the test sentence S = “predictable with no fun”, after removing the word ‘with’,
the chosen class, via Eq. 4.9, is therefore computed as follows:
3 2×2×1
P(−)P(S|−) = × = 6.1 × 10−5
5 343
2 1×1×2
P(+)P(S|+) = × = 3.2 × 10−5
5 293
The model thus predicts the class negative for the test sentence.

4.4 Optimizing for Sentiment Analysis


While standard naive Bayes text classification can work well for sentiment analysis,
some small changes are generally employed that improve performance.
First, for sentiment classification and a number of other text classification tasks,
whether a word occurs or not seems to matter more than its frequency. Thus it
often improves performance to clip the word counts in each document at 1 (see
the end of the chapter for pointers to these results). This variant is called binary
62 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

binary NB multinomial naive Bayes or binary NB. The variant uses the same Eq. 4.10 except
that for each document we remove all duplicate words before concatenating them
into the single big document. Fig. 4.3 shows an example in which a set of four
documents (shortened and text-normalized for this example) are remapped to binary,
with the modified counts shown in the table on the right. The example is worked
without add-1 smoothing to make the differences clearer. Note that the results counts
need not be 1; the word great has a count of 2 even for Binary NB, because it appears
in multiple documents.

NB Binary
Counts Counts
Four original documents: + − + −
− it was pathetic the worst part was the and 2 0 1 0
boxing scenes boxing 0 1 0 1
film 1 0 1 0
− no plot twists or great scenes great 3 1 2 1
+ and satire and great plot twists it 0 1 0 1
+ great scenes great film no 0 1 0 1
or 0 1 0 1
After per-document binarization: part 0 1 0 1
− it was pathetic the worst part boxing pathetic 0 1 0 1
plot 1 1 1 1
scenes satire 1 0 1 0
− no plot twists or great scenes scenes 1 2 1 2
+ and satire great plot twists the 0 2 0 1
+ great scenes film twists 1 1 1 1
was 0 2 0 1
worst 0 1 0 1
Figure 4.3 An example of binarization for the binary naive Bayes algorithm.

A second important addition commonly made when doing text classification for
sentiment is to deal with negation. Consider the difference between I really like this
movie (positive) and I didn’t like this movie (negative). The negation expressed by
didn’t completely alters the inferences we draw from the predicate like. Similarly,
negation can modify a negative word to produce a positive review (don’t dismiss this
film, doesn’t let us get bored).
A very simple baseline that is commonly used in sentiment analysis to deal with
negation is the following: during text normalization, prepend the prefix NOT to
every word after a token of logical negation (n’t, not, no, never) until the next punc-
tuation mark. Thus the phrase
didn’t like this movie , but I
becomes
didn’t NOT_like NOT_this NOT_movie , but I
Newly formed ‘words’ like NOT like, NOT recommend will thus occur more of-
ten in negative document and act as cues for negative sentiment, while words like
NOT bored, NOT dismiss will acquire positive associations. We will return in Chap-
ter 16 to the use of parsing to deal more accurately with the scope relationship be-
tween these negation words and the predicates they modify, but this simple baseline
works quite well in practice.
Finally, in some situations we might have insufficient labeled training data to
train accurate naive Bayes classifiers using all words in the training set to estimate
positive and negative sentiment. In such cases we can instead derive the positive
4.5 • NAIVE BAYES FOR OTHER TEXT CLASSIFICATION TASKS 63

sentiment and negative word features from sentiment lexicons, lists of words that are pre-
lexicons
annotated with positive or negative sentiment. Four popular lexicons are the General
General
Inquirer Inquirer (Stone et al., 1966), LIWC (Pennebaker et al., 2007), the opinion lexicon
LIWC of Hu and Liu (2004a) and the MPQA Subjectivity Lexicon (Wilson et al., 2005).
For example the MPQA subjectivity lexicon has 6885 words, 2718 positive and
4912 negative, each marked for whether it is strongly or weakly biased. Some sam-
ples of positive and negative words from the MPQA lexicon include:
+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great
− : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate
A common way to use lexicons in a naive Bayes classifier is to add a feature
that is counted whenever a word from that lexicon occurs. Thus we might add a
feature called ‘this word occurs in the positive lexicon’, and treat all instances of
words in the lexicon as counts for that one feature, instead of counting each word
separately. Similarly, we might add as a second feature ‘this word occurs in the
negative lexicon’ of words in the negative lexicon. If we have lots of training data,
and if the test data matches the training data, using just two features won’t work as
well as using all the words. But when training data is sparse or not representative of
the test set, using dense lexicon features instead of sparse individual-word features
may generalize better.
We’ll return to this use of lexicons in Chapter 20, showing how these lexicons
can be learned automatically, and how they can be applied to many other tasks be-
yond sentiment classification.

4.5 Naive Bayes for other text classification tasks


In the previous section we pointed out that naive Bayes doesn’t require that our
classifier use all the words in the training data as features. In fact features in naive
Bayes can express any property of the input text we want.
spam detection Consider the task of spam detection, deciding if a particular piece of email is
an example of spam (unsolicited bulk email) — and one of the first applications of
naive Bayes to text classification (Sahami et al., 1998).
A common solution here, rather than using all the words as individual features,
is to predefine likely sets of words or phrases as features, combined with features
that are not purely linguistic. For example the open-source SpamAssassin tool1
predefines features like the phrase “one hundred percent guaranteed”, or the feature
mentions millions of dollars, which is a regular expression that matches suspiciously
large sums of money. But it also includes features like HTML has a low ratio of text
to image area, that aren’t purely linguistic and might require some sophisticated
computation, or totally non-linguistic features about, say, the path that the email
took to arrive. More sample SpamAssassin features:
• Email subject line is all capital letters
• Contains phrases of urgency like “urgent reply”
• Email subject line contains “online pharmaceutical”
• HTML has unbalanced “head” tags
• Claims you can be removed from the list
language ID For other tasks, like language ID—determining what language a given piece
1 https://spamassassin.apache.org
64 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

of text is written in—the most effective naive Bayes features are not words at all,
but character n-grams, 2-grams (‘zw’) 3-grams (‘nya’, ‘ Vo’), or 4-grams (‘ie z’,
‘thei’), or, even simpler byte n-grams, where instead of using the multibyte Unicode
character representations called codepoints, we just pretend everything is a string of
raw bytes. Because spaces count as a byte, byte n-grams can model statistics about
the beginning or ending of words. A widely used naive Bayes system, langid.py
(Lui and Baldwin, 2012) begins with all possible n-grams of lengths 1-4, using fea-
ture selection to winnow down to the most informative 7000 final features.
Language ID systems are trained on multilingual text, such as Wikipedia (Wiki-
pedia text in 68 different languages were used in (Lui and Baldwin, 2011)), or
newswire. To make sure that this multilingual text correctly reflects different re-
gions, dialects, and socioeconomic classes, systems also add Twitter text in many
languages geotagged to many regions (important for getting world English dialects
from countries with large Anglophone populations like Nigeria or India), Bible and
Quran translations, slang websites like Urban Dictionary, corpora of African Amer-
ican Vernacular English (Blodgett et al., 2016), and so on (Jurgens et al., 2017).

4.6 Naive Bayes as a Language Model


As we saw in the previous section, naive Bayes classifiers can use any sort of fea-
ture: dictionaries, URLs, email addresses, network features, phrases, and so on. But
if, as in the previous section, we use only individual word features, and we use all
of the words in the text (not a subset), then naive Bayes has an important similar-
ity to language modeling. Specifically, a naive Bayes model can be viewed as a
set of class-specific unigram language models, in which the model for each class
instantiates a unigram language model.
Since the likelihood features from the naive Bayes model assign a probability to
each word P(word|c), the model also assigns a probability to each sentence:
Y
P(s|c) = P(wi |c) (4.15)
i∈positions

Thus consider a naive Bayes model with the classes positive (+) and negative (-)
and the following model parameters:

w P(w|+) P(w|-)
I 0.1 0.2
love 0.1 0.001
this 0.01 0.01
fun 0.05 0.005
film 0.1 0.1
... ... ...

Each of the two columns above instantiates a language model that can assign a
probability to the sentence “I love this fun film”:

P(“I love this fun film”|+) = 0.1 × 0.1 × 0.01 × 0.05 × 0.1 = 0.0000005
P(“I love this fun film”|−) = 0.2 × 0.001 × 0.01 × 0.005 × 0.1 = .0000000010
4.7 • E VALUATION : P RECISION , R ECALL , F- MEASURE 65

As it happens, the positive model assigns a higher probability to the sentence:


P(s|pos) > P(s|neg). Note that this is just the likelihood part of the naive Bayes
model; once we multiply in the prior a full naive Bayes model might well make a
different classification decision.

4.7 Evaluation: Precision, Recall, F-measure


To introduce the methods for evaluating text classification, let’s first consider some
simple binary detection tasks. For example, in spam detection, our goal is to label
every text as being in the spam category (“positive”) or not in the spam category
(“negative”). For each item (email document) we therefore need to know whether
our system called it spam or not. We also need to know whether the email is actually
spam or not, i.e. the human-defined labels for each document that we are trying to
gold labels match. We will refer to these human labels as the gold labels.
Or imagine you’re the CEO of the Delicious Pie Company and you need to know
what people are saying about your pies on social media, so you build a system that
detects tweets concerning Delicious Pie. Here the positive class is tweets about
Delicious Pie and the negative class is all other tweets.
In both cases, we need a metric for knowing how well our spam detector (or
pie-tweet-detector) is doing. To evaluate any system for detecting things, we start
confusion by building a confusion matrix like the one shown in Fig. 4.4. A confusion matrix
matrix
is a table for visualizing how an algorithm performs with respect to the human gold
labels, using two dimensions (system output and gold labels), and each cell labeling
a set of possible outcomes. In the spam detection case, for example, true positives
are documents that are indeed spam (indicated by human-created gold labels) that
our system correctly said were spam. False negatives are documents that are indeed
spam but our system incorrectly labeled as non-spam.
To the bottom right of the table is the equation for accuracy, which asks what
percentage of all the observations (for the spam or pie examples that means all emails
or tweets) our system labeled correctly. Although accuracy might seem a natural
metric, we generally don’t use it for text classification tasks. That’s because accuracy
doesn’t work well when the classes are unbalanced (as indeed they are with spam,
which is a large majority of email, or with tweets, which are mainly not about pie).

gold standard labels


gold positive gold negative
system system tp
positive true positive false positive precision = tp+fp
output
labels system
negative false negative true negative
tp tp+tn
recall = accuracy =
tp+fn tp+fp+tn+fn

Figure 4.4 A confusion matrix for visualizing how well a binary classification system per-
forms against gold standard labels.

To make this more explicit, imagine that we looked at a million tweets, and
let’s say that only 100 of them are discussing their love (or hatred) for our pie,
66 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

while the other 999,900 are tweets about something completely unrelated. Imagine a
simple classifier that stupidly classified every tweet as “not about pie”. This classifier
would have 999,900 true negatives and only 100 false negatives for an accuracy of
999,900/1,000,000 or 99.99%! What an amazing accuracy level! Surely we should
be happy with this classifier? But of course this fabulous ‘no pie’ classifier would
be completely useless, since it wouldn’t find a single one of the customer comments
we are looking for. In other words, accuracy is not a good metric when the goal is
to discover something that is rare, or at least not completely balanced in frequency,
which is a very common situation in the world.
That’s why instead of accuracy we generally turn to two other metrics shown in
precision Fig. 4.4: precision and recall. Precision measures the percentage of the items that
the system detected (i.e., the system labeled as positive) that are in fact positive (i.e.,
are positive according to the human gold labels). Precision is defined as

true positives
Precision =
true positives + false positives
recall Recall measures the percentage of items actually present in the input that were
correctly identified by the system. Recall is defined as

true positives
Recall =
true positives + false negatives

Precision and recall will help solve the problem with the useless “nothing is
pie” classifier. This classifier, despite having a fabulous accuracy of 99.99%, has
a terrible recall of 0 (since there are no true positives, and 100 false negatives, the
recall is 0/100). You should convince yourself that the precision at finding relevant
tweets is equally problematic. Thus precision and recall, unlike accuracy, emphasize
true positives: finding the things that we are supposed to be looking for.
There are many ways to define a single metric that incorporates aspects of both
F-measure precision and recall. The simplest of these combinations is the F-measure (van
Rijsbergen, 1975) , defined as:
(β 2 + 1)PR
Fβ =
β 2P + R
The β parameter differentially weights the importance of recall and precision,
based perhaps on the needs of an application. Values of β > 1 favor recall, while
values of β < 1 favor precision. When β = 1, precision and recall are equally bal-
F1 anced; this is the most frequently used metric, and is called Fβ =1 or just F1 :
2PR
F1 = (4.16)
P+R
F-measure comes from a weighted harmonic mean of precision and recall. The
harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of recip-
rocals:
n
HarmonicMean(a1 , a2 , a3 , a4 , ..., an ) = 1 1 1 1
(4.17)
a1 + a2 + a3 + ... + an

and hence F-measure is


 
1 1−α (β 2 + 1)PR
F= or with β 2 = F= (4.18)
α P + (1 − α) R1
1 α β 2P + R
4.8 • T EST SETS AND C ROSS - VALIDATION 67

Harmonic mean is used because it is a conservative metric; the harmonic mean of


two values is closer to the minimum of the two values than the arithmetic mean is.
Thus it weighs the lower of the two numbers more heavily.

4.7.1 Evaluating with more than two classes


Up to now we have been describing text classification tasks with only two classes.
But lots of classification tasks in language processing have more than two classes.
For sentiment analysis we generally have 3 classes (positive, negative, neutral) and
even more classes are common for tasks like part-of-speech tagging, word sense
disambiguation, semantic role labeling, emotion detection, and so on. Luckily the
naive Bayes algorithm is already a multi-class classification algorithm.

gold labels
urgent normal spam
8
urgent 8 10 1 precisionu=
8+10+1
system 60
output normal 5 60 50 precisionn=
5+60+50
200
spam 3 30 200 precisions=
3+30+200
recallu = recalln = recalls =
8 60 200
8+5+3 10+60+30 1+50+200

Figure 4.5 Confusion matrix for a three-class categorization task, showing for each pair of
classes (c1 , c2 ), how many documents from c1 were (in)correctly assigned to c2

But we’ll need to slightly modify our definitions of precision and recall. Con-
sider the sample confusion matrix for a hypothetical 3-way one-of email catego-
rization decision (urgent, normal, spam) shown in Fig. 4.5. The matrix shows, for
example, that the system mistakenly labeled one spam document as urgent, and we
have shown how to compute a distinct precision and recall value for each class. In
order to derive a single metric that tells us how well the system is doing, we can com-
macroaveraging bine these values in two ways. In macroaveraging, we compute the performance
microaveraging for each class, and then average over classes. In microaveraging, we collect the de-
cisions for all classes into a single confusion matrix, and then compute precision and
recall from that table. Fig. 4.6 shows the confusion matrix for each class separately,
and shows the computation of microaveraged and macroaveraged precision.
As the figure shows, a microaverage is dominated by the more frequent class (in
this case spam), since the counts are pooled. The macroaverage better reflects the
statistics of the smaller classes, and so is more appropriate when performance on all
the classes is equally important.

4.8 Test sets and Cross-validation


The training and testing procedure for text classification follows what we saw with
language modeling (Section 3.2): we use the training set to train the model, then use
development the development test set (also called a devset) to perhaps tune some parameters,
test set
devset
68 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

Class 1: Urgent Class 2: Normal Class 3: Spam Pooled


true true true true true true true true
urgent not normal not spam not yes no
system system system system
urgent 8 11 normal 60 55 spam 200 33 yes 268 99
system system system system
not 8 340 not 40 212 not 51 83 no 99 635
8 60 200 microaverage = 268
precision = = .42 precision = = .52 precision = = .86 = .73
8+11 60+55 200+33 precision 268+99

macroaverage = .42+.52+.86
= .60
precision 3

Figure 4.6 Separate confusion matrices for the 3 classes from the previous figure, showing the pooled confu-
sion matrix and the microaveraged and macroaveraged precision.

and in general decide what the best model is. Once we come up with what we think
is the best model, we run it on the (hitherto unseen) test set to report its performance.
While the use of a devset avoids overfitting the test set, having a fixed train-
ing set, devset, and test set creates another problem: in order to save lots of data
for training, the test set (or devset) might not be large enough to be representative.
Wouldn’t it be better if we could somehow use all our data for training and still use
cross-validation all our data for test? We can do this by cross-validation: we randomly choose a
training and test set division of our data, train our classifier, and then compute the
error rate on the test set. Then we repeat with a different randomly selected training
set and test set. We do this sampling process 10 times and average these 10 runs to
10-fold get an average error rate. This is called 10-fold cross-validation.
cross-validation
The only problem with cross-validation is that because all the data is used for
testing, we need the whole corpus to be blind; we can’t examine any of the data
to suggest possible features and in general see what’s going on, because we’d be
peeking at the test set, and such cheating would cause us to overestimate the perfor-
mance of our system. However, looking at the corpus to understand what’s going
on is important in designing NLP systems! What to do? For this reason, it is com-
mon to create a fixed training set and test set, then do 10-fold cross-validation inside
the training set, but compute error rate the normal way in the test set, as shown in
Fig. 4.7.

Training Iterations Testing


1 Dev Training
2 Dev Training
3 Dev Training
4 Dev Training
Test
5 Training Dev Training
Set
6 Training Dev
7 Training Dev
8 Training Dev
9 Training Dev
10 Training Dev

Figure 4.7 10-fold cross-validation


4.9 • S TATISTICAL S IGNIFICANCE T ESTING 69

4.9 Statistical Significance Testing


In building systems we often need to compare the performance of two systems. How
can we know if the new system we just built is better than our old one? Or better than
the some other system described in the literature? This is the domain of statistical
hypothesis testing, and in this section we introduce tests for statistical significance
for NLP classifiers, drawing especially on the work of Dror et al. (2020) and Berg-
Kirkpatrick et al. (2012).
Suppose we’re comparing the performance of classifiers A and B on a metric M
such as F1 , or accuracy. Perhaps we want to know if our logistic regression senti-
ment classifier A (Chapter 5) gets a higher F1 score than our naive Bayes sentiment
classifier B on a particular test set x. Let’s call M(A, x) the score that system A gets
on test set x, and δ (x) the performance difference between A and B on x:

δ (x) = M(A, x) − M(B, x) (4.19)

We would like to know if δ (x) > 0, meaning that our logistic regression classifier
effect size has a higher F1 than our naive Bayes classifier on X. δ (x) is called the effect size;
a bigger δ means that A seems to be way better than B; a small δ means A seems to
be only a little better.
Why don’t we just check if δ (x) is positive? Suppose we do, and we find that
the F1 score of A is higher than Bs by .04. Can we be certain that A is better? We
cannot! That’s because A might just be accidentally better than B on this particular x.
We need something more: we want to know if A’s superiority over B is likely to hold
again if we checked another test set x0 , or under some other set of circumstances.
In the paradigm of statistical hypothesis testing, we test this by formalizing two
hypotheses.

H0 : δ (x) ≤ 0
H1 : δ (x) > 0 (4.20)

null hypothesis The hypothesis H0 , called the null hypothesis, supposes that δ (x) is actually nega-
tive or zero, meaning that A is not better than B. We would like to know if we can
confidently rule out this hypothesis, and instead support H1 , that A is better.
We do this by creating a random variable X ranging over all test sets. Now we
ask how likely is it, if the null hypothesis H0 was correct, that among these test sets
we would encounter the value of δ (x) that we found. We formalize this likelihood
p-value as the p-value: the probability, assuming the null hypothesis H0 is true, of seeing
the δ (x) that we saw or one even greater

P(δ (X) ≥ δ (x)|H0 is true) (4.21)

So in our example, this p-value is the probability that we would see δ (x) assuming
A is not better than B. If δ (x) is huge (let’s say A has a very respectable F1 of .9
and B has a terrible F1 of only .2 on x), we might be surprised, since that would be
extremely unlikely to occur if H0 were in fact true, and so the p-value would be low
(unlikely to have such a large δ if A is in fact not better than B). But if δ (x) is very
small, it might be less surprising to us even if H0 were true and A is not really better
than B, and so the p-value would be higher.
A very small p-value means that the difference we observed is very unlikely
under the null hypothesis, and we can reject the null hypothesis. What counts as very
70 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

small? It is common to use values like .05 or .01 as the thresholds. A value of .01
means that if the p-value (the probability of observing the δ we saw assuming H0 is
true) is less than .01, we reject the null hypothesis and assume that A is indeed better
statistically
significant than B. We say that a result (e.g., “A is better than B”) is statistically significant if
the δ we saw has a probability that is below the threshold and we therefore reject
this null hypothesis.
How do we compute this probability we need for the p-value? In NLP we gen-
erally don’t use simple parametric tests like t-tests or ANOVAs that you might be
familiar with. Parametric tests make assumptions about the distributions of the test
statistic (such as normality) that don’t generally hold in our cases. So in NLP we
usually use non-parametric tests based on sampling: we artificially create many ver-
sions of the experimental setup. For example, if we had lots of different test sets x0
we could just measure all the δ (x0 ) for all the x0 . That gives us a distribution. Now
we set a threshold (like .01) and if we see in this distribution that 99% or more of
those deltas are smaller than the delta we observed, i.e. that p-value(x)—the proba-
bility of seeing a δ (x) as big as the one we saw, is less than .01, then we can reject
the null hypothesis and agree that δ (x) was a sufficiently surprising difference and
A is really a better algorithm than B.
There are two common non-parametric tests used in NLP: approximate ran-
approximate domization (Noreen, 1989). and the bootstrap test. We will describe bootstrap
randomization
below, showing the paired version of the test, which again is most common in NLP.
paired Paired tests are those in which we compare two sets of observations that are aligned:
each observation in one set can be paired with an observation in another. This hap-
pens naturally when we are comparing the performance of two systems on the same
test set; we can pair the performance of system A on an individual observation xi
with the performance of system B on the same xi .

4.9.1 The Paired Bootstrap Test


bootstrap test The bootstrap test (Efron and Tibshirani, 1993) can apply to any metric; from pre-
cision, recall, or F1 to the BLEU metric used in machine translation. The word
bootstrapping bootstrapping refers to repeatedly drawing large numbers of smaller samples with
replacement (called bootstrap samples) from an original larger sample. The intu-
ition of the bootstrap test is that we can create many virtual test sets from an observed
test set by repeatedly sampling from it. The method only makes the assumption that
the sample is representative of the population.
Consider a tiny text classification example with a test set x of 10 documents. The
first row of Fig. 4.8 shows the results of two classifiers (A and B) on this test set,
with each document labeled by one of the four possibilities: (A and B both right,
both wrong, A right and B wrong, A wrong and B right); a slash through a letter
(
B) means that that classifier got the answer wrong. On the first document both A
and B get the correct class (AB), while on the second document A got it right but B
got it wrong (A B). If we assume for simplicity that our metric is accuracy, A has an
accuracy of .70 and B of .50, so δ (x) is .20.
Now we create a large number b (perhaps 105 ) of virtual test sets x(i) , each of
size n = 10. Fig. 4.8 shows a couple examples. To create each virtual test set x(i) , we
repeatedly (n = 10 times) select a cell from row x with replacement. For example, to
create the first cell of the first virtual test set x(1) , if we happened to randomly select
the second cell of the x row; we would copy the value A B into our new cell, and
move on to create the second cell of x(1) , each time sampling (randomly choosing)
from the original x with replacement.
4.9 • S TATISTICAL S IGNIFICANCE T ESTING 71

1 2 3 4 5 6 7 8 9 10 A% B% δ ()
x AB AB AB AB AB AB AB AB AB AB
 .70 .50 .20
x(1) AB AB AB AB AB AB AB AB AB AB .60 .60 .00
x(2) AB AB AB AB AB AB AB AB AB AB .60 .70 -.10
...
x(b)
Figure 4.8 The paired bootstrap test: Examples of b pseudo test sets x(i) being created
from an initial true test set x. Each pseudo test set is created by sampling n = 10 times with
replacement; thus an individual sample is a single cell, a document with its gold label and
the correct or incorrect performance of classifiers A and B. Of course real test sets don’t have
only 10 examples, and b needs to be large as well.

Now that we have the b test sets, providing a sampling distribution, we can do
statistics on how often A has an accidental advantage. There are various ways to
compute this advantage; here we follow the version laid out in Berg-Kirkpatrick
et al. (2012). Assuming H0 (A isn’t better than B), we would expect that δ (X), esti-
mated over many test sets, would be zero; a much higher value would be surprising,
since H0 specifically assumes A isn’t better than B. To measure exactly how surpris-
ing is our observed δ (x) we would in other circumstances compute the p-value by
counting over many test sets how often δ (x(i) ) exceeds the expected zero value by
δ (x) or more:

b
X  
p-value(x) = 1 δ (x(i) ) − δ (x) ≥ 0
i=1

However, although it’s generally true that the expected value of δ (X) over many test
sets, (again assuming A isn’t better than B) is 0, this isn’t true for the bootstrapped
test sets we created. That’s because we didn’t draw these samples from a distribution
with 0 mean; we happened to create them from the original test set x, which happens
to be biased (by .20) in favor of A. So to measure how surprising is our observed
δ (x), we actually compute the p-value by counting over many test sets how often
δ (x(i) ) exceeds the expected value of δ (x) by δ (x) or more:

b
X  
p-value(x) = 1 δ (x(i) ) − δ (x) ≥ δ (x)
i=1
b
X  
= 1 δ (x(i) ) ≥ 2δ (x) (4.22)
i=1

So if for example we have 10,000 test sets x(i) and a threshold of .01, and in only
47 of the test sets do we find that δ (x(i) ) ≥ 2δ (x), the resulting p-value of .0047 is
smaller than .01, indicating δ (x) is indeed sufficiently surprising, and we can reject
the null hypothesis and conclude A is better than B.
The full algorithm for the bootstrap is shown in Fig. 4.9. It is given a test set x, a
number of samples b, and counts the percentage of the b bootstrap test sets in which
δ (x∗(i) ) > 2δ (x). This percentage then acts as a one-sided empirical p-value
72 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

function B OOTSTRAP(test set x, num of samples b) returns p-value(x)

Calculate δ (x) # how much better does algorithm A do than B on x


s=0
for i = 1 to b do
for j = 1 to n do # Draw a bootstrap sample x(i) of size n
Select a member of x at random and add it to x(i)
Calculate δ (x(i) ) # how much better does algorithm A do than B on x(i)
s ← s + 1 if δ (x(i) ) > 2δ (x)
p-value(x) ≈ bs # on what % of the b samples did algorithm A beat expectations?
return p-value(x) # if very few did, our observed δ is probably not accidental

Figure 4.9 A version of the paired bootstrap algorithm after Berg-Kirkpatrick et al. (2012).

4.10 Avoiding Harms in Classification


It is important to avoid harms that may result from classifiers, harms that exist both
for naive Bayes classifiers and for the other classification algorithms we introduce
in later chapters.
representational
harms
One class of harms is representational harms (Crawford 2017, Blodgett et al. 2020),
harms caused by a system that demeans a social group, for example by perpetuating
negative stereotypes about them. For example Kiritchenko and Mohammad (2018)
examined the performance of 200 sentiment analysis systems on pairs of sentences
that were identical except for containing either a common African American first
name (like Shaniqua) or a common European American first name (like Stephanie),
chosen from the Caliskan et al. (2017) study discussed in Chapter 6. They found
that most systems assigned lower sentiment and more negative emotion to sentences
with African American names, reflecting and perpetuating stereotypes that associate
African Americans with negative emotions (Popp et al., 2003).
In other tasks classifiers may lead to both representational harms and other
harms, such as censorship. For example the important text classification task of
toxicity toxicity detection is the task of detecting hate speech, abuse, harassment, or other
detection
kinds of toxic language. While the goal of such classifiers is to help reduce soci-
etal harm, toxicity classifiers can themselves cause harms. For example, researchers
have shown that some widely used toxicity classifiers incorrectly flag as being toxic
sentences that are non-toxic but simply mention minority identities like women
(Park et al., 2018), blind people (Hutchinson et al., 2020) or gay people (Dixon
et al., 2018), or simply use linguistic features characteristic of varieties like African-
American Vernacular English (Sap et al. 2019, Davidson et al. 2019). Such false
positive errors, if employed by toxicity detection systems without human oversight,
could lead to the censoring of discourse by or about these groups.
These model problems can be caused by biases or other problems in the training
data; in general, machine learning systems replicate and even amplify the biases in
their training data. But these problems can also be caused by the labels (for exam-
ple caused by biases in the human labelers) by the resources used (like lexicons,
or model components like pretrained embeddings), or even by model architecture
(like what the model is trained to optimized). While the mitigation of these biases
(for example by carefully considering the training data sources) is an important area
of research, we currently don’t have general solutions. For this reason it’s impor-
4.11 • S UMMARY 73

tant, when introducing any NLP model, to study these these kinds of factors and
model card make them clear. One way to do this is by releasing a model card (Mitchell et al.,
2019) for each version of a model, that documents a machine learning model with
information like:
• training algorithms and parameters
• training data sources, motivation, and preprocessing
• evaluation data sources, motivation, and preprocessing
• intended use and users
• model performance across different demographic or other groups and envi-
ronmental situations

4.11 Summary
This chapter introduced the naive Bayes model for classification and applied it to
the text categorization task of sentiment analysis.
• Many language processing tasks can be viewed as tasks of classification.
• Text categorization, in which an entire text is assigned a class from a finite set,
includes such tasks as sentiment analysis, spam detection, language identi-
fication, and authorship attribution.
• Sentiment analysis classifies a text as reflecting the positive or negative orien-
tation (sentiment) that a writer expresses toward some object.
• Naive Bayes is a generative model that makes the bag of words assumption
(position doesn’t matter) and the conditional independence assumption (words
are conditionally independent of each other given the class)
• Naive Bayes with binarized features seems to work better for many text clas-
sification tasks.
• Classifiers are evaluated based on precision and recall.
• Classifiers are trained using distinct training, dev, and test sets, including the
use of cross-validation in the training set.
• Statistical significance tests should be used to determine whether we can be
confident that one version of a classifier is better than another.
• Designers of classifiers should carefully consider harms that may be caused
by the model, including its training data and other components, and report
model characteristics in a model card.

Bibliographical and Historical Notes


Multinomial naive Bayes text classification was proposed by Maron (1961) at the
RAND Corporation for the task of assigning subject categories to journal abstracts.
His model introduced most of the features of the modern form presented here, ap-
proximating the classification task with one-of categorization, and implementing
add-δ smoothing and information-based feature selection.
The conditional independence assumptions of naive Bayes and the idea of Bayes-
ian analysis of text seems to have arisen multiple times. The same year as Maron’s
paper, Minsky (1961) proposed a naive Bayes classifier for vision and other arti-
ficial intelligence problems, and Bayesian techniques were also applied to the text
74 C HAPTER 4 • NAIVE BAYES AND S ENTIMENT C LASSIFICATION

classification task of authorship attribution by Mosteller and Wallace (1963). It had


long been known that Alexander Hamilton, John Jay, and James Madison wrote
the anonymously-published Federalist papers in 1787–1788 to persuade New York
to ratify the United States Constitution. Yet although some of the 85 essays were
clearly attributable to one author or another, the authorship of 12 were in dispute
between Hamilton and Madison. Mosteller and Wallace (1963) trained a Bayesian
probabilistic model of the writing of Hamilton and another model on the writings
of Madison, then computed the maximum-likelihood author for each of the disputed
essays. Naive Bayes was first applied to spam detection in Heckerman et al. (1998).
Metsis et al. (2006), Pang et al. (2002), and Wang and Manning (2012) show
that using boolean attributes with multinomial naive Bayes works better than full
counts. Binary multinomial naive Bayes is sometimes confused with another variant
of naive Bayes that also use a binary representation of whether a term occurs in
a document: Multivariate Bernoulli naive Bayes. The Bernoulli variant instead
estimates P(w|c) as the fraction of documents that contain a term, and includes a
probability for whether a term is not in a document. McCallum and Nigam (1998)
and Wang and Manning (2012) show that the multivariate Bernoulli variant of naive
Bayes doesn’t work as well as the multinomial algorithm for sentiment or other text
tasks.
There are a variety of sources covering the many kinds of text classification
tasks. For sentiment analysis see Pang and Lee (2008), and Liu and Zhang (2012).
Stamatatos (2009) surveys authorship attribute algorithms. On language identifica-
tion see Jauhiainen et al. (2018); Jaech et al. (2016) is an important early neural
system. The task of newswire indexing was often used as a test case for text classi-
fication algorithms, based on the Reuters-21578 collection of newswire articles.
See Manning et al. (2008) and Aggarwal and Zhai (2012) on text classification;
classification in general is covered in machine learning textbooks (Hastie et al. 2001,
Witten and Frank 2005, Bishop 2006, Murphy 2012).
Non-parametric methods for computing statistical significance were used first in
NLP in the MUC competition (Chinchor et al., 1993), and even earlier in speech
recognition (Gillick and Cox 1989, Bisani and Ney 2004). Our description of the
bootstrap draws on the description in Berg-Kirkpatrick et al. (2012). Recent work
has focused on issues including multiple test sets and multiple metrics (Søgaard
et al. 2014, Dror et al. 2017).
Feature selection is a method of removing features that are unlikely to generalize
well. Features are generally ranked by how informative they are about the classifica-
information
gain tion decision. A very common metric, information gain, tells us how many bits of
information the presence of the word gives us for guessing the class. Other feature
selection metrics include χ 2 , pointwise mutual information, and GINI index; see
Yang and Pedersen (1997) for a comparison and Guyon and Elisseeff (2003) for an
introduction to feature selection.

Exercises
4.1 Assume the following likelihoods for each word being part of a positive or
negative movie review, and equal prior probabilities for each class.
E XERCISES 75

pos neg
I 0.09 0.16
always 0.07 0.06
like 0.29 0.06
foreign 0.04 0.15
films 0.08 0.11
What class will Naive bayes assign to the sentence “I always like foreign
films.”?
4.2 Given the following short movie reviews, each labeled with a genre, either
comedy or action:
1. fun, couple, love, love comedy
2. fast, furious, shoot action
3. couple, fly, fast, fun, fun comedy
4. furious, shoot, shoot, fun action
5. fly, fast, shoot, love action
and a new document D:
fast, couple, shoot, fly
compute the most likely class for D. Assume a naive Bayes classifier and use
add-1 smoothing for the likelihoods.
4.3 Train two models, multinomial naive Bayes and binarized naive Bayes, both
with add-1 smoothing, on the following document counts for key sentiment
words, with positive or negative class assigned as noted.
doc “good” “poor” “great” (class)
d1. 3 0 3 pos
d2. 0 1 2 pos
d3. 1 3 0 neg
d4. 1 5 2 neg
d5. 0 2 0 neg
Use both naive Bayes models to assign a class (pos or neg) to this sentence:
A good, good plot and great characters, but poor acting.
Recall from page 60 that with naive Bayes text classification, we simply ig-
nore (throw out) any word that never occurred in the training document. (We
don’t throw out words that appear in some classes but not others; that’s what
add-one smoothing is for.) Do the two models agree or disagree?
76 C HAPTER 5 • L OGISTIC R EGRESSION

CHAPTER

5 Logistic Regression

“And how do you know that these fine begonias are not of equal importance?”
Hercule Poirot, in Agatha Christie’s The Mysterious Affair at Styles
Detective stories are as littered with clues as texts are with words. Yet for the
poor reader it can be challenging to know how to weigh the author’s clues in order
to make the crucial classification task: deciding whodunnit.
In this chapter we introduce an algorithm that is admirably suited for discovering
logistic
regression the link between features or cues and some particular outcome: logistic regression.
Indeed, logistic regression is one of the most important analytic tools in the social
and natural sciences. In natural language processing, logistic regression is the base-
line supervised machine learning algorithm for classification, and also has a very
close relationship with neural networks. As we will see in Chapter 7, a neural net-
work can be viewed as a series of logistic regression classifiers stacked on top of
each other. Thus the classification and machine learning techniques introduced here
will play an important role throughout the book.
Logistic regression can be used to classify an observation into one of two classes
(like ‘positive sentiment’ and ‘negative sentiment’), or into one of many classes.
Because the mathematics for the two-class case is simpler, we’ll describe this special
case of logistic regression first in the next few sections, and then briefly summarize
the use of multinomial logistic regression for more than two classes in Section 5.6.
We’ll introduce the mathematics of logistic regression in the next few sections.
But let’s begin with some high-level issues.
Generative and Discriminative Classifiers: The most important difference be-
tween naive Bayes and logistic regression is that logistic regression is a discrimina-
tive classifier while naive Bayes is a generative classifier.
These are two very different frameworks for how
to build a machine learning model. Consider a visual
metaphor: imagine we’re trying to distinguish dog
images from cat images. A generative model would
have the goal of understanding what dogs look like
and what cats look like. You might literally ask such
a model to ‘generate’, i.e., draw, a dog. Given a test
image, the system then asks whether it’s the cat model or the dog model that better
fits (is less surprised by) the image, and chooses that as its label.
A discriminative model, by contrast, is only try-
ing to learn to distinguish the classes (perhaps with-
out learning much about them). So maybe all the
dogs in the training data are wearing collars and the
cats aren’t. If that one feature neatly separates the
classes, the model is satisfied. If you ask such a
model what it knows about cats all it can say is that
they don’t wear collars.
5.1 • C LASSIFICATION : THE SIGMOID 77

More formally, recall that the naive Bayes assigns a class c to a document d not
by directly computing P(c|d) but by computing a likelihood and a prior

likelihood prior
z }| { z}|{
ĉ = argmax P(d|c) P(c) (5.1)
c∈C

generative A generative model like naive Bayes makes use of this likelihood term, which
model
expresses how to generate the features of a document if we knew it was of class c.
discriminative By contrast a discriminative model in this text categorization scenario attempts
model
to directly compute P(c|d). Perhaps it will learn to assign a high weight to document
features that directly improve its ability to discriminate between possible classes,
even if it couldn’t generate an example of one of the classes.
Components of a probabilistic machine learning classifier: Like naive Bayes,
logistic regression is a probabilistic classifier that makes use of supervised machine
learning. Machine learning classifiers require a training corpus of m input/output
pairs (x(i) , y(i) ). (We’ll use superscripts in parentheses to refer to individual instances
in the training set—for sentiment classification each instance might be an individual
document to be classified). A machine learning system for classification then has
four components:
1. A feature representation of the input. For each input observation x(i) , this
will be a vector of features [x1 , x2 , ..., xn ]. We will generally refer to feature
( j)
i for input x( j) as xi , sometimes simplified as xi , but we will also see the
notation fi , fi (x), or, for multiclass classification, fi (c, x).
2. A classification function that computes ŷ, the estimated class, via p(y|x). In
the next section we will introduce the sigmoid and softmax tools for classifi-
cation.
3. An objective function for learning, usually involving minimizing error on
training examples. We will introduce the cross-entropy loss function.
4. An algorithm for optimizing the objective function. We introduce the stochas-
tic gradient descent algorithm.
Logistic regression has two phases:
training: we train the system (specifically the weights w and b) using stochastic
gradient descent and the cross-entropy loss.
test: Given a test example x we compute p(y|x) and return the higher probability
label y = 1 or y = 0.

5.1 Classification: the sigmoid


The goal of binary logistic regression is to train a classifier that can make a binary
decision about the class of a new input observation. Here we introduce the sigmoid
classifier that will help us make this decision.
Consider a single input observation x, which we will represent by a vector of fea-
tures [x1 , x2 , ..., xn ] (we’ll show sample features in the next subsection). The classifier
output y can be 1 (meaning the observation is a member of the class) or 0 (the ob-
servation is not a member of the class). We want to know the probability P(y = 1|x)
that this observation is a member of the class. So perhaps the decision is “positive
78 C HAPTER 5 • L OGISTIC R EGRESSION

sentiment” versus “negative sentiment”, the features represent counts of words in a


document, P(y = 1|x) is the probability that the document has positive sentiment,
and P(y = 0|x) is the probability that the document has negative sentiment.
Logistic regression solves this task by learning, from a training set, a vector of
weights and a bias term. Each weight wi is a real number, and is associated with one
of the input features xi . The weight wi represents how important that input feature
is to the classification decision, and can be positive (providing evidence that the in-
stance being classified belongs in the positive class) or negative (providing evidence
that the instance being classified belongs in the negative class). Thus we might
expect in a sentiment task the word awesome to have a high positive weight, and
bias term abysmal to have a very negative weight. The bias term, also called the intercept, is
intercept another real number that’s added to the weighted inputs.
To make a decision on a test instance— after we’ve learned the weights in
training— the classifier first multiplies each xi by its weight wi , sums up the weighted
features, and adds the bias term b. The resulting single number z expresses the
weighted sum of the evidence for the class.
n
!
X
z = wi xi + b (5.2)
i=1
dot product In the rest of the book we’ll represent such sums using the dot product notation from
linear algebra. The dot product of two vectors a and b, written as a · b is the sum of
the products of the corresponding elements of each vector. Thus the following is an
equivalent formation to Eq. 5.2:
z = w·x+b (5.3)

But note that nothing in Eq. 5.3 forces z to be a legal probability, that is, to lie
between 0 and 1. In fact, since weights are real-valued, the output might even be
negative; z ranges from −∞ to ∞.

1
Figure 5.1 The sigmoid function y = 1+e −z takes a real value and maps it to the range [0, 1].
It is nearly linear around 0 but outlier values get squashed toward 0 or 1.

sigmoid To create a probability, we’ll pass z through the sigmoid function, σ (z). The
sigmoid function (named because it looks like an s) is also called the logistic func-
logistic tion, and gives logistic regression its name. The sigmoid has the following equation,
function
shown graphically in Fig. 5.1:
1 1
y = σ (z) = = (5.4)
1 + e−z 1 + exp (−z)
(For the rest of the book, we’ll use the notation exp(x) to mean ex .) The sigmoid
has a number of advantages; it takes a real-valued number and maps it into the range
5.1 • C LASSIFICATION : THE SIGMOID 79

[0, 1], which is just what we want for a probability. Because it is nearly linear around
0 but flattens toward the ends, it tends to squash outlier values toward 0 or 1. And
it’s differentiable, which as we’ll see in Section 5.8 will be handy for learning.
We’re almost there. If we apply the sigmoid to the sum of the weighted features,
we get a number between 0 and 1. To make it a probability, we just need to make
sure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:

P(y = 1) = σ (w · x + b)
1
=
1 + exp (−(w · x + b))

P(y = 0) = 1 − σ (w · x + b)
1
= 1−
1 + exp (−(w · x + b))
exp (−(w · x + b))
= (5.5)
1 + exp (−(w · x + b))

The sigmoid function has the property

1 − σ (x) = σ (−x) (5.6)

so we could also have expressed P(y = 0) as σ (−(w · x + b)).


Now we have an algorithm that given an instance x computes the probability
P(y = 1|x). How do we make a decision? For a test instance x, we say yes if the
probability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision
decision
boundary boundary:

1 if P(y = 1|x) > 0.5
ŷ =
0 otherwise

5.1.1 Example: sentiment classification


Let’s have an example. Suppose we are doing binary sentiment classification on
movie review text, and we would like to know whether to assign the sentiment class
+ or − to a review document doc. We’ll represent each input observation by the 6
features x1 ...x6 of the input shown in the following table; Fig. 5.2 shows the features
in a sample mini test document.

Var Definition Value in Fig. 5.2


x1 count(positive lexicon) ∈ doc) 3
x2 count(negative
 lexicon) ∈ doc) 2
1 if “no” ∈ doc
x3 1
0 otherwise
x4 count(1st
 and 2nd pronouns ∈ doc) 3
1 if “!” ∈ doc
x5 0
0 otherwise
x6 log(word count of doc) ln(66) = 4.19

Let’s assume for the moment that we’ve already learned a real-valued weight for
each of these features, and that the 6 weights corresponding to the 6 features are
[2.5, −5.0, −1.2, 0.5, 2.0, 0.7], while b = 0.1. (We’ll discuss in the next section how
80 C HAPTER 5 • L OGISTIC R EGRESSION

x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19

Figure 5.2 A sample mini test document showing the extracted features in the vector x.

the weights are learned.) The weight w1 , for example indicates how important a
feature the number of positive lexicon words (great, nice, enjoyable, etc.) is to
a positive sentiment decision, while w2 tells us the importance of negative lexicon
words. Note that w1 = 2.5 is positive, while w2 = −5.0, meaning that negative words
are negatively associated with a positive sentiment decision, and are about twice as
important as positive words.
Given these 6 features and the input review x, P(+|x) and P(−|x) can be com-
puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = σ (w · x + b)


= σ ([2.5, −5.0, −1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= σ (.833)
= 0.70 (5.7)
p(−|x) = P(Y = 0|x) = 1 − σ (w · x + b)
= 0.30

Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
(perhaps with a positive weight), or that the current word is in our abbreviations
dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). A feature
can also express a quite complex combination of properties. For example a period
following an upper case word is likely to be an EOS, but if the word itself is St. and
the previous word is capitalized, then the period is likely part of a shortening of the
word street.

1 if “Case(wi ) = Lower”
x1 =
0 otherwise

1 if “wi ∈ AcronymDict”
x2 =
0 otherwise

1 if “wi = St. & Case(wi−1 ) = Cap”
x3 =
0 otherwise

Designing features: Features are generally designed by examining the training


set with an eye to linguistic intuitions and the linguistic literature on the domain. A
careful error analysis on the training set or devset of an early version of a system
often provides insights into features.
5.2 • L EARNING IN L OGISTIC R EGRESSION 81

For some tasks it is especially helpful to build complex features that are combi-
nations of more primitive features. We saw such a feature for period disambiguation
above, where a period on the word St. was less likely to be the end of the sentence
if the previous word was capitalized. For logistic regression and naive Bayes these
feature combination features or feature interactions have to be designed by hand.
interactions
For many tasks (especially when feature values can reference specific words)
we’ll need large numbers of features. Often these are created automatically via fea-
feature
templates ture templates, abstract specifications of features. For example a bigram template
for period disambiguation might create a feature for every pair of words that occurs
before a period in the training set. Thus the feature space is sparse, since we only
have to create a feature if that n-gram exists in that position in the training set. The
feature is generally created as a hash from the string descriptions. A user description
of a feature as, “bigram(American breakfast)” is hashed into a unique integer i that
becomes the feature number fi .
In order to avoid the extensive human effort of feature design, recent research in
NLP has focused on representation learning: ways to learn features automatically
in an unsupervised way from the input. We’ll introduce methods for representation
learning in Chapter 6 and Chapter 7.

Choosing a classifier Logistic regression has a number of advantages over naive


Bayes. Naive Bayes has overly strong conditional independence assumptions. Con-
sider two features which are strongly correlated; in fact, imagine that we just add the
same feature f1 twice. Naive Bayes will treat both copies of f1 as if they were sep-
arate, multiplying them both in, overestimating the evidence. By contrast, logistic
regression is much more robust to correlated features; if two features f1 and f2 are
perfectly correlated, regression will simply assign part of the weight to w1 and part
to w2 . Thus when there are many correlated features, logistic regression will assign
a more accurate probability than naive Bayes. So logistic regression generally works
better on larger documents or datasets and is a common default.
Despite the less accurate probabilities, naive Bayes still often makes the correct
classification decision. Furthermore, naive Bayes can work extremely well (some-
times even better than logistic regression) on very small datasets (Ng and Jordan,
2002) or short documents (Wang and Manning, 2012). Furthermore, naive Bayes is
easy to implement and very fast to train (there’s no optimization step). So it’s still a
reasonable approach to use in some situations.

5.2 Learning in Logistic Regression


How are the parameters of the model, the weights w and bias b, learned? Logistic
regression is an instance of supervised classification in which we know the correct
label y (either 0 or 1) for each observation x. What the system produces via Eq. 5.5
is ŷ, the system’s estimate of the true y. We want to learn parameters (meaning w
and b) that make ŷ for each training observation as close as possible to the true y.
This requires two components that we foreshadowed in the introduction to the
chapter. The first is a metric for how close the current label (ŷ) is to the true gold
label y. Rather than measure similarity, we usually talk about the opposite of this:
the distance between the system output and the gold output, and we call this distance
loss the loss function or the cost function. In the next section we’ll introduce the loss
function that is commonly used for logistic regression and also for neural networks,
82 C HAPTER 5 • L OGISTIC R EGRESSION

the cross-entropy loss.


The second thing we need is an optimization algorithm for iteratively updating
the weights so as to minimize this loss function. The standard algorithm for this is
gradient descent; we’ll introduce the stochastic gradient descent algorithm in the
following section.

5.3 The cross-entropy loss function


We need a loss function that expresses, for an observation x, how close the classifier
output (ŷ = σ (w · x + b)) is to the correct output (y, which is 0 or 1). We’ll call this:

L(ŷ, y) = How much ŷ differs from the true y (5.8)

We do this via a loss function that prefers the correct class labels of the train-
ing examples to be more likely. This is called conditional maximum likelihood
estimation: we choose the parameters w, b that maximize the log probability of
the true y labels in the training data given the observations x. The resulting loss
cross-entropy function is the negative log likelihood loss, generally called the cross-entropy loss.
loss
Let’s derive this loss function, applied to a single observation x. We’d like to
learn weights that maximize the probability of the correct label p(y|x). Since there
are only two discrete outcomes (1 or 0), this is a Bernoulli distribution, and we can
express the probability p(y|x) that our classifier produces for one observation as
the following (keeping in mind that if y=1, Eq. 5.9 simplifies to ŷ; if y=0, Eq. 5.9
simplifies to 1 − ŷ):

p(y|x) = ŷ y (1 − ŷ)1−y (5.9)

Now we take the log of both sides. This will turn out to be handy mathematically,
and doesn’t hurt us; whatever values maximize a probability will also maximize the
log of the probability:
 
log p(y|x) = log ŷ y (1 − ŷ)1−y
= y log ŷ + (1 − y) log(1 − ŷ) (5.10)

Eq. 5.10 describes a log likelihood that should be maximized. In order to turn this
into loss function (something that we need to minimize), we’ll just flip the sign on
Eq. 5.10. The result is the cross-entropy loss LCE :

LCE (ŷ, y) = − log p(y|x) = − [y log ŷ + (1 − y) log(1 − ŷ)] (5.11)

Finally, we can plug in the definition of ŷ = σ (w · x + b):

LCE (ŷ, y) = − [y log σ (w · x + b) + (1 − y) log (1 − σ (w · x + b))] (5.12)

Let’s see if this loss function does the right thing for our example from Fig. 5.2. We
want the loss to be smaller if the model’s estimate is close to correct, and bigger if
the model is confused. So first let’s suppose the correct gold label for the sentiment
example in Fig. 5.2 is positive, i.e., y = 1. In this case our model is doing well, since
from Eq. 5.7 it indeed gave the example a higher probability of being positive (.69)
than negative (.31). If we plug σ (w · x + b) = .69 and y = 1 into Eq. 5.12, the right
5.4 • G RADIENT D ESCENT 83

side of the equation drops out, leading to the following loss (we’ll use log to mean
natural log when the base is not specified):

LCE (ŷ, y) = −[y log σ (w · x + b) + (1 − y) log (1 − σ (w · x + b))]


= − [log σ (w · x + b)]
= − log(.69)
= .37

By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
the loss to be higher. Now if we plug y = 0 and 1 − σ (w · x + b) = .31 from Eq. 5.7
into Eq. 5.12, the left side of the equation drops out:

LCE (ŷ, y) = −[y log σ (w · x + b)+(1 − y) log (1 − σ (w · x + b))]


= − [log (1 − σ (w · x + b))]
= − log (.31)
= 1.17

Sure enough, the loss for the first classifier (.37) is less than the loss for the second
classifier (1.17).
Why does minimizing this negative log probability do what we want? A per-
fect classifier would assign probability 1 to the correct outcome (y=1 or y=0) and
probability 0 to the incorrect outcome. That means the higher ŷ (the closer it is
to 1), the better the classifier; the lower ŷ is (the closer it is to 0), the worse the
classifier. The negative log of this probability is a convenient loss metric since it
goes from 0 (negative log of 1, no loss) to infinity (negative log of 0, infinite loss).
This loss function also ensures that as the probability of the correct answer is max-
imized, the probability of the incorrect answer is minimized; since the two sum to
one, any increase in the probability of the correct answer is coming at the expense
of the incorrect answer. It’s called the cross-entropy loss, because Eq. 5.10 is also
the formula for the cross-entropy between the true probability distribution y and our
estimated distribution ŷ.
Now we know what we want to minimize; in the next section, we’ll see how to
find the minimum.

5.4 Gradient Descent


Our goal with gradient descent is to find the optimal weights: minimize the loss
function we’ve defined for the model. In Eq. 5.13 below, we’ll explicitly represent
the fact that the loss function L is parameterized by the weights, which we’ll refer
to in machine learning in general as θ (in the case of logistic regression θ = w, b).
So the goal is to find the set of weights which minimizes the loss function, averaged
over all examples:

m
1X
θ̂ = argmin LCE ( f (x(i) ; θ ), y(i) ) (5.13)
θ m
i=1
84 C HAPTER 5 • L OGISTIC R EGRESSION

How shall we find the minimum of this (or any) loss function? Gradient descent
is a method that finds a minimum of a function by figuring out in which direction
(in the space of the parameters θ ) the function’s slope is rising the most steeply,
and moving in the opposite direction. The intuition is that if you are hiking in a
canyon and trying to descend most quickly down to the river at the bottom, you might
look around yourself 360 degrees, find the direction where the ground is sloping the
steepest, and walk downhill in that direction.
convex For logistic regression, this loss function is conveniently convex. A convex func-
tion has just one minimum; there are no local minima to get stuck in, so gradient
descent starting from any point is guaranteed to find the minimum. (By contrast,
the loss for multi-layer neural networks is non-convex, and gradient descent may
get stuck in local minima for neural network training and never find the global opti-
mum.)
Although the algorithm (and the concept of gradient) are designed for direction
vectors, let’s first consider a visualization of the case where the parameter of our
system is just a single scalar w, shown in Fig. 5.3.
Given a random initialization of w at some value w1 , and assuming the loss
function L happened to have the shape in Fig. 5.3, we need the algorithm to tell us
whether at the next iteration we should move left (making w2 smaller than w1 ) or
right (making w2 bigger than w1 ) to reach the minimum.

Loss

one step
of gradient
slope of loss at w1 descent
is negative

w1 wmin w
0 (goal)
Figure 5.3 The first step in iteratively finding the minimum of this loss function, by moving
w in the reverse direction from the slope of the function. Since the slope is negative, we need
to move w in a positive direction, to the right. Here superscripts are used for learning steps,
so w1 means the initial value of w (which is 0), w2 at the second step, and so on.

gradient The gradient descent algorithm answers this question by finding the gradient
of the loss function at the current point and moving in the opposite direction. The
gradient of a function of many variables is a vector pointing in the direction of the
greatest increase in a function. The gradient is a multi-variable generalization of the
slope, so for a function of one variable like the one in Fig. 5.3, we can informally
think of the gradient as the slope. The dotted line in Fig. 5.3 shows the slope of this
hypothetical loss function at point w = w1 . You can see that the slope of this dotted
line is negative. Thus to find the minimum, gradient descent tells us to go in the
opposite direction: moving w in a positive direction.
The magnitude of the amount to move in gradient descent is the value of the slope
d
dw (x; w) weighted by a learning rate η. A higher (faster) learning rate means that
f
learning rate
we should move w more on each step. The change we make in our parameter is the
5.4 • G RADIENT D ESCENT 85

learning rate times the gradient (or the slope, in our single-variable example):

d
wt+1 = wt − η f (x; w) (5.14)
dw
Now let’s extend the intuition from a function of one scalar variable w to many
variables, because we don’t just want to move left or right, we want to know where
in the N-dimensional space (of the N parameters that make up θ ) we should move.
The gradient is just such a vector; it expresses the directional components of the
sharpest slope along each of those N dimensions. If we’re just imagining two weight
dimensions (say for one weight w and one bias b), the gradient might be a vector with
two orthogonal components, each of which tells us how much the ground slopes in
the w dimension and in the b dimension. Fig. 5.4 shows a visualization of the value
of a 2-dimensional gradient vector taken at the red point.

Cost(w,b)

b
w
Figure 5.4 Visualization of the gradient vector at the red point in two dimensions w and b,
showing the gradient as a red arrow in the x-y plane.

In an actual logistic regression, the parameter vector w is much longer than 1 or


2, since the input feature vector x can be quite long, and we need a weight wi for
each xi . For each dimension/variable wi in w (plus the bias b), the gradient will have
a component that tells us the slope with respect to that variable. Essentially we’re
asking: “How much would a small change in that variable wi influence the total loss
function L?”
In each dimension wi , we express the slope as a partial derivative ∂∂wi of the loss
function. The gradient is then defined as a vector of these partials. We’ll represent ŷ
as f (x; θ ) to make the dependence on θ more obvious:

 
∂ w1 L( f (x; θ ), y)

 ∂ 
 ∂ w2 L( f (x; θ ), y)

∇θ L( f (x; θ ), y)) =   (5.15)
.. 
 . 

∂ wn L( f (x; θ ), y)

The final equation for updating θ based on the gradient is thus

θt+1 = θt − η∇L( f (x; θ ), y) (5.16)


86 C HAPTER 5 • L OGISTIC R EGRESSION

5.4.1 The Gradient for Logistic Regression


In order to update θ , we need a definition for the gradient ∇L( f (x; θ ), y). Recall that
for logistic regression, the cross-entropy loss function is:
LCE (ŷ, y) = − [y log σ (w · x + b) + (1 − y) log (1 − σ (w · x + b))] (5.17)
It turns out that the derivative of this function for one observation vector x is
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this equation):
∂ LCE (ŷ, y)
= [σ (w · x + b) − y]x j (5.18)
∂wj
Note in Eq. 5.18 that the gradient with respect to a single weight w j represents a
very intuitive value: the difference between the true y and our estimated ŷ = σ (w ·
x + b) for that observation, multiplied by the corresponding input value x j .

5.4.2 The Stochastic Gradient Descent Algorithm


Stochastic gradient descent is an online algorithm that minimizes the loss function
by computing its gradient after each training example, and nudging θ in the right
direction (the opposite direction of the gradient). Fig. 5.5 shows the algorithm.

function S TOCHASTIC G RADIENT D ESCENT(L(), f (), x, y) returns θ


# where: L is the loss function
# f is a function parameterized by θ
# x is the set of training inputs x(1) , x(2) , ..., x(m)
# y is the set of training outputs (labels) y(1) , y(2) , ..., y(m)

θ ←0
repeat til done # see caption
For each training tuple (x(i) , y(i) ) (in random order)
1. Optional (for reporting): # How are we doing on this tuple?
Compute ŷ (i) = f (x(i) ; θ ) # What is our estimated output ŷ?
Compute the loss L(ŷ (i) , y(i) ) # How far off is ŷ(i) ) from the true output y(i) ?
2. g ← ∇θ L( f (x(i) ; θ ), y(i) ) # How should we move θ to maximize loss?
3. θ ← θ − η g # Go the other way instead
return θ

Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
to report how well we are doing on the current tuple. The algorithm can terminate when it
converges (or when the gradient norm < ), or when progress halts (for example when the
loss starts going up on a held-out set).

hyperparameter The learning rate η is a hyperparameter that must be adjusted. If it’s too high,
the learner will take steps that are too large, overshooting the minimum of the loss
function. If it’s too low, the learner will take steps that are too small, and take too
long to get to the minimum. It is common to start with a higher learning rate and then
slowly decrease it, so that it is a function of the iteration k of training; the notation
ηk can be used to mean the value of the learning rate at iteration k.
We’ll discuss hyperparameters in more detail in Chapter 7, but briefly they are
a special kind of parameter for any machine learning model. Unlike regular param-
eters of a model (weights like w and b), which are learned by the algorithm from
the training set, hyperparameters are special parameters chosen by the algorithm
designer that affect how the algorithm works.
5.4 • G RADIENT D ESCENT 87

5.4.3 Working through an example


Let’s walk though a single step of the gradient descent algorithm. We’ll use a sim-
plified version of the example in Fig. 5.2 as it sees a single observation x, whose
correct value is y = 1 (this is a positive review), and with only two features:

x1 = 3 (count of positive lexicon words)


x2 = 2 (count of negative lexicon words)

Let’s assume the initial weights and bias in θ 0 are all set to 0, and the initial learning
rate η is 0.1:

w1 = w2 = b = 0
η = 0.1

The single update step requires that we compute the gradient, multiplied by the
learning rate

θ t+1 = θ t − η∇θ L( f (x(i) ; θ ), y(i) )

In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
 ∂ LCE (ŷ,y)         
∂ w1 (σ (w · x + b) − y)x1 (σ (0) − 1)x1 −0.5x1 −1.5
 ∂ LCE (ŷ,y)  
∇w,b =  ∂ w2  = (σ (w · x + b) − y)x2  =  (σ (0) − 1)x2  =  −0.5x2  =  −1.0 
∂ LCE (ŷ,y) σ (w · x + b) − y σ (0) − 1 −0.5 −0.5
∂b

Now that we have a gradient, we compute the new parameter vector θ 1 by moving
θ 0 in the opposite direction from the gradient:
     
w1 −1.5 .15
θ 1 =  w2  − η  −1.0  =  .1 
b −0.5 .05

So after one step of gradient descent, the weights have shifted to be: w1 = .15,
w2 = .1, and b = .05.
Note that this observation x happened to be a positive example. We would expect
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.

5.4.4 Mini-batch training


Stochastic gradient descent is called stochastic because it chooses a single random
example at a time, moving the weights so as to improve performance on that single
example. That can result in very choppy movements, so it’s common to compute the
gradient over batches of training instances rather than a single instance.
batch training For example in batch training we compute the gradient over the entire dataset.
By seeing so many examples, batch training offers a superb estimate of which di-
rection to move the weights, at the cost of spending a lot of time processing every
single example in the training set to compute this perfect direction.
mini-batch A compromise is mini-batch training: we train on a group of m examples (per-
haps 512, or 1024) that is less than the whole dataset. (If m is the size of the dataset,
88 C HAPTER 5 • L OGISTIC R EGRESSION

then we are doing batch gradient descent; if m = 1, we are back to doing stochas-
tic gradient descent). Mini-batch training also has the advantage of computational
efficiency. The mini-batches can easily be vectorized, choosing the size of the mini-
batch based on the computational resources. This allows us to process all the exam-
ples in one mini-batch in parallel and then accumulate the loss, something that’s not
possible with individual or batch training.
We just need to define mini-batch versions of the cross-entropy loss function
we defined in Section 5.3 and the gradient in Section 5.4.1. Let’s extend the cross-
entropy loss for one example from Eq. 5.11 to mini-batches of size m. We’ll continue
to use the notation that x(i) and y(i) mean the ith training features and training label,
respectively. We make the assumption that the training examples are independent:
m
Y
log p(training labels) = log p(y(i) |x(i) )
i=1
m
X
= log p(y(i) |x(i) )
i=1
m
X
= − LCE (ŷ(i) , y(i) ) (5.19)
i=1

Now the cost function for the mini-batch of m examples is the average loss for each
example:
m
1X
Cost(ŷ, y) = LCE (ŷ(i) , y(i) )
m
i=1
Xm  
1
= − y(i) log σ (w · x(i) + b) + (1 − y(i) ) log 1 − σ (w · x(i) + b) (5.20)
m
i=1

The mini-batch gradient is the average of the individual gradients from Eq. 5.18:
m
∂Cost(ŷ, y) 1 Xh i
(i)
= σ (w · x(i) + b) − y(i) x j (5.21)
∂wj m
i=1

5.5 Regularization

Numquam ponenda est pluralitas sine necessitate


‘Plurality should never be proposed unless needed’
William of Occam

There is a problem with learning weights that make the model perfectly match the
training data. If a feature is perfectly predictive of the outcome because it happens
to only occur in one class, it will be assigned a very high weight. The weights for
features will attempt to perfectly fit details of the training set, in fact too perfectly,
modeling noisy factors that just accidentally correlate with the class. This problem is
overfitting called overfitting. A good model should be able to generalize well from the training
generalize
5.5 • R EGULARIZATION 89

data to the unseen test set, but a model that overfits will have poor generalization.
regularization To avoid overfitting, a new regularization term R(θ ) is added to the objective
function in Eq. 5.13, resulting in the following objective for a batch of m exam-
ples (slightly rewritten from Eq. 5.13 to be maximizing log probability rather than
minimizing loss, and removing the m1 term which doesn’t affect the argmax):
m
X
θ̂ = argmax log P(y(i) |x(i) ) − αR(θ ) (5.22)
θ i=1

The new regularization term R(θ ) is used to penalize large weights. Thus a setting
of the weights that matches the training data perfectly— but uses many weights with
high values to do so—will be penalized more than a setting that matches the data a
little less well, but does so using smaller weights. There are two common ways to
L2
regularization compute this regularization term R(θ ). L2 regularization is a quadratic function of
the weight values, named because it uses the (square of the) L2 norm of the weight
values. The L2 norm, ||θ ||2 , is the same as the Euclidean distance of the vector θ
from the origin. If θ consists of n weights, then:
n
X
R(θ ) = ||θ ||22 = θ j2 (5.23)
j=1

The L2 regularized objective function becomes:


" m # n
X X
θ̂ = argmax (i) (i)
log P(y |x ) − α θ j2 (5.24)
θ i=1 j=1

L1
regularization L1 regularization is a linear function of the weight values, named after the L1 norm
||W ||1 , the sum of the absolute values of the weights, or Manhattan distance (the
Manhattan distance is the distance you’d have to walk between two points in a city
with a street grid like New York):
n
X
R(θ ) = ||θ ||1 = |θi | (5.25)
i=1

The L1 regularized objective function becomes:


" m # n
X X
θ̂ = argmax log P(y(i) |x(i) ) − α |θ j | (5.26)
θ 1=i j=1

These kinds of regularization come from statistics, where L1 regularization is called


lasso lasso regression (Tibshirani, 1996) and L2 regularization is called ridge regression,
ridge and both are commonly used in language processing. L2 regularization is easier to
optimize because of its simple derivative (the derivative of θ 2 is just 2θ ), while
L1 regularization is more complex (the derivative of |θ | is non-continuous at zero).
But where L2 prefers weight vectors with many small weights, L1 prefers sparse
solutions with some larger weights but many more weights set to zero. Thus L1
regularization leads to much sparser weight vectors, that is, far fewer features.
Both L1 and L2 regularization have Bayesian interpretations as constraints on
the prior of how weights should look. L1 regularization can be viewed as a Laplace
prior on the weights. L2 regularization corresponds to assuming that weights are
90 C HAPTER 5 • L OGISTIC R EGRESSION

distributed according to a Gaussian distribution with mean µ = 0. In a Gaussian


or normal distribution, the further away a value is from the mean, the lower its
probability (scaled by the variance σ ). By using a Gaussian prior on the weights, we
are saying that weights prefer to have the value 0. A Gaussian for a weight θ j is
!
1 (θ j − µ j )2
q exp − (5.27)
2πσ 2 2σ 2j
j

If we multiply each weight by a Gaussian prior on the weight, we are thus maximiz-
ing the following constraint:
M n
!
Y
(i) (i)
Y 1 (θ j − µ j )2
θ̂ = argmax P(y |x ) × q exp − 2
(5.28)
θ i=1 j=1 2πσ 2 2σ j
j

which in log space, with µ = 0, and assuming 2σ 2 = 1, corresponds to


m
X n
X
θ̂ = argmax log P(y(i) |x(i) ) − α θ j2 (5.29)
θ i=1 j=1

which is in the same form as Eq. 5.24.

5.6 Multinomial logistic regression


Sometimes we need more than two classes. Perhaps we might want to do 3-way
sentiment classification (positive, negative, or neutral). Or we could be assigning
some of the labels we will introduce in Chapter 8, like the part of speech of a word
(choosing from 10, 30, or even 50 different parts of speech), or the named entity
multinomial
type of a phrase (choosing from tags like person, location, organization).
logistic In such cases we use multinomial logistic regression, also called softmax re-
regression
gression (or, historically, the maxent classifier). In multinomial logistic regression
the target y is a variable that ranges over more than two classes; we want to know
the probability of y being in each potential class c ∈ C, p(y = c|x).
The multinomial logistic classifier uses a generalization of the sigmoid, called
softmax the softmax function, to compute the probability p(y = c|x). The softmax function
takes a vector z = [z1 , z2 , ..., zk ] of k arbitrary values and maps them to a probability
distribution, with each value in the range (0,1), and all the values summing to 1.
Like the sigmoid, it is an exponential function.
For a vector z of dimensionality k, the softmax is defined as:

exp (zi )
softmax(zi ) = Pk 1≤i≤k (5.30)
j=1 exp (z j )

The softmax of an input vector z = [z1 , z2 , ..., zk ] is thus a vector itself:

" #
exp (z1 ) exp (z2 ) exp (zk )
softmax(z) = Pk , Pk , ..., Pk (5.31)
i=1 exp (zi ) i=1 exp (zi ) i=1 exp (zi )
5.6 • M ULTINOMIAL LOGISTIC REGRESSION 91
P
The denominator ki=1 exp (zi ) is used to normalize all the values into probabil-
ities. Thus for example given a vector:

z = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1]

the resulting (rounded) softmax(z) is

[0.055, 0.090, 0.006, 0.099, 0.74, 0.010]

Again like the sigmoid, the input to the softmax will be the dot product between
a weight vector w and an input vector x (plus a bias). But now we’ll need separate
weight vectors (and bias) for each of the K classes.

exp (wc · x + bc )
p(y = c|x) = k
(5.32)
X
exp (w j · x + b j )
j=1

Like the sigmoid, the softmax has the property of squashing values toward 0 or 1.
Thus if one of the inputs is larger than the others, it will tend to push its probability
toward 1, and suppress the probabilities of the smaller inputs.

5.6.1 Features in Multinomial Logistic Regression


Features in multinomial logistic regression function similarly to binary logistic re-
gression, with one difference that we’ll need separate weight vectors (and biases) for
each of the K classes. Recall our binary exclamation point feature x5 from page 79:

1 if “!” ∈ doc
x5 =
0 otherwise

In binary classification a positive weight w5 on a feature influences the classifier


toward y = 1 (positive sentiment) and a negative weight influences it toward y = 0
(negative sentiment) with the absolute value indicating how important the feature
is. For multinominal logistic regression, by contrast, with separate weights for each
class, a feature can be evidence for or against each individual class.
In 3-way multiclass sentiment classification, for example, we must assign each
document one of the 3 classes +, −, or 0 (neutral). Now a feature related to excla-
mation marks might have a negative weight for 0 documents, and a positive weight
for + or − documents:

Feature Definition
 w5,+ w5,− w5,0
1 if “!” ∈ doc
f5 (x) 3.5 3.1 −5.3
0 otherwise

5.6.2 Learning in Multinomial Logistic Regression


The loss function for multinomial logistic regression generalizes the loss function
for binary logistic regression from 2 to K classes. Recall that that the cross-entropy
loss for binary logistic regression (repeated from Eq. 5.11) is:

LCE (ŷ, y) = − log p(y|x) = − [y log ŷ + (1 − y) log(1 − ŷ)] (5.33)


92 C HAPTER 5 • L OGISTIC R EGRESSION

The loss function for multinominal logistic regression generalizes the two terms in
Eq. 5.33 (one that is non-zero when y = 1 and one that is non-zero when y = 0) to K
terms. The loss function for a single example x is thus the sum of the logs of the K
output classes, each weighted by yk , the probability of the true class :
K
X
LCE (ŷ, y) = − yk log yˆk
k=1
XK
= − yk log p̂(y = k|x) (5.34)
k=1

Because only one class (let’s call it i) is the correct one, the vector y takes the value
1 only for this value of k, i.e., has yi = 1 and y j = 0 ∀ j 6= i. A vector like this,
with one value=1 and the rest 0, is called a one-hot vector. The terms in the sum in
Eq. 5.34 will thus be 0 except for the term corresponding to the true class, i.e.:
K
X
LCE (ŷ, y) = − 1{y = k} log p̂(y = k|x)
k=1
XK
exp (wk · x + bk )
= − 1{y = k} log PK (5.35)
k=1 j=1 exp (w j · x + b j )

Hence the cross-entropy loss is simply the log of the output probability correspond-
ing to the correct class, and we therefore also call this the negative log likelihood
negative log loss:
likelihood loss

LCE (ŷ, y) = − log ŷk ,


(where k is the correct class)
exp (wk · x + bk )
= − log PK (where k is the correct class)(5.36)
j=1 exp (w j · x + b j )

The gradient for a single example turns out to be very similar to the gradient
for binary logistic regression, although we don’t show the derivation here. It is the
difference between the value for the true class k (which is 1) and the probability the
classifier outputs for class k, weighted by the value of the input xi corresponding to
the ith element of the weight for class k wk,i :

∂ LCE
= −(1{y = k} − p(y = k|x))xi
∂ wk,i
!
exp (wk · x + bk )
= − 1{y = k} − PK xi (5.37)
j=1 exp (w j · x + b j )

5.7 Interpreting models


Often we want to know more than just the correct classification of an observation.
We want to know why the classifier made the decision it did. That is, we want our
interpretable decision to be interpretable. Interpretability can be hard to define strictly, but the
core idea is that as humans we should know why our algorithms reach the conclu-
sions they do. Because the features to logistic regression are often human-designed,
5.8 • A DVANCED : D ERIVING THE G RADIENT E QUATION 93

one way to understand a classifier’s decision is to understand the role each feature
plays in the decision. Logistic regression can be combined with statistical tests (the
likelihood ratio test, or the Wald test); investigating whether a particular feature is
significant by one of these tests, or inspecting its magnitude (how large is the weight
w associated with the feature?) can help us interpret why the classifier made the
decision it makes. This is enormously important for building transparent models.
Furthermore, in addition to its use as a classifier, logistic regression in NLP and
many other fields is widely used as an analytic tool for testing hypotheses about the
effect of various explanatory variables (features). In text classification, perhaps we
want to know if logically negative words (no, not, never) are more likely to be asso-
ciated with negative sentiment, or if negative reviews of movies are more likely to
discuss the cinematography. However, in doing so it’s necessary to control for po-
tential confounds: other factors that might influence sentiment (the movie genre, the
year it was made, perhaps the length of the review in words). Or we might be study-
ing the relationship between NLP-extracted linguistic features and non-linguistic
outcomes (hospital readmissions, political outcomes, or product sales), but need to
control for confounds (the age of the patient, the county of voting, the brand of the
product). In such cases, logistic regression allows us to test whether some feature is
associated with some outcome above and beyond the effect of other features.

5.8 Advanced: Deriving the Gradient Equation


In this section we give the derivation of the gradient of the cross-entropy loss func-
tion LCE for logistic regression. Let’s start with some quick calculus refreshers.
First, the derivative of ln(x):

d 1
ln(x) = (5.38)
dx x
Second, the (very elegant) derivative of the sigmoid:

dσ (z)
= σ (z)(1 − σ (z)) (5.39)
dz
chain rule Finally, the chain rule of derivatives. Suppose we are computing the derivative
of a composite function f (x) = u(v(x)). The derivative of f (x) is the derivative of
u(x) with respect to v(x) times the derivative of v(x) with respect to x:

df du dv
= · (5.40)
dx dv dx
First, we want to know the derivative of the loss function with respect to a single
weight w j (we’ll need to compute it for each weight, and for the bias):

∂ LCE ∂
= − [y log σ (w · x + b) + (1 − y) log (1 − σ (w · x + b))]
∂wj ∂wj
 
∂ ∂
= − y log σ (w · x + b) + (1 − y) log [1 − σ (w · x + b)]
∂wj ∂wj
(5.41)
94 C HAPTER 5 • L OGISTIC R EGRESSION

Next, using the chain rule, and relying on the derivative of log:
∂ LCE y ∂ 1−y ∂
= − σ (w · x + b) − 1 − σ (w · x + b)
∂wj σ (w · x + b) ∂ w j 1 − σ (w · x + b) ∂ w j
(5.42)

Rearranging terms:
 
∂ LCE y 1−y ∂
= − − σ (w · x + b)
∂wj σ (w · x + b) 1 − σ (w · x + b) ∂ w j
(5.43)

And now plugging in the derivative of the sigmoid, and using the chain rule one
more time, we end up with Eq. 5.44:
 
∂ LCE y − σ (w · x + b) ∂ (w · x + b)
= − σ (w · x + b)[1 − σ (w · x + b)]
∂wj σ (w · x + b)[1 − σ (w · x + b)] ∂wj
 
y − σ (w · x + b)
= − σ (w · x + b)[1 − σ (w · x + b)]x j
σ (w · x + b)[1 − σ (w · x + b)]
= −[y − σ (w · x + b)]x j
= [σ (w · x + b) − y]x j (5.44)

5.9 Summary
This chapter introduced the logistic regression model of classification.
• Logistic regression is a supervised machine learning classifier that extracts
real-valued features from the input, multiplies each by a weight, sums them,
and passes the sum through a sigmoid function to generate a probability. A
threshold is used to make a decision.
• Logistic regression can be used with two classes (e.g., positive and negative
sentiment) or with multiple classes (multinomial logistic regression, for ex-
ample for n-ary text classification, part-of-speech labeling, etc.).
• Multinomial logistic regression uses the softmax function to compute proba-
bilities.
• The weights (vector w and bias b) are learned from a labeled training set via a
loss function, such as the cross-entropy loss, that must be minimized.
• Minimizing this loss function is a convex optimization problem, and iterative
algorithms like gradient descent are used to find the optimal weights.
• Regularization is used to avoid overfitting.
• Logistic regression is also one of the most useful analytic tools, because of its
ability to transparently study the importance of individual features.

Bibliographical and Historical Notes


Logistic regression was developed in the field of statistics, where it was used for
the analysis of binary data by the 1960s, and was particularly common in medicine
(Cox, 1969). Starting in the late 1970s it became widely used in linguistics as one
E XERCISES 95

of the formal foundations of the study of linguistic variation (Sankoff and Labov,
1979).
Nonetheless, logistic regression didn’t become common in natural language pro-
cessing until the 1990s, when it seems to have appeared simultaneously from two
directions. The first source was the neighboring fields of information retrieval and
speech processing, both of which had made use of regression, and both of which
lent many other statistical techniques to NLP. Indeed a very early use of logistic
regression for document routing was one of the first NLP applications to use (LSI)
embeddings as word representations (Schütze et al., 1995).
At the same time in the early 1990s logistic regression was developed and ap-
maximum
entropy plied to NLP at IBM Research under the name maximum entropy modeling or
maxent (Berger et al., 1996), seemingly independent of the statistical literature. Un-
der that name it was applied to language modeling (Rosenfeld, 1996), part-of-speech
tagging (Ratnaparkhi, 1996), parsing (Ratnaparkhi, 1997), coreference resolution
(Kehler, 1997b), and text classification (Nigam et al., 1999).
More on classification can be found in machine learning textbooks (Hastie et al. 2001,
Witten and Frank 2005, Bishop 2006, Murphy 2012).

Exercises
96 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

CHAPTER

6 Vector Semantics and


Embeddings
荃者所以在鱼,得鱼而忘荃 Nets are for fish;
Once you get the fish, you can forget the net.
言者所以在意,得意而忘言 Words are for meaning;
Once you get the meaning, you can forget the words
庄子(Zhuangzi), Chapter 26

The asphalt that Los Angeles is famous for occurs mainly on its freeways. But
in the middle of the city is another patch of asphalt, the La Brea tar pits, and this
asphalt preserves millions of fossil bones from the last of the Ice Ages of the Pleis-
tocene Epoch. One of these fossils is the Smilodon, or saber-toothed tiger, instantly
recognizable by its long canines. Five million years ago or so, a completely different
sabre-tooth tiger called Thylacosmilus lived
in Argentina and other parts of South Amer-
ica. Thylacosmilus was a marsupial whereas
Smilodon was a placental mammal, but Thy-
lacosmilus had the same long upper canines
and, like Smilodon, had a protective bone
flange on the lower jaw. The similarity of
these two mammals is one of many examples
of parallel or convergent evolution, in which particular contexts or environments
lead to the evolution of very similar structures in different species (Gould, 1980).
The role of context is also important in the similarity of a less biological kind
of organism: the word. Words that occur in similar contexts tend to have similar
meanings. This link between similarity in how words are distributed and similarity
distributional
hypothesis in what they mean is called the distributional hypothesis. The hypothesis was
first formulated in the 1950s by linguists like Joos (1950), Harris (1954), and Firth
(1957), who noticed that words which are synonyms (like oculist and eye-doctor)
tended to occur in the same environment (e.g., near words like eye or examined)
with the amount of meaning difference between two words “corresponding roughly
to the amount of difference in their environments” (Harris, 1954, 157).
vector In this chapter we introduce vector semantics, which instantiates this linguistic
semantics
embeddings hypothesis by learning representations of the meaning of words, called embeddings,
directly from their distributions in texts. These representations are used in every nat-
ural language processing application that makes use of meaning, and the static em-
beddings we introduce here underlie the more powerful dynamic or contextualized
embeddings like BERT that we will see in Chapter 10.
These word representations are also the first example in this book of repre-
representation
learning sentation learning, automatically learning useful representations of the input text.
Finding such self-supervised ways to learn representations of the input, instead of
creating representations by hand via feature engineering, is an important focus of
NLP research (Bengio et al., 2013).
6.1 • L EXICAL S EMANTICS 97

6.1 Lexical Semantics


Let’s begin by introducing some basic principles of word meaning. How should we
represent the meaning of a word? In the n-gram models of Chapter 3, and in classical
NLP applications, our only representation of a word is as a string of letters, or an
index in a vocabulary list. This representation is not that different from a tradition
in philosophy, perhaps you’ve seen it in introductory logic classes, in which the
meaning of words is represented by just spelling the word with small capital letters;
representing the meaning of “dog” as DOG, and “cat” as CAT.
Representing the meaning of a word by capitalizing it is a pretty unsatisfactory
model. You might have seen a joke due originally to semanticist Barbara Partee
(Carlson, 1977):
Q: What’s the meaning of life?
A: LIFE ’
Surely we can do better than this! After all, we’ll want a model of word meaning
to do all sorts of things for us. It should tell us that some words have similar mean-
ings (cat is similar to dog), others are antonyms (cold is the opposite of hot), some
have positive connotations (happy) while others have negative connotations (sad).
It should represent the fact that the meanings of buy, sell, and pay offer differing
perspectives on the same underlying purchasing event (If I buy something from you,
you’ve probably sold it to me, and I likely paid you). More generally, a model of
word meaning should allow us to draw inferences to address meaning-related tasks
like question-answering or dialogue.
In this section we summarize some of these desiderata, drawing on results in the
lexical linguistic study of word meaning, which is called lexical semantics; we’ll return to
semantics
and expand on this list in Chapter 18 and Chapter 10.
Lemmas and Senses Let’s start by looking at how one word (we’ll choose mouse)
might be defined in a dictionary (simplified from the online dictionary WordNet):
mouse (N)
1. any of numerous small rodents...
2. a hand-operated device that controls a cursor...
lemma Here the form mouse is the lemma, also called the citation form. The form
citation form mouse would also be the lemma for the word mice; dictionaries don’t have separate
definitions for inflected forms like mice. Similarly sing is the lemma for sing, sang,
sung. In many languages the infinitive form is used as the lemma for the verb, so
Spanish dormir “to sleep” is the lemma for duermes “you sleep”. The specific forms
wordform sung or carpets or sing or duermes are called wordforms.
As the example above shows, each lemma can have multiple meanings; the
lemma mouse can refer to the rodent or the cursor control device. We call each
of these aspects of the meaning of mouse a word sense. The fact that lemmas can
be polysemous (have multiple senses) can make interpretation difficult (is someone
who types “mouse info” into a search engine looking for a pet or a tool?). Chapter 18
will discuss the problem of polysemy, and introduce word sense disambiguation,
the task of determining which sense of a word is being used in a particular context.
Synonymy One important component of word meaning is the relationship be-
tween word senses. For example when one word has a sense whose meaning is
identical to a sense of another word, or nearly identical, we say the two senses of
synonym those two words are synonyms. Synonyms include such pairs as
98 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

couch/sofa vomit/throw up filbert/hazelnut car/automobile


A more formal definition of synonymy (between words rather than senses) is that
two words are synonymous if they are substitutable for one another in any sentence
without changing the truth conditions of the sentence, the situations in which the
sentence would be true. We often say in this case that the two words have the same
propositional
meaning propositional meaning.
While substitutions between some pairs of words like car / automobile or wa-
ter / H2 O are truth preserving, the words are still not identical in meaning. Indeed,
probably no two words are absolutely identical in meaning. One of the fundamen-
principle of tal tenets of semantics, called the principle of contrast (Girard 1718, Bréal 1897,
contrast
Clark 1987), states that a difference in linguistic form is always associated with some
difference in meaning. For example, the word H2 O is used in scientific contexts and
would be inappropriate in a hiking guide—water would be more appropriate— and
this genre difference is part of the meaning of the word. In practice, the word syn-
onym is therefore used to describe a relationship of approximate or rough synonymy.
Word Similarity While words don’t have many synonyms, most words do have
lots of similar words. Cat is not a synonym of dog, but cats and dogs are certainly
similar words. In moving from synonymy to similarity, it will be useful to shift from
talking about relations between word senses (like synonymy) to relations between
words (like similarity). Dealing with words avoids having to commit to a particular
representation of word senses, which will turn out to simplify our task.
similarity The notion of word similarity is very useful in larger semantic tasks. Know-
ing how similar two words are can help in computing how similar the meaning of
two phrases or sentences are, a very important component of natural language un-
derstanding tasks like question answering, paraphrasing, and summarization. One
way of getting values for word similarity is to ask humans to judge how similar one
word is to another. A number of datasets have resulted from such experiments. For
example the SimLex-999 dataset (Hill et al., 2015) gives values on a scale from 0 to
10, like the examples below, which range from near-synonyms (vanish, disappear)
to pairs that scarcely seem to have anything in common (hole, agreement):
vanish disappear 9.8
belief impression 5.95
muscle bone 3.65
modest flexible 0.98
hole agreement 0.3

Word Relatedness The meaning of two words can be related in ways other than
relatedness similarity. One such class of connections is called word relatedness (Budanitsky
association and Hirst, 2006), also traditionally called word association in psychology.
Consider the meanings of the words coffee and cup. Coffee is not similar to cup;
they share practically no features (coffee is a plant or a beverage, while a cup is a
manufactured object with a particular shape). But coffee and cup are clearly related;
they are associated by co-participating in an everyday event (the event of drinking
coffee out of a cup). Similarly scalpel and surgeon are not similar but are related
eventively (a surgeon tends to make use of a scalpel).
One common kind of relatedness between words is if they belong to the same
semantic field semantic field. A semantic field is a set of words which cover a particular semantic
domain and bear structured relations with each other. For example, words might be
related by being in the semantic field of hospitals (surgeon, scalpel, nurse, anes-
thetic, hospital), restaurants (waiter, menu, plate, food, chef), or houses (door, roof,
6.1 • L EXICAL S EMANTICS 99

topic models kitchen, family, bed). Semantic fields are also related to topic models, like Latent
Dirichlet Allocation, LDA, which apply unsupervised learning on large sets of texts
to induce sets of associated words from text. Semantic fields and topic models are
very useful tools for discovering topical structure in documents.
In Chapter 18 we’ll introduce more relations between senses like hypernymy or
IS-A, antonymy (opposites) and meronymy (part-whole relations).
Semantic Frames and Roles Closely related to semantic fields is the idea of a
semantic frame semantic frame. A semantic frame is a set of words that denote perspectives or
participants in a particular type of event. A commercial transaction, for example,
is a kind of event in which one entity trades money to another entity in return for
some good or service, after which the good changes hands or perhaps the service is
performed. This event can be encoded lexically by using verbs like buy (the event
from the perspective of the buyer), sell (from the perspective of the seller), pay
(focusing on the monetary aspect), or nouns like buyer. Frames have semantic roles
(like buyer, seller, goods, money), and words in a sentence can take on these roles.
Knowing that buy and sell have this relation makes it possible for a system to
know that a sentence like Sam bought the book from Ling could be paraphrased as
Ling sold the book to Sam, and that Sam has the role of the buyer in the frame and
Ling the seller. Being able to recognize such paraphrases is important for question
answering, and can help in shifting perspective for machine translation.
connotations Connotation Finally, words have affective meanings or connotations. The word
connotation has different meanings in different fields, but here we use it to mean
the aspects of a word’s meaning that are related to a writer or reader’s emotions,
sentiment, opinions, or evaluations. For example some words have positive conno-
tations (happy) while others have negative connotations (sad). Even words whose
meanings are similar in other ways can vary in connotation; consider the difference
in connotations between fake, knockoff, forgery, on the one hand, and copy, replica,
reproduction on the other, or innocent (positive connotation) and naive (negative
connotation). Some words describe positive evaluation (great, love) and others neg-
ative evaluation (terrible, hate). Positive or negative evaluation language is called
sentiment sentiment, as we saw in Chapter 4, and word sentiment plays a role in important
tasks like sentiment analysis, stance detection, and applications of NLP to the lan-
guage of politics and consumer reviews.
Early work on affective meaning (Osgood et al., 1957) found that words varied
along three important dimensions of affective meaning:
valence: the pleasantness of the stimulus
arousal: the intensity of emotion provoked by the stimulus
dominance: the degree of control exerted by the stimulus
Thus words like happy or satisfied are high on valence, while unhappy or an-
noyed are low on valence. Excited is high on arousal, while calm is low on arousal.
Controlling is high on dominance, while awed or influenced are low on dominance.
Each word is thus represented by three numbers, corresponding to its value on each
of the three dimensions:
Valence Arousal Dominance
courageous 8.05 5.5 7.38
music 7.67 5.57 6.5
heartbreak 2.45 5.65 3.58
cub 6.71 3.95 4.24
100 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

Osgood et al. (1957) noticed that in using these 3 numbers to represent the
meaning of a word, the model was representing each word as a point in a three-
dimensional space, a vector whose three dimensions corresponded to the word’s
rating on the three scales. This revolutionary idea that word meaning could be rep-
resented as a point in space (e.g., that part of the meaning of heartbreak can be
represented as the point [2.45, 5.65, 3.58]) was the first expression of the vector se-
mantics models that we introduce next.

6.2 Vector Semantics


vector Vectors semantics is the standard way to represent word meaning in NLP, helping
semantics
us model many of the aspects of word meaning we saw in the previous section. The
roots of the model lie in the 1950s when two big ideas converged: Osgood’s (1957)
idea mentioned above to use a point in three-dimensional space to represent the
connotation of a word, and the proposal by linguists like Joos (1950), Harris (1954),
and Firth (1957) to define the meaning of a word by its distribution in language
use, meaning its neighboring words or grammatical environments. Their idea was
that two words that occur in very similar distributions (whose neighboring words are
similar) have similar meanings.
For example, suppose you didn’t know the meaning of the word ongchoi (a re-
cent borrowing from Cantonese) but you see it in the following contexts:
(6.1) Ongchoi is delicious sauteed with garlic.
(6.2) Ongchoi is superb over rice.
(6.3) ...ongchoi leaves with salty sauces...
And suppose that you had seen many of these context words in other contexts:
(6.4) ...spinach sauteed with garlic over rice...
(6.5) ...chard stems and leaves are delicious...
(6.6) ...collard greens and other salty leafy greens
The fact that ongchoi occurs with words like rice and garlic and delicious and
salty, as do words like spinach, chard, and collard greens might suggest that ongchoi
is a leafy green similar to these other leafy greens.1 We can do the same thing
computationally by just counting words in the context of ongchoi.
The idea of vector semantics is to represent a word as a point in a multidimen-
sional semantic space that is derived (in ways we’ll see) from the distributions of
embeddings word neighbors. Vectors for representing words are called embeddings (although
the term is sometimes more strictly applied only to dense vectors like word2vec
(Section 6.8), rather than sparse tf-idf or PPMI vectors (Section 6.3-Section 6.6)).
The word “embedding” derives from its mathematical sense as a mapping from one
space or structure to another, although the meaning has shifted; see the end of the
chapter.
Fig. 6.1 shows a visualization of embeddings learned for sentiment analysis,
showing the location of selected words projected down from 60-dimensional space
into a two dimensional space. Notice the distinct regions containing positive words,
negative words, and neutral function words.
1 It’s in fact Ipomoea aquatica, a relative of morning glory sometimes called water spinach in English.
6.3 • W ORDS AND V ECTORS 101

not good
bad
to by dislike worst
’s
that now incredibly bad
are worse
a i you
than with is

very good incredibly good


amazing fantastic
terrific wonderful
nice
good

Figure 6.1 A two-dimensional (t-SNE) projection of embeddings for some words and
phrases, showing that words with similar meanings are nearby in space. The original 60-
dimensional embeddings were trained for sentiment analysis. Simplified from Li et al. (2015)
with colors added for explanation.

The fine-grained model of word similarity of vector semantics offers enormous


power to NLP applications. NLP applications like the sentiment classifiers of Chap-
ter 4 or Chapter 5 depend on the same words appearing in the training and test sets.
But by representing words as embeddings, classifiers can assign sentiment as long as
it sees some words with similar meanings. And as we’ll see, vector semantic models
can be learned automatically from text without supervision.
In this chapter we’ll introduce the two most commonly used models. In the tf-idf
model, an important baseline, the meaning of a word is defined by a simple function
of the counts of nearby words. We will see that this method results in very long
vectors that are sparse, i.e. mostly zeros (since most words simply never occur in
the context of others). We’ll introduce the word2vec model family for construct-
ing short, dense vectors that have useful semantic properties. We’ll also introduce
the cosine, the standard way to use embeddings to compute semantic similarity, be-
tween two words, two sentences, or two documents, an important tool in practical
applications like question answering, summarization, or automatic essay grading.

6.3 Words and Vectors


“The most important attributes of a vector in 3-space are {Location, Location, Location}”
Randall Munroe, https://xkcd.com/2358/

Vector or distributional models of meaning are generally based on a co-occurrence


matrix, a way of representing how often words co-occur. We’ll look at two popular
matrices: the term-document matrix and the term-term matrix.

6.3.1 Vectors and documents


term-document In a term-document matrix, each row represents a word in the vocabulary and each
matrix
column represents a document from some collection of documents. Fig. 6.2 shows a
small selection from a term-document matrix showing the occurrence of four words
in four plays by Shakespeare. Each cell in this matrix represents the number of times
a particular word (defined by the row) occurs in a particular document (defined by
the column). Thus fool appeared 58 times in Twelfth Night.
The term-document matrix of Fig. 6.2 was first defined as part of the vector
vector space space model of information retrieval (Salton, 1971). In this model, a document is
model
102 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

As You Like It Twelfth Night Julius Caesar Henry V


battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell
contains the number of times the (row) word occurs in the (column) document.

represented as a count vector, a column in Fig. 6.3.


vector To review some basic linear algebra, a vector is, at heart, just a list or array of
numbers. So As You Like It is represented as the list [1,114,36,20] (the first column
vector in Fig. 6.3) and Julius Caesar is represented as the list [7,62,1,2] (the third
vector space column vector). A vector space is a collection of vectors, characterized by their
dimension dimension. In the example in Fig. 6.3, the document vectors are of dimension 4,
just so they fit on the page; in real term-document matrices, the vectors representing
each document would have dimensionality |V |, the vocabulary size.
The ordering of the numbers in a vector space indicates different meaningful di-
mensions on which documents vary. Thus the first dimension for both these vectors
corresponds to the number of times the word battle occurs, and we can compare
each dimension, noting for example that the vectors for As You Like It and Twelfth
Night have similar values (1 and 0, respectively) for the first dimension.

As You Like It Twelfth Night Julius Caesar Henry V


battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Figure 6.3 The term-document matrix for four words in four Shakespeare plays. The red
boxes show that each document is represented as a column vector of length four.

We can think of the vector for a document as a point in |V |-dimensional space;


thus the documents in Fig. 6.3 are points in 4-dimensional space. Since 4-dimensional
spaces are hard to visualize, Fig. 6.4 shows a visualization in two dimensions; we’ve
arbitrarily chosen the dimensions corresponding to the words battle and fool.

40
Henry V [4,13]
15
battle

10 Julius Caesar [1,7]

5 As You Like It [36,1] Twelfth Night [58,0]

5 10 15 20 25 30 35 40 45 50 55 60
fool
Figure 6.4 A spatial visualization of the document vectors for the four Shakespeare play
documents, showing just two of the dimensions, corresponding to the words battle and fool.
The comedies have high values for the fool dimension and low values for the battle dimension.

Term-document matrices were originally defined as a means of finding similar


documents for the task of document information retrieval. Two documents that are
6.3 • W ORDS AND V ECTORS 103

similar will tend to have similar words, and if two documents have similar words
their column vectors will tend to be similar. The vectors for the comedies As You
Like It [1,114,36,20] and Twelfth Night [0,80,58,15] look a lot more like each other
(more fools and wit than battles) than they look like Julius Caesar [7,62,1,2] or
Henry V [13,89,4,3]. This is clear with the raw numbers; in the first dimension
(battle) the comedies have low numbers and the others have high numbers, and we
can see it visually in Fig. 6.4; we’ll see very shortly how to quantify this intuition
more formally.
A real term-document matrix, of course, wouldn’t just have 4 rows and columns,
let alone 2. More generally, the term-document matrix has |V | rows (one for each
word type in the vocabulary) and D columns (one for each document in the collec-
tion); as we’ll see, vocabulary sizes are generally in the tens of thousands, and the
number of documents can be enormous (think about all the pages on the web).
information Information retrieval (IR) is the task of finding the document d from the D
retrieval
documents in some collection that best matches a query q. For IR we’ll therefore also
represent a query by a vector, also of length |V |, and we’ll need a way to compare
two vectors to find how similar they are. (Doing IR will also require efficient ways
to store and manipulate these vectors by making use of the convenient fact that these
vectors are sparse, i.e., mostly zeros).
Later in the chapter we’ll introduce some of the components of this vector com-
parison process: the tf-idf term weighting, and the cosine similarity metric.

6.3.2 Words as vectors: document dimensions


We’ve seen that documents can be represented as vectors in a vector space. But
vector semantics can also be used to represent the meaning of words. We do this
row vector by associating each word with a word vector— a row vector rather than a column
vector, hence with different dimensions, as shown in Fig. 6.5. The four dimensions
of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays. Word
counts in the same four dimensions are used to form the vectors for the other 3
words: wit, [20,15,2,3]; battle, [1,0,7,13]; and good [114,80,62,89].

As You Like It Twelfth Night Julius Caesar Henry V


battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Figure 6.5 The term-document matrix for four words in four Shakespeare plays. The red
boxes show that each word is represented as a row vector of length four.

For documents, we saw that similar documents had similar vectors, because sim-
ilar documents tend to have similar words. This same principle applies to words:
similar words have similar vectors because they tend to occur in similar documents.
The term-document matrix thus lets us represent the meaning of a word by the doc-
uments it tends to occur in.

6.3.3 Words as vectors: word dimensions


An alternative to using the term-document matrix to represent words as vectors of
document counts, is to use the term-term matrix, also called the word-word ma-
word-word trix or the term-context matrix, in which the columns are labeled by words rather
matrix
than documents. This matrix is thus of dimensionality |V |×|V | and each cell records
104 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

the number of times the row (target) word and the column (context) word co-occur
in some context in some training corpus. The context could be the document, in
which case the cell represents the number of times the two words appear in the same
document. It is most common, however, to use smaller contexts, generally a win-
dow around the word, for example of 4 words to the left and 4 words to the right,
in which case the cell represents the number of times (in some training corpus) the
column word occurs in such a ±4 word window around the row word. For example
here is one example each of some words in their windows:
is traditionally followed by cherry pie, a traditional dessert
often mixed, such as strawberry rhubarb pie. Apple pie
computer peripherals and personal digital assistants. These devices usually
a computer. This includes information available on the internet
If we then take every occurrence of each word (say strawberry) and count the
context words around it, we get a word-word co-occurrence matrix. Fig. 6.6 shows a
simplified subset of the word-word co-occurrence matrix for these four words com-
puted from the Wikipedia corpus (Davies, 2015).

aardvark ... computer data result pie sugar ...


cherry 0 ... 2 8 9 442 25 ...
strawberry 0 ... 0 0 1 60 19 ...
digital 0 ... 1670 1683 85 5 4 ...
information 0 ... 3325 3982 378 5 13 ...
Figure 6.6 Co-occurrence vectors for four words in the Wikipedia corpus, showing six of
the dimensions (hand-picked for pedagogical purposes). The vector for digital is outlined in
red. Note that a real vector would have vastly more dimensions and thus be much sparser.

Note in Fig. 6.6 that the two words cherry and strawberry are more similar to
each other (both pie and sugar tend to occur in their window) than they are to other
words like digital; conversely, digital and information are more similar to each other
than, say, to strawberry. Fig. 6.7 shows a spatial visualization.

4000
information
computer

3000 [3982,3325]
digital
2000 [1683,1670]

1000

1000 2000 3000 4000


data
Figure 6.7 A spatial visualization of word vectors for digital and information, showing just
two of the dimensions, corresponding to the words data and computer.

Note that |V |, the length of the vector, is generally the size of the vocabulary, of-
ten between 10,000 and 50,000 words (using the most frequent words in the training
corpus; keeping words after about the most frequent 50,000 or so is generally not
helpful). Since most of these numbers are zero these are sparse vector representa-
tions; there are efficient algorithms for storing and computing with sparse matrices.
Now that we have some intuitions, let’s move on to examine the details of com-
puting word similarity. Afterwards we’ll discuss methods for weighting cells.
6.4 • C OSINE FOR MEASURING SIMILARITY 105

6.4 Cosine for measuring similarity


To measure similarity between two target words v and w, we need a metric that
takes two vectors (of the same dimensionality, either both with words as dimensions,
hence of length |V |, or both with documents as dimensions as documents, of length
|D|) and gives a measure of their similarity. By far the most common similarity
metric is the cosine of the angle between the vectors.
The cosine—like most measures for vector similarity used in NLP—is based on
dot product the dot product operator from linear algebra, also called the inner product:
inner product
N
X
dot product(v, w) = v · w = vi wi = v1 w1 + v2 w2 + ... + vN wN (6.7)
i=1

As we will see, most metrics for similarity between vectors are based on the dot
product. The dot product acts as a similarity metric because it will tend to be high
just when the two vectors have large values in the same dimensions. Alternatively,
vectors that have zeros in different dimensions—orthogonal vectors—will have a
dot product of 0, representing their strong dissimilarity.
This raw dot product, however, has a problem as a similarity metric: it favors
vector length long vectors. The vector length is defined as
v
u N
uX
|v| = t v2i (6.8)
i=1

The dot product is higher if a vector is longer, with higher values in each dimension.
More frequent words have longer vectors, since they tend to co-occur with more
words and have higher co-occurrence values with each of them. The raw dot product
thus will be higher for frequent words. But this is a problem; we’d like a similarity
metric that tells us how similar two words are regardless of their frequency.
We modify the dot product to normalize for the vector length by dividing the
dot product by the lengths of each of the two vectors. This normalized dot product
turns out to be the same as the cosine of the angle between the two vectors, following
from the definition of the dot product between two vectors a and b:

a · b = |a||b| cos θ
a·b
= cos θ (6.9)
|a||b|
cosine The cosine similarity metric between two vectors v and w thus can be computed as:
N
X
vi wi
v·w i=1
cosine(v, w) = =v v (6.10)
|v||w| u
uXN u N
uX
t v t
2 w2 i i
i=1 i=1

For some applications we pre-normalize each vector, by dividing it by its length,


unit vector creating a unit vector of length 1. Thus we could compute a unit vector from a by
dividing it by |a|. For unit vectors, the dot product is the same as the cosine.
106 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

The cosine value ranges from 1 for vectors pointing in the same direction, through
0 for orthogonal vectors, to -1 for vectors pointing in opposite directions. But since
raw frequency values are non-negative, the cosine for these vectors ranges from 0–1.
Let’s see how the cosine computes which of the words cherry or digital is closer
in meaning to information, just using raw counts from the following shortened table:
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325

442 ∗ 5 + 8 ∗ 3982 + 2 ∗ 3325


cos(cherry, information) = √ √ = .017
4422 + 82 + 22 52 + 39822 + 33252
5 ∗ 5 + 1683 ∗ 3982 + 1670 ∗ 3325
cos(digital, information) = √ √ = .996
5 + 16832 + 16702 52 + 39822 + 33252
2

The model decides that information is way closer to digital than it is to cherry, a
result that seems sensible. Fig. 6.8 shows a visualization.
Dimension 1: ‘pie’

500
cherry
digital information

500 1000 1500 2000 2500 3000

Dimension 2: ‘computer’
Figure 6.8 A (rough) graphical demonstration of cosine similarity, showing vectors for
three words (cherry, digital, and information) in the two dimensional space defined by counts
of the words computer and pie nearby. Note that the angle between digital and information is
smaller than the angle between cherry and information. When two vectors are more similar,
the cosine is larger but the angle is smaller; the cosine has its maximum (1) when the angle
between two vectors is smallest (0◦ ); the cosine of all other angles is less than 1.

6.5 TF-IDF: Weighing terms in the vector


The co-occurrence matrices above represent each cell by frequencies, either of words
with documents (Fig. 6.5), or words with other words (Fig. 6.6). But raw frequency
is not the best measure of association between words. Raw frequency is very skewed
and not very discriminative. If we want to know what kinds of contexts are shared
by cherry and strawberry but not by digital and information, we’re not going to get
good discrimination from words like the, it, or they, which occur frequently with
all sorts of words and aren’t informative about any particular word. We saw this
also in Fig. 6.3 for the Shakespeare corpus; the dimension for the word good is not
very discriminative between plays; good is simply a frequent word and has roughly
equivalent high frequencies in each of the plays.
It’s a bit of a paradox. Words that occur nearby frequently (maybe pie nearby
cherry) are more important than words that only appear once or twice. Yet words
6.5 • TF-IDF: W EIGHING TERMS IN THE VECTOR 107

that are too frequent—ubiquitous, like the or good— are unimportant. How can we
balance these two conflicting constraints?
There are two common solutions to this problem: in this section we’ll describe
the tf-idf algorithm, usually used when the dimensions are documents. In the next
we introduce the PPMI algorithm (usually used when the dimensions are words).
The tf-idf algorithm (the ‘-’ here is a hyphen, not a minus sign) is the product
of two terms, each term capturing one of these two intuitions:
term frequency The first is the term frequency (Luhn, 1957): the frequency of the word t in the
document d. We can just use the raw count as the term frequency:

tft, d = count(t, d) (6.11)

More commonly we squash the raw frequency a bit, by using the log10 of the fre-
quency instead. The intuition is that a word appearing 100 times in a document
doesn’t make that word 100 times more likely to be relevant to the meaning of the
document. Because we can’t take the log of 0, we normally add 1 to the count:2

tft, d = log10 (count(t, d) + 1) (6.12)

If we use log weighting, terms which occur 0 times in a document would have
tf = log10 (1) = 0, 10 times in a document tf = log10 (11) = 1.4, 100 times tf =
log10 (101) = 2.004, 1000 times tf = 3.00044, and so on.
The second factor in tf-idf is used to give a higher weight to words that occur
only in a few documents. Terms that are limited to a few documents are useful
for discriminating those documents from the rest of the collection; terms that occur
document
frequency frequently across the entire collection aren’t as helpful. The document frequency
dft of a term t is the number of documents it occurs in. Document frequency is
not the same as the collection frequency of a term, which is the total number of
times the word appears in the whole collection in any document. Consider in the
collection of Shakespeare’s 37 plays the two words Romeo and action. The words
have identical collection frequencies (they both occur 113 times in all the plays) but
very different document frequencies, since Romeo only occurs in a single play. If
our goal is to find documents about the romantic tribulations of Romeo, the word
Romeo should be highly weighted, but not action:
Collection Frequency Document Frequency
Romeo 113 1
action 113 31
We emphasize discriminative words like Romeo via the inverse document fre-
idf quency or idf term weight (Sparck Jones, 1972). The idf is defined using the frac-
tion N/dft , where N is the total number of documents in the collection, and dft is
the number of documents in which term t occurs. The fewer documents in which a
term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that
occur in all the documents. It’s usually clear what counts as a document: in Shake-
speare we would use a play; when processing a collection of encyclopedia articles
like Wikipedia, the document is a Wikipedia page; in processing newspaper articles,
the document is a single article. Occasionally your corpus might not have appropri-
ate document divisions and you might need to break up the corpus into documents
yourself for the purposes of computing idf.
1 + log10 count(t, d) if count(t, d) > 0

2 Or we can use this alternative: tft, d =
0 otherwise
108 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

Because of the large number of documents in many collections, this measure


too is usually squashed with a log function. The resulting definition for inverse
document frequency (idf) is thus
 
N
idft = log10 (6.13)
dft

Here are some idf values for some words in the Shakespeare corpus, ranging from
extremely informative words which occur in only one play like Romeo, to those that
occur in a few like salad or Falstaff, to those which are very common like fool or so
common as to be completely non-discriminative since they occur in all 37 plays like
good or sweet.3
Word df idf
Romeo 1 1.57
salad 2 1.27
Falstaff 4 0.967
forest 12 0.489
battle 21 0.246
wit 34 0.037
fool 36 0.012
good 37 0
sweet 37 0
tf-idf The tf-idf weighted value wt, d for word t in document d thus combines term
frequency tft, d (defined either by Eq. 6.11 or by Eq. 6.12) with idf from Eq. 6.13:

wt, d = tft, d × idft (6.14)

Fig. 6.9 applies tf-idf weighting to the Shakespeare term-document matrix in Fig. 6.2,
using the tf equation Eq. 6.12. Note that the tf-idf values for the dimension corre-
sponding to the word good have now all become 0; since this word appears in every
document, the tf-idf algorithm leads it to be ignored. Similarly, the word fool, which
appears in 36 out of the 37 plays, has a much lower weight.

As You Like It Twelfth Night Julius Caesar Henry V


battle 0.074 0 0.22 0.28
good 0 0 0 0
fool 0.019 0.021 0.0036 0.0083
wit 0.049 0.044 0.018 0.022
Figure 6.9 A tf-idf weighted term-document matrix for four words in four Shakespeare
plays, using the counts in Fig. 6.2. For example the 0.049 value for wit in As You Like It is
the product of tf = log10 (20 + 1) = 1.322 and idf = .037. Note that the idf weighting has
eliminated the importance of the ubiquitous word good and vastly reduced the impact of the
almost-ubiquitous word fool.

The tf-idf weighting is the way for weighting co-occurrence matrices in infor-
mation retrieval, but also plays a role in many other aspects of natural language
processing. It’s also a great baseline, the simple thing to try first. We’ll look at other
weightings like PPMI (Positive Pointwise Mutual Information) in Section 6.6.
3 Sweet was one of Shakespeare’s favorite adjectives, a fact probably related to the increased use of
sugar in European recipes around the turn of the 16th century (Jurafsky, 2014, p. 175).
6.6 • P OINTWISE M UTUAL I NFORMATION (PMI) 109

6.6 Pointwise Mutual Information (PMI)


An alternative weighting function to tf-idf, PPMI (positive pointwise mutual infor-
mation), is used for term-term-matrices, when the vector dimensions correspond to
words rather than documents. PPMI draws on the intuition that the best way to weigh
the association between two words is to ask how much more the two words co-occur
in our corpus than we would have a priori expected them to appear by chance.
pointwise
mutual Pointwise mutual information (Fano, 1961)4 is one of the most important con-
information
cepts in NLP. It is a measure of how often two events x and y occur, compared with
what we would expect if they were independent:

P(x, y)
I(x, y) = log2 (6.16)
P(x)P(y)
The pointwise mutual information between a target word w and a context word
c (Church and Hanks 1989, Church and Hanks 1990) is then defined as:

P(w, c)
PMI(w, c) = log2 (6.17)
P(w)P(c)

The numerator tells us how often we observed the two words together (assuming
we compute probability by using the MLE). The denominator tells us how often
we would expect the two words to co-occur assuming they each occurred indepen-
dently; recall that the probability of two independent events both occurring is just
the product of the probabilities of the two events. Thus, the ratio gives us an esti-
mate of how much more the two words co-occur than we expect by chance. PMI is
a useful tool whenever we need to find words that are strongly associated.
PMI values range from negative to positive infinity. But negative PMI values
(which imply things are co-occurring less often than we would expect by chance)
tend to be unreliable unless our corpora are enormous. To distinguish whether
two words whose individual probability is each 10−6 occur together less often than
chance, we would need to be certain that the probability of the two occurring to-
gether is significantly different than 10−12 , and this kind of granularity would require
an enormous corpus. Furthermore it’s not clear whether it’s even possible to evalu-
ate such scores of ‘unrelatedness’ with human judgments. For this reason it is more
PPMI common to use Positive PMI (called PPMI) which replaces all negative PMI values
with zero (Church and Hanks 1989, Dagan et al. 1993, Niwa and Nitta 1994)5 :

P(w, c)
PPMI(w, c) = max(log2 , 0) (6.18)
P(w)P(c)

More formally, let’s assume we have a co-occurrence matrix F with W rows (words)
and C columns (contexts), where fi j gives the number of times word wi occurs in
4 PMI is based on the mutual information between two random variables X and Y , defined as:
XX P(x, y)
I(X,Y ) = P(x, y) log2 (6.15)
x y
P(x)P(y)
In a confusion of terminology, Fano used the phrase mutual information to refer to what we now call
pointwise mutual information and the phrase expectation of the mutual information for what we now call
mutual information
5 Positive PMI also cleanly solves the problem of what to do with zero counts, using 0 to replace the
−∞ from log(0).
110 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

context c j . This can be turned into a PPMI matrix where ppmii j gives the PPMI
value of word wi with context c j as follows:
PC PW
fi j j=1 f i j fi j
pi j = PW PC pi∗ = PW PC p∗ j = PW i=1
PC (6.19)
i=1 j=1 f i j i=1 j=1 f i j i=1 j=1 f i j

pi j
PPMIi j = max(log2 , 0) (6.20)
pi∗ p∗ j

Let’s see some PPMI calculations. We’ll use Fig. 6.10, which repeats Fig. 6.6 plus
all the count marginals, and let’s pretend for ease of calculation that these are the
only words/contexts that matter.

computer data result pie sugar count(w)


cherry 2 8 9 442 25 486
strawberry 0 0 1 60 19 80
digital 1670 1683 85 5 4 3447
information 3325 3982 378 5 13 7703

count(context) 4997 5673 473 512 61 11716


Figure 6.10 Co-occurrence counts for four words in 5 contexts in the Wikipedia corpus,
together with the marginals, pretending for the purpose of this calculation that no other word-
s/contexts matter.

Thus for example we could compute PPMI(w=information,c=data), assuming


we pretended that Fig. 6.6 encompassed all the relevant word contexts/dimensions,
as follows:
3982
P(w=information,c=data) = = .3399
11716
7703
P(w=information) = = .6575
11716
5673
P(c=data) = = .4842
11716
ppmi(information,data) = log 2(.3399/(.6575 ∗ .4842)) = .0944

Fig. 6.11 shows the joint probabilities computed from the counts in Fig. 6.10, and
Fig. 6.12 shows the PPMI values. Not surprisingly, cherry and strawberry are highly
associated with both pie and sugar, and data is mildly associated with information.

p(w,context) p(w)
computer data result pie sugar p(w)
cherry 0.0002 0.0007 0.0008 0.0377 0.0021 0.0415
strawberry 0.0000 0.0000 0.0001 0.0051 0.0016 0.0068
digital 0.1425 0.1436 0.0073 0.0004 0.0003 0.2942
information 0.2838 0.3399 0.0323 0.0004 0.0011 0.6575

p(context) 0.4265 0.4842 0.0404 0.0437 0.0052


Figure 6.11 Replacing the counts in Fig. 6.6 with joint probabilities, showing the marginals
around the outside.

PMI has the problem of being biased toward infrequent events; very rare words
tend to have very high PMI values. One way to reduce this bias toward low frequency
6.7 • A PPLICATIONS OF THE TF - IDF OR PPMI VECTOR MODELS 111

computer data result pie sugar


cherry 0 0 0 4.38 3.30
strawberry 0 0 0 4.10 5.51
digital 0.18 0.01 0 0 0
information 0.02 0.09 0.28 0 0
Figure 6.12 The PPMI matrix showing the association between words and context words,
computed from the counts in Fig. 6.11. Note that most of the 0 PPMI values are ones that had
a negative PMI; for example PMI(cherry,computer) = -6.7, meaning that cherry and computer
co-occur on Wikipedia less often than we would expect by chance, and with PPMI we replace
negative values by zero.

events is to slightly change the computation for P(c), using a different function Pα (c)
that raises the probability of the context word to the power of α:

P(w, c)
PPMIα (w, c) = max(log2 , 0) (6.21)
P(w)Pα (c)
count(c)α
Pα (c) = P (6.22)
c count(c)
α

Levy et al. (2015) found that a setting of α = 0.75 improved performance of


embeddings on a wide range of tasks (drawing on a similar weighting used for skip-
grams described below in Eq. 6.32). This works because raising the count to α =
0.75 increases the probability assigned to rare contexts, and hence lowers their PMI
(Pα (c) > P(c) when c is rare).
Another possible solution is Laplace smoothing: Before computing PMI, a small
constant k (values of 0.1-3 are common) is added to each of the counts, shrinking
(discounting) all the non-zero values. The larger the k, the more the non-zero counts
are discounted.

6.7 Applications of the tf-idf or PPMI vector models


In summary, the vector semantics model we’ve described so far represents a target
word as a vector with dimensions corresponding either to to the documents in a large
collection (the term-document matrix) or to the counts of words in some neighboring
window (the term-term matrix). The values in each dimension are counts, weighted
by tf-idf (for term-document matrices) or PPMI (for term-term matrices), and the
vectors are sparse (since most values are zero).
The model computes the similarity between two words x and y by taking the
cosine of their tf-idf or PPMI vectors; high cosine, high similarity. This entire model
is sometimes referred to as the tf-idf model or the PPMI model, after the weighting
function.
The tf-idf model of meaning is often used for document functions like deciding
if two documents are similar. We represent a document by taking the vectors of
centroid all the words in the document, and computing the centroid of all those vectors.
The centroid is the multidimensional version of the mean; the centroid of a set of
vectors is a single vector that has the minimum sum of squared distances to each of
the vectors in the set. Given k word vectors w1 , w2 , ..., wk , the centroid document
document vector d is:
vector
w1 + w2 + ... + wk
d= (6.23)
k
112 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

Given two documents, we can then compute their document vectors d1 and d2 , and
estimate the similarity between the two documents by cos(d1 , d2 ). Document sim-
ilarity is also useful for all sorts of applications; information retrieval, plagiarism
detection, news recommender systems, and even for digital humanities tasks like
comparing different versions of a text to see which are similar to each other.
Either the PPMI model or the tf-idf model can be used to compute word simi-
larity, for tasks like finding word paraphrases, tracking changes in word meaning, or
automatically discovering meanings of words in different corpora. For example, we
can find the 10 most similar words to any target word w by computing the cosines
between w and each of the V − 1 other words, sorting, and looking at the top 10.

6.8 Word2vec
In the previous sections we saw how to represent a word as a sparse, long vector with
dimensions corresponding to words in the vocabulary or documents in a collection.
We now introduce a more powerful word representation: embeddings, short dense
vectors. Unlike the vectors we’ve seen so far, embeddings are short, with number
of dimensions d ranging from 50-1000, rather than the much larger vocabulary size
|V | or number of documents D we’ve seen. These d dimensions don’t have a clear
interpretation. And the vectors are dense: instead of vector entries being sparse,
mostly-zero counts or functions of counts, the values will be real-valued numbers
that can be negative.
It turns out that dense vectors work better in every NLP task than sparse vectors.
While we don’t completely understand all the reasons for this, we have some intu-
itions. Representing words as 300-dimensional dense vectors requires our classifiers
to learn far fewer weights than if we represented words as 50,000-dimensional vec-
tors, and the smaller parameter space possibly helps with generalization and avoid-
ing overfitting. Dense vectors may also do a better job of capturing synonymy.
For example, in a sparse vector representation, dimensions for synonyms like car
and automobile dimension are distinct and unrelated; sparse vectors may thus fail
to capture the similarity between a word with car as a neighbor and a word with
automobile as a neighbor.
skip-gram In this section we introduce one method for computing embeddings: skip-gram
SGNS with negative sampling, sometimes called SGNS. The skip-gram algorithm is one
word2vec of two algorithms in a software package called word2vec, and so sometimes the al-
gorithm is loosely referred to as word2vec (Mikolov et al. 2013, Mikolov et al. 2013a).
The word2vec methods are fast, efficient to train, and easily available online with
static
embeddings code and pretrained embeddings. Word2vec embeddings are static embeddings,
meaning that the method learns one fixed embedding for each word in the vocabu-
lary. In Chapter 10 we’ll introduce methods for learning dynamic contextual em-
beddings like the popular BERT or ELMO representations, in which the vector for
each word is different in different contexts.
The intuition of word2vec is that instead of counting how often each word w oc-
curs near, say, apricot, we’ll instead train a classifier on a binary prediction task: “Is
word w likely to show up near apricot?” We don’t actually care about this prediction
task; instead we’ll take the learned classifier weights as the word embeddings.
The revolutionary intuition here is that we can just use running text as implicitly
supervised training data for such a classifier; a word c that occurs near the target
word apricot acts as gold ‘correct answer’ to the question “Is word c likely to show
6.8 • W ORD 2 VEC 113

self-supervision up near apricot?” This method, often called self-supervision, avoids the need for
any sort of hand-labeled supervision signal. This idea was first proposed in the task
of neural language modeling, when Bengio et al. (2003) and Collobert et al. (2011)
showed that a neural language model (a neural network that learned to predict the
next word from prior words) could just use the next word in running text as its
supervision signal, and could be used to learn an embedding representation for each
word as part of doing this prediction task.
We’ll see how to do neural networks in the next chapter, but word2vec is a
much simpler model than the neural network language model, in two ways. First,
word2vec simplifies the task (making it binary classification instead of word pre-
diction). Second, word2vec simplifies the architecture (training a logistic regression
classifier instead of a multi-layer neural network with hidden layers that demand
more sophisticated training algorithms). The intuition of skip-gram is:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples.
3. Use logistic regression to train a classifier to distinguish those two cases.
4. Use the learned weights as the embeddings.

6.8.1 The classifier


Let’s start by thinking about the classification task, and then turn to how to train.
Imagine a sentence like the following, with a target word apricot, and assume we’re
using a window of ±2 context words:
... lemon, a [tablespoon of apricot jam, a] pinch ...
c1 c2 w c3 c4
Our goal is to train a classifier such that, given a tuple (w, c) of a target word
w paired with a candidate context word c (for example (apricot, jam), or perhaps
(apricot, aardvark)) it will return the probability that c is a real context word (true
for jam, false for aardvark):

P(+|w, c) (6.24)
The probability that word c is not a real context word for w is just 1 minus
Eq. 6.24:

P(−|w, c) = 1 − P(+|w, c) (6.25)


How does the classifier compute the probability P? The intuition of the skip-
gram model is to base this probability on embedding similarity: a word is likely to
occur near the target if its embedding is similar to the target embedding. To compute
similarity between these dense embeddings, we rely on the intuition that two vectors
are similar if they have a high dot product (after all, cosine is just a normalized dot
product). In other words:
Similarity(w, c) ≈ c · w (6.26)

The dot product c · w is not a probability, it’s just a number ranging from −∞ to ∞
(since the elements in word2vec embeddings can be negative, the dot product can be
negative). To turn the dot product into a probability, we’ll use the logistic or sigmoid
function σ (x), the fundamental core of logistic regression:
1
σ (x) = (6.27)
1 + exp (−x)
114 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

We model the probability that word c is a real context word for target word w as:
1
P(+|w, c) = σ (c · w) = (6.28)
1 + exp (−c · w)
The sigmoid function returns a number between 0 and 1, but to make it a probability
we’ll also need the total probability of the two possible events (c is a context word,
and c isn’t a context word) to sum to 1. We thus estimate the probability that word c
is not a real context word for w as:

P(−|w, c) = 1 − P(+|w, c)
1
= σ (−c · w) = (6.29)
1 + exp (c · w)
Equation 6.28 gives us the probability for one word, but there are many context
words in the window. Skip-gram makes the simplifying assumption that all context
words are independent, allowing us to just multiply their probabilities:
L
Y
P(+|w, c1:L ) = σ (−ci · w) (6.30)
i=1
XL
log P(+|w, c1:L ) = log σ (−ci · w) (6.31)
i=1

In summary, skip-gram trains a probabilistic classifier that, given a test target word
w and its context window of L words c1:L , assigns a probability based on how similar
this context window is to the target word. The probability is based on applying the
logistic (sigmoid) function to the dot product of the embeddings of the target word
with each context word. To compute this probability, we just need embeddings for
each target word and context word in the vocabulary.

1..d
aardvark 1

apricot

… … W target words

zebra |V|
&= aardvark |V|+1
apricot

C context & noise


… …
words
zebra 2V

Figure 6.13 The embeddings learned by the skipgram model. The algorithm stores two
embeddings for each word, the target embedding (sometimes called the input embedding)
and the context embedding (sometimes called the output embedding). The parameter θ that
the algorithm learns is thus a matrix of 2|V | vectors, each of dimension d, formed by concate-
nating two matrices, the target embeddings W and the context+noise embeddings C.

Fig. 6.13 shows the intuition of the parameters we’ll need. Skip-gram actually
stores two embeddings for each word, one for the word as a target, and one for the
6.8 • W ORD 2 VEC 115

word considered as context. Thus the parameters we need to learn are two matrices
W and C, each containing an embedding for every one of the |V | words in the vo-
cabulary V .6 Let’s now turn to learning these embeddings (which is the real goal of
training this classifier in the first place).

6.8.2 Learning skip-gram embeddings


Skip-gram learns embeddings by starting with random embedding vectors and then
iteratively shifting the embedding of each word w to be more like the embeddings
of words that occur nearby in texts, and less like the embeddings of words that don’t
occur nearby. Let’s start by considering a single piece of training data:
... lemon, a [tablespoon of apricot jam, a] pinch ...
c1 c2 w c3 c4
This example has a target word w (apricot), and 4 context words in the L = ±2
window, resulting in 4 positive training instances (on the left below):
positive examples + negative examples -
w cpos w cneg w cneg
apricot tablespoon apricot aardvark apricot seven
apricot of apricot my apricot forever
apricot jam apricot where apricot dear
apricot a apricot coaxial apricot if
For training a binary classifier we also need negative examples. In fact skip-
gram with negative sampling (SGNS) uses more negative examples than positive
examples (with the ratio between them set by a parameter k). So for each of these
(w, cpos ) training instances we’ll create k negative samples, each consisting of the
target w plus a ‘noise word’ cneg . A noise word is a random word from the lexicon,
constrained not to be the target word w. The right above shows the setting where
k = 2, so we’ll have 2 negative examples in the negative training set − for each
positive example w, cpos .
The noise words are chosen according to their weighted unigram frequency
pα (w), where α is a weight. If we were sampling according to unweighted fre-
quency p(w), it would mean that with unigram probability p(“the”) we would choose
the word the as a noise word, with unigram probability p(“aardvark”) we would
choose aardvark, and so on. But in practice it is common to set α = .75, i.e. use the
3
weighting p 4 (w):
count(w)α
Pα (w) = P (6.32)
w0 count(w )
0 α

Setting α = .75 gives better performance because it gives rare noise words slightly
higher probability: for rare words, Pα (w) > P(w). To illustrate this intuition, it
might help to work out the probabilities for an example with two events, P(a) = .99
and P(b) = .01:

.99.75
Pα (a) = = .97
.99.75 + .01.75
.01.75
Pα (b) = = .03 (6.33)
.99.75 + .01.75
6 In principle the target matrix and the context matrix could use different vocabularies, but we’ll simplify
by assuming one shared vocabulary V .
116 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

Given the set of positive and negative training instances, and an initial set of embed-
dings, the goal of the learning algorithm is to adjust those embeddings to
• Maximize the similarity of the target word, context word pairs (w, cpos ) drawn
from the positive examples
• Minimize the similarity of the (w, cneg ) pairs from the negative examples.
If we consider one word/context pair (w, cpos ) with its k noise words cneg1 ...cnegk ,
we can express these two goals as the following loss function L to be minimized
(hence the −); here the first term expresses that we want the classifier to assign the
real context word cpos a high probability of being a neighbor, and the second term
expresses that we want to assign each of the noise words cnegi a high probability of
being a non-neighbor, all multiplied because we assume independence:
" k
#
Y
LCE = − log P(+|w, cpos ) P(−|w, cnegi )
i=1
" k
#
X
= − log P(+|w, cpos ) + log P(−|w, cnegi )
i=1
" k
#
X 
= − log P(+|w, cpos ) + log 1 − P(+|w, cnegi )
i=1
" k
#
X
= − log σ (cpos · w) + log σ (−cnegi · w) (6.34)
i=1
That is, we want to maximize the dot product of the word with the actual context
words, and minimize the dot products of the word with the k negative sampled non-
neighbor words.
We minimize this loss function using stochastic gradient descent. Fig. 6.14
shows the intuition of one step of learning.

aardvark
move apricot and jam closer,
apricot w increasing cpos w
W
“…apricot jam…”
zebra
! aardvark move apricot and matrix apart
cpos decreasing cneg1 w
jam

C matrix cneg1
k=2
Tolstoy cneg2 move apricot and Tolstoy apart
decreasing cneg2 w
zebra

Figure 6.14 Intuition of one step of gradient descent. The skip-gram model tries to shift
embeddings so the target embeddings (here for apricot) are closer to (have a higher dot prod-
uct with) context embeddings for nearby words (here jam) and further from (lower dot product
with) context embeddings for noise words that don’t occur nearby (here Tolstoy and matrix).

To get the gradient, we need to take the derivative of Eq. 6.34 with respect to
the different embeddings. It turns out the derivatives are the following (we leave the
6.8 • W ORD 2 VEC 117

proof as an exercise at the end of the chapter):

∂ LCE
= [σ (cpos · w) − 1]w (6.35)
∂ cpos
∂ LCE
= [σ (cneg · w)]w (6.36)
∂ cneg
X k
∂ LCE
= [σ (cpos · w) − 1]cpos + [σ (cnegi · w)]cnegi (6.37)
∂w
i=1

The update equations going from time step t to t + 1 in stochastic gradient descent
are thus:

ct+1 t t
pos = cpos − η[σ (cpos · w) − 1]w (6.38)
ct+1
neg = ctneg − η[σ (ctneg · w)]w (6.39)
k
X
wt+1 = wt − η[σ (cpos · wt ) − 1]cpos + [σ (cnegi · wt )]cnegi (6.40)
i=1

Just as in logistic regression, then, the learning algorithm starts with randomly ini-
tialized W and C matrices, and then walks through the training corpus using gradient
descent to move W and C so as to maximize the objective in Eq. 6.34 by making the
updates in (Eq. 6.39)-(Eq. 6.40).
Recall that the skip-gram model learns two separate embeddings for each word i:
target
embedding the target embedding wi and the context embedding ci , stored in two matrices, the
context
embedding target matrix W and the context matrix C. It’s common to just add them together,
representing word i with the vector wi + ci . Alternatively we can throw away the C
matrix and just represent each word i by the vector wi .
As with the simple count-based methods like tf-idf, the context window size L
affects the performance of skip-gram embeddings, and experiments often tune the
parameter L on a devset.

6.8.3 Other kinds of static embeddings


fasttext There are many kinds of static embeddings. An extension of word2vec, fasttext
(Bojanowski et al., 2017), deals with unknown words and sparsity in languages with
rich morphology, by using subword models. Each word in fasttext is represented as
itself plus a bag of constituent n-grams, with special boundary symbols < and >
added to each word. For example, with n = 3 the word where would be represented
by the sequence <where> plus the character n-grams:
<wh, whe, her, ere, re>
Then a skipgram embedding is learned for each constituent n-gram, and the word
where is represented by the sum of all of the embeddings of its constituent n-grams.
A fasttext open-source library, including pretrained embeddings for 157 languages,
is available at https://fasttext.cc.
The most widely used static embedding model besides word2vec is GloVe (Pen-
nington et al., 2014), short for Global Vectors, because the model is based on cap-
turing global corpus statistics. GloVe is based on ratios of probabilities from the
word-word co-occurrence matrix, combining the intuitions of count-based models
like PPMI while also capturing the linear structures used by methods like word2vec.
hde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence

RUSSIA
FRANCE
CHINA

118 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS


EUROPE
WRIST ASIA
ANKLE AFRICA
AMERICA
ARM
BRAZIL
SHOULDER
FINGER
EAR
EYE
FACE
HAND It turns out that dense embeddings like word2vec actually have an elegant math-
MOSCOW

ematical relationships with sparse embeddings like PPMI, in which word2vec can
TOE LEG
FOOT
HAWAII

be seen as implicitly optimizing a shifted version of a PPMI matrix (Levy and Gold-
TOOTH
NOSE
HEAD TOKYO

berg, 2014c).
MONTREAL
CHICAGO
ATLANTA
MOUSE

6.9
DOG
CAT
Visualizing Embeddings
TURTLE
LION NASHVILLE
PUPPY
KITTEN COW

OYSTER
“I see well in many dimensions as long as the dimensions are around two.”
BULL The late economist Martin Shubik
Figure 8: Multidimensional scaling for three noun classes.
Visualizing embeddings is an important goal in helping understand, apply, and
improve these models of word meaning. But how can we visualize a (for example)
100-dimensional vector?
WRIST The simplest way to visualize the meaning of a word
ANKLE
SHOULDER
ARM
w embedded in a space is to list the most similar words to
LEG
HAND w by sorting the vectors for all words in the vocabulary by
FOOT
HEAD
NOSE
their cosine with the vector for w. For example the 7 closest
FINGER
TOE words to frog using the GloVe embeddings are: frogs, toad,
FACE
EAR
EYE
litoria, leptodactylidae, rana, lizard, and eleutherodactylus
DOG
TOOTH
(Pennington et al., 2014).
CAT
PUPPY
KITTEN
Yet another visualization method is to use a clustering
MOUSE
COW
algorithm to show a hierarchical representation of which
TURTLE

LION
OYSTER words are similar to others in the embedding space. The
BULL
CHICAGO uncaptioned figure on the left uses hierarchical clustering
of some embedding vectors for nouns as a visualization
ATLANTA
MONTREAL
NASHVILLE
CHINA
TOKYO
method (Rohde et al., 2006).
Probably the most common visualization method, how-
RUSSIA
AFRICA
ASIA
EUROPE
AMERICA ever, is to project the 100 dimensions of a word down into 2
dimensions. Fig. 6.1 showed one such visualization, as does
BRAZIL
MOSCOW
FRANCE
HAWAII
Fig. 6.16, using a projection method called t-SNE (van der
Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations.
Maaten and Hinton, 2008).

6.10 Semantic
20 properties of embeddings
In this section we briefly summarize some of the semantic properties of embeddings
that have been studied.
Different types of similarity or association: One parameter of vector semantic
models that is relevant to both sparse tf-idf vectors and dense word2vec vectors is
the size of the context window used to collect counts. This is generally between 1
and 10 words on each side of the target word (for a total context of 2-20 words).
The choice depends on the goals of the representation. Shorter context windows
tend to lead to representations that are a bit more syntactic, since the information is
coming from immediately nearby words. When the vectors are computed from short
context windows, the most similar words to a target word w tend to be semantically
similar words with the same parts of speech. When vectors are computed from long
context windows, the highest cosine words to a target word w tend to be words that
are topically related but not similar.
6.10 • S EMANTIC PROPERTIES OF EMBEDDINGS 119

For example Levy and Goldberg (2014a) showed that using skip-gram with a
window of ±2, the most similar words to the word Hogwarts (from the Harry Potter
series) were names of other fictional schools: Sunnydale (from Buffy the Vampire
Slayer) or Evernight (from a vampire series). With a window of ±5, the most similar
words to Hogwarts were other words topically related to the Harry Potter series:
Dumbledore, Malfoy, and half-blood.
It’s also often useful to distinguish two kinds of similarity or association between
first-order words (Schütze and Pedersen, 1993). Two words have first-order co-occurrence
co-occurrence
(sometimes called syntagmatic association) if they are typically nearby each other.
Thus wrote is a first-order associate of book or poem. Two words have second-order
second-order co-occurrence (sometimes called paradigmatic association) if they have similar
co-occurrence
neighbors. Thus wrote is a second-order associate of words like said or remarked.
Analogy/Relational Similarity: Another semantic property of embeddings is their
ability to capture relational meanings. In an important early vector space model of
parallelogram cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model
model
for solving simple analogy problems of the form a is to b as a* is to what?. In such
problems, a system given a problem like apple:tree::grape:?, i.e., apple is to tree as
grape is to , and must fill in the word vine. In the parallelogram model, illus-
# » # »
trated in Fig. 6.15, the vector from the word apple to the word tree (= apple − tree)
# » the nearest word to that point is returned.
is added to the vector for grape (grape);

tree
apple

vine
grape
Figure 6.15 The parallelogram model for analogy problems (Rumelhart and Abrahamson,
# » # » # » # »
1973): the location of vine can be found by subtracting tree from apple and adding grape.

In early work with sparse embeddings, scholars showed that sparse vector mod-
els of meaning could solve such analogy problems (Turney and Littman, 2005), but
the parallelogram method received more modern attention because of its success
with word2vec or GloVe vectors (Mikolov et al. 2013b, Levy and Goldberg 2014b,
# » # »+
Pennington et al. 2014). For example, the result of the expression (king) − man
#woman » is a vector close to queen. # » # » # »
# » Similarly, Paris − France + Italy) results in a
# »
vector that is close to Rome. The embedding model thus seems to be extracting rep-
resentations of relations like MALE - FEMALE, or CAPITAL - CITY- OF, or even COM -
PARATIVE / SUPERLATIVE , as shown in Fig. 6.16 from GloVe.
For a a:b::a*:b* problem, meaning the algorithm is given a, b, and a* and must
find b*, the parallelogram method is thus:
b̂∗ = argmax distance(x, a∗ − a + b) (6.41)
x

with the distance function defined either as cosine or as Euclidean distance.


There are some caveats. For example, the closest value returned by the paral-
lelogram algorithm in word2vec or GloVe embedding spaces is usually not in fact
b* but one of the 3 input words or their morphological variants (i.e., cherry:red ::
120 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

(a) (b)
Figure 6.16 Relational properties of the GloVe vector space, shown by projecting vectors onto two dimen-
# » # » + woman
# » is close to queen.
# » (b) offsets seem to capture comparative and superlative
sions. (a) (king) − man
morphology (Pennington et al., 2014).

potato:x returns potato or potatoes instead of brown), so these must be explicitly


excluded. Furthermore while embedding spaces perform well if the task involves
frequent words, small distances, and certain relations (like relating countries with
their capitals or verbs/nouns with their inflected forms), the parallelogram method
with embeddings doesn’t work as well for other relations (Linzen 2016, Gladkova
et al. 2016, Ethayarajh et al. 2019a), and indeed Peterson et al. (2020) argue that the
parallelogram method is in general too simple to model the human cognitive process
of forming analogies of this kind.

6.10.1 Embeddings and Historical Semantics


Embeddings can also be a useful tool for studying how meaning changes over time,
by computing multiple embedding spaces, each from texts written in a particular
time period. For example Fig. 6.17 shows a visualization of changes in meaning in
English words over the last two centuries, computed by building separate embedding
spaces for each decade from historical corpora like Google N-grams (Lin et al.,
2012b) and the Corpus of Historical American English (Davies, 2012).

6.11 Bias and Embeddings


In addition to their ability to learn word meaning from text, embeddings, alas,
also reproduce the implicit biases and stereotypes that were latent in the text. As
the prior section just showed, embeddings can roughly model relational similar-
ity: ‘queen’ as the closest word to ‘king’ - ‘man’ + ‘woman’ implies the analogy
man:woman::king:queen. But these same embedding analogies also exhibit gender
stereotypes. For example Bolukbasi et al. (2016) find that the closest occupation
to ‘man’ - ‘computer programmer’ + ‘woman’ in word2vec embeddings trained on
news text is ‘homemaker’, and that the embeddings similarly suggest the analogy
‘father’ is to ‘doctor’ as ‘mother’ is to ‘nurse’. This could result in what Crawford
allocational (2017) and Blodgett et al. (2020) call an allocational harm, when a system allo-
harm
cates resources (jobs or credit) unfairly to different groups. For example algorithms
CHAPTER 5. DYNAMIC SOCIAL REPRESENTATIONS OF WORD MEANING79
6.11 • B IAS AND E MBEDDINGS 121

Figure 6.17 A t-SNE visualization of the semantic change of 3 words in English using
Figure
word2vec5.1: Two-dimensional
vectors. The modern sensevisualization of semantic
of each word, and the change in English
grey context words,using SGNS
are com-
vectors (see Section 5.8 for the visualization algorithm). A, The word
puted from the most recent (modern) time-point embedding space. Earlier points are com- gay shifted
from
puted meaning “cheerful”
from earlier historicalorembedding
“frolicsome” to referring
spaces. to homosexuality.
The visualizations A, In the
show the changes early
in the
20th century broadcast referred to “casting out seeds”; with the rise of television
word gay from meanings related to “cheerful” or “frolicsome” to referring to homosexuality, and
radio its meaning
the development shifted
of the modernto “transmitting signals”.
“transmission” sense C, Awfulfrom
of broadcast underwent a sense
its original process
of of
pejoration,
sowing seeds,asand
it shifted from meaning
the pejoration “full
of the word of awe”
awful to meaning
as it shifted “terrible“full
from meaning or appalling”
of awe”
to meaning “terrible or appalling” (Hamilton et al., 2016b).
[212].

that use
that embeddings
adverbials (e.g.,asactually)
part of ahave
search for hiring
a general potentialtoprogrammers
tendency or doctors
undergo subjectification
might thus
where they incorrectly
shift from downweight documents
objective statements with the
about women’s
world names.
(e.g., “Sorry, the car is
It turns out that embeddings don’t just reflect the statistics of their input, but
bias actually broken”) to subjective statements (e.g., “I can’t believe he actually did that”,
amplification also amplify bias; gendered terms become more gendered in embedding space than
indicating
they were in surprise/disbelief).
the input text statistics (Zhao et al. 2017, Ethayarajh et al. 2019b, Jia
et al. 2020), and biases are more exaggerated than in actual labor employment statis-
tics (GargComputational
5.2.2 et al., 2018). linguistic studies
Embeddings also encode the implicit associations that are a property of human
reasoning.
There The
are also ImplicitofAssociation
a number recent worksTest (Greenwald
analyzing semantic et al., 1998)
change measures
using peo-
computational
ple’s associations between concepts (like ‘flowers’ or ‘insects’)
methods. [200] use latent semantic analysis to analyze how word meanings broaden and attributes (like
‘pleasantness’ and ‘unpleasantness’) by measuring differences in the latency with
and
which narrow over time.
they label words[113]
in theusevarious
raw co-occurrence
categories.7 vectors
Using suchto perform
methods,a number
people of
in the United States have been shown to associate African-American namesofwith
historical case-studies on semantic change, and [252] perform a similar set small-
unpleasant
scale words (more
case-studies using than European-American
temporal topic models. names), male names
[87] construct more mutual
point-wise with
mathematics and female names with the arts, and old people’s
information-based embeddings and found that semantic changes uncovered by theirnames with unpleas-
ant words (Greenwald et al. 1998, Nosek et al. 2002a, Nosek et al. 2002b). Caliskan
method had reasonable
et al. (2017) replicated allagreement with human
these findings judgments.
of implicit [129]using
associations and [119]
GloVe use “neural”
vectors
and cosine similarity instead of human latencies. For example African-American
word-embedding methods to detect linguistic change points. Finally, [257] analyze
names likeco-occurrences
historical ‘Leroy’ and ‘Shaniqua’ had a higher
to test whether GloVe
synonyms cosine
tend with unpleasant
to change in similarwords
ways.
while European-American names (‘Brad’, ‘Greg’, ‘Courtney’) had a higher cosine
with pleasant words. These problems with embeddings are an example of a repre-
representational
harm
sentational harm (Crawford 2017, Blodgett et al. 2020), which is a harm caused by
a system demeaning or even ignoring some social groups. Any embedding-aware al-
gorithm that made use of word sentiment could thus exacerbate bias against African
Americans.
Recent research focuses on ways to try to remove these kinds of biases, for ex-
ample by developing a transformation of the embedding space that removes gender
stereotypes but preserves definitional gender (Bolukbasi et al. 2016, Zhao et al. 2017)

7 Roughly speaking, if humans associate ‘flowers’ with ‘pleasantness’ and ‘insects’ with ‘unpleasant-
ness’, when they are instructed to push a green button for ‘flowers’ (daisy, iris, lilac) and ‘pleasant words’
(love, laughter, pleasure) and a red button for ‘insects’ (flea, spider, mosquito) and ‘unpleasant words’
(abuse, hatred, ugly) they are faster than in an incongruous condition where they push a red button for
‘flowers’ and ‘unpleasant words’ and a green button for ‘insects’ and ‘pleasant words’.
122 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

or changing the training procedure (Zhao et al., 2018b). However, although these
debiasing sorts of debiasing may reduce bias in embeddings, they do not eliminate it (Gonen
and Goldberg, 2019), and this remains an open problem.
Historical embeddings are also being used to measure biases in the past. Garg
et al. (2018) used embeddings from historical texts to measure the association be-
tween embeddings for occupations and embeddings for names of various ethnici-
ties or genders (for example the relative cosine similarity of women’s names versus
men’s to occupation words like ‘librarian’ or ‘carpenter’) across the 20th century.
They found that the cosines correlate with the empirical historical percentages of
women or ethnic groups in those occupations. Historical embeddings also repli-
cated old surveys of ethnic stereotypes; the tendency of experimental participants in
1933 to associate adjectives like ‘industrious’ or ‘superstitious’ with, e.g., Chinese
ethnicity, correlates with the cosine between Chinese last names and those adjectives
using embeddings trained on 1930s text. They also were able to document historical
gender biases, such as the fact that embeddings for adjectives related to competence
(‘smart’, ‘wise’, ‘thoughtful’, ‘resourceful’) had a higher cosine with male than fe-
male words, and showed that this bias has been slowly decreasing since 1960. We
return in later chapters to this question about the role of bias in natural language
processing.

6.12 Evaluating Vector Models


The most important evaluation metric for vector models is extrinsic evaluation on
tasks, i.e., using vectors in an NLP task and seeing whether this improves perfor-
mance over some other model.
Nonetheless it is useful to have intrinsic evaluations. The most common metric
is to test their performance on similarity, computing the correlation between an
algorithm’s word similarity scores and word similarity ratings assigned by humans.
WordSim-353 (Finkelstein et al., 2002) is a commonly used set of ratings from 0
to 10 for 353 noun pairs; for example (plane, car) had an average score of 5.77.
SimLex-999 (Hill et al., 2015) is a more difficult dataset that quantifies similarity
(cup, mug) rather than relatedness (cup, coffee), and including both concrete and
abstract adjective, noun and verb pairs. The TOEFL dataset is a set of 80 questions,
each consisting of a target word with 4 additional word choices; the task is to choose
which is the correct synonym, as in the example: Levied is closest in meaning to:
imposed, believed, requested, correlated (Landauer and Dumais, 1997). All of these
datasets present words without context.
Slightly more realistic are intrinsic similarity tasks that include context. The
Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) and the
Word-in-Context (WiC) dataset (Pilehvar and Camacho-Collados, 2019) offer richer
evaluation scenarios. SCWS gives human judgments on 2,003 pairs of words in their
sentential context, while WiC gives target words in two sentential contexts that are
either in the same or different senses; see Section 18.5.3. The semantic textual
similarity task (Agirre et al. 2012, Agirre et al. 2015) evaluates the performance of
sentence-level similarity algorithms, consisting of a set of pairs of sentences, each
pair with human-labeled similarity scores.
Another task used for evaluation is the analogy task, discussed on page 119,
where the system has to solve problems of the form a is to b as a* is to b*, given a, b,
and a* and having to find b* (Turney and Littman, 2005). A number of sets of tuples
6.13 • S UMMARY 123

have been created for this task, (Mikolov et al. 2013, Mikolov et al. 2013b, Gladkova
et al. 2016), covering morphology (city:cities::child:children), lexicographic rela-
tions (leg:table::spout::teapot) and encyclopedia relations (Beijing:China::Dublin:Ireland),
some drawing from the SemEval-2012 Task 2 dataset of 79 different relations (Jur-
gens et al., 2012).
All embedding algorithms suffer from inherent variability. For example because
of randomness in the initialization and the random negative sampling, algorithms
like word2vec may produce different results even from the same dataset, and in-
dividual documents in a collection may strongly impact the resulting embeddings
(Hellrich and Hahn 2016, Antoniak and Mimno 2018). When embeddings are used
to study word associations in particular corpora, therefore, it is best practice to train
multiple embeddings with bootstrap sampling over documents and average the re-
sults (Antoniak and Mimno, 2018).

6.13 Summary
• In vector semantics, a word is modeled as a vector—a point in high-dimensional
space, also called an embedding. In this chapter we focus on static embed-
dings, in each each word is mapped to a fixed embedding.
• Vector semantic models fall into two classes: sparse and dense. In sparse
models each dimension corresponds to a word in the vocabulary V and cells
are functions of co-occurrence counts. The term-document matrix has a row
for each word (term) in the vocabulary and a column for each document. The
word-context or term-term matrix has a row for each (target) word in the
vocabulary and a column for each context term in the vocabulary. Two sparse
weightings are common: the tf-idf weighting which weights each cell by its
term frequency and inverse document frequency, and PPMI (pointwise
positive mutual information) most common for for word-context matrices.
• Dense vector models have dimensionality 50–1000. Word2vec algorithms
like skip-gram are a popular way to compute dense embeddings. Skip-gram
trains a logistic regression classifier to compute the probability that two words
are ‘likely to occur nearby in text’. This probability is computed from the dot
product between the embeddings for the two words.
• Skip-gram uses stochastic gradient descent to train the classifier, by learning
embeddings that have a high dot product with embeddings of words that occur
nearby and a low dot product with noise words.
• Other important embedding algorithms include GloVe, a method based on
ratios of word co-occurrence probabilities.
• Whether using sparse or dense vectors, word and document similarities are
computed by some function of the dot product between vectors. The cosine
of two vectors—a normalized dot product—is the most popular such metric.

Bibliographical and Historical Notes


The idea of vector semantics arose out of research in the 1950s in three distinct
fields: linguistics, psychology, and computer science, each of which contributed a
124 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

fundamental aspect of the model.


The idea that meaning is related to the distribution of words in context was
widespread in linguistic theory of the 1950s, among distributionalists like Zellig
Harris, Martin Joos, and J. R. Firth, and semioticians like Thomas Sebeok. As Joos
(1950) put it,
the linguist’s “meaning” of a morpheme. . . is by definition the set of conditional
probabilities of its occurrence in context with all other morphemes.
The idea that the meaning of a word might be modeled as a point in a multi-
dimensional semantic space came from psychologists like Charles E. Osgood, who
had been studying how people responded to the meaning of words by assigning val-
ues along scales like happy/sad or hard/soft. Osgood et al. (1957) proposed that the
meaning of a word in general could be modeled as a point in a multidimensional
Euclidean space, and that the similarity of meaning between two words could be
modeled as the distance between these points in the space.
A final intellectual source in the 1950s and early 1960s was the field then called
mechanical
indexing mechanical indexing, now known as information retrieval. In what became known
as the vector space model for information retrieval (Salton 1971, Sparck Jones 1986),
researchers demonstrated new ways to define the meaning of words in terms of vec-
tors (Switzer, 1965), and refined methods for word similarity based on measures
of statistical association between words like mutual information (Giuliano, 1965)
and idf (Sparck Jones, 1972), and showed that the meaning of documents could be
represented in the same vector spaces used for words.
Some of the philosophical underpinning of the distributional way of thinking
came from the late writings of the philosopher Wittgenstein, who was skeptical of
the possibility of building a completely formal theory of meaning definitions for
each word, suggesting instead that “the meaning of a word is its use in the language”
(Wittgenstein, 1953, PI 43). That is, instead of using some logical language to define
each word, or drawing on denotations or truth values, Wittgenstein’s idea is that we
should define a word by how it is used by people in speaking and understanding in
their day-to-day interactions, thus prefiguring the movement toward embodied and
experiential models in linguistics and NLP (Glenberg and Robertson 2000, Lake and
Murphy 2020, Bisk et al. 2020, Bender and Koller 2020).
More distantly related is the idea of defining words by a vector of discrete fea-
tures, which has roots at least as far back as Descartes and Leibniz (Wierzbicka 1992,
Wierzbicka 1996). By the middle of the 20th century, beginning with the work of
Hjelmslev (Hjelmslev, 1969) (originally 1943) and fleshed out in early models of
generative grammar (Katz and Fodor, 1963), the idea arose of representing mean-
semantic ing with semantic features, symbols that represent some sort of primitive meaning.
feature
For example words like hen, rooster, or chick, have something in common (they all
describe chickens) and something different (their age and sex), representable as:
hen +female, +chicken, +adult
rooster -female, +chicken, +adult
chick +chicken, -adult
The dimensions used by vector models of meaning to define words, however, are
only abstractly related to this idea of a small fixed number of hand-built dimensions.
Nonetheless, there has been some attempt to show that certain dimensions of em-
bedding models do contribute some specific compositional aspect of meaning like
these early semantic features.
The use of dense vectors to model word meaning, and indeed the term embed-
ding, grew out of the latent semantic indexing (LSI) model (Deerwester et al.,
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 125

1988) recast as LSA (latent semantic analysis) (Deerwester et al., 1990). In LSA
SVD singular value decomposition—SVD— is applied to a term-document matrix (each
cell weighted by log frequency and normalized by entropy), and then the first 300
dimensions are used as the LSA embedding. Singular Value Decomposition (SVD)
is a method for finding the most important dimensions of a data set, those dimen-
sions along which the data varies the most. LSA was then quickly widely applied:
as a cognitive model Landauer and Dumais (1997), and for tasks like spell check-
ing (Jones and Martin, 1997), language modeling (Bellegarda 1997, Coccaro and
Jurafsky 1998, Bellegarda 2000) morphology induction (Schone and Jurafsky 2000,
Schone and Jurafsky 2001b), multiword expressions (MWEs) (Schone and Jurafsky,
2001a), and essay grading (Rehder et al., 1998). Related models were simulta-
neously developed and applied to word sense disambiguation by Schütze (1992b).
LSA also led to the earliest use of embeddings to represent words in a probabilis-
tic classifier, in the logistic regression document router of Schütze et al. (1995).
The idea of SVD on the term-term matrix (rather than the term-document matrix)
as a model of meaning for NLP was proposed soon after LSA by Schütze (1992b).
Schütze applied the low-rank (97-dimensional) embeddings produced by SVD to the
task of word sense disambiguation, analyzed the resulting semantic space, and also
suggested possible techniques like dropping high-order dimensions. See Schütze
(1997a).
A number of alternative matrix models followed on from the early SVD work,
including Probabilistic Latent Semantic Indexing (PLSI) (Hofmann, 1999), Latent
Dirichlet Allocation (LDA) (Blei et al., 2003), and Non-negative Matrix Factoriza-
tion (NMF) (Lee and Seung, 1999).
The LSA community seems to have first used the word “embedding” in Landauer
et al. (1997), in a variant of its mathematical meaning as a mapping from one space
or mathematical structure to another. In LSA, the word embedding seems to have
described the mapping from the space of sparse count vectors to the latent space of
SVD dense vectors. Although the word thus originally meant the mapping from one
space to another, it has metonymically shifted to mean the resulting dense vector in
the latent space. and it is in this sense that we currently use the word.
By the next decade, Bengio et al. (2003) and Bengio et al. (2006) showed that
neural language models could also be used to develop embeddings as part of the task
of word prediction. Collobert and Weston (2007), Collobert and Weston (2008), and
Collobert et al. (2011) then demonstrated that embeddings could be used to rep-
resent word meanings for a number of NLP tasks. Turian et al. (2010) compared
the value of different kinds of embeddings for different NLP tasks. Mikolov et al.
(2011) showed that recurrent neural nets could be used as language models. The
idea of simplifying the hidden layer of these neural net language models to create
the skip-gram (and also CBOW) algorithms was proposed by Mikolov et al. (2013).
The negative sampling training algorithm was proposed in Mikolov et al. (2013a).
There are numerous surveys of static embeddings and their parameterizations (Bul-
linaria and Levy 2007, Bullinaria and Levy 2012, Lapesa and Evert 2014, Kiela and
Clark 2014, Levy et al. 2015).
See Manning et al. (2008) for a deeper understanding of the role of vectors in in-
formation retrieval, including how to compare queries with documents, more details
on tf-idf, and issues of scaling to very large datasets. See Kim (2019) for a clear and
comprehensive tutorial on word2vec. Cruse (2004) is a useful introductory linguistic
text on lexical semantics.
126 C HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS

Exercises
CHAPTER

7 Neural Networks and Neural


Language Models

“[M]achines of this character can behave in a very complicated manner when


the number of units is large.”
Alan Turing (1948) “Intelligent Machines”, page 6

Neural networks are a fundamental computational tool for language process-


ing, and a very old one. They are called neural because their origins lie in the
McCulloch-Pitts neuron (McCulloch and Pitts, 1943), a simplified model of the
human neuron as a kind of computing element that could be described in terms of
propositional logic. But the modern use in language processing no longer draws on
these early biological inspirations.
Instead, a modern neural network is a network of small computing units, each
of which takes a vector of input values and produces a single output value. In this
chapter we introduce the neural net applied to classification. The architecture we
feedforward introduce is called a feedforward network because the computation proceeds iter-
atively from one layer of units to the next. The use of modern neural nets is often
deep learning called deep learning, because modern networks are often deep (have many layers).
Neural networks share much of the same mathematics as logistic regression. But
neural networks are a more powerful classifier than logistic regression, and indeed a
minimal neural network (technically one with a single ‘hidden layer’) can be shown
to learn any function.
Neural net classifiers are different from logistic regression in another way. With
logistic regression, we applied the regression classifier to many different tasks by
developing many rich kinds of feature templates based on domain knowledge. When
working with neural networks, it is more common to avoid most uses of rich hand-
derived features, instead building neural networks that take raw words as inputs
and learn to induce features as part of the process of learning to classify. We saw
examples of this kind of representation learning for embeddings in Chapter 6. Nets
that are very deep are particularly good at representation learning. For that reason
deep neural nets are the right tool for large scale problems that offer sufficient data
to learn features automatically.
In this chapter we’ll introduce feedforward networks as classifiers, and also ap-
ply them to the simple task of language modeling: assigning probabilities to word
sequences and predicting upcoming words. In subsequent chapters we’ll introduce
many other aspects of neural models, such as recurrent neural networks and the
Transformer (Chapter 9), contextual embeddings like BERT (Chapter 10), and
encoder-decoder models and attention (Chapter 11).
128 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

7.1 Units
The building block of a neural network is a single computational unit. A unit takes
a set of real valued numbers as input, performs some computation on them, and
produces an output.
At its heart, a neural unit is taking a weighted sum of its inputs, with one addi-
bias term tional term in the sum called a bias term. Given a set of inputs x1 ...xn , a unit has
a set of corresponding weights w1 ...wn and a bias b, so the weighted sum z can be
represented as:
X
z = b+ wi xi (7.1)
i

Often it’s more convenient to express this weighted sum using vector notation; recall
vector from linear algebra that a vector is, at heart, just a list or array of numbers. Thus
we’ll talk about z in terms of a weight vector w, a scalar bias b, and an input vector
x, and we’ll replace the sum with the convenient dot product:

z = w·x+b (7.2)

As defined in Eq. 7.2, z is just a real valued number.


Finally, instead of using z, a linear function of x, as the output, neural units
apply a non-linear function f to z. We will refer to the output of this function as
activation the activation value for the unit, a. Since we are just modeling a single unit, the
activation for the node is in fact the final output of the network, which we’ll generally
call y. So the value y is defined as:

y = a = f (z)

We’ll discuss three popular non-linear functions f () below (the sigmoid, the tanh,
and the rectified linear ReLU) but it’s pedagogically convenient to start with the
sigmoid sigmoid function since we saw it in Chapter 5:

1
y = σ (z) = (7.3)
1 + e−z
The sigmoid (shown in Fig. 7.1) has a number of advantages; it maps the output
into the range [0, 1], which is useful in squashing outliers toward 0 or 1. And it’s
differentiable, which as we saw in Section 5.8 will be handy for learning.

Figure 7.1 The sigmoid function takes a real value and maps it to the range [0, 1]. It is
nearly linear around 0 but outlier values get squashed toward 0 or 1.
7.1 • U NITS 129

Substituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = σ (w · x + b) = (7.4)
1 + exp(−(w · x + b))
Fig. 7.2 shows a final schematic of a basic neural unit. In this example the unit
takes 3 input values x1 , x2 , and x3 , and computes a weighted sum, multiplying each
value by a weight (w1 , w2 , and w3 , respectively), adds them to a bias term b, and then
passes the resulting sum through a sigmoid function to result in a number between 0
and 1.

y
a
σ
z

w1 w2 w3 b
x1 x2 x3 +1

Figure 7.2 A neural unit, taking 3 inputs x1 , x2 , and x3 (and a bias b that we represent as a
weight for an input clamped at +1) and producing an output y. We include some convenient
intermediate variables: the output of the summation, z, and the output of the sigmoid, a. In
this case the output of the unit y is the same as a, but in deeper networks we’ll reserve y to
mean the final output of the entire network, leaving a as the activation of an individual node.

Let’s walk through an example just to get an intuition. Let’s suppose we have a
unit with the following weight vector and bias:

w = [0.2, 0.3, 0.9]


b = 0.5

What would this unit do with the following input vector:

x = [0.5, 0.6, 0.1]

The resulting output y would be:


1 1 1
y = σ (w · x + b) = = = = .70
1 + e−(w·x+b) 1 + e−(.5∗.2+.6∗.3+.1∗.9+.5) 1 + e−0.87
In practice, the sigmoid is not commonly used as an activation function. A function
tanh that is very similar but almost always better is the tanh function shown in Fig. 7.3a;
tanh is a variant of the sigmoid that ranges from -1 to +1:
ez − e−z
y= (7.5)
ez + e−z
The simplest activation function, and perhaps the most commonly used, is the rec-
ReLU tified linear unit, also called the ReLU, shown in Fig. 7.3b. It’s just the same as x
when x is positive, and 0 otherwise:

y = max(x, 0) (7.6)
130 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

(a) (b)
Figure 7.3 The tanh and ReLU activation functions.

These activation functions have different properties that make them useful for
different language applications or network architectures. For example, the tanh func-
tion has the nice properties of being smoothly differentiable and mapping outlier
values toward the mean. The rectifier function, on the other hand has nice properties
that result from it being very close to linear. In the sigmoid or tanh functions, very
saturated high values of z result in values of y that are saturated, i.e., extremely close to 1,
and have derivatives very close to 0. Zero derivatives cause problems for learning,
because as we’ll see in Section 7.4, we’ll train networks by propagating an error
signal backwards, multiplying gradients (partial derivatives) from each layer of the
network; gradients that are almost 0 cause the error signal to get smaller and smaller
vanishing
gradient until it is too small to be used for training, a problem called the vanishing gradient
problem. Rectifiers don’t have this problem, since the derivative of ReLU for high
values of z is 1 rather than very close to 0.

7.2 The XOR problem


Early in the history of neural networks it was realized that the power of neural net-
works, as with the real neurons that inspired them, comes from combining these
units into larger networks.
One of the most clever demonstrations of the need for multi-layer networks was
the proof by Minsky and Papert (1969) that a single neural unit cannot compute
some very simple functions of its input. Consider the task of computing elementary
logical functions of two inputs, like AND, OR, and XOR. As a reminder, here are
the truth tables for those functions:
AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0

perceptron This example was first shown for the perceptron, which is a very simple neural
unit that has a binary output and does not have a non-linear activation function. The
7.2 • T HE XOR PROBLEM 131

output y of a perceptron is 0 or 1, and is computed as follows (using the same weight


w, input x, and bias b as in Eq. 7.2):

0, if w · x + b ≤ 0
y= (7.7)
1, if w · x + b > 0
It’s very easy to build a perceptron that can compute the logical AND and OR
functions of its binary inputs; Fig. 7.4 shows the necessary weights.

x1 x1
1 1
x2 1 x2 1
-1 0
+1 +1
(a) (b)
Figure 7.4 The weights w and bias b for perceptrons for computing logical functions. The
inputs are shown as x1 and x2 and the bias as a special node with value +1 which is multiplied
with the bias weight b. (a) logical AND, showing weights w1 = 1 and w2 = 1 and bias weight
b = −1. (b) logical OR, showing weights w1 = 1 and w2 = 1 and bias weight b = 0. These
weights/biases are just one from an infinite number of possible sets of weights and biases that
would implement the functions.

It turns out, however, that it’s not possible to build a perceptron to compute
logical XOR! (It’s worth spending a moment to give it a try!)
The intuition behind this important result relies on understanding that a percep-
tron is a linear classifier. For a two-dimensional input x1 and x2 , the perception
equation, w1 x1 + w2 x2 + b = 0 is the equation of a line. (We can see this by putting
it in the standard linear format: x2 = (−w1 /w2 )x1 + (−b/w2 ).) This line acts as a
decision
boundary decision boundary in two-dimensional space in which the output 0 is assigned to all
inputs lying on one side of the line, and the output 1 to all input points lying on the
other side of the line. If we had more than 2 inputs, the decision boundary becomes
a hyperplane instead of a line, but the idea is the same, separating the space into two
categories.
Fig. 7.5 shows the possible logical inputs (00, 01, 10, and 11) and the line drawn
by one possible set of parameters for an AND and an OR classifier. Notice that there
is simply no way to draw a line that separates the positive cases of XOR (01 and 10)
linearly
separable from the negative cases (00 and 11). We say that XOR is not a linearly separable
function. Of course we could draw a boundary with a curve, or some other function,
but not a single line.

7.2.1 The solution: neural networks


While the XOR function cannot be calculated by a single perceptron, it can be cal-
culated by a layered network of units. Let’s see an example of how to do this from
Goodfellow et al. (2016) that computes XOR using two layers of ReLU-based units.
Fig. 7.6 shows a figure with the input being processed by two layers of neural units.
The middle layer (called h) has two units, and the output layer (called y) has one
unit. A set of weights and biases are shown for each ReLU that correctly computes
the XOR function.
Let’s walk through what happens with the input x = [0 0]. If we multiply each
input value by the appropriate weight, sum, and then add the bias b, we get the
132 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

x2 x2 x2

1 1 1

?
0 x1 0 x1 0 x1
0 1 0 1 0 1

a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2

Figure 7.5 The functions AND, OR, and XOR, represented with input x1 on the x-axis and input x2 on the
y axis. Filled circles represent perceptron outputs of 1, and white circles perceptron outputs of 0. There is no
way to draw a line that correctly separates the two categories for XOR. Figure styled after Russell and Norvig
(2002).

y1
1 -2 0
h1 h2 +1
1 1 1 1 0 -1
x1 x2 +1
Figure 7.6 XOR solution after Goodfellow et al. (2016). There are three ReLU units, in
two layers; we’ve called them h1 , h2 (h for “hidden layer”) and y1 . As before, the numbers
on the arrows represent the weights w for each unit, and we represent the bias b as a weight
on a unit clamped to +1, with the bias weights/units in gray.

vector [0 -1], and we then apply the rectified linear transformation to give the output
of the h layer as [0 0]. Now we once again multiply by the weights, sum, and add
the bias (0 in this case) resulting in the value 0. The reader should work through the
computation of the remaining 3 possible input pairs to see that the resulting y values
are 1 for the inputs [0 1] and [1 0] and 0 for [0 0] and [1 1].
It’s also instructive to look at the intermediate results, the outputs of the two
hidden nodes h1 and h2 . We showed in the previous paragraph that the h vector for
the inputs x = [0 0] was [0 0]. Fig. 7.7b shows the values of the h layer for all 4
inputs. Notice that hidden representations of the two input points x = [0 1] and x
= [1 0] (the two cases with XOR output = 1) are merged to the single point h = [1
0]. The merger makes it easy to linearly separate the positive and negative cases
of XOR. In other words, we can view the hidden layer of the network as forming a
representation for the input.
In this example we just stipulated the weights in Fig. 7.6. But for real examples
the weights for neural networks are learned automatically using the error backprop-
agation algorithm to be introduced in Section 7.4. That means the hidden layers will
learn to form useful representations. This intuition, that neural networks can auto-
matically learn useful representations of the input, is one of their key advantages,
7.3 • F EED -F ORWARD N EURAL N ETWORKS 133

x2 h2

1 1

0 x1 0
h1
0 1 0 1 2

a) The original x space b) The new (linearly separable) h space


Figure 7.7 The hidden layer forming a new representation of the input. (b) shows the
representation of the hidden layer, h, compared to the original input representation x in (a).
Notice that the input point [0 1] has been collapsed with the input point [1 0], making it
possible to linearly separate the positive and negative cases of XOR. After Goodfellow et al.
(2016).

and one that we will return to again and again in later chapters.
Note that the solution to the XOR problem requires a network of units with non-
linear activation functions. A network made up of simple linear (perceptron) units
cannot solve the XOR problem. This is because a network formed by many layers of
purely linear units can always be reduced (i.e., shown to be computationally identical
to) a single layer of linear units with appropriate weights, and we’ve already shown
(visually, in Fig. 7.5) that a single unit cannot solve the XOR problem.

7.3 Feed-Forward Neural Networks


Let’s now walk through a slightly more formal presentation of the simplest kind of
feedforward neural network, the feedforward network. A feedforward network is a multilayer
network
network in which the units are connected with no cycles; the outputs from units in
each layer are passed to units in the next higher layer, and no outputs are passed
back to lower layers. (In Chapter 9 we’ll introduce networks with cycles, called
recurrent neural networks.)
For historical reasons multilayer networks, especially feedforward networks, are
multi-layer
perceptrons sometimes called multi-layer perceptrons (or MLPs); this is a technical misnomer,
MLP since the units in modern multilayer networks aren’t perceptrons (perceptrons are
purely linear, but modern networks are made up of units with non-linearities like
sigmoids), but at some point the name stuck.
Simple feedforward networks have three kinds of nodes: input units, hidden
units, and output units. Fig. 7.8 shows a picture.
The input units are simply scalar values just as we saw in Fig. 7.2.
hidden layer The core of the neural network is the hidden layer formed of hidden units,
each of which is a neural unit as described in Section 7.1, taking a weighted sum of
its inputs and then applying a non-linearity. In the standard architecture, each layer
fully-connected is fully-connected, meaning that each unit in each layer takes as input the outputs
from all the units in the previous layer, and there is a link between every pair of units
from two adjacent layers. Thus each hidden unit sums over all the input units.
134 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

y1 y2 … yn
2

h1 h2 h3 … hn
1

W b

x1 x2 … xn
0 +1
Figure 7.8 A simple 2-layer feedforward network, with one hidden layer, one output layer,
and one input layer (the input layer is usually not counted when enumerating layers).

Recall that a single hidden unit has parameters w (the weight vector) and b (the
bias scalar). We represent the parameters for the entire hidden layer by combining
the weight vector wi and bias bi for each unit i into a single weight matrix W and
a single bias vector b for the whole layer (see Fig. 7.8). Each element W ji of the
weight matrix W represents the weight of the connection from the ith input unit xi to
the jth hidden unit h j .
The advantage of using a single matrix W for the weights of the entire layer is
that now the hidden layer computation for a feedforward network can be done very
efficiently with simple matrix operations. In fact, the computation only has three
steps: multiplying the weight matrix by the input vector x, adding the bias vector b,
and applying the activation function g (such as the sigmoid, tanh, or ReLU activation
function defined above).
The output of the hidden layer, the vector h, is thus the following, using the
sigmoid function σ :

h = σ (W x + b) (7.8)

Notice that we’re applying the σ function here to a vector, while in Eq. 7.3 it was
applied to a scalar. We’re thus allowing σ (·), and indeed any activation function
g(·), to apply to a vector element-wise, so g[z1 , z2 , z3 ] = [g(z1 ), g(z2 ), g(z3 )].
Let’s introduce some constants to represent the dimensionalities of these vectors
and matrices. We’ll refer to the input layer as layer 0 of the network, and have n0
represent the number of inputs, so x is a vector of real numbers of dimension n0 ,
or more formally x ∈ Rn0 , a column vector of dimensionality [n0 , 1]. Let’s call the
hidden layer layer 1 and the output layer layer 2. The hidden layer has dimensional-
ity n1 , so h ∈ Rn1 and also b ∈ Rn1 (since each hidden unit can take a different bias
value). And the weight matrix W has dimensionality W ∈ Rn1 ×n0 , i.e. [n1 , n0 ].
Take a moment to convince yourselfPn0 that the matrix  multiplication in Eq. 7.8 will
compute the value of each h j as σ i=1 W ji x i + b j .
As we saw in Section 7.2, the resulting value h (for hidden but also for hypoth-
esis) forms a representation of the input. The role of the output layer is to take
this new representation h and compute a final output. This output could be a real-
valued number, but in many cases the goal of the network is to make some sort of
classification decision, and so we will focus on the case of classification.
If we are doing a binary task like sentiment classification, we might have a single
output node, and its value y is the probability of positive versus negative sentiment.
7.3 • F EED -F ORWARD N EURAL N ETWORKS 135

If we are doing multinomial classification, such as assigning a part-of-speech tag, we


might have one output node for each potential part-of-speech, whose output value
is the probability of that part-of-speech, and the values of all the output nodes must
sum to one. The output layer thus gives a probability distribution across the output
nodes.
Let’s see how this happens. Like the hidden layer, the output layer has a weight
matrix (let’s call it U), but some models don’t include a bias vector b in the output
layer, so we’ll simplify by eliminating the bias vector in this example. The weight
matrix is multiplied by its input vector (h) to produce the intermediate output z.

z = Uh

There are n2 output nodes, so z ∈ Rn2 , weight matrix U has dimensionality U ∈


Rn2 ×n1 , and element Ui j is the weight from unit j in the hidden layer to unit i in the
output layer.
However, z can’t be the output of the classifier, since it’s a vector of real-valued
numbers, while what we need for classification is a vector of probabilities. There is
normalizing a convenient function for normalizing a vector of real values, by which we mean
converting it to a vector that encodes a probability distribution (all the numbers lie
softmax between 0 and 1 and sum to 1): the softmax function that we saw on page 90 of
Chapter 5. For a vector z of dimensionality d, the softmax is defined as:
ezi
softmax(zi ) = Pd zj
1≤i≤d (7.9)
j=1 e

Thus for example given a vector z=[0.6, 1.1, -1.5, 1.2, 3.2, -1.1], softmax(z) is
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010].
You may recall that softmax was exactly what is used to create a probability
distribution from a vector of real-valued numbers (computed from summing weights
times features) in logistic regression in Chapter 5.
That means we can think of a neural network classifier with one hidden layer
as building a vector h which is a hidden layer representation of the input, and then
running standard logistic regression on the features that the network develops in h.
By contrast, in Chapter 5 the features were mainly designed by hand via feature
templates. So a neural network is like logistic regression, but (a) with many layers,
since a deep neural network is like layer after layer of logistic regression classifiers,
and (b) rather than forming the features by feature templates, the prior layers of the
network induce the feature representations themselves.
Here are the final equations for a feedforward network with a single hidden layer,
which takes an input vector x, outputs a probability distribution y, and is parameter-
ized by weight matrices W and U and a bias vector b:

h = σ (W x + b)
z = Uh
y = softmax(z) (7.10)

We’ll call this network a 2-layer network (we traditionally don’t count the input
layer when numbering layers, but do count the output layer). So by this terminology
logistic regression is a 1-layer network.
Let’s now set up some notation to make it easier to talk about deeper networks
of depth more than 2. We’ll use superscripts in square brackets to mean layer num-
bers, starting at 0 for the input layer. So W [1] will mean the weight matrix for the
136 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

(first) hidden layer, and b[1] will mean the bias vector for the (first) hidden layer. n j
will mean the number of units at layer j. We’ll use g(·) to stand for the activation
function, which will tend to be ReLU or tanh for intermediate layers and softmax
for output layers. We’ll use a[i] to mean the output from layer i, and z[i] to mean the
combination of weights and biases W [i] a[i−1] + b[i] . The 0th layer is for inputs, so the
inputs x we’ll refer to more generally as a[0] .
Thus we can re-represent our 2-layer net from Eq. 7.10 as follows:

z[1] = W [1] a[0] + b[1]


a[1] = g[1] (z[1] )
z[2] = W [2] a[1] + b[2]
a[2] = g[2] (z[2] )
ŷ = a[2] (7.11)

Note that with this notation, the equations for the computation done at each layer are
the same. The algorithm for computing the forward step in an n-layer feedforward
network, given the input vector a[0] is thus simply:

for i in 1..n
z[i] = W [i] a[i−1] + b[i]
a[i] = g[i] (z[i] )
ŷ = a[n]

The activation functions g(·) are generally different at the final layer. Thus g[2]
might be softmax for multinomial classification or sigmoid for binary classification,
while ReLU or tanh might be the activation function g(·) at the internal layers.
Replacing the bias unit In describing networks, we will often use a slightly sim-
plified notation that represents exactly the same function without referring to an ex-
plicit bias node b. Instead, we add a dummy node a0 to each layer whose value will
[0]
always be 1. Thus layer 0, the input layer, will have a dummy node a0 = 1, layer 1
[1]
will have a0 = 1, and so on. This dummy node still has an associated weight, and
that weight represents the bias value b. For example instead of an equation like

h = σ (W x + b) (7.12)

we’ll use:

h = σ (W x) (7.13)

But now instead of our vector x having n values: x = x1 , . . . , xn , it will have n +


1 values, with a new 0th dummy value x0 = 1: x = x0 , . . . , xn0 . And instead of
computing each h j as follows:
n0
!
X
hj = σ W ji xi + b j , (7.14)
i=1

we’ll instead use:


n0
!
X
σ W ji xi , (7.15)
i=0
7.4 • T RAINING N EURAL N ETS 137

y1 y2 … yn
2
y1 y2 … yn
2

U U

h1 h2 h3 … hn
1
h1 h2 h3 … hn
1

W b W

x1 x2 … xn
0 +1 x0=1 x1 x2 … xn
0

(a) (b)
Figure 7.9 Replacing the bias node (shown in a) with x0 (b).

where the value W j0 replaces what had been b j . Fig. 7.9 shows a visualization.
We’ll continue showing the bias as b for the learning example in the next section,
but then we’ll switch to this simplified notation without explicit bias terms for the
rest of the book.

7.4 Training Neural Nets


A feedforward neural net is an instance of supervised machine learning in which we
know the correct output y for each observation x. What the system produces, via
Eq. 7.11, is ŷ, the system’s estimate of the true y. The goal of the training procedure
is to learn parameters W [i] and b[i] for each layer i that make ŷ for each training
observation as close as possible to the true y.
In general, we do all this by drawing on the methods we introduced in Chapter 5
for logistic regression, so the reader should be comfortable with that chapter before
proceeding.
First, we’ll need a loss function that models the distance between the system
output and the gold output, and it’s common to use the loss function used for logistic
regression, the cross-entropy loss.
Second, to find the parameters that minimize this loss function, we’ll use the
gradient descent optimization algorithm introduced in Chapter 5.
Third, gradient descent requires knowing the gradient of the loss function, the
vector that contains the partial derivative of the loss function with respect to each of
the parameters. Here is one part where learning for neural networks is more complex
than for logistic regression. In logistic regression, for each observation we could
directly compute the derivative of the loss function with respect to an individual w
or b. But for neural networks, with millions of parameters in many layers, it’s much
harder to see how to compute the partial derivative of some weight in layer 1 when
the loss is attached to some much later layer. How do we partial out the loss over all
those intermediate layers?
The answer is the algorithm called error backpropagation or reverse differen-
tiation.

7.4.1 Loss function


cross-entropy The cross-entropy loss that is used in neural networks is the same one we saw for
loss
logistic regression.
In fact, if the neural network is being used as a binary classifier, with the sig-
138 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

moid at the final layer, the loss function is exactly the same as we saw with logistic
regression in Eq. 5.11:

LCE (ŷ, y) = − log p(y|x) = − [y log ŷ + (1 − y) log(1 − ŷ)] (7.16)

What about if the neural network is being used as a multinomial classifier? Let y be
a vector over the C classes representing the true output probability distribution. The
cross-entropy loss here is

C
X
LCE (ŷ, y) = − yi log ŷi (7.17)
i=1

We can simplify this equation further. Assume this is a hard classification task,
meaning that only one class is the correct one, and that there is one output unit in y
for each class. If the true class is i, then y is a vector where yi = 1 and y j = 0 ∀ j 6= i.
A vector like this, with one value=1 and the rest 0, is called a one-hot vector. The
terms in the sum in Eq. 7.17 will be 0 except for the term corresponding to the true
class, i.e.:
K
X
LCE (ŷ, y) = − 1{y = k} log ŷi
k=1
XK
= − 1{y = k} log p̂(y = k|x)
k=1
K
X ezk
= − 1{y = k} log PK zj
(7.18)
k=1 j=1 e

Hence the cross-entropy loss is simply the log of the output probability correspond-
ing to the correct class, and we therefore also call this the negative log likelihood
negative log loss:
likelihood loss

LCE (ŷ, y) = − log ŷi , (where i is the correct class) (7.19)

Plugging in the softmax formula from Eq. 7.9, and with K the number of classes:

ezi
LCE (ŷ, y) = − log PK zj
(where i is the correct class) (7.20)
j=1 e

7.4.2 Computing the Gradient


How do we compute the gradient of this loss function? Computing the gradient
requires the partial derivative of the loss function with respect to each parameter.
For a network with one weight layer and sigmoid output (which is what logistic
regression is), we could simply use the derivative of the loss that we used for logistic
regression in Eq. 7.21 (and derived in Section 5.8):

∂ LCE (w, b)
= (ŷ − y) x j
∂wj
= (σ (w · x + b) − y) x j (7.21)
7.4 • T RAINING N EURAL N ETS 139

Or for a network with one hidden layer and softmax output, we could use the deriva-
tive of the softmax loss from Eq. 5.37:

∂ LCE
= (1{y = k} − p(y = k|x))xk
∂ wk
!
exp(wk · x + bk )
= 1{y = k} − PK xk (7.22)
j=1 exp(w j · x + b j )

But these derivatives only give correct updates for one weight layer: the last one!
For deep networks, computing the gradients for each weight is much more complex,
since we are computing the derivative with respect to weight parameters that appear
all the way back in the very early layers of the network, even though the loss is
computed only at the very end of the network.
The solution to computing this gradient is an algorithm called error backprop-
error back-
propagation agation or backprop (Rumelhart et al., 1986). While backprop was invented spe-
cially for neural networks, it turns out to be the same as a more general procedure
called backward differentiation, which depends on the notion of computation
graphs. Let’s see how that works in the next subsection.

7.4.3 Computation Graphs


A computation graph is a representation of the process of computing a mathematical
expression, in which the computation is broken down into separate operations, each
of which is modeled as a node in a graph.
Consider computing the function L(a, b, c) = c(a + 2b). If we make each of the
component addition and multiplication operations explicit, and add names (d and e)
for the intermediate outputs, the resulting series of computations is:

d = 2∗b
e = a+d
L = c∗e

We can now represent this as a graph, with nodes for each operation, and di-
rected edges showing the outputs from each operation as the inputs to the next, as
in Fig. 7.10. The simplest use of computation graphs is to compute the value of
the function with some given inputs. In the figure, we’ve assumed the inputs a = 3,
b = 1, c = −2, and we’ve shown the result of the forward pass to compute the re-
sult L(3, 1, −2) = −10. In the forward pass of a computation graph, we apply each
operation left to right, passing the outputs of each computation as the input to the
next node.

7.4.4 Backward differentiation on computation graphs


The importance of the computation graph comes from the backward pass, which
is used to compute the derivatives that we’ll need for the weight update. In this
example our goal is to compute the derivative of the output function L with respect
to each of the input variables, i.e., ∂∂ La , ∂∂ Lb , and ∂∂ Lc . The derivative ∂∂ La , tells us how
much a small change in a affects L.
chain rule Backwards differentiation makes use of the chain rule in calculus. Suppose we
are computing the derivative of a composite function f (x) = u(v(x)). The derivative
140 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

forward pass
3
e=5
a
d=2 e=a+d L=-10
1
b d = 2b L=ce

-2
c
Figure 7.10 Computation graph for the function L(a, b, c) = c(a+2b), with values for input
nodes a = 3, b = 1, c = −2, showing the forward pass computation of L.

of f (x) is the derivative of u(x) with respect to v(x) times the derivative of v(x) with
respect to x:
df du dv
= · (7.23)
dx dv dx
The chain rule extends to more than two functions. If computing the derivative of a
composite function f (x) = u(v(w(x))), the derivative of f (x) is:
df du dv dw
= · · (7.24)
dx dv dw dx
Let’s now compute the 3 derivatives we need. Since in the computation graph
L = ce, we can directly compute the derivative ∂∂ Lc :

∂L
=e (7.25)
∂c
For the other two, we’ll need to use the chain rule:
∂L ∂L ∂e
=
∂a ∂e ∂a
∂L ∂L ∂e ∂d
= (7.26)
∂b ∂e ∂d ∂b

Eq. 7.26 thus requires five intermediate derivatives: ∂∂ Le , ∂∂ Lc , ∂∂ ae , ∂∂ de , and ∂∂ db ,


which are as follows (making use of the fact that the derivative of a sum is the sum
of the derivatives):
∂L ∂L
L = ce : = c, =e
∂e ∂c
∂e ∂e
e = a+d : = 1, =1
∂a ∂d
∂d
d = 2b : =2
∂b
In the backward pass, we compute each of these partials along each edge of the graph
from right to left, multiplying the necessary partials to result in the final derivative
we need. Thus we begin by annotating the final node with ∂∂ LL = 1. Moving to the
left, we then compute ∂∂ Lc and ∂∂ Le , and so on, until we have annotated the graph all
the way to the input variables. The forward pass conveniently already will have
computed the values of the forward intermediate variables we need (like d and e)
7.4 • T RAINING N EURAL N ETS 141

to compute these derivatives. Fig. 7.11 shows the backward pass. At each node we
need to compute the local partial derivative with respect to the parent, multiply it by
the partial derivative that is being passed down from the parent, and then pass it to
the child.

a=3
a
∂L=-2
∂e
∂a ∂a
=1 e=5
b=1 d=2 e=d+a L=-10
b ∂L
=1 ∂L=-2
∂e =-2
∂d d = 2b ∂e L=ce
∂d ∂e
∂L=-4 ∂b =2
∂b ∂L=-2 ∂L =1
∂d ∂L
∂L
=5
c=-2 ∂c
c backward pass
∂L =5
∂c
Figure 7.11 Computation graph for the function L(a, b, c) = c(a + 2b), showing the back-
ward pass computation of ∂∂ La , ∂∂ Lb , and ∂∂ Lc .

Backward differentiation for a neural network


Of course computation graphs for real neural networks are much more complex.
Fig. 7.12 shows a sample computation graph for a 2-layer neural network with n0 =
2, n1 = 2, and n2 = 1, assuming binary classification and hence using a sigmoid
output unit for simplicity. The function that the computation graph is computing is:

z[1] = W [1] x + b[1]


a[1] = ReLU(z[1] )
z[2] = W [2] a[1] + b[2]
a[2] = σ (z[2] )
ŷ = a[2] (7.27)

The weights that need updating (those for which we need to know the partial
derivative of the loss function) are shown in orange. In order to do the backward
pass, we’ll need to know the derivatives of all the functions in the graph. We already
saw in Section 5.8 the derivative of the sigmoid σ :
dσ (z)
= σ (z)(1 − σ (z)) (7.28)
dz
We’ll also need the derivatives of each of the other activation functions. The
derivative of tanh is:
d tanh(z)
= 1 − tanh2 (z) (7.29)
dz
The derivative of the ReLU is

d ReLU(z) 0 f or x < 0
= (7.30)
dz 1 f or x ≥ 0
142 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

[1]
w11
*
w[1]
12 z[1] = a[1] =
*
+ ReLU
x1
*
b[1]
1 z[2] =
w[2] a[2] = σ L (a[2],y)
x2 11 +
*
*
w[1] z[1] = a[1] = w[2]
21 * 12
+ ReLU
w[1] b[2]
22 1
b[1]
2

Figure 7.12 Sample computation graph for a simple 2-layer neural net (= 1 hidden layer)
with two input dimensions and 2 hidden dimensions.

7.4.5 More details on learning


Optimization in neural networks is a non-convex optimization problem, more com-
plex than for logistic regression, and for that and other reasons there are many best
practices for successful learning.
For logistic regression we can initialize gradient descent with all the weights and
biases having the value 0. In neural networks, by contrast, we need to initialize the
weights with small random numbers. It’s also helpful to normalize the input values
to have 0 mean and unit variance.
Various forms of regularization are used to prevent overfitting. One of the most
dropout important is dropout: randomly dropping some units and their connections from
the network during training (Hinton et al. 2012, Srivastava et al. 2014). Tuning
hyperparameter of hyperparameters is also important. The parameters of a neural network are the
weights W and biases b; those are learned by gradient descent. The hyperparameters
are things that are chosen by the algorithm designer; optimal values are tuned on a
devset rather than by gradient descent learning on the training set. Hyperparameters
include the learning rate η, the mini-batch size, the model architecture (the number
of layers, the number of hidden nodes per layer, the choice of activation functions),
how to regularize, and so on. Gradient descent itself also has many architectural
variants such as Adam (Kingma and Ba, 2015).
Finally, most modern neural networks are built using computation graph for-
malisms that make it easy and natural to do gradient computation and parallelization
onto vector-based GPUs (Graphic Processing Units). PyTorch (Paszke et al., 2017)
and TensorFlow (Abadi et al., 2015) are two of the most popular. The interested
reader should consult a neural network textbook for further details; some sugges-
tions are at the end of the chapter.

7.5 Neural Language Models


As our first application of neural networks, let’s consider language modeling: pre-
dicting upcoming words from prior word context.
Neural net-based language models turn out to have many advantages over the n-
gram language models of Chapter 3. Among these are that neural language models
7.5 • N EURAL L ANGUAGE M ODELS 143

don’t need smoothing, they can handle much longer histories, and they can general-
ize over contexts of similar words. For a training set of a given size, a neural lan-
guage model has much higher predictive accuracy than an n-gram language model.
Furthermore, neural language models underlie many of the models we’ll introduce
for tasks like machine translation, dialog, and language generation.
On the other hand, there is a cost for this improved performance: neural net
language models are strikingly slower to train than traditional language models, and
so for many tasks an n-gram language model is still the right tool.
In this chapter we’ll describe simple feedforward neural language models, first
introduced by Bengio et al. (2003). Modern neural language models are generally
not feedforward but recurrent, using the technology that we will introduce in Chap-
ter 9.
A feedforward neural LM is a standard feedforward network that takes as in-
put at time t a representation of some number of previous words (wt−1 , wt−2 , etc.)
and outputs a probability distribution over possible next words. Thus—like the n-
gram LM—the feedforward neural LM approximates the probability of a word given
the entire prior context P(wt |w1 : t − 1) by approximating based on the N previous
words:

P(wt |w1 , . . . , wt−1 ) ≈ P(wt |wt−N+1 , . . . , wt−1 ) (7.31)

In the following examples we’ll use a 4-gram example, so we’ll show a net to esti-
mate the probability P(wt = i|wt−1 , wt−2 , wt−3 ).

7.5.1 Embeddings
In neural language models, the prior context is represented by embeddings of the
previous words. Representing the prior context as embeddings, rather than by ex-
act words as used in n-gram language models, allows neural language models to
generalize to unseen data much better than n-gram language models. For example,
suppose we’ve seen this sentence in training:
I have to make sure that the cat gets fed.
but have never seen the words “gets fed” after the word “dog”. Our test set has the
prefix “I forgot to make sure that the dog gets”. What’s the next word? An n-gram
language model will predict “fed” after “that the cat gets”, but not after “that the dog
gets”. But a neural LM, knowing that “cat” and “dog” have similar embeddings, will
be able to generalize from the “cat” context to assign a high enough probability to
“fed” even after seeing “dog”.
Let’s see how this works in practice. For now we’ll assume we already have
an embedding dictionary E that gives us, for each word in our vocabulary V , the
embedding for that word.
Fig. 7.13 shows a sketch of this simplified feedforward neural language model
with N=3; we have a moving window at time t with an embedding vector represent-
ing each of the 3 previous words (words wt−1 , wt−2 , and wt−3 ). These 3 vectors are
concatenated together to produce x, the input layer of a neural network whose output
is a softmax with a probability distribution over words. Thus y42 , the value of output
node 42 is the probability of the next word wt being V42 , the vocabulary word with
index 42.
The model shown in Fig. 7.13 is quite sufficient, assuming we have already
learned the embeddings separately by a method like the word2vec methods of Chap-
ter 6. Relying on another algorithm to have already learned an embedding represen-
144 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

p(aardvark|…) p(fish|…) p(for|…) p(zebra|…)

Output layer y^1 … ^y


42 … ^y … ^y
59
^
35102 … y|V| |V|⨉1
softmax
|V|⨉dh
U
Hidden layer h1 h2 h3 … hdh dh⨉1

wt-1 dh⨉3d
W
Projection layer 3d⨉1
embeddings

E embedding for
word 35
embedding for
word 9925
embedding for
word 45180

... and thanks for all the ? ...

wt-3 wt-2 wt-1 wt

Figure 7.13 A simplified view of a feedforward neural language model moving through
a text. At each timestep t the network takes the 3 context words, converts each to a d-
dimensional embedding, and concatenates the 3 embeddings together to get the 1 × Nd unit
input layer x for the network. These units are multiplied by a weight matrix W and then an
activation function is applied element-wise to produce the hidden layer h, which is then mul-
tiplied by another weight matrix U. Finally, a softmax output layer predicts at each node i the
probability that the next word wt will be vocabulary word Vi . (This picture is simplified be-
cause it assumes we just look up in an embedding dictionary E the d-dimensional embedding
vector for each word, precomputed by an algorithm like word2vec.)

pretraining tation for input words is called pretraining. If those pretrained embeddings are
sufficient for your purposes, then this is all you need.
However, often we’d like to learn the embeddings simultaneously with training
the network. This is true when the task the network is designed for (sentiment clas-
sification, or translation, or parsing) places strong constraints on what makes a good
representation.
Let’s therefore show an architecture that allows the embeddings to be learned.
To do this, we’ll add an extra layer to the network, and propagate the error all the
way back to the embedding vectors, starting with embeddings with random values
and slowly moving toward sensible representations.
For this to work at the input layer, instead of pretrained embeddings, we’re going
to represent each of the N previous words as a one-hot vector of length |V |, i.e., with
one-hot vector one dimension for each word in the vocabulary. A one-hot vector is a vector that
has one element equal to 1—in the dimension corresponding to that word’s index in
the vocabulary— while all the other elements are set to zero.
Thus in a one-hot representation for the word “toothpaste”, supposing it is index
5 in the vocabulary, x5 = 1, and xi = 0 ∀i 6= 5, as shown here:
[0 0 0 0 1 0 0 ... 0 0 0 0]
1 2 3 4 5 6 7 ... ... |V|
Fig. 7.14 shows the additional layers needed to learn the embeddings during LM
training. Here the N=3 context words are represented as 3 one-hot vectors, fully
connected to the embedding layer via 3 instantiations of the embedding matrix E.
7.5 • N EURAL L ANGUAGE M ODELS 145

wt=fish

L = −log P(fish | for, all, the)

p(aardvark|…) p(do|…) p(fish|…) p(zebra|…)

Output layer y^1 … ^y


34 … ^y … ^y
42
^
35102 … y|V| |V|⨉1
softmax
|V|⨉dh
U
Hidden layer h1 h2 h3 … hdh dh⨉1

dh⨉3d
W
Projection layer 3d⨉1
embeddings
d⨉|V|
E E is shared
1 59 |V| 1 9925 |V| 1 45180 |V| across words
Input layer |V|⨉1
one-hot vectors 00 1 00 00 0 1 0 0 00 0 1 0 0

“for” = index “all” = index “the” = index


word 59 word 9925 word 45180

... and thanks for all the fish ...

wt-3 wt-2 wt-1 wt

Figure 7.14 Learning all the way back to embeddings. Notice that the embedding matrix
E is shared among the 3 context words.

Note that we don’t want to learn separate weight matrices for mapping each of the
3 previous words to the projection layer, we want one single embedding dictionary
E that’s shared among these three. That’s because over time, many different words
will appear as wt−2 or wt−1 , and we’d like to just represent each word with one
vector, whichever context position it appears in. The embedding weight matrix E
thus has a column for each word, each a column vector of d dimensions, and hence
has dimensionality d × |V |.
Let’s walk through the forward pass of Fig. 7.14.
1. Select three embeddings from E: Given the three previous words, we look
up their indices, create 3 one-hot vectors, and then multiply each by the em-
bedding matrix E. Consider wt−3 . The one-hot vector for ‘the’ (index 35) is
multiplied by the embedding matrix E, to give the first part of the first hidden
projection layer layer, called the projection layer. Since each row of the input matrix E is just
an embedding for a word, and the input is a one-hot column vector xi for word
Vi , the projection layer for input w will be Exi = ei , the embedding for word i.
We now concatenate the three embeddings for the context words.
2. Multiply by W: We now multiply by W (and add b) and pass through the
rectified linear (or other) activation function to get the hidden layer h.
3. Multiply by U: h is now multiplied by U
4. Apply softmax: After the softmax, each node i in the output layer estimates
the probability P(wt = i|wt−1 , wt−2 , wt−3 )
In summary, if we use e to represent the projection layer, formed by concate-
nating the 3 embeddings for the three context vectors, the equations for a neural
146 C HAPTER 7 • N EURAL N ETWORKS AND N EURAL L ANGUAGE M ODELS

language model become:

e = (Ex1 , Ex2 , ..., Ex) (7.32)


h = σ (We + b) (7.33)
z = Uh (7.34)
ŷ = softmax(z) (7.35)

7.5.2 Training the neural language model


To train the model, i.e. to set all the parameters θ = E,W,U, b, we do gradient
descent (Fig. 5.5), using error backpropagation on the computation graph to compute
the gradient. Training thus not only sets the weights W and U of the network, but
also as we’re predicting upcoming words, we’re learning the embeddings E for each
words that best predict upcoming words.
Generally training proceeds by taking as input a very long text, concatenating all
the sentences, starting with random weights, and then iteratively moving through the
text predicting each word wt . At each word wt , we use the cross-entropy (negative
log likelihood) loss. Recall that the general form for this (repeated from Eq. 7.19 is:

LCE (ŷ, y) = − log ŷi , (where i is the correct class) (7.36)

For language modeling, the classes are are the word in the vocabulary, so ŷi here
means the probability that the model assigns to the correct next word wt :

LCE = − log p(wt |wt−1 , ..., wt−n+1 ) (7.37)

The parameter update for stochastic gradient descent for this loss from step s to s + 1
is then:
∂ − log p(wt |wt−1 , ..., wt−n+1 )
θ s+1 = θ s − η (7.38)
∂θ
This gradient can be computed in any standard neural network framework which
will then backpropagate through θ = E,W,U, b.
Training the parameters to minimize loss will result both in an algorithm for
language modeling (a word predictor) but also a new set of embeddings E that can
be used as word representations for other tasks.

7.6 Summary
• Neural networks are built out of neural units, originally inspired by human
neurons but now simply an abstract computational device.
• Each neural unit multiplies input values by a weight vector, adds a bias, and
then applies a non-linear activation function like sigmoid, tanh, or rectified
linear.
• In a fully-connected, feedforward network, each unit in layer i is connected
to each unit in layer i + 1, and there are no cycles.
• The power of neural networks comes from the ability of early layers to learn
representations that can be utilized by later layers in the network.
• Neural networks are trained by optimization algorithms like gradient de-
scent.
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 147

• Error backpropagation, backward differentiation on a computation graph,


is used to compute the gradients of the loss function for a network.
• Neural language models use a neural network as a probabilistic classifier, to
compute the probability of the next word given the previous n words.
• Neural language models can use pretrained embeddings, or can learn embed-
dings from scratch in the process of language modeling.

Bibliographical and Historical Notes


The origins of neural networks lie in the 1940s McCulloch-Pitts neuron (McCul-
loch and Pitts, 1943), a simplified model of the human neuron as a kind of com-
puting element that could be described in terms of propositional logic. By the late
1950s and early 1960s, a number of labs (including Frank Rosenblatt at Cornell and
Bernard Widrow at Stanford) developed research into neural networks; this phase
saw the development of the perceptron (Rosenblatt, 1958), and the transformation
of the threshold into a bias, a notation we still use (Widrow and Hoff, 1960).
The field of neural networks declined after it was shown that a single percep-
tron unit was unable to model functions as simple as XOR (Minsky and Papert,
1969). While some small amount of work continued during the next two decades,
a major revival for the field didn’t come until the 1980s, when practical tools for
building deeper networks like error backpropagation became widespread (Rumel-
hart et al., 1986). During the 1980s a wide variety of neural network and related
architectures were developed, particularly for applications in psychology and cog-
nitive science (Rumelhart and McClelland 1986b, McClelland and Elman 1986,
Rumelhart and McClelland 1986a, Elman 1990), for which the term connection-
connectionist ist or parallel distributed processing was often used (Feldman and Ballard 1982,
Smolensky 1988). Many of the principles and techniques developed in this period
are foundational to modern work, including the ideas of distributed representations
(Hinton, 1986), recurrent networks (Elman, 1990), and the use of tensors for com-
positionality (Smolensky, 1990).
By the 1990s larger neural networks began to be applied to many practical lan-
guage processing tasks as well, like handwriting recognition (LeCun et al. 1989) and
speech recognition (Morgan and Bourlard 1990). By the early 2000s, improvements
in computer hardware and advances in optimization and training techniques made it
possible to train even larger and deeper networks, leading to the modern term deep
learning (Hinton et al. 2006, Bengio et al. 2007). We cover more related history in
Chapter 9 and Chapter 26.
There are a number of excellent books on the subject. Goldberg (2017) has
superb coverage of neural networks for natural language processing. For neural
networks in general see Goodfellow et al. (2016) and Nielsen (2015).
148 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

CHAPTER

8 Sequence Labeling for Parts of


Speech and Named Entities
To each word a warbling note
A Midsummer Night’s Dream, V.I

Dionysius Thrax of Alexandria (c. 100 B . C .), or perhaps someone else (it was a long
time ago), wrote a grammatical sketch of Greek (a “technē”) that summarized the
linguistic knowledge of his day. This work is the source of an astonishing proportion
of modern linguistic vocabulary, including the words syntax, diphthong, clitic, and
parts of speech analogy. Also included are a description of eight parts of speech: noun, verb,
pronoun, preposition, adverb, conjunction, participle, and article. Although earlier
scholars (including Aristotle as well as the Stoics) had their own lists of parts of
speech, it was Thrax’s set of eight that became the basis for descriptions of European
languages for the next 2000 years. (All the way to the Schoolhouse Rock educational
television shows of our childhood, which had songs about 8 parts of speech, like the
late great Bob Dorough’s Conjunction Junction.) The durability of parts of speech
through two millennia speaks to their centrality in models of human language.
Proper names are another important and anciently studied linguistic category.
While parts of speech are generally assigned to individual words or morphemes, a
proper name is often an entire multiword phrase, like the name “Marie Curie”, the
location “New York City”, or the organization “Stanford University”. We’ll use the
named entity term named entity for, roughly speaking, anything that can be referred to with a
proper name: a person, a location, an organization, although as we’ll see the term is
commonly extended to include things that aren’t entities per se.
POS Parts of speech (also known as POS) and named entities are useful clues to sen-
tence structure and meaning. Knowing whether a word is a noun or a verb tells us
about likely neighboring words (nouns in English are preceded by determiners and
adjectives, verbs by nouns) and syntactic structure (verbs have dependency links to
nouns), making part-of-speech tagging a key aspect of parsing. Knowing if a named
entity like Washington is a name of a person, a place, or a university is important to
many natural language understanding tasks like question answering, stance detec-
tion, or information extraction.
In this chapter we’ll introduce the task of part-of-speech tagging, taking a se-
quence of words and assigning each word a part of speech like NOUN or VERB, and
the task of named entity recognition (NER), assigning words or phrases tags like
PERSON , LOCATION , or ORGANIZATION .
Such tasks in which we assign, to each word xi in an input word sequence, a
label yi , so that the output sequence Y has the same length as the input sequence X
sequence
labeling are called sequence labeling tasks. We’ll introduce classic sequence labeling algo-
rithms, one generative— the Hidden Markov Model (HMM)—and one discriminative—
the Conditional Random Field (CRF). In following chapters we’ll introduce modern
sequence labelers based on RNNs and Transformers.
8.1 • (M OSTLY ) E NGLISH W ORD C LASSES 149

8.1 (Mostly) English Word Classes


Until now we have been using part-of-speech terms like noun and verb rather freely.
In this section we give more complete definitions. While word classes do have
semantic tendencies—adjectives, for example, often describe properties and nouns
people— parts of speech are defined instead based on their grammatical relationship
with neighboring words or the morphological properties about their affixes.

Tag
Description Example
ADJ
Adjective: noun modifiers describing properties red, young, awesome
ADV
Adverb: verb modifiers of time, place, manner very, slowly, home, yesterday
Open Class

NOUN
words for persons, places, things, etc. algorithm, cat, mango, beauty
VERB
words for actions and processes draw, provide, go
PROPN
Proper noun: name of a person, organization, place, etc.. Regina, IBM, Colorado
INTJ
Interjection: exclamation, greeting, yes/no response, etc. oh, um, yes, hello
ADP
Adposition (Preposition/Postposition): marks a noun’s in, on, by under
spacial, temporal, or other relation
Closed Class Words

AUX Auxiliary: helping verb marking tense, aspect, mood, etc., can, may, should, are
CCONJ Coordinating Conjunction: joins two phrases/clauses and, or, but
DET Determiner: marks noun phrase properties a, an, the, this
NUM Numeral one, two, first, second
PART Particle: a preposition-like form used together with a verb up, down, on, off, in, out, at, by
PRON Pronoun: a shorthand for referring to an entity or event she, who, I, others
SCONJ Subordinating Conjunction: joins a main clause with a that, which
subordinate clause such as a sentential complement
PUNCT Punctuation ,̇ , ()
Other

SYM Symbols like $ or emoji $, %


X Other asdf, qwfg
Figure 8.1 The 17 parts of speech in the Universal Dependencies tagset (Nivre et al., 2016a). Features can
be added to make finer-grained distinctions (with properties like number, case, definiteness, and so on).

closed class Parts of speech fall into two broad categories: closed class and open class.
open class Closed classes are those with relatively fixed membership, such as prepositions—
new prepositions are rarely coined. By contrast, nouns and verbs are open classes—
new nouns and verbs like iPhone or to fax are continually being created or borrowed.
function word Closed class words are generally function words like of, it, and, or you, which tend
to be very short, occur frequently, and often have structuring uses in grammar.
Four major open classes occur in the languages of the world: nouns (including
proper nouns), verbs, adjectives, and adverbs, as well as the smaller open class of
interjections. English has all five, although not every language does.
noun Nouns are words for people, places, or things, but include others as well. Com-
common noun mon nouns include concrete terms like cat and mango, abstractions like algorithm
and beauty, and verb-like terms like pacing as in His pacing to and fro became quite
annoying. Nouns in English can occur with determiners (a goat, its bandwidth)
take possessives (IBM’s annual revenue), and may occur in the plural (goats, abaci).
count noun Many languages, including English, divide common nouns into count nouns and
mass noun mass nouns. Count nouns can occur in the singular and plural (goat/goats, rela-
tionship/relationships) and can be counted (one goat, two goats). Mass nouns are
used when something is conceptualized as a homogeneous group. So snow, salt, and
proper noun communism are not counted (i.e., *two snows or *two communisms). Proper nouns,
like Regina, Colorado, and IBM, are names of specific persons or entities.
150 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

verb Verbs refer to actions and processes, including main verbs like draw, provide,
and go. English verbs have inflections (non-third-person-singular (eat), third-person-
singular (eats), progressive (eating), past participle (eaten)). While many scholars
believe that all human languages have the categories of noun and verb, others have
argued that some languages, such as Riau Indonesian and Tongan, don’t even make
this distinction (Broschart 1997; Evans 2000; Gil 2000) .
adjective Adjectives often describe properties or qualities of nouns, like color (white,
black), age (old, young), and value (good, bad), but there are languages without
adjectives. In Korean, for example, the words corresponding to English adjectives
act as a subclass of verbs, so what is in English an adjective “beautiful” acts in
Korean like a verb meaning “to be beautiful”.
adverb Adverbs are a hodge-podge. All the italicized words in this example are adverbs:
Actually, I ran home extremely quickly yesterday
Adverbs generally modify something (often verbs, hence the name “adverb”, but
locative also other adverbs and entire verb phrases). Directional adverbs or locative ad-
degree verbs (home, here, downhill) specify the direction or location of some action; degree
adverbs (extremely, very, somewhat) specify the extent of some action, process, or
manner property; manner adverbs (slowly, slinkily, delicately) describe the manner of some
temporal action or process; and temporal adverbs describe the time that some action or event
took place (yesterday, Monday).
interjection Interjections (oh, hey, alas, uh, um), are a smaller open class, that also includes
greetings (hello, goodbye), and question responses (yes, no, uh-huh).
preposition English adpositions occur before nouns, hence are called prepositions. They can
indicate spatial or temporal relations, whether literal (on it, before then, by the house)
or metaphorical (on time, with gusto, beside herself), and relations like marking the
agent in Hamlet was written by Shakespeare.
particle A particle resembles a preposition or an adverb and is used in combination with
a verb. Particles often have extended meanings that aren’t quite the same as the
prepositions they resemble, as in the particle over in she turned the paper over. A
phrasal verb verb and a particle acting as a single unit is called a phrasal verb. The meaning
of phrasal verbs is often non-compositional—not predictable from the individual
meanings of the verb and the particle. Thus, turn down means ‘reject’, rule out
‘eliminate’, and go on ‘continue’.
determiner Determiners like this and that (this chapter, that page) can mark the start of an
article English noun phrase. Articles like a, an, and the, are a type of determiner that mark
discourse properties of the noun and are quite frequent; the is the most common
word in written English, with a and an right behind.
conjunction Conjunctions join two phrases, clauses, or sentences. Coordinating conjunc-
tions like and, or, and but join two elements of equal status. Subordinating conjunc-
tions are used when one of the elements has some embedded status. For example,
the subordinating conjunction that in “I thought that you might like some milk” links
the main clause I thought with the subordinate clause you might like some milk. This
clause is called subordinate because this entire clause is the “content” of the main
verb thought. Subordinating conjunctions like that which link a verb to its argument
complementizer in this way are also called complementizers.
pronoun Pronouns act as a shorthand for referring to an entity or event. Personal pro-
nouns refer to persons or entities (you, she, I, it, me, etc.). Possessive pronouns
are forms of personal pronouns that indicate either actual possession or more often
just an abstract relation between the person and some object (my, your, his, her, its,
wh one’s, our, their). Wh-pronouns (what, who, whom, whoever) are used in certain
8.2 • PART- OF -S PEECH TAGGING 151

question forms, or act as complementizers (Frida, who married Diego. . . ).


auxiliary Auxiliary verbs mark semantic features of a main verb such as its tense, whether
it is completed (aspect), whether it is negated (polarity), and whether an action is
necessary, possible, suggested, or desired (mood). English auxiliaries include the
copula copula verb be, the two verbs do and have, forms, as well as modal verbs used to
modal mark the mood associated with the event depicted by the main verb: can indicates
ability or possibility, may permission or possibility, must necessity.
An English-specific tagset, the 45-tag Penn Treebank tagset (Marcus et al., 1993),
shown in Fig. 8.2, has been used to label many syntactically annotated corpora like
the Penn Treebank corpora, so is worth knowing about.

Tag Description Example Tag Description Example Tag Description Example


CC coord. conj. and, but, or NNP proper noun, sing. IBM TO “to” to
CD cardinal number one, two NNPS proper noun, plu. Carolinas UH interjection ah, oops
DT determiner a, the NNS noun, plural llamas VB verb base eat
EX existential ‘there’ there PDT predeterminer all, both VBD verb past tense ate
FW foreign word mea culpa POS possessive ending ’s VBG verb gerund eating
IN preposition/ of, in, by PRP personal pronoun I, you, he VBN verb past partici- eaten
subordin-conj ple
JJ adjective yellow PRP$ possess. pronoun your, one’s VBP verb non-3sg-pr eat
JJR comparative adj bigger RB adverb quickly VBZ verb 3sg pres eats
JJS superlative adj wildest RBR comparative adv faster WDT wh-determ. which, that
LS list item marker 1, 2, One RBS superlatv. adv fastest WP wh-pronoun what, who
MD modal can, should RP particle up, off WP$ wh-possess. whose
NN sing or mass noun llama SYM symbol +,%, & WRB wh-adverb how, where
Figure 8.2 Penn Treebank part-of-speech tags.

Below we show some examples with each word tagged according to both the
UD and Penn tagsets. Notice that the Penn tagset distinguishes tense and participles
on verbs, and has a special tag for the existential there construction in English. Note
that since New England Journal of Medicine is a proper noun, both tagsets mark its
component nouns as NNP, including journal and medicine, which might otherwise
be labeled as common nouns (NOUN/NN).
(8.1) There/PRO/EX are/VERB/VBP 70/NUM/CD children/NOUN/NNS
there/ADV/RB ./PUNC/.
(8.2) Preliminary/ADJ/JJ findings/NOUN / NNS were/AUX / VBD reported/VERB / VBN
in/ADP / IN today/NOUN / NN ’s/PART / POS New/PROPN / NNP
England/PROPN / NNP Journal/PROPN / NNP of/ADP / IN Medicine/PROPN / NNP

8.2 Part-of-Speech Tagging


part-of-speech
tagging Part-of-speech tagging is the process of assigning a part-of-speech to each word in
a text. The input is a sequence x1 , x2 , ..., xn of (tokenized) words and a tagset, and
the output is a sequence y1 , y2 , ..., yn of tags, each output yi corresponding exactly to
one input xi , as shown in the intuition in Fig. 8.3.
ambiguous Tagging is a disambiguation task; words are ambiguous —have more than one
possible part-of-speech—and the goal is to find the correct tag for the situation.
For example, book can be a verb (book that flight) or a noun (hand me that book).
That can be a determiner (Does that flight serve dinner) or a complementizer (I
152 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

y1 y2 y3 y4 y5

NOUN AUX VERB DET NOUN

Part of Speech Tagger

Janet will back the bill


x1 x2 x3 x4 x5

Figure 8.3 The task of part-of-speech tagging: mapping from input words x1 , x2 , ..., xn to
output POS tags y1 , y2 , ..., yn .

ambiguity thought that your flight was earlier). The goal of POS-tagging is to resolve these
resolution
ambiguities, choosing the proper tag for the context.
accuracy The accuracy of part-of-speech tagging algorithms (the percentage of test set
tags that match human gold labels) is extremely high. One study found accuracies
over 97% across 15 languages from the Universal Dependency (UD) treebank (Wu
and Dredze, 2019). Accuracies on various English treebanks are also 97% (no matter
the algorithm; HMMs, CRFs, BERT perform similarly). This 97% number is also
about the human performance on this task, at least for English (Manning, 2011).

Types: WSJ Brown


Unambiguous (1 tag) 44,432 (86%) 45,799 (85%)
Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%)
Tokens:
Unambiguous (1 tag) 577,421 (45%) 384,349 (33%)
Ambiguous (2+ tags) 711,780 (55%) 786,646 (67%)
Figure 8.4 Tag ambiguity in the Brown and WSJ corpora (Treebank-3 45-tag tagset).

We’ll introduce algorithms for the task in the next few sections, but first let’s
explore the task. Exactly how hard is it? Fig. 8.4 shows that most word types
(85-86%) are unambiguous (Janet is always NNP, hesitantly is always RB). But the
ambiguous words, though accounting for only 14-15% of the vocabulary, are very
common, and 55-67% of word tokens in running text are ambiguous. Particularly
ambiguous common words include that, back, down, put and set; here are some
examples of the 6 different parts of speech for the word back:
earnings growth took a back/JJ seat
a small building in the back/NN
a clear majority of senators back/VBP the bill
Dave began to back/VB toward the door
enable the country to buy back/RP debt
I was twenty-one back/RB then
Nonetheless, many words are easy to disambiguate, because their different tags
aren’t equally likely. For example, a can be a determiner or the letter a, but the
determiner sense is much more likely.
This idea suggests a useful baseline: given an ambiguous word, choose the tag
which is most frequent in the training corpus. This is a key concept:
Most Frequent Class Baseline: Always compare a classifier against a baseline at
least as good as the most frequent class baseline (assigning each token to the class
it occurred in most often in the training set).
8.3 • NAMED E NTITIES AND NAMED E NTITY TAGGING 153

The most-frequent-tag baseline has an accuracy of about 92%1 . The baseline


thus differs from the state-of-the-art and human ceiling (97%) by only 5%.

8.3 Named Entities and Named Entity Tagging


Part of speech tagging can tell us that words like Janet, Stanford University, and
Colorado are all proper nouns; being a proper noun is a grammatical property of
these words. But viewed from a semantic perspective, these proper nouns refer to
different kinds of entities: Janet is a person, Stanford University is an organization,..
and Colorado is a location.
named entity A named entity is, roughly speaking, anything that can be referred to with a
proper name: a person, a location, an organization. The task of named entity recog-
named entity
recognition nition (NER) is to find spans of text that constitute proper names and tag the type of
NER the entity. Four entity tags are most common: PER (person), LOC (location), ORG
(organization), or GPE (geo-political entity). However, the term named entity is
commonly extended to include things that aren’t entities per se, including dates,
times, and other kinds of temporal expressions, and even numerical expressions like
prices. Here’s an example of the output of an NER tagger:
Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it
has increased fares by [MONEY $6] per round trip on flights to some
cities also served by lower-cost carriers. [ORG American Airlines], a
unit of [ORG AMR Corp.], immediately matched the move, spokesman
[PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.],
said the increase took effect [TIME Thursday] and applies to most
routes where it competes against discount carriers, such as [LOC Chicago]
to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].
The text contains 13 mentions of named entities including 5 organizations, 4 loca-
tions, 2 times, 1 person, and 1 mention of money. Figure 8.5 shows typical generic
named entity types. Many applications will also need to use specific entity types like
proteins, genes, commercial products, or works of art.

Type Tag Sample Categories Example sentences


People PER people, characters Turing is a giant of computer science.
Organization ORG companies, sports teams The IPCC warned about the cyclone.
Location LOC regions, mountains, seas Mt. Sanitas is in Sunshine Canyon.
Geo-Political Entity GPE countries, states Palo Alto is raising the fees for parking.
Figure 8.5 A list of generic named entity types with the kinds of entities they refer to.

Named entity tagging is a useful first step in lots of natural language understand-
ing tasks. In sentiment analysis we might want to know a consumer’s sentiment
toward a particular entity. Entities are a useful first stage in question answering,
or for linking text to information in structured knowledge sources like Wikipedia.
And named entity tagging is also central to natural language understanding tasks
of building semantic representations, like extracting events and the relationship be-
tween participants.
Unlike part-of-speech tagging, where there is no segmentation problem since
each word gets one tag, the task of named entity recognition is to find and label
1 In English, on the WSJ corpus, tested on sections 22-24.
154 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

spans of text, and is difficult partly because of the ambiguity of segmentation; we


need to decide what’s an entity and what isn’t, and where the boundaries are. Indeed,
most words in a text will not be named entities. Another difficulty is caused by type
ambiguity. The mention JFK can refer to a person, the airport in New York, or any
number of schools, bridges, and streets around the United States. Some examples of
this kind of cross-type confusion are given in Figure 8.6.

[PER Washington] was born into slavery on the farm of James Burroughs.
[ORG Washington] went up 2 games to 1 in the four-game series.
Blair arrived in [LOC Washington] for what may well be his last state visit.
In June, [GPE Washington] passed a primary seatbelt law.
Figure 8.6 Examples of type ambiguities in the use of the name Washington.

The standard approach to sequence labeling for a span-recognition problem like


NER is BIO tagging (Ramshaw and Marcus, 1995). This is a method that allows us
to treat NER like a word-by-word sequence labeling task, via tags that capture both
the boundary and the named entity type. Consider the following sentence:
[PER Jane Villanueva ] of [ORG United] , a unit of [ORG United Airlines
Holding] , said the fare applies to the [LOC Chicago ] route.
BIO Figure 8.7 shows the same excerpt represented with BIO tagging, as well as
variants called IO tagging and BIOES tagging. In BIO tagging we label any token
that begins a span of interest with the label B, tokens that occur inside a span are
tagged with an I, and any tokens outside of any span of interest are labeled O. While
there is only one O tag, we’ll have distinct B and I tags for each named entity class.
The number of tags is thus 2n + 1 tags, where n is the number of entity types. BIO
tagging can represent exactly the same information as the bracketed notation, but has
the advantage that we can represent the task in the same simple sequence modeling
way as part-of-speech tagging: assigning a single label yi to each input word xi :

Words IO Label BIO Label BIOES Label


Jane I-PER B-PER B-PER
Villanueva I-PER I-PER E-PER
of O O O
United I-ORG B-ORG B-ORG
Airlines I-ORG I-ORG I-ORG
Holding I-ORG I-ORG E-ORG
discussed O O O
the O O O
Chicago I-LOC B-LOC S-LOC
route O O O
. O O O
Figure 8.7 NER as a sequence model, showing IO, BIO, and BIOES taggings.

We’ve also shown two variant tagging schemes: IO tagging, which loses some
information by eliminating the B tag, and BIOES tagging, which adds an end tag
E for the end of a span, and a span tag S for a span consisting of only one word.
A sequence labeler (HMM, CRF, RNN, Transformer, etc.) is trained to label each
token in a text with tags that indicate the presence (or absence) of particular kinds
of named entities.
8.4 • HMM PART- OF -S PEECH TAGGING 155

8.4 HMM Part-of-Speech Tagging


In this section we introduce our first sequence labeling algorithm, the Hidden Markov
Model, and show how to apply it to part-of-speech tagging. Recall that a sequence
labeler is a model whose job is to assign a label to each unit in a sequence, thus map-
ping a sequence of observations to a sequence of labels of the same length. HMMs
are a classic model that introduces many of the key concepts of sequence modeling
that we will see again in more modern models.
An HMM is a probabilistic sequence model: given a sequence of units (words,
letters, morphemes, sentences, whatever), it computes a probability distribution over
possible sequences of labels and chooses the best label sequence.

8.4.1 Markov Chains


Markov chain The HMM is based on augmenting the Markov chain. A Markov chain is a model
that tells us something about the probabilities of sequences of random variables,
states, each of which can take on values from some set. These sets can be words, or
tags, or symbols representing anything, for example the weather. A Markov chain
makes a very strong assumption that if we want to predict the future in the sequence,
all that matters is the current state. All the states before the current state have no im-
pact on the future except via the current state. It’s as if to predict tomorrow’s weather
you could examine today’s weather but you weren’t allowed to look at yesterday’s
weather.

.8
are .2
.1 COLD2 .1 .4 .5
.1 .5
.1
.3 uniformly charming
HOT1 WARM3 .5

.6 .3 .6 .1 .2
.6
(a) (b)
Figure 8.8 A Markov chain for weather (a) and one for words (b), showing states and
transitions. A start distribution π is required; setting π = [0.1, 0.7, 0.2] for (a) would mean a
probability 0.7 of starting in state 2 (cold), probability 0.1 of starting in state 1 (hot), etc.

More formally, consider a sequence of state variables q1 , q2 , ..., qi . A Markov


Markov
assumption model embodies the Markov assumption on the probabilities of this sequence: that
when predicting the future, the past doesn’t matter, only the present.
Markov Assumption: P(qi = a|q1 ...qi−1 ) = P(qi = a|qi−1 ) (8.3)

Figure 8.8a shows a Markov chain for assigning a probability to a sequence of


weather events, for which the vocabulary consists of HOT, COLD, and WARM. The
states are represented as nodes in the graph, and the transitions, with their probabil-
ities, as edges. The transitions are probabilities: the values of arcs leaving a given
state must sum to 1. Figure 8.8b shows a Markov chain for assigning a probability to
a sequence of words w1 ...wt . This Markov chain should be familiar; in fact, it repre-
sents a bigram language model, with each edge expressing the probability p(wi |w j )!
Given the two models in Fig. 8.8, we can assign a probability to any sequence from
our vocabulary.
156 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

Formally, a Markov chain is specified by the following components:


Q = q1 q2 . . . qN a set of N states
A = a11 a12 . . . aN1 . . . aNN a transition probability matrix A, each ai j represent-
ing
Pn the probability of moving from state i to state j, s.t.
j=1 ai j = 1 ∀i
π = π1 , π2 , ..., πN an initial probability distribution over states. πi is the
probability that the Markov chain will start in state i.
Some states j may haveP π j = 0, meaning that they cannot
be initial states. Also, ni=1 πi = 1
Before you go on, use the sample probabilities in Fig. 8.8a (with π = [0.1, 0.7, 0.2])
to compute the probability of each of the following sequences:
(8.4) hot hot hot hot
(8.5) cold hot cold hot
What does the difference in these probabilities tell you about a real-world weather
fact encoded in Fig. 8.8a?

8.4.2 The Hidden Markov Model


A Markov chain is useful when we need to compute a probability for a sequence
of observable events. In many cases, however, the events we are interested in are
hidden hidden: we don’t observe them directly. For example we don’t normally observe
part-of-speech tags in a text. Rather, we see words, and must infer the tags from the
word sequence. We call the tags hidden because they are not observed.
hidden Markov A hidden Markov model (HMM) allows us to talk about both observed events
model
(like words that we see in the input) and hidden events (like part-of-speech tags) that
we think of as causal factors in our probabilistic model. An HMM is specified by
the following components:
Q = q1 q2 . . . qN a set of N states
A = a11 . . . ai j . . . aNN a transition probability matrix A, each ai j representing the probability
P
of moving from state i to state j, s.t. Nj=1 ai j = 1 ∀i
O = o1 o2 . . . oT a sequence of T observations, each one drawn from a vocabulary V =
v1 , v2 , ..., vV
B = bi (ot ) a sequence of observation likelihoods, also called emission probabili-
ties, each expressing the probability of an observation ot being generated
from a state qi
π = π1 , π2 , ..., πN an initial probability distribution over states. πi is the probability that
the Markov chain will start in state i. Some statesP j may have π j = 0,
meaning that they cannot be initial states. Also, ni=1 πi = 1

A first-order hidden Markov model instantiates two simplifying assumptions.


First, as with a first-order Markov chain, the probability of a particular state depends
only on the previous state:

Markov Assumption: P(qi |q1 , ..., qi−1 ) = P(qi |qi−1 ) (8.6)


Second, the probability of an output observation oi depends only on the state that
produced the observation qi and not on any other states or any other observations:

Output Independence: P(oi |q1 , . . . qi , . . . , qT , o1 , . . . , oi , . . . , oT ) = P(oi |qi ) (8.7)


8.4 • HMM PART- OF -S PEECH TAGGING 157

8.4.3 The components of an HMM tagger


Let’s start by looking at the pieces of an HMM tagger, and then we’ll see how to use
it to tag. An HMM has two components, the A and B probabilities.
The A matrix contains the tag transition probabilities P(ti |ti−1 ) which represent
the probability of a tag occurring given the previous tag. For example, modal verbs
like will are very likely to be followed by a verb in the base form, a VB, like race, so
we expect this probability to be high. We compute the maximum likelihood estimate
of this transition probability by counting, out of the times we see the first tag in a
labeled corpus, how often the first tag is followed by the second:

C(ti−1 ,ti )
P(ti |ti−1 ) = (8.8)
C(ti−1 )
In the WSJ corpus, for example, MD occurs 13124 times of which it is followed
by VB 10471, for an MLE estimate of

C(MD,V B) 10471
P(V B|MD) = = = .80 (8.9)
C(MD) 13124
Let’s walk through an example, seeing how these probabilities are estimated and
used in a sample tagging task, before we return to the algorithm for decoding.
In HMM tagging, the probabilities are estimated by counting on a tagged training
corpus. For this example we’ll use the tagged WSJ corpus.
The B emission probabilities, P(wi |ti ), represent the probability, given a tag (say
MD), that it will be associated with a given word (say will). The MLE of the emis-
sion probability is
C(ti , wi )
P(wi |ti ) = (8.10)
C(ti )
Of the 13124 occurrences of MD in the WSJ corpus, it is associated with will 4046
times:
C(MD, will) 4046
P(will|MD) = = = .31 (8.11)
C(MD) 13124
We saw this kind of Bayesian modeling in Chapter 4; recall that this likelihood
term is not asking “which is the most likely tag for the word will?” That would be
the posterior P(MD|will). Instead, P(will|MD) answers the slightly counterintuitive
question “If we were going to generate a MD, how likely is it that this modal would
be will?”
The A transition probabilities, and B observation likelihoods of the HMM are
illustrated in Fig. 8.9 for three states in an HMM part-of-speech tagger; the full
tagger would have one state for each tag.

8.4.4 HMM tagging as decoding


For any model, such as an HMM, that contains hidden variables, the task of deter-
mining the hidden variables sequence corresponding to the sequence of observations
decoding is called decoding. More formally,

Decoding: Given as input an HMM λ = (A, B) and a sequence of ob-


servations O = o1 , o2 , ..., oT , find the most probable sequence of states
Q = q1 q2 q3 . . . qT .
158 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

B2 a22
P("aardvark" | MD)
...
P(“will” | MD)
...
P("the" | MD)
...
MD2 B3
P(“back” | MD)
... a12 a32 P("aardvark" | NN)
P("zebra" | MD) ...
a11 a21 a33 P(“will” | NN)
a23 ...
P("the" | NN)
B1 a13 ...
P(“back” | NN)
P("aardvark" | VB)
...
VB1 a31
NN3 ...
P("zebra" | NN)
P(“will” | VB)
...
P("the" | VB)
...
P(“back” | VB)
...
P("zebra" | VB)

Figure 8.9 An illustration of the two parts of an HMM representation: the A transition
probabilities used to compute the prior probability, and the B observation likelihoods that are
associated with each state, one likelihood for each possible observation word.

For part-of-speech tagging, the goal of HMM decoding is to choose the tag
sequence t1 . . .tn that is most probable given the observation sequence of n words
w1 . . . wn :
tˆ1:n = argmax P(t1 . . .tn |w1 . . . wn ) (8.12)
t1 ... tn
The way we’ll do this in the HMM is to use Bayes’ rule to instead compute:
P(w1 . . . wn |t1 . . .tn )P(t1 . . .tn )
tˆ1:n = argmax (8.13)
t1 ... tn P(w1 . . . wn )
Furthermore, we simplify Eq. 8.13 by dropping the denominator P(wn1 ):
tˆ1:n = argmax P(w1 . . . wn |t1 . . .tn )P(t1 . . .tn ) (8.14)
t1 ... tn

HMM taggers make two further simplifying assumptions. The first is that the
probability of a word appearing depends only on its own tag and is independent of
neighboring words and tags:
n
Y
P(w1 . . . wn |t1 . . .tn ) ≈ P(wi |ti ) (8.15)
i=1

The second assumption, the bigram assumption, is that the probability of a tag is
dependent only on the previous tag, rather than the entire tag sequence;
n
Y
P(t1 . . .tn ) ≈ P(ti |ti−1 ) (8.16)
i=1

Plugging the simplifying assumptions from Eq. 8.15 and Eq. 8.16 into Eq. 8.14
results in the following equation for the most probable tag sequence from a bigram
tagger:
emission transition
n z }| { z }| {
Y
tˆ1:n = argmax P(t1 . . .tn |w1 . . . wn ) ≈ argmax P(wi |ti ) P(ti |ti−1 ) (8.17)
t1 ... tn t1 ... tn
i=1

The two parts of Eq. 8.17 correspond neatly to the B emission probability and A
transition probability that we just defined above!
8.4 • HMM PART- OF -S PEECH TAGGING 159

8.4.5 The Viterbi Algorithm


Viterbi
algorithm The decoding algorithm for HMMs is the Viterbi algorithm shown in Fig. 8.10.
As an instance of dynamic programming, Viterbi resembles the dynamic program-
ming minimum edit distance algorithm of Chapter 2.

function V ITERBI(observations of len T,state-graph of len N) returns best-path, path-prob

create a path probability matrix viterbi[N,T]


for each state s from 1 to N do ; initialization step
viterbi[s,1] ← πs ∗ bs (o1 )
backpointer[s,1] ← 0
for each time step t from 2 to T do ; recursion step
for each state s from 1 to N do
N
viterbi[s,t] ← max
0
viterbi[s0 ,t − 1] ∗ as0 ,s ∗ bs (ot )
s =1
N
backpointer[s,t] ← argmax viterbi[s0 ,t − 1] ∗ as0 ,s ∗ bs (ot )
s0 =1
N
bestpathprob ← max viterbi[s, T ] ; termination step
s=1
N
bestpathpointer ← argmax viterbi[s, T ] ; termination step
s=1
bestpath ← the path starting at state bestpathpointer, that follows backpointer[] to states back in time
return bestpath, bestpathprob

Figure 8.10 Viterbi algorithm for finding the optimal sequence of tags. Given an observation sequence and
an HMM λ = (A, B), the algorithm returns the state path through the HMM that assigns maximum likelihood
to the observation sequence.

The Viterbi algorithm first sets up a probability matrix or lattice, with one col-
umn for each observation ot and one row for each state in the state graph. Each col-
umn thus has a cell for each state qi in the single combined automaton. Figure 8.11
shows an intuition of this lattice for the sentence Janet will back the bill.
Each cell of the lattice, vt ( j), represents the probability that the HMM is in state
j after seeing the first t observations and passing through the most probable state
sequence q1 , ..., qt−1 , given the HMM λ . The value of each cell vt ( j) is computed
by recursively taking the most probable path that could lead us to this cell. Formally,
each cell expresses the probability

vt ( j) = max P(q1 ...qt−1 , o1 , o2 . . . ot , qt = j|λ ) (8.18)


q1 ,...,qt−1

We represent the most probable path by taking the maximum over all possible
previous state sequences max . Like other dynamic programming algorithms,
q1 ,...,qt−1
Viterbi fills each cell recursively. Given that we had already computed the probabil-
ity of being in every state at time t − 1, we compute the Viterbi probability by taking
the most probable of the extensions of the paths that lead to the current cell. For a
given state q j at time t, the value vt ( j) is computed as

N
vt ( j) = max vt−1 (i) ai j b j (ot ) (8.19)
i=1

The three factors that are multiplied in Eq. 8.19 for extending the previous paths to
compute the Viterbi probability at time t are
160 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

DT DT DT DT DT

RB RB RB RB RB

NN NN NN NN NN

JJ JJ JJ JJ JJ

VB VB VB VB VB

MD MD MD MD MD

NNP NNP NNP NNP NNP

Janet will back the bill


Figure 8.11 A sketch of the lattice for Janet will back the bill, showing the possible tags (qi )
for each word and highlighting the path corresponding to the correct tag sequence through the
hidden states. States (parts of speech) which have a zero probability of generating a particular
word according to the B matrix (such as the probability that a determiner DT will be realized
as Janet) are greyed out.

vt−1 (i) the previous Viterbi path probability from the previous time step
ai j the transition probability from previous state qi to current state q j
b j (ot ) the state observation likelihood of the observation symbol ot given
the current state j

8.4.6 Working through an example


Let’s tag the sentence Janet will back the bill; the goal is the correct series of tags
(see also Fig. 8.11):
(8.20) Janet/NNP will/MD back/VB the/DT bill/NN

NNP MD VB JJ NN RB DT
<s > 0.2767 0.0006 0.0031 0.0453 0.0449 0.0510 0.2026
NNP 0.3777 0.0110 0.0009 0.0084 0.0584 0.0090 0.0025
MD 0.0008 0.0002 0.7968 0.0005 0.0008 0.1698 0.0041
VB 0.0322 0.0005 0.0050 0.0837 0.0615 0.0514 0.2231
JJ 0.0366 0.0004 0.0001 0.0733 0.4509 0.0036 0.0036
NN 0.0096 0.0176 0.0014 0.0086 0.1216 0.0177 0.0068
RB 0.0068 0.0102 0.1011 0.1012 0.0120 0.0728 0.0479
DT 0.1147 0.0021 0.0002 0.2157 0.4744 0.0102 0.0017
Figure 8.12 The A transition probabilities P(ti |ti−1 ) computed from the WSJ corpus with-
out smoothing. Rows are labeled with the conditioning event; thus P(V B|MD) is 0.7968.

Let the HMM be defined by the two tables in Fig. 8.12 and Fig. 8.13. Figure 8.12
lists the ai j probabilities for transitioning between the hidden states (part-of-speech
tags). Figure 8.13 expresses the bi (ot ) probabilities, the observation likelihoods of
words given tags. This table is (slightly simplified) from counts in the WSJ corpus.
So the word Janet only appears as an NNP, back has 4 possible parts of speech, and
8.4 • HMM PART- OF -S PEECH TAGGING 161

Janet will back the bill


NNP 0.000032 0 0 0.000048 0
MD 0 0.308431 0 0 0
VB 0 0.000028 0.000672 0 0.000028
JJ 0 0 0.000340 0 0
NN 0 0.000200 0.000223 0 0.002337
RB 0 0 0.010446 0 0
DT 0 0 0 0.506099 0
Figure 8.13 Observation likelihoods B computed from the WSJ corpus without smoothing,
simplified slightly.

the word the can appear as a determiner or as an NNP (in titles like “Somewhere
Over the Rainbow” all words are tagged as NNP).

v1(7) v2(7)
q7 DT

v1(6) v2(6) v3(6)=


q6 RB )
max * .0104
N
N
B|
(R
*P

v1(5) v2(5)= v3(5)=


q5 NN max * .0002 * P(NN|NN) max * .
= .0000000001 000223

v1(4)= . v2(4) v3(4)=


)=
q4 JJ tart 045*0=0
J |s max * .00034
P(J .045
*P =

v1(3)= v2(3)= v3(3)=


(M 0

art)
D

q3 VB B|st
|J

.0031 x 0 max * .000028


P(V 0031 max * .00067
J)

=. =0 = 2.5e-13
* P
(MD
= 0 |VB)
v2(2) =
tart) v1(2)=
q2 MD D|s
P(M 0006 .0006 x 0 = * P(MD|M max * .308 =
= . D) 2.772e-8
0 =0
8 1 =)
.9 9*.0 NP

v1(1) =
00 D|N

v2(1)
.0 P(M

q1 NNP tart) .28* .000032


P(NNP|s
e-
00

= .000009
*

= .28

backtrace
start start start start
start
π backtrace

Janet will
t back the bill
o1 o2 o3 o4 o5

Figure 8.14 The first few entries in the individual state columns for the Viterbi algorithm. Each cell keeps
the probability of the best path so far and a pointer to the previous cell along that path. We have only filled out
columns 1 and 2; to avoid clutter most cells with value 0 are left empty. The rest is left as an exercise for the
reader. After the cells are filled in, backtracing from the end state, we should be able to reconstruct the correct
state sequence NNP MD VB DT NN.

Figure 8.14 shows a fleshed-out version of the sketch we saw in Fig. 8.11, the
Viterbi lattice for computing the best hidden state sequence for the observation se-
quence Janet will back the bill.
There are N = 5 state columns. We begin in column 1 (for the word Janet) by
setting the Viterbi value in each cell to the product of the π transition probability
(the start probability for that state i, which we get from the <s > entry of Fig. 8.12),
162 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

and the observation likelihood of the word Janet given the tag for that cell. Most of
the cells in the column are zero since the word Janet cannot be any of those tags.
The reader should find this in Fig. 8.14.
Next, each cell in the will column gets updated. For each state, we compute the
value viterbi[s,t] by taking the maximum over the extensions of all the paths from
the previous column that lead to the current cell according to Eq. 8.19. We have
shown the values for the MD, VB, and NN cells. Each cell gets the max of the 7
values from the previous column, multiplied by the appropriate transition probabil-
ity; as it happens in this case, most of them are zero from the previous column. The
remaining value is multiplied by the relevant observation probability, and the (triv-
ial) max is taken. In this case the final value, 2.772e-8, comes from the NNP state at
the previous column. The reader should fill in the rest of the lattice in Fig. 8.14 and
backtrace to see whether or not the Viterbi algorithm returns the gold state sequence
NNP MD VB DT NN.

8.5 Conditional Random Fields (CRFs)


While the HMM is a useful and powerful model, it turns out that HMMs need a
number of augmentations to achieve high accuracy. For example, in POS tagging
unknown as in other tasks, we often run into unknown words: proper names and acronyms
words
are created very often, and even new common nouns and verbs enter the language
at a surprising rate. It would be great to have ways to add arbitrary features to
help with this, perhaps based on capitalization or morphology (words starting with
capital letters are likely to be proper nouns, words ending with -ed tend to be past
tense (VBD or VBN), etc.) Or knowing the previous or following words might be a
useful feature (if the previous word is the, the current tag is unlikely to be a verb).
Although we could try to hack the HMM to find ways to incorporate some of
these, in general it’s hard for generative models like HMMs to add arbitrary features
directly into the model in a clean way. We’ve already seen a model for combining
arbitrary features in a principled way: log-linear models like the logistic regression
model of Chapter 5! But logistic regression isn’t a sequence model; it assigns a class
to a single observation.
Luckily, there is a discriminative sequence model based on log-linear models:
CRF the conditional random field (CRF). We’ll describe here the linear chain CRF,
the version of the CRF most commonly used for language processing, and the one
whose conditioning closely matches the HMM.
Assuming we have a sequence of input words X = x1n = x1 ...xn and want to
compute a sequence of output tags Y = yn1 = y1 ...yn. In an HMM to compute the
best tag sequence that maximizes P(Y |X) we rely on Bayes’ rule and the likelihood
P(X|Y ):

Ŷ = argmax p(Y |X)


Y
= argmax p(X|Y )p(Y )
Y
Y Y
= argmax p(xi |yi ) p(yi |yi−1 ) (8.21)
Y i i

In a CRF, by contrast, we compute the posterior p(Y |X) directly, training the
8.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 163

CRF to discriminate among the possible tag sequences:

Ŷ = argmax P(Y |X) (8.22)


Y ∈Y

However, the CRF does not compute a probability for each tag at each time step. In-
stead, at each time step the CRF computes log-linear functions over a set of relevant
features, and these local features are aggregated and normalized to produce a global
probability for the whole sequence.
Let’s introduce the CRF more formally, again using X and Y as the input and
output sequences. A CRF is a log-linear model that assigns a probability to an entire
output (tag) sequence Y , out of all possible sequences Y, given the entire input (word)
sequence X. We can think of a CRF as like a giant version of what multinomial
logistic regression does for a single token. Recall that the feature function f in
regular multinomial logistic regression maps a tuple of a token x and a label y into
a feature vector. In a CRF, the function F maps an entire input sequence X and an
entire output sequence Y to a feature vector. Let’s assume we have K features, with
a weight wk for each feature Fk :
K
!
X
exp wk Fk (X,Y )
k=1
p(Y |X) = K
! (8.23)
X X
0
exp wk Fk (X,Y )
Y 0 ∈Y k=1

It’s common to also describe the same equation by pulling out the denominator into
a function Z(X):
K
!
1 X
p(Y |X) = exp wk Fk (X,Y ) (8.24)
Z(X)
k=1
K
!
X X
0
Z(X) = exp wk Fk (X,Y ) (8.25)
Y 0 ∈Y k=1

We’ll call these K functions Fk (X,Y ) global features, since each one is a property
of the entire input sequence X and output sequence Y . We compute them by decom-
posing into a sum of local features for each position i in Y :
n
X
Fk (X,Y ) = fk (yi−1 , yi , X, i) (8.26)
i=1

Each of these local features fk in a linear-chain CRF is allowed to make use of the
current output token yi , the previous output token yi−1 , the entire input string X (or
any subpart of it), and the current position i. This constraint to only depend on
the current and previous output tokens yi and yi−1 are what characterizes a linear
linear chain chain CRF. As we will see, this limitation makes it possible to use versions of the
CRF
efficient Viterbi and Forward-Backwards algorithms from the HMM. A general CRF,
by contrast, allows a feature to make use of any output token, and are thus necessary
for tasks in which the decision depend on distant output tokens, like yi−4 . General
CRFs require more complex inference, and are less commonly used for language
processing.
164 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

8.5.1 Features in a CRF POS Tagger


Let’s look at some of these features in detail, since the reason to use a discriminative
sequence model is that it’s easier to incorporate a lot of features.2
Again, in a linear-chain CRF, each local feature fk at position i can depend on
any information from: (yi−1 , yi , X, i). So some legal features representing common
situations might be the following:

1{xi = the, yi = DET}


1{yi = PROPN, xi+1 = Street, yi−1 = NUM}
1{yi = VERB, yi−1 = AUX}
For simplicity, we’ll assume all CRF features take on the value 1 or 0. Above, we
explicitly use the notation 1{x} to mean “1 if x is true, and 0 otherwise”. From now
on, we’ll leave off the 1 when we define features, but you can assume each feature
has it there implicitly.
Although the idea of what features to use is done by the system designer by hand,
feature
templates the specific features are automatically populated by using feature templates as we
briefly mentioned in Chapter 5. Here are some templates that only use information
from yi−1 , yi , X, i):

hyi , xi i, hyi , yi−1 i, hyi , xi−1 , xi+2 i

These templates automatically populate the set of features from every instance in
the training and test set. Thus for our example Janet/NNP will/MD back/VB the/DT
bill/NN, when xi is the word back, the following features would be generated and
have the value 1 (we’ve assigned them arbitrary feature numbers):
f3743 : yi = VB and xi = back
f156 : yi = VB and yi−1 = MD
f99732 : yi = VB and xi−1 = will and xi+2 = bill
It’s also important to have features that help with unknown words. One of the
word shape most important is word shape features, which represent the abstract letter pattern
of the word by mapping lower-case letters to ‘x’, upper-case to ‘X’, numbers to
’d’, and retaining punctuation. Thus for example I.M.F would map to X.X.X. and
DC10-30 would map to XXdd-dd. A second class of shorter word shape features is
also used. In these features consecutive character types are removed, so words in all
caps map to X, words with initial-caps map to Xx, DC10-30 would be mapped to
Xd-d but I.M.F would still map to X.X.X. Prefix and suffix features are also useful.
In summary, here are some sample feature templates that help with unknown words:

xi contains a particular prefix (perhaps from all prefixes of length ≤ 2)


xi contains a particular suffix (perhaps from all suffixes of length ≤ 2)
xi ’s word shape
xi ’s short word shape

For example the word well-dressed might generate the following non-zero val-
ued feature values:
2 Because in HMMs all computation is based on the two probabilities P(tag|tag) and P(word|tag), if
we want to include some source of knowledge into the tagging process, we must find a way to encode
the knowledge into one of these two probabilities. Each time we add a feature we have to do a lot of
complicated conditioning which gets harder and harder as we have more and more such features.
8.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 165

prefix(xi ) = w
prefix(xi ) = we
suffix(xi ) = ed
suffix(xi ) = d
word-shape(xi ) = xxxx-xxxxxxx
short-word-shape(xi ) = x-x

The known-word templates are computed for every word seen in the training
set; the unknown word features can also be computed for all words in training, or
only on training words whose frequency is below some threshold. The result of the
known-word templates and word-signature features is a very large set of features.
Generally a feature cutoff is used in which features are thrown out if they have count
< 5 in the training set.
Remember that in a CRF we don’t learn weights for each of these local features
fk . Instead, we first sum the values of each local feature (for example feature f3743 )
over the entire sentence, to create each global feature (for example F3743 ). It is those
global features that will then be multiplied by weight w3743 . Thus for training and
inference there is always a fixed set of K features with K weights, even though the
length of each sentence is different.

8.5.2 Features for CRF Named Entity Recognizers


A CRF for NER makes use of very similar features to a POS tagger, as shown in
Figure 8.15.

identity of wi , identity of neighboring words


embeddings for wi , embeddings for neighboring words
part of speech of wi , part of speech of neighboring words
presence of wi in a gazetteer
wi contains a particular prefix (from all prefixes of length ≤ 4)
wi contains a particular suffix (from all suffixes of length ≤ 4)
word shape of wi , word shape of neighboring words
short word shape of wi , short word shape of neighboring words
gazetteer features
Figure 8.15 Typical features for a feature-based NER system.

gazetteer One feature that is especially useful for locations is a gazetteer, a list of place
names, often providing millions of entries for locations with detailed geographical
and political information.3 This can be implemented as a binary feature indicating a
phrase appears in the list. Other related resources like name-lists, for example from
the United States Census Bureau4 , can be used, as can other entity dictionaries like
lists of corporations or products, although they may not be as helpful as a gazetteer
(Mikheev et al., 1999).
The sample named entity token L’Occitane would generate the following non-
zero valued feature values (assuming that L’Occitane is neither in the gazetteer nor
the census).

3 www.geonames.org
4 www.census.gov
166 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

prefix(xi ) = L suffix(xi ) = tane


prefix(xi ) = L’ suffix(xi ) = ane
prefix(xi ) = L’O suffix(xi ) = ne
prefix(xi ) = L’Oc suffix(xi ) = e
word-shape(xi ) = X’Xxxxxxxx short-word-shape(xi ) = X’Xx
Figure 8.16 illustrates the result of adding part-of-speech tags and some shape
information to our earlier example.

Words POS Short shape Gazetteer BIO Label


Jane NNP Xx 0 B-PER
Villanueva NNP Xx 1 I-PER
of IN x 0 O
United NNP Xx 0 B-ORG
Airlines NNP Xx 0 I-ORG
Holding NNP Xx 0 I-ORG
discussed VBD x 0 O
the DT x 0 O
Chicago NNP Xx 1 B-LOC
route NN x 0 O
. . . 0 O
Figure 8.16 Some NER features for a sample sentence, assuming that Chicago and Vil-
lanueva are listed as locations in a gazetteer. We assume features only take on the values 0 or
1, so the first POS feature, for example, would be represented as 1{POS = NNP}.

8.5.3 Inference and Training for CRFs


How do we find the best tag sequence Ŷ for a given input X? We start with Eq. 8.22:
Ŷ = argmax P(Y |X)
Y ∈Y
K
!
1 X
= argmax exp wk Fk (X,Y ) (8.27)
Y ∈Y Z(X) k=1
K n
!
X X
= argmax exp wk fk (yi−1 , yi , X, i) (8.28)
Y ∈Y k=1 i=1
K
X Xn
= argmax wk fk (yi−1 , yi , X, i) (8.29)
Y ∈Y k=1 i=1
Xn X K
= argmax wk fk (yi−1 , yi , X, i) (8.30)
Y ∈Y i=1 k=1

We can ignore the exp function and the denominator Z(X), as we do above, because
exp doesn’t change the argmax, and the denominator Z(X) is constant for a given
observation sequence X.
How should we decode to find this optimal tag sequence ŷ? Just as with HMMs,
we’ll turn to the Viterbi algorithm, which works because, like the HMM, the linear-
chain CRF depends at each timestep on only one previous output token yi−1 .
Concretely, this involves filling an N ×T array with the appropriate values, main-
taining backpointers as we proceed. As with HMM Viterbi, when the table is filled,
we simply follow pointers back from the maximum value in the final column to
retrieve the desired set of labels.
8.6 • E VALUATION OF NAMED E NTITY R ECOGNITION 167

The requisite changes from HMM Viterbi have to do only with how we fill each
cell. Recall from Eq. 8.19 that the recursive step of the Viterbi equation computes
the Viterbi value of time t for state j as
N
vt ( j) = max vt−1 (i) ai j b j (ot ); 1 ≤ j ≤ N, 1 < t ≤ T (8.31)
i=1

which is the HMM implementation of


N
vt ( j) = max vt−1 (i) P(s j |si ) P(ot |s j ) 1 ≤ j ≤ N, 1 < t ≤ T (8.32)
i=1

The CRF requires only a slight change to this latter formula, replacing the a and b
prior and likelihood probabilities with the CRF features:
K
X
N
vt ( j) = max vt−1 (i) wk fk (yt−1 , yt , X,t) 1 ≤ j ≤ N, 1 < t ≤ T (8.33)
i=1
k=1

Learning in CRFs relies on the same supervised learning algorithms we presented


for logistic regression. Given a sequence of observations, feature functions, and cor-
responding outputs, we use stochastic gradient descent to train the weights to maxi-
mize the log-likelihood of the training corpus. The local nature of linear-chain CRFs
means that a CRF version of the forward-backward algorithm (see Appendix A) can
be used to efficiently compute the necessary derivatives. As with logistic regression,
L1 or L2 regularization is important,

8.6 Evaluation of Named Entity Recognition


Part-of-speech taggers are evaluated by the standard metric of accuracy. Named
entity recognizers are evaluated by recall, precision, and F1 measure. Recall that
recall is the ratio of the number of correctly labeled responses to the total that should
have been labeled; precision is the ratio of the number of correctly labeled responses
to the total labeled; and F-measure is the harmonic mean of the two.
To know if the difference between the F1 scores of two MT systems is a signif-
icant difference, we use the paired bootstrap test, or the similar randomization test
(Section 4.9).
For named entities, the entity rather than the word is the unit of response. Thus
in the example in Fig. 8.16, the two entities Jane Villanueva and United Airlines
Holding and the non-entity discussed would each count as a single response.
The fact that named entity tagging has a segmentation component which is not
present in tasks like text categorization or part-of-speech tagging causes some prob-
lems with evaluation. For example, a system that labeled Jane but not Jane Vil-
lanueva as a person would cause two errors, a false positive for O and a false nega-
tive for I-PER. In addition, using entities as the unit of response but words as the unit
of training means that there is a mismatch between the training and test conditions.

8.7 Further Details


In this section we summarize a few remaining details of the data and models, be-
ginning with data. Since the algorithms we have presented are supervised, hav-
168 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

ing labeled data is essential for training and test. A wide variety of datasets exist
for part-of-speech tagging and/or NER. The Universal Dependencies (UD) dataset
(Nivre et al., 2016b) has POS tagged corpora in 92 languages at the time of this
writing, as do the Penn Treebanks in English, Chinese, and Arabic. OntoNotes has
corpora labeled for named entities in English, Chinese, and Arabic (Hovy et al.,
2006). Named entity tagged corpora also available in particular domains, such as
for biomedical (Bada et al., 2012) and literary text (Bamman et al., 2019).

8.7.1 Bidirectionality
One problem with the CRF and HMM architectures as presented is that the models
are exclusively run left-to-right. While the Viterbi algorithm still allows present
decisions to be influenced indirectly by future decisions, it would help even more if
a decision about word wi could directly use information about future tags ti+1 and
ti+2 .
Alternatively, any sequence model can be turned into a bidirectional model by
using multiple passes. For example, the first pass would use only part-of-speech
features from already-disambiguated words on the left. In the second pass, tags for
all words, including those on the right, can be used. Alternately, the tagger can be
run twice, once left-to-right and once right-to-left. In Viterbi decoding, the labeler
would chooses the higher scoring of the two sequences (left-to-right or right-to-left).
Bidirectional models are quite standard for neural models, as we will see with the
biLSTM models to be introduced in Chapter 9.

8.7.2 Rule-based Methods


While machine learned (neural or CRF) sequence models are the norm in academic
research, commercial approaches to NER are often based on pragmatic combina-
tions of lists and rules, with some smaller amount of supervised machine learning
(Chiticariu et al., 2013). For example in the IBM System T architecture, a user
specifies declarative constraints for tagging tasks in a formal query language that
includes regular expressions, dictionaries, semantic constraints, and other operators,
which the system compiles into an efficient extractor (Chiticariu et al., 2018).
One common approach is to make repeated rule-based passes over a text, starting
with rules with very high precision but low recall, and, in subsequent stages, using
machine learning methods that take the output of the first pass into account.
1. First, use high-precision rules to tag unambiguous entity mentions.
2. Then, search for substring matches of the previously detected names.
3. Use application-specific name lists to find likely domain-specific mentions.
4. Finally, apply supervised sequence labeling techniques that use tags from pre-
vious stages as additional features.
Rule-based methods were also the earliest methods for part-of-speech tagging.
Rule-based taggers like the English Constraint Grammar system (Karlsson et al. 1995,
Voutilainen 1999). use the two-stage formalism that was invented in the 1950s and
1960s: a morphological analyzer with tens of thousands of word stem entries re-
turns all parts of speech for a word. Then a large set of thousands of constraints are
applied to the input sentence to rule out parts of speech inconsistent with the context.
8.8 • S UMMARY 169

8.7.3 POS Tagging for Morphologically Rich Languages


Augmentations to tagging algorithms become necessary when dealing with lan-
guages with rich morphology like Czech, Hungarian and Turkish.
These productive word-formation processes result in a large vocabulary for these
languages: a 250,000 word token corpus of Hungarian has more than twice as many
word types as a similarly sized corpus of English (Oravecz and Dienes, 2002), while
a 10 million word token corpus of Turkish contains four times as many word types
as a similarly sized English corpus (Hakkani-Tür et al., 2002). Large vocabular-
ies mean many unknown words, and these unknown words cause significant per-
formance degradations in a wide variety of languages (including Czech, Slovene,
Estonian, and Romanian) (Hajič, 2000).
Highly inflectional languages also have much more information than English
coded in word morphology, like case (nominative, accusative, genitive) or gender
(masculine, feminine). Because this information is important for tasks like pars-
ing and coreference resolution, part-of-speech taggers for morphologically rich lan-
guages need to label words with case and gender information. Tagsets for morpho-
logically rich languages are therefore sequences of morphological tags rather than a
single primitive tag. Here’s a Turkish example, in which the word izin has three pos-
sible morphological/part-of-speech tags and meanings (Hakkani-Tür et al., 2002):
1. Yerdeki izin temizlenmesi gerek. iz + Noun+A3sg+Pnon+Gen
The trace on the floor should be cleaned.
2. Üzerinde parmak izin kalmiş iz + Noun+A3sg+P2sg+Nom
Your finger print is left on (it).
3. Içeri girmek için izin alman gerekiyor. izin + Noun+A3sg+Pnon+Nom
You need permission to enter.
Using a morphological parse sequence like Noun+A3sg+Pnon+Gen as the part-
of-speech tag greatly increases the number of parts of speech, and so tagsets can
be 4 to 10 times larger than the 50–100 tags we have seen for English. With such
large tagsets, each word needs to be morphologically analyzed to generate the list
of possible morphological tag sequences (part-of-speech tags) for the word. The
role of the tagger is then to disambiguate among these tags. This method also helps
with unknown words since morphological parsers can accept unknown stems and
still segment the affixes properly.

8.8 Summary
This chapter introduced parts of speech and named entities, and the tasks of part-
of-speech tagging and named entity recognition:
• Languages generally have a small set of closed class words that are highly
frequent, ambiguous, and act as function words, and open-class words like
nouns, verbs, adjectives. Various part-of-speech tagsets exist, of between 40
and 200 tags.
• Part-of-speech tagging is the process of assigning a part-of-speech label to
each of a sequence of words.
• Named entities are words for proper nouns referring mainly to people, places,
and organizations, but extended to many other types that aren’t strictly entities
or even proper nouns.
170 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

• Two common approaches to sequence modeling are a generative approach,


HMM tagging, and a discriminative approach, CRF tagging. We will see a
neural approach in following chapters.
• The probabilities in HMM taggers are estimated by maximum likelihood es-
timation on tag-labeled training corpora. The Viterbi algorithm is used for
decoding, finding the most likely tag sequence
• Conditional Random Fields or CRF taggers train a log-linear model that can
choose the best tag sequence given an observation sequence, based on features
that condition on the output tag, the prior output tag, the entire input sequence,
and the current timestep. They use the Viterbi algorithm for inference, to
choose the best sequence of tags, and a version of the Forward-Backward
algorithm (see Appendix A) for training,

Bibliographical and Historical Notes


What is probably the earliest part-of-speech tagger was part of the parser in Zellig
Harris’s Transformations and Discourse Analysis Project (TDAP), implemented be-
tween June 1958 and July 1959 at the University of Pennsylvania (Harris, 1962),
although earlier systems had used part-of-speech dictionaries. TDAP used 14 hand-
written rules for part-of-speech disambiguation; the use of part-of-speech tag se-
quences and the relative frequency of tags for a word prefigures modern algorithms.
The parser was implemented essentially as a cascade of finite-state transducers; see
Joshi and Hopely (1999) and Karttunen (1999) for a reimplementation.
The Computational Grammar Coder (CGC) of Klein and Simmons (1963) had
three components: a lexicon, a morphological analyzer, and a context disambigua-
tor. The small 1500-word lexicon listed only function words and other irregular
words. The morphological analyzer used inflectional and derivational suffixes to as-
sign part-of-speech classes. These were run over words to produce candidate parts
of speech which were then disambiguated by a set of 500 context rules by relying on
surrounding islands of unambiguous words. For example, one rule said that between
an ARTICLE and a VERB, the only allowable sequences were ADJ-NOUN, NOUN-
ADVERB, or NOUN-NOUN. The TAGGIT tagger (Greene and Rubin, 1971) used
the same architecture as Klein and Simmons (1963), with a bigger dictionary and
more tags (87). TAGGIT was applied to the Brown corpus and, according to Francis
and Kučera (1982, p. 9), accurately tagged 77% of the corpus; the remainder of the
Brown corpus was then tagged by hand. All these early algorithms were based on
a two-stage architecture in which a dictionary was first used to assign each word a
set of potential parts of speech, and then lists of handwritten disambiguation rules
winnowed the set down to a single part of speech per word.
Probabilities were used in tagging by Stolz et al. (1965) and a complete proba-
bilistic tagger with Viterbi decoding was sketched by Bahl and Mercer (1976). The
Lancaster-Oslo/Bergen (LOB) corpus, a British English equivalent of the Brown cor-
pus, was tagged in the early 1980’s with the CLAWS tagger (Marshall 1983; Mar-
shall 1987; Garside 1987), a probabilistic algorithm that approximated a simplified
HMM tagger. The algorithm used tag bigram probabilities, but instead of storing the
word likelihood of each tag, the algorithm marked tags either as rare (P(tag|word) <
.01) infrequent (P(tag|word) < .10) or normally frequent (P(tag|word) > .10).
DeRose (1988) developed a quasi-HMM algorithm, including the use of dy-
namic programming, although computing P(t|w)P(w) instead of P(w|t)P(w). The
same year, the probabilistic PARTS tagger of Church (1988), (1989) was probably
E XERCISES 171

the first implemented HMM tagger, described correctly in Church (1989), although
Church (1988) also described the computation incorrectly as P(t|w)P(w) instead
of P(w|t)P(w). Church (p.c.) explained that he had simplified for pedagogical pur-
poses because using the probability P(t|w) made the idea seem more understandable
as “storing a lexicon in an almost standard form”.
Later taggers explicitly introduced the use of the hidden Markov model (Ku-
piec 1992; Weischedel et al. 1993; Schütze and Singer 1994). Merialdo (1994)
showed that fully unsupervised EM didn’t work well for the tagging task and that
reliance on hand-labeled data was important. Charniak et al. (1993) showed the
importance of the most frequent tag baseline; the 92.3% number we give above
was from Abney et al. (1999). See Brants (2000) for HMM tagger implementa-
tion details, including the extension to trigram contexts, and the use of sophisticated
unknown word features; its performance is still close to state of the art taggers.
Log-linear models for POS tagging were introduced by Ratnaparkhi (1996),
who introduced a system called MXPOST which implemented a maximum en-
tropy Markov model (MEMM), a slightly simpler version of a CRF. Around the
same time, sequence labelers were applied to the task of named entity tagging, first
with HMMs (Bikel et al., 1997) and MEMMs (McCallum et al., 2000), and then
once CRFs were developed (Lafferty et al. 2001), they were also applied to NER
(McCallum and Li, 2003). A wide exploration of features followed (Zhou et al.,
2005). Neural approaches to NER mainly follow from the pioneering results of Col-
lobert et al. (2011), who applied a CRF on top of a convolutional net. BiLSTMs
with word and character-based embeddings as input followed shortly and became a
standard neural algorithm for NER (Huang et al. 2015, Ma and Hovy 2016, Lample
et al. 2016) followed by the more recent use of Transformers and BERT.
The idea of using letter suffixes for unknown words is quite old; the early Klein
and Simmons (1963) system checked all final letter suffixes of lengths 1-5. The un-
known word features described on page 164 come mainly from Ratnaparkhi (1996),
with augmentations from Toutanova et al. (2003) and Manning (2011).
State of the art POS taggers use neural algorithms, either bidirectional RNNs or
Transformers like BERT; see Chapter 9 and Chapter 10. HMM (Brants 2000; Thede
and Harper 1999) and CRF tagger accuracies are likely just a tad lower.
Manning (2011) investigates the remaining 2.7% of errors in a high-performing
tagger (Toutanova et al., 2003). He suggests that a third or half of these remaining
errors are due to errors or inconsistencies in the training data, a third might be solv-
able with richer linguistic models, and for the remainder the task is underspecified
or unclear.
Supervised tagging relies heavily on in-domain training data hand-labeled by
experts. Ways to relax this assumption include unsupervised algorithms for cluster-
ing words into part-of-speech-like classes, summarized in Christodoulopoulos et al.
(2010), and ways to combine labeled and unlabeled data, for example by co-training
(Clark et al. 2003; Søgaard 2010).
See Householder (1995) for historical notes on parts of speech, and Sampson
(1987) and Garside et al. (1997) on the provenance of the Brown and other tagsets.

Exercises
8.1 Find one tagging error in each of the following sentences that are tagged with
the Penn Treebank tagset:
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN
172 C HAPTER 8 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES

2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS


3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP
4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
8.2 Use the Penn Treebank tagset to tag each word in the following sentences
from Damon Runyon’s short stories. You may ignore punctuation. Some of
these are quite difficult; do your best.
1. It is a nice night.
2. This crap game is over a garage in Fifty-second Street. . .
3. . . . Nobody ever takes the newspapers she sells . . .
4. He is a tall, skinny guy with a long, sad, mean-looking kisser, and a
mournful voice.
5. . . . I am sitting in Mindy’s restaurant putting on the gefillte fish, which is
a dish I am very fond of, . . .
6. When a guy and a doll get to taking peeks back and forth at each other,
why there you are indeed.
8.3 Now compare your tags from the previous exercise with one or two friend’s
answers. On which words did you disagree the most? Why?
8.4 Implement the “most likely tag” baseline. Find a POS-tagged training set,
and use it to compute for each word the tag that maximizes p(t|w). You will
need to implement a simple tokenizer to deal with sentence boundaries. Start
by assuming that all unknown words are NN and compute your error rate on
known and unknown words. Now write at least five rules to do a better job of
tagging unknown words, and show the difference in error rates.
8.5 Build a bigram HMM tagger. You will need a part-of-speech-tagged corpus.
First split the corpus into a training set and test set. From the labeled training
set, train the transition and observation probabilities of the HMM tagger di-
rectly on the hand-tagged data. Then implement the Viterbi algorithm so you
can decode a test sentence. Now run your algorithm on the test set. Report its
error rate and compare its performance to the most frequent tag baseline.
8.6 Do an error analysis of your tagger. Build a confusion matrix and investigate
the most frequent errors. Propose some features for improving the perfor-
mance of your tagger on these errors.
8.7 Develop a set of regular expressions to recognize the character shape features
described on page 164.
8.8 The BIO and other labeling schemes given in this chapter aren’t the only
possible one. For example, the B tag can be reserved only for those situations
where an ambiguity exists between adjacent entities. Propose a new set of
BIO tags for use with your NER system. Experiment with it and compare its
performance with the schemes presented in this chapter.
8.9 Names of works of art (books, movies, video games, etc.) are quite different
from the kinds of named entities we’ve discussed in this chapter. Collect a
list of names of works of art from a particular category from a Web-based
source (e.g., gutenberg.org, amazon.com, imdb.com, etc.). Analyze your list
and give examples of ways that the names in it are likely to be problematic for
the techniques described in this chapter.
8.10 Develop an NER system specific to the category of names that you collected in
the last exercise. Evaluate your system on a collection of text likely to contain
instances of these named entities.
CHAPTER

9 Deep Learning Architectures


for Sequence Processing
Time will explain.
Jane Austen, Persuasion

Language is an inherently temporal phenomenon. When we comprehend and pro-


duce spoken language, we process continuous input streams of indefinite length.
Even when dealing with written text, we normally process it sequentially. The tem-
poral nature of language is reflected in the metaphors we use; we talk of the flow of
conversations, news feeds, and twitter streams, all of which call out the notion that
language is a sequence that unfolds in time.
This temporal nature is reflected in the algorithms we use to process language.
For example, when applied to the problem of part-of-speech tagging, the Viterbi
algorithm works its way incrementally through the input a word at a time, carrying
forward information gleaned along the way. On the other hand, the machine learning
approaches we’ve studied for sentiment analysis and other text classification tasks
don’t have this temporal nature – they assume simultaneous access to all aspects of
their input. This is especially true of feedforward neural networks, including their
application to neural language models. These fully-connected networks use fixed-
size inputs, along with associated weights, to capture all the relevant aspects of an
example at once. This makes it difficult to deal with sequences of varying length
and fails to capture important temporal aspects of language.
A work-around for these problems is the sliding window approach employed
with neural language models. These models operate by accepting fixed-sized win-
dows of tokens as input; sequences longer than the window size are processed by
walking through the input making predictions along the way, with the end result
being a sequence of predictions spanning the input. Importantly, decisions made in
one window have no impact on subsequent decisions. Fig. 9.1, reproduced here from
Chapter 7, depicts the operation of a neural language model using this approach with
a window of size 3. Here, we’re predicting which word will come next given the in-
put for all the. Subsequent words are predicted by sliding the window forward a
word at a time.
This general approach is problematic for a number of reasons. First, it shares
the primary weakness of our earlier Markov N-gram approaches in that it limits
the context from which information can be extracted; anything outside the context
window has no impact on the decision being made. This is an issue since there are
many tasks that require access to information that can be arbitrarily distant from
the point at which processing is happening. Second, the use of windows makes
it difficult for networks to learn systematic patterns arising from phenomena like
constituency. For example, in Fig. 9.1 the phrase all the appears in two separate
windows: first as the second and third positions in the window, and again in the next
step where it appears as the first and second positions, thus forcing the network to
174 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

p(aardvark|…) p(fish|…) p(for|…) p(zebra|…)

Output layer y^1 … ^y


42 … ^y … ^y
59
^
35102 … y|V| |V|⨉1
softmax
|V|⨉dh
U
Hidden layer h1 h2 h3 … hdh dh⨉1

wt-1 dh⨉3d
W
Projection layer 3d⨉1
embeddings

E embedding for
word 35
embedding for
word 9925
embedding for
word 45180

... and thanks for all the ? ...

wt-3 wt-2 wt-1 wt

Figure 9.1 A simplified view of a feedforward neural language model moving through a text. At each time
step t the network takes the 3 context words, converts each to a d-dimensional embedding, and concatenates
the 3 embeddings together to get the 1 × Nd unit input layer x for the network. The output of the network is a
probability distribution over the vocabulary representing the models belief with respect to each word being the
next possible word.

learn two separate patterns for what should be the same item.
This chapter covers two closely related deep learning architectures designed to
address these challenges: recurrent neural networks and transformer networks. Both
approaches have mechanisms to deal directly with the sequential nature of language
that allow them to handle variable length inputs without the use of arbitrary fixed-
sized windows, and to capture and exploit the temporal nature of language.

9.1 Language Models Revisited


In this chapter, we’ll explore these two architectures primarily through the lens of
probabilistic language models. Recall from Chapter 3 that probabilistic language
models predict the next word in a sequence given some preceding context. For
example, if the preceding context is “Thanks for all the” and we want to know how
likely the next word is “fish” we would compute:

P(fish|Thanks for all the)

Language models give us the ability to assign such a conditional probability to every
possible next word, giving us a distribution over the entire vocabulary. We can also
assign probabilities to entire sequences by using these conditional probabilities in
9.1 • L ANGUAGE M ODELS R EVISITED 175

combination with the chain rule:


n
Y
P(w1:n ) = P(wi |w<i )
i=1

This formulation gives rise to a wide range of sequence labeling applications, and
as we’ll see, it provides a clear training objective based on how well a model is
predicting the next word in a sequence.
We’ve already seen two ways to instantiate probabilistic language models with
the N-gram models from Chapter 3 and the feedforward neural networks with sliding
windows from Chapter 7. Unfortunately, both of these methods are constrained by
the Markov assumption embodied in the following equation.

P(wn |w1:n−1 ) ≈ P(wn |w(n−N+1):(n−1) )

That is, the prediction is based on a fixed preceding context of size N; any input that
occurred earlier than that has no bearing on the outcome. The methods we explore
in this chapter will relax this assumption, allowing the models to make use of much
larger contexts.
We evaluate language models by examining how well they predict unseen data
drawn from the same source as the training data. Intuitively, good models are those
that assign higher probabilities to unseen data. To make this intuition concrete, we
perplexity use perplexity as a measure of model quality. The perplexity (PP) of a model θ with
respect to an unseen test set is the probability the model assigns to it, normalized by
its length.
1
PPθ (w1:n ) = P(w1:n ) n

An alternative way of viewing perplexity, inspired by information theory, is in terms


of entropy.

PP(w1:n ) = 2H(w1:n )
1 Pn
= 2− n 1 log2 m(wn )

In this formulation, the value in the exponent is the cross-entropy of our current
model with respect to the true distribution.
Another way to assess a language model is to use it to generate novel sequences.
The extent to which a generated sequence mirrors the training data is an indication
of the quality of the model. We saw how to do this in Chapter 3 by adapting a
technique suggested contemporaneously by Claude Shannon (Shannon, 1951) and
the psychologists George Miller and Selfridge (Miller and Selfridge, 1950). To get
started, we randomly sample a word to begin a sequence based on its suitability as
the start of a sequence. Having sampled the first word, we sample further words
conditioned on our previous choices until we reach a pre-determined length, or an
end of sequence token is generated. Today, this approach is called autoregressive
autoregressive
generation generation and we’ll cover its practical application in problems like machine trans-
lation and text summarization in this and later chapters.
176 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

9.2 Recurrent Neural Networks


A recurrent neural network (RNN) is any network that contains a cycle within its
network connections. That is, any network where the value of a unit is directly,
or indirectly, dependent on its own earlier outputs as an input. While powerful,
such networks are difficult to reason about and to train. However, within the general
class of recurrent networks there are constrained architectures that have proven to be
extremely effective when applied to spoken and written language. In this section, we
Elman consider a class of recurrent networks referred to as Elman Networks (Elman, 1990)
Networks
or simple recurrent networks. These networks are useful in their own right and
serve as the basis for more complex approaches like the Long Short-Term Memory
(LSTM) networks discussed later in this chapter. Going forward, when we use the
term RNN we’ll be referring to these simpler more constrained networks.

yt

ht

xt

Figure 9.2 Simple recurrent neural network after Elman (Elman, 1990). The hidden layer
includes a recurrent connection as part of its input. That is, the activation value of the hidden
layer depends on the current input as well as the activation value of the hidden layer from the
previous time step.

Fig. 9.2 illustrates the structure of an RNN. As with ordinary feedforward net-
works, an input vector representing the current input, xt , is multiplied by a weight
matrix and then passed through a non-linear activation function to compute the val-
ues for a layer of hidden units. This hidden layer is then used to calculate a cor-
responding output, yt . In a departure from our earlier window-based approach, se-
quences are processed by presenting one item at a time to the network. The key
difference from a feedforward network lies in the recurrent link shown in the figure
with the dashed line. This link augments the input to the computation at the hidden
layer with the value of the hidden layer from the preceding point in time.
The hidden layer from the previous time step provides a form of memory, or
context, that encodes earlier processing and informs the decisions to be made at
later points in time. Critically, this approach does not impose a fixed-length limit
on this prior context; the context embodied in the previous hidden layer includes
information extending back to the beginning of the sequence.
Adding this temporal dimension makes RNNs appear to be more complex than
non-recurrent architectures. But in reality, they’re not all that different. Given an
input vector and the values for the hidden layer from the previous time step, we’re
still performing the standard feedforward calculation introduced in Chapter 7. To
see this, consider Fig. 9.3 which clarifies the nature of the recurrence and how it
9.2 • R ECURRENT N EURAL N ETWORKS 177

factors into the computation at the hidden layer. The most significant change lies in
the new set of weights, U, that connect the hidden layer from the previous time step
to the current hidden layer. These weights determine how the network makes use of
past context in calculating the output for the current input. As with the other weights
in the network, these connections are trained via backpropagation.

yt

ht

U W

ht-1 xt

Figure 9.3 Simple recurrent neural network illustrated as a feedforward network.

9.2.1 Inference in RNNs


Forward inference (mapping a sequence of inputs to a sequence of outputs) in an
RNN is nearly identical to what we’ve already seen with feedforward networks. To
compute an output yt for an input xt , we need the activation value for the hidden
layer ht . To calculate this, we multiply the input xt with the weight matrix W , and
the hidden layer from the previous time step ht−1 with the weight matrix U. We
add these values together and pass them through a suitable activation function, g,
to arrive at the activation value for the current hidden layer, ht . Once we have the
values for the hidden layer, we proceed with the usual computation to generate the
output vector.

ht = g(Uht−1 +W xt )
yt = f (V ht )

It’s worthwhile here to be careful about specifying the dimensions of the input, hid-
den and output layers, as well as the weight matrices to make sure these calculations
are correct. Let’s refer to the input, hidden and output layer dimensions as din , dh ,
and dout respectively. Given this, our three parameter matrices are: W ∈ Rdh ×din ,
U ∈ Rdh ×dh , and V ∈ Rdout ×dh .
In the commonly encountered case of soft classification, computing yt consists
of a softmax computation that provides a probability distribution over the possible
output classes.

yt = softmax(V ht )

The fact that the computation at time t requires the value of the hidden layer
from time t − 1 mandates an incremental inference algorithm that proceeds from the
start of the sequence to the end as illustrated in Fig. 9.4. The sequential nature of
simple recurrent networks can also be seen by unrolling the network in time as is
shown in Fig. 9.5. In this figure, the various layers of units are copied for each time
178 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

step to illustrate that they will have differing values over time. However, the various
weight matrices are shared across time.

function F ORWARD RNN(x, network) returns output sequence y

h0 ← 0
for i ← 1 to L ENGTH(x) do
hi ← g(U hi−1 + W xi )
yi ← f (V hi )
return y

Figure 9.4 Forward inference in a simple recurrent network. The matrices U, V and W are
shared across time, while new values for h and y are calculated with each time step.

9.2.2 Training
As with feedforward networks, we’ll use a training set, a loss function, and back-
propagation to obtain the gradients needed to adjust the weights in these recurrent
networks. As shown in Fig. 9.3, we now have 3 sets of weights to update: W , the
weights from the input layer to the hidden layer, U, the weights from the previous
hidden layer to the current hidden layer, and finally V , the weights from the hidden
layer to the output layer.
Fig. 9.5 highlights two considerations that we didn’t have to worry about with
backpropagation in feedforward networks. First, to compute the loss function for
the output at time t we need the hidden layer from time t − 1. Second, the hidden
layer at time t influences both the output at time t and the hidden layer at time t + 1
(and hence the output and loss at t + 1). It follows from this that to assess the error
accruing to ht , we’ll need to know its influence on both the current output as well as
the ones that follow.
Tailoring the backpropagation algorithm to this situation leads to a two-pass al-
gorithm for training the weights in RNNs. In the first pass, we perform forward
inference, computing ht , yt , accumulating the loss at each step in time, saving the
value of the hidden layer at each step for use at the next time step. In the second
phase, we process the sequence in reverse, computing the required gradients as we
go, computing and saving the error term for use in the hidden layer for each step
backward in time. This general approach is commonly referred to as Backpropaga-
Backpropaga-
tion Through tion Through Time (Werbos 1974, Rumelhart et al. 1986, Werbos 1990).
Time
Fortunately, with modern computational frameworks and adequate computing
resources, there is no need for a specialized approach to training RNNs. As illus-
trated in Fig. 9.5, explicitly unrolling a recurrent network into a feedforward com-
putational graph eliminates any explicit recurrences, allowing the network weights
to be trained directly. In such an approach, we provide a template that specifies the
basic structure of the network, including all the necessary parameters for the input,
output, and hidden layers, the weight matrices, as well as the activation and output
functions to be used. Then, when presented with a specific input sequence, we can
generate an unrolled feedforward network specific to that input, and use that graph
to perform forward inference or training via ordinary backpropagation.
For applications that involve much longer input sequences, such as speech recog-
nition, character-level processing, or streaming of continuous inputs, unrolling an
entire input sequence may not be feasible. In these cases, we can unroll the input
9.2 • R ECURRENT N EURAL N ETWORKS 179

y3

y2 h3

V U W

h2
y1 x3

U W
V

h1 x2

U W

h0 x1

Figure 9.5 A simple recurrent neural network shown unrolled in time. Network layers are copied for each
time step, while the weights U, V and W are shared in common across all time steps.

into manageable fixed-length segments and treat each segment as a distinct training
item.

9.2.3 RNNs as Language Models


RNN-based language models process sequences a word at a time, attempting to
predict the next word in a sequence by using the current word and the previous
hidden state as inputs (Mikolov et al., 2010). The limited context constraint inherent
in N-gram models is avoided since the hidden state embodies information about all
of the preceding words all the way back to the beginning of the sequence.
Forward inference in a recurrent language model proceeds exactly as described
in Section 9.2.1. The input sequence x consists of word embeddings represented
as one-hot vectors of size |V | × 1, and the output predictions, y, are represented as
vectors representing a probability distribution over the vocabulary. At each step, the
model uses the word embedding matrix E to retrieve the embedding for the current
word, and then combines it with the hidden layer from the previous step to compute a
new hidden layer. This hidden layer is then used to generate an output layer which is
passed through a softmax layer to generate a probability distribution over the entire
vocabulary. That is, at time t:

et = E T xt
ht = g(Uht−1 +Wet )
yt = softmax(V ht )

The vector resulting from V h can be thought of as a set of scores over the vocabulary
given the evidence provided in h. Passing these scores through the softmax normal-
izes the scores into a probability distribution. Given y, the probability of a particular
180 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

word in the vocabulary, i, as the next word is just its corresponding component of y.
P(wt+1 = i|w1:t ) = yti

It follows from this that the probability of an entire sequence is just the product of
the probabilities of each item in the sequence.
n
Y
P(w1:n ) = P(wi |w1:i−1 )
i=1
Yn
= yiwi
i=1

To train an RNN as a language a model, we use a corpus of text as training


teacher forcing material in combination with a training regimen called teacher forcing. The task
is to minimize the error in predicting the next word in the training sequence, using
cross-entropy as the loss function. Recall that the cross-entropy loss measures the
difference between a predicted probability distribution and the correct distribution.

X
LCE = − ytw log yˆtw
w∈V

In the case of language modeling, the correct distribution y comes from knowing
the next word. This is represented as a one-hot vector corresponding to the vocab-
ulary where the entry for the actual next word is 1, and all the other entries are 0.
Thus, the cross-entropy loss for language modeling is determined by the probability
the model assigns to the correct next word. To be specific, at time t the CE loss is
the negative log probability assigned to the next word in the training sequence.
LCE (yˆt , yt ) = − log ŷtwt+1 (9.1)

In practice, the weights in the network are adjusted to minimize the average CE
loss over the training sequence via gradient descent. Fig. 9.6 illustrates this training
regimen.
Careful readers may have noticed that the input embedding matrix E and the
final layer matrix V , which feeds the output softmax, are quite similar. The rows of
E represent the word embeddings for each word in the vocabulary learned during the
training process with the goal that words that have similar meaning and function will
have similar embeddings. And, since the length of these embeddings corresponds to
the size of the hidden layer dh , the embedding matrix shape E is |V | × dh .
The final layer matrix V provides a way to score the likelihood of each word
in the vocabulary given the evidence present in the final hidden layer of the net-
work through the calculation of V h. This entails that it also has the dimensionality
|V | × dh . That is, the rows of V provide a second set of learned word embeddings
that capture relevant aspects of word meaning and function. This leads to an obvi-
Weight Tying ous question – is it even necessary to have both? Weight Tying is a method that
dispenses with this redundancy and uses a single set of embeddings at the input and
softmax layers. That is, E = V . To do this, we set the dimensionality of the fi-
nal hidden layer to be the same dh , (or add an additional projection layer to do the
same thing), and simply use the same matrix for both layers. In addition to provid-
ing improved perplexity results, this approach significantly reduces the number of
parameters required for the model.
9.2 • R ECURRENT N EURAL N ETWORKS 181

Next word a hole in the ground


T
1X
Loss log ya log yhole log yin log ythe log yground … T t=1
LCE

Softmax over
Vocabulary

RNN …
Layer(s)

Input
Embeddings

In a hole in the
Figure 9.6 Training RNNs as language models.

Generation with RNN-Based Language Models


As with the probabilistic Shakespeare generator from Chapter 3, a useful way to
gain insight into a language model is to use a trained model to generate random
novel sentences. The procedure is basically the same as that described on 38.
• To begin, sample a word in the output from the softmax distribution that re-
sults from using the beginning of sentence marker, <s>, as the first input.
• Use the word embedding for that first word as the input to the network at the
next time step, and then sample the next word in the same fashion.
• Continue generating until the end of sentence marker, </s>, is sampled or a
fixed length limit is reached.
autoregressive
generation This technique is called autoregressive generation since the word generated at each
time step is conditioned on the word selected by the network from the previous step.
Fig. 9.7 illustrates this approach. In this figure, the details of the RNN’s hidden
layers and recurrent connections are hidden within the blue block.
While this is an entertaining exercise, this architecture has inspired state-of-the-
art approaches to applications such as machine translation, summarization, and ques-
tion answering. The key to these approaches is to prime the generation component
with an appropriate context. That is, instead of simply using <s> to get things started
we can provide a richer task-appropriate context. We’ll discuss the application of
contextual generation to the problem of summarization in Section ?? in the context
of Transformer-based language models.

9.2.4 Other Applications of RNNs


Recurrent neural networks have proven to be an effective approach to language mod-
eling, sequence labeling tasks such as part-of-speech tagging, as well as sequence
classification tasks such as sentiment analysis and topic classification. And as we’ll
see in Chapter 11 and Chapter 11, they form the basis for sequence-to-sequence
approaches to summarization, machine translation, and question answering.
182 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

Sampled Word In a hole ?

Softmax

RNN

Embedding

Input Word <s> In a hole

Figure 9.7 Autoregressive generation with an RNN-based neural language model.

Sequence Labeling
In sequence labeling, the network’s task is to assign a label chosen from a small
fixed set of labels to each element of a sequence. Canonical examples of sequence
labeling include part-of-speech tagging and named entity recognition discussed in
detail in Chapter 8. In an RNN approach to sequence labeling, inputs are word
embeddings and the outputs are tag probabilities generated by a softmax layer over
the given tagset, as illustrated in Fig. 9.8.
In this figure, the inputs at each time step are pre-trained word embeddings cor-
responding to the input tokens. The RNN block is an abstraction that represents

Argmax NNP MD VB DT NN

Softmax

RNN

Embdeddings

Words Janet will back the bill

Figure 9.8 Part-of-speech tagging as sequence labeling with a simple RNN. Pre-trained
word embeddings serve as inputs and a softmax layer provides a probability distribution over
the part-of-speech tags as output at each time step.
9.2 • R ECURRENT N EURAL N ETWORKS 183

an unrolled simple recurrent network consisting of an input layer, hidden layer, and
output layer at each time step, as well as the shared U, V and W weight matrices that
comprise the network. The outputs of the network at each time step represent the
distribution over the POS tagset generated by a softmax layer.
To generate a sequence of tags for a given input, we run forward inference over
the input sequence and select the most likely tag from the softmax at each step. Since
we’re using a softmax layer to generate the probability distribution over the output
tagset at each time step, we will again employ the cross-entropy loss during training.

9.2.5 RNNs for Sequence Classification


Another use of RNNs is to classify entire sequences rather than the tokens within
them. We’ve already encountered this task in Chapter 4 with our discussion of sen-
timent analysis. Other examples include document-level topic classification, spam
detection, message routing for customer service applications, and deception detec-
tion. In all of these applications, sequences of text are classified as belonging to one
of a small number of categories.
To apply RNNs in this setting, the text to be classified is passed through the RNN
a word at a time generating a new hidden layer at each time step. The hidden layer
for the final element of the text, hn , is taken to constitute a compressed representation
of the entire sequence. In the simplest approach to classification, hn , serves as the
input to a subsequent feedforward network that chooses a class via a softmax over
the possible classes. Fig. 9.9 illustrates this approach.

Softmax

hn

RNN

x1 x2 x3 xn

Figure 9.9 Sequence classification using a simple RNN combined with a feedforward net-
work. The final hidden state from the RNN is used as the input to a feedforward network that
performs the classification.

Note that in this approach there are no intermediate outputs for the words in
the sequence preceding the last element. Therefore, there are no loss terms associ-
ated with those elements. Instead, the loss function used to train the weights in the
network is based entirely on the final text classification task. Specifically, the out-
put from the softmax output from the feedforward classifier together with a cross-
entropy loss drives the training. The error signal from the classification is backprop-
184 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

agated all the way through the weights in the feedforward classifier through, to its
input, and then through to the three sets of weights in the RNN as described earlier
in Section 9.2.2. This combination of a simple recurrent network with a feedforward
classifier is our first example of a deep neural network. And the training regimen
that uses the loss from a downstream application to adjust the weights all the way
end-to-end
training through the network is referred to as end-to-end training.

9.2.6 Stacked and Bidirectional RNNs


As suggested by the sequence classification architecture shown in Fig. 9.9, recurrent
networks are quite flexible. By combining the feedforward nature of unrolled com-
putational graphs with vectors as common inputs and outputs, complex networks
can be treated as modules that can be combined in creative ways. This section intro-
duces two of the more common network architectures used in language processing
with RNNs.

Stacked RNNs
In our examples thus far, the inputs to our RNNs have consisted of sequences of
word or character embeddings (vectors) and the outputs have been vectors useful for
predicting words, tags or sequence labels. However, nothing prevents us from using
the entire sequence of outputs from one RNN as an input sequence to another one.
Stacked RNNs Stacked RNNs consist of multiple networks where the output of one layer serves as
the input to a subsequent layer, as shown in Fig. 9.10.

yn
y1 y2 y3

RNN 3

RNN 2

RNN 1

x1 x2 x3 xn

Figure 9.10 Stacked recurrent networks. The output of a lower level serves as the input to
higher levels with the output of the last network serving as the final output.

It has been demonstrated across numerous tasks that stacked RNNs can outper-
form single-layer networks. One reason for this success has to do with the network’s
ability to induce representations at differing levels of abstraction across layers. Just
as the early stages of the human visual system detect edges that are then used for
finding larger regions and shapes, the initial layers of stacked networks can induce
representations that serve as useful abstractions for further layers — representations
that might prove difficult to induce in a single RNN.
9.2 • R ECURRENT N EURAL N ETWORKS 185

The optimal number of stacked RNNs is specific to each application and to each
training set. However, as the number of stacks is increased the training costs rise
quickly.

Bidirectional RNNs
In a simple recurrent network, the hidden state at a given time t represents everything
the network knows about the sequence up to that point in the sequence. That is, the
hidden state at time t is the result of a function of the inputs from the start up through
time t. We can think of this as the context of the network to the left of the current
time.

htf = RNNforward (xt1 )

Where htf corresponds to the normal hidden state at time t, and represents everything
the network has gleaned from the sequence to that point.
In many applications we have access to the entire input sequence all at once. We
might ask whether it is helpful to take advantage of the context to the right of the
current input as well. One way to recover such information is to train an RNN on an
input sequence in reverse, using exactly the same kind of networks that we’ve been
discussing. With this approach, the hidden state at time t now represents information
about the sequence to the right of the current input.

htb = RNNbackward (xtn )

Here, the hidden state htb represents all the information we have discerned about the
sequence from t to the end of the sequence.
bidirectional Combining the forward and backward networks results in a bidirectional RNN(Schuster
RNN
and Paliwal, 1997). A Bi-RNN consists of two independent RNNs, one where the
input is processed from the start to the end, and the other from the end to the start.
We then combine the outputs of the two networks into a single representation that
captures both the left and right contexts of an input at each point in time.

ht = htf ⊕ htb

Fig. 9.11 illustrates a bidirectional network where the outputs of the forward and
backward pass are concatenated. Other simple ways to combine the forward and
backward contexts include element-wise addition or multiplication. The output at
each step in time thus captures information to the left and to the right of the current
input. In sequence labeling applications, these concatenated outputs can serve as the
basis for a local labeling decision.
Bidirectional RNNs have also proven to be quite effective for sequence classi-
fication. Recall from Fig. 9.10, that for sequence classification we used the final
hidden state of the RNN as the input to a subsequent feedforward classifier. A dif-
ficulty with this approach is that the final state naturally reflects more information
about the end of the sentence than its beginning. Bidirectional RNNs provide a
simple solution to this problem; as shown in Fig. 9.12, we simply combine the final
hidden states from the forward and backward passes and use that as input for follow-
on processing. Again, concatenation is a common approach to combining the two
outputs but element-wise summation, multiplication or averaging are also used.
186 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

y1 y2 y3 yn

+ + + +

RNN 2 (Right to Left)

RNN 1 (Left to Right)

x1 x2 x3 xn

Figure 9.11 A bidirectional RNN. Separate models are trained in the forward and backward
directions with the output of each model at each time point concatenated to represent the state
of affairs at that point in time. The box wrapped around the forward and backward network
emphasizes the modular nature of this architecture.

Softmax

h1_back RNN 2 (Right to Left)

RNN 1 (Left to Right) hn_forw

x1 x2 x3 xn

Figure 9.12 A bidirectional RNN for sequence classification. The final hidden units from
the forward and backward passes are combined to represent the entire sequence. This com-
bined representation serves as input to the subsequent classifier.

9.3 Managing Context in RNNs: LSTMs and GRUs


In practice, it is quite difficult to train RNNs for tasks that require a network to make
use of information distant from the current point of processing. Despite having
access to the entire preceding sequence, the information encoded in hidden states
tends to be fairly local, more relevant to the most recent parts of the input sequence
and recent decisions. It is often the case, however, that distant information is critical
9.3 • M ANAGING C ONTEXT IN RNN S : LSTM S AND GRU S 187

to many language applications. To see this, consider the following example in the
context of language modeling.
(9.2) The flights the airline was cancelling were full.
Assigning a high probability to was following airline is straightforward since airline
provides a strong local context for the singular agreement. However, assigning an
appropriate probability to were is quite difficult, not only because the plural flights is
quite distant, but also because the intervening context involves singular constituents.
Ideally, a network should be able to retain the distant information about plural flights
until it is needed, while still processing the intermediate parts of the sequence cor-
rectly.
One reason for the inability of RNNs to carry forward critical information is that
the hidden layers, and, by extension, the weights that determine the values in the hid-
den layer, are being asked to perform two tasks simultaneously: provide information
useful for the current decision, and updating and carrying forward information re-
quired for future decisions.
A second difficulty with training SRNs arises from the need to backpropagate
the error signal back through time. Recall from Section 9.2.2 that the hidden layer
at time t contributes to the loss at the next time step since it takes part in that cal-
culation. As a result, during the backward pass of training, the hidden layers are
subject to repeated multiplications, as determined by the length of the sequence. A
frequent result of this process is that the gradients are eventually driven to zero – the
vanishing
gradients so-called vanishing gradients problem.
To address these issues, more complex network architectures have been designed
to explicitly manage the task of maintaining relevant context over time. More specif-
ically, the network needs to learn to forget information that is no longer needed and
to remember information required for decisions still to come.

9.3.1 Long Short-Term Memory


Long
short-term Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997)
memory
divide the context management problem into two sub-problems: removing informa-
tion no longer needed from the context, and adding information likely to be needed
for later decision making. The key to solving both problems is to learn how to man-
age this context rather than hard-coding a strategy into the architecture. LSTMs
accomplish this by first adding an explicit context layer to the architecture (in addi-
tion to the usual recurrent hidden layer), and through the use of specialized neural
units that make use of gates to control the flow of information into and out of the
units that comprise the network layers. These gates are implemented through the
use of additional weights that operate sequentially on the input, and previous hidden
layer, and previous context layers.
The gates in an LSTM share a common design pattern; each consists of a feed-
forward layer, followed by a sigmoid activation function, followed by a pointwise
multiplication with the layer being gated. The choice of the sigmoid as the activation
function arises from its tendency to push its outputs to either 0 or 1. Combining this
with a pointwise multiplication has an effect similar to that of a binary mask. Values
in the layer being gated that align with values near 1 in the mask are passed through
nearly unchanged; values corresponding to lower values are essentially erased.
forget gate The first gate we’ll consider is the forget gate. The purpose of this gate to delete
information from the context that is no longer needed. The forget gate computes a
weighted sum of the previous state’s hidden layer and the current input and passes
188 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

that through a sigmoid. This mask is then multiplied by the context vector to remove
the information from context that is no longer required.

ft = σ (U f ht−1 +W f xt )
kt = ct−1 ft

The next task is compute the actual information we need to extract from the
previous hidden state and current inputs — the same basic computation we’ve been
using for all our recurrent networks.

gt = tanh(Ug ht−1 +Wg xt ) (9.3)

add gate Next, we generate the mask for the add gate to select the information to add to the
current context.

it = σ (Ui ht−1 +Wi xt ) (9.4)


jt = gt it (9.5)

Next, we add this to the modified context vector to get our new context vector.

ct = jt + kt (9.6)

output gate The final gate we’ll use is the output gate which is used to decide what informa-
tion is required for the current hidden state (as opposed to what information needs
to be preserved for future decisions).

ot = σ (Uo ht−1 +Wo xt ) (9.7)


ht = ot tanh(ct ) (9.8)
(9.9)

Fig. 9.13 illustrates the complete computation for a single LSTM unit. Given
the appropriate weights for the various gates, an LSTM accepts as input the context
layer, and hidden layer from the previous time step, along with the current input
vector. It then generates updated context and hidden vectors as output. The hidden
layer, ht , can be used as input to subsequent layers in a stacked RNN, or to generate
an output for the final layer of a network.

9.3.2 Gated Recurrent Units


LSTMs introduce a considerable number of additional parameters to our recurrent
networks. We now have 8 sets of weights to learn (i.e., the U and W for each of the 4
gates within each unit), whereas with simple recurrent units we only had 2. Training
these additional parameters imposes a much significantly higher training cost. Gated
Recurrent Units (GRUs)(Cho et al., 2014) ease this burden by dispensing with the
use of a separate context vector, and by reducing the number of gates to 2 — a reset
gate, r and an update gate, z.

rt = σ (Ur ht−1 +Wr xt ) (9.10)


zt = σ (Uz ht−1 +Wz xt ) (9.11)
9.3 • M ANAGING C ONTEXT IN RNN S : LSTM S AND GRU S 189

ht

st

+
Y Y

st-1 f g i o

ht-1 xt

Figure 9.13 A single LSTM unit displayed as a computation graph. The inputs to each unit consists of the
current input, x, the previous hidden state, ht−1 , and the previous context, ct−1 . The outputs are a new hidden
state, ht and an updated context, ct .

As with LSTMs, the use of the sigmoid in the design of these gates results in
a binary-like mask that either blocks information with values near zero or allows
information to pass through unchanged with values near one. The purpose of the
reset gate is to decide which aspects of the previous hidden state are relevant to the
current context and what can be ignored. This is accomplished by performing an
element-wise multiplication of r with the value of the previous hidden state. We
then use this masked value in computing an intermediate representation for the new
hidden state at time t.

h̃t = tanh(U(rt ht−1 ) +W xt ) (9.12)

The job of the update gate z is to determine which aspects of this new state will
be used directly in the new hidden state and which aspects of the previous state need
to be preserved for future use. This is accomplished by using the values in z to
interpolate between the old hidden state and the new one.

ht = (1 − zt )ht−1 + zt h̃t (9.13)

9.3.3 Gated Units, Layers and Networks


The neural units used in LSTMs and GRUs are obviously much more complex than
those used in basic feedforward networks. Fortunately, this complexity is encapsu-
lated within the basic processing units, allowing us to maintain modularity and to
190 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

h ht ct ht ht

a a

g g
LSTM
z z GRU
Unit

⌃ ⌃

x ht-1 xt ct-1 ht-1 xt ht-1 xt

(a) (b) (c) (d)

Figure 9.14 Basic neural units used in feedforward, simple recurrent networks (SRN), long
short-term memory (LSTM) and gate recurrent units.

easily experiment with different architectures. To see this, consider Fig. 9.14 which
illustrates the inputs and outputs associated with each kind of unit.
At the far left, (a) is the basic feedforward unit where a single set of weights and
a single activation function determine its output, and when arranged in a layer there
are no connections among the units in the layer. Next, (b) represents the unit in a
simple recurrent network. Now there are two inputs and an additional set of weights
to go with it. However, there is still a single activation function and output.
The increased complexity of the LSTM (c) and GRU (d) units on the right is
encapsulated within the units themselves. The only additional external complexity
for the LSTM over the basic recurrent unit (b) is the presence of the additional
context vector as an input and output. The GRU units have the same input and
output architecture as the simple recurrent unit.
This modularity is key to the power and widespread applicability of LSTM and
GRU units. LSTM and GRU units can be substituted into any of the network ar-
chitectures described in Section 9.2.6. And, as with simple RNNs, multi-layered
networks making use of gated units can be unrolled into deep feedforward networks
and trained in the usual fashion with backpropagation.

9.4 Self-Attention Networks: Transformers


Despite the ability of LSTMs to mitigate the loss of distant information due to the
recurrence in RNNs, the underlying problem remains. Passing information forward
through an extended series of recurrent connections leads to a loss of relevant in-
formation and to difficulties in training. Moreover, the inherently sequential nature
of recurrent networks inhibits the use of parallel computational resources. These
Transformers considerations led to the development of Transformers – an approach to sequence
processing that eliminates recurrent connections and returns to architectures remi-
niscent of the fully connected networks described earlier in Chapter 7.
Transformers map sequences of input vectors (x1 , ..., xn ) to sequences of output
vectors (y1 , ..., yn ) of the same length. Transformers are made up of stacks of net-
work layers consisting of simple linear layers, feedforward networks, and custom
9.4 • S ELF -ATTENTION N ETWORKS : T RANSFORMERS 191

connections around them. In addition to these standard components, the key inno-
self-attention vation of transformers is the use of self-attention layers. We’ll start by describing
how self-attention works and then return to how it fits into larger transformer blocks.
Self-attention allows a network to directly extract and use information from arbitrar-
ily large contexts without the need to pass it through intermediate recurrent connec-
tions as in RNNs. In this chapter, we’ll focus on the application of self-attention to
the problems of language modeling and autoregressive generation where the context
to be used lies in the past. We’ll return to wider applications of self-attention and
Transformers in later chapters.
Fig. 9.15 illustrates the flow of information in a single causal, or backward look-
ing, self-attention layer. As with the overall Transformer, a self-attention layer maps
input sequences (x1 , ..., xn ) to output sequences of the same length (y1 , ..., yn ). When
processing each item in the input, the model has access to all of the inputs up to an
including the one under consideration, but no access to information about inputs
beyond the current one. In addition, the computation performed for each item is
independent of all the other computations. The first point ensures that we can use
this approach to create language models and use them for autoregressive generation,
and the second point means that we can easily parallelize both forward inference
and training of such models.

y1 y2 y3 y4 y5

Self-Attention
Layer

x1 x2 x3 x4 x5

Figure 9.15 Information flow in a causal (or masked) self-attention model. In processing each element of
the sequence, the model attends to all the inputs up to, and including, the current one. Unlike RNNs, the
computations at each time step are independent of all the other steps and therefore can be performed in parallel.

At the core of an attention-based approach is the ability to compare an item of


interest to a collection of other items in way that reveals their relevance in the current
context. In the case of self-attention, the set of comparisons are to other elements
within a given sequence. The result of these comparisons is then used to compute an
output for the current input. For example, returning to Fig. 9.15, the computation of
y3 is based on a set of comparisons between the input x3 and its preceding elements
x1 and x2 , and to x3 itself. The simplest form of comparison between elements in a
self-attention layer is a dot product. To allow for other possible comparisons, let’s
refer to the result of these comparisons as scores.
score(xi , x j ) = xi · x j (9.14)

The result of a dot product is a scalar value ranging from −∞ to ∞, the larger
the value the more similar the vectors that are being compared. Continuing with our
192 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

example, the first step in computing y3 would be to compute three scores: x3 · x1 ,


x3 · x2 and x3 · x3 . Then to make effective use of these scores, we’ll normalize them
with a softmax to create a vector of weights, αi j , that indicates the proportional
relevance of each input to the input element i that is the current focus of attention.
αi j = softmax(score(xi , x j )) ∀ j ≤ i (9.15)
exp(score(xi , x j ))
= Pi ∀j ≤ i (9.16)
k=1 exp(score(xi , xk ))
Given the proportional scores in α, we then generate an output value yi by taking
the sum of the inputs seen so far, weighted by their respective α value.
X
yi = αi j x j (9.17)
j≤i

The steps embodied in Equations 9.14 through 9.17 represent the core of an
attention-based approach: a set of comparisons to relevant items in some context,
a normalization of those scores to provide a probability distribution, followed by a
weighted sum using this distribution. The output y is the result of this straightfor-
ward computation over the inputs.
Unfortunately, this simple mechanism provides no opportunity for learning, ev-
erything is directly based on the original input values x. In particular, there are no
opportunities to learn the diverse ways that words can contribute to the represen-
tation of longer inputs. To allow for this kind of learning, Transformers include
additional parameters in the form of a set of weight matrices that operate over the
input embeddings. To motivate these new parameters, consider the different roles
that each input embedding plays during the course of the attention process.
• As the current focus of attention when being compared to all of the other
preceding inputs. We’ll refer to this role as a query.
• In its role as a preceding input being compared to the current focus of atten-
tion. We’ll refer to this role as a key.
• And finally, as a value used to compute the output for the current focus of
attention.
To capture the different roles that input embeddings play in each of these steps,
Transformers introduce three sets of weights which we’ll call W Q , W K , and W V .
These weights will be used to compute linear transformations of each input x with
the resulting values being used in their respective roles in subsequent calculations.
qi = W Q xi ; ki = W K xi ; vi = W V xi
Given input embeddings of size dm , the dimensionality of these matrices are dq ×dm ,
dk × dm and dv × dm , respectively. In the original Transformer work (Vaswani et al.,
2017), dm was 1024 and 64 for dk , dq and dv .
Given these projections, the score between a current focus of attention, xi and
an element in the preceding context, x j consists of a dot product between its query
vector qi and the preceding elements key vectors k j . Let’s update our previous com-
parison calculation to reflect this.
score(xi , x j ) = qi · k j (9.18)
The ensuing softmax calculation resulting in αi, j remains the same, but the output
calculation for yi is now based on a weighted sum over the value vectors v.
X
yi = αi j v j (9.19)
j≤i
9.4 • S ELF -ATTENTION N ETWORKS : T RANSFORMERS 193

y3
Output Vector

Weight and Sum


value vectors ( * ) ( * ) ( * )

Softmax
↵i,j

Key/Query
Comparisons

k1 q1 v1 k2 q2 v2 k3 q3 v3
Generate k
W W
q
W
v
W
k
W
q
W
v
W
k
W
q
W
v

key, query value


vectors

x1 x2 x3
Figure 9.16 Calculation of the value of the third element of a sequence using causal self-attention.

Fig. 9.16 illustrates this calculation in the case of computing the third output y3 in a
sequence.
A practical consideration that arises in computing αi j arises from the use of a
dot product as a comparison in combination with the exponential in the softmax.
The result of dot product can be an arbitrarily large (positive or negative) value.
Exponentiating such large values can lead to numerical issues and to an effective
loss of gradients during training. To avoid this, the dot product needs to be scaled
in a suitable fashion. A scaled dot-product approach divides the result of the dot
product by a factor related to the size of the embeddings before passing them through
the softmax. A typical approach is to divide the dot product by the square root of
the dimensionality of the query and key vectors, leading us to update our scoring
function one more time.
qi · k j
score(xi , x j ) = √ (9.20)
dk
This description of the self-attention process has been from the perspective of
computing a single output at a particular point in time. However, since each out-
put, yi , is computed independently this entire process can be parallelized by taking
advantage of efficient matrix multiplication routines by packing the input embed-
194 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

dings into a single matrix and multiplying it by the key, query and value matrices to
produce matrices containing all the key, query and value vectors.

Q = W Q X; K = W K X; V = W V X
Given these matrices we can compute all the requisite query-key comparisons
simultaneously by multiplying K and Q in a single matrix multiplication. Taking
this one step further, we can scale these scores, take the softmax, and then multiply
the result by V , thus reducing the entire self-attention step for an entire sequence to
the following computation.
 
QK T
SelfAttention(Q, K,V ) = softmax √ V (9.21)
dk
Unfortunately, this process goes a bit too far since the calculation of the compar-
isons in QK T results in a score for each query value to every key value, including
those that follow the query. This is inappropriate in the setting of language modeling
since guessing the next word is pretty simple if you already know it. To fix this, the
elements in the upper-triangular portion of the comparisons matrix are zeroed out
(set to −∞), thus eliminating any knowledge of words that follow in the sequence.

Transformer Blocks
The self-attention calculation lies at the core of what’s called a transformer block,
which, in addition to the self-attention layer, includes additional feedforward layers,
residual connections, and normalizing layers. Fig. 9.17 illustrates a typical trans-
former block consisting of a single attention layer followed by a fully-connected
feedforward layer with residual connections and layer normalizations following each.
These blocks can then be stacked just as was the case for stacked RNNs.

Multihead Attention
The different words in a sentence can relate to each other in many different ways si-
multaneously. For example, distinct syntactic, semantic, and discourse relationships
can hold between verbs and their arguments in a sentence. It would be difficult for
a single transformer block to learn to capture all of the different kinds of parallel
multihead
relations among its inputs. Transformers address this issue with multihead self-
self-attention attention layers. These are sets of self-attention layers, called heads, that reside in
layers
parallel layers at the same depth in a model, each with its own set of parameters.
Given these distinct sets of parameters, each head can learn different aspects of the
relationships that exist among inputs at the same level of abstraction.
To implement this notion, each head, i, in a self-attention layer is provided with
its own set of key, query and value matrices: WiK , WiQ and WiV . These are used
to project the inputs to the layer, xi , separately for each head, with the rest of the
self-attention computation remaining unchanged. The output of a multi-head layer
with h heads consists of h vectors of the same length. To make use of these vec-
tors in further processing, they are combined and then reduced down to the original
input dimension dm . This is accomplished by concatenating the outputs from each
head and then using yet another linear projection to reduce it to the original output
dimension.
MultiHeadAttn(Q, K,V ) = W O (head1 ⊕ head2 ... ⊕ headh )
headi = SelfAttention(WiQ X,WiK X,WiV X)
9.4 • S ELF -ATTENTION N ETWORKS : T RANSFORMERS 195

y1 y2 y3 yn

Add and Normalize

Feedforward Layer

Transformer
Block
Add and Normalize

Self-Attention Layer

x1 x2 x3 xn

Figure 9.17 with all the layers

Fig. 9.18 illustrates this approach with 4 self-attention heads. This multihead
layer replaces the single self-attention layer in the transformer block shown earlier
in Fig. 9.17, the rest of the Transformer block with its feedforward layer, residual
connections, and layer norms remains the same.

Positional Embeddings
With RNNs information about the order of the inputs was baked into the nature of
the models. Unfortunately, the same isn’t true for Transformers; there’s nothing that
would allow such models to make use of information about the relative, or absolute,
positions of the elements of an input sequence. This can be seen from the fact that if
you scramble the order of inputs in the attention computation illustrated earlier you
get exactly the same answer. To address this issue, Transformer inputs are combined
positional
embeddings with positional embeddings specific to each position in an input sequence.
Where do we get these positional embeddings? A simple and effective approach
is to start with randomly initialized embeddings corresponding to each possible input
position up to some maximum length. For example, just as we have an embedding
for the word fish, we’ll have an embedding for the position 3. As with word embed-
dings, these positional embeddings are learned along with other parameters during
training. To produce an input embedding that captures positional information, we
just add the word embedding for each input to its corresponding positional embed-
ding. This new embedding serves as the input for further processing.
A potential problem with this approach is that there will be plenty of training
examples for the initial positions in our inputs and correspondingly fewer at the
outer length limits. These latter embeddings may be poorly trained and may not
196 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

y1 y2 y3 yn

WO

Concat

Multihead Attention
Layer
W4Q , W4K , QV4

W3Q , W3K , QV3

W2Q , W2K , QV2

Self-Attention Layer

x1 x2 x3 xn

Figure 9.18 Multihead self-attention: Each of the multihead self-attention layers is provided with its own
set of key, query and value weight matrices. The outputs from each of the layers are concatenated and then
projected down to dmodel , thus producing an output of the right size.

generalize well during testing. An alternative approach to positional embeddings is


to choose a static function that maps an integer inputs to real-valued vectors in a
way that captures the inherent relationships among the positions. That is, it captures
the fact that position 4 in an input is more closely related to position 5 than it is to
position 17. A combination of sine and cosine functions with differing frequencies
was used in the original Transformer work.

9.4.1 Transformers as Autoregressive Language Models


Now that we’ve seen all the major components of Transformers, let’s examine how
to deploy them as language models via semi-supervised learning. To do this, we’ll
proceed just as we did with the RNN-based approach: given a training corpus of
plain text we’ll train a model to predict the next word in a sequence using teacher
forcing. Fig. 9.19 illustrates the general approach. At each step, given all the pre-
ceding words, the final Transformer layer produces an output distribution over the
entire vocabulary. During training, the probability assigned to the correct word is
used to calculate the cross-entropy loss for each item in the sequence. As with
RNNs, the loss for a training sequence is the average cross-entropy loss over the
entire sequence.
Note the key difference between this figure and the earlier RNN-based version
for shown in Fig. 9.6. There the calculation of the outputs and the losses at each step
was inherently serial given the recurrence in the calculation of the hidden states.
With Transformers, each training item can be processed in parallel since the output
for each element in the sequence is computed separately. Once trained, we can
compute the perplexity of the resulting model, or autoregressively generate novel
text just as with RNN-based models.
9.4 • S ELF -ATTENTION N ETWORKS : T RANSFORMERS 197

Next word a hole in the ground


T
1X
Loss log ya log yhole log yin log ythe log yground … = T t=1
LCE

Softmax over
Vocabulary

Transformer
Block(s) …

Input …
Embeddings

In a hole in the
Figure 9.19 Training a Transformer as a language model.

Contextual Generation and Summarization


A simple variation on autoregressive generation that underlies a number of practi-
cal applications uses a prior context to prime the autoregressive generation process.
Fig. 9.20 illustrates this with the task of text completion. Here a standard language
model is given the prefix to some text and is asked to generate a possible completion
to it. Note that as the generation process proceeds, the model has direct access to
the priming context as well as to all of its own subsequently generated outputs. This
ability to incorporate the entirety of the earlier context and generated outputs at each
time step is the key to the power of these models.

Completion Text

ground there

Sample from Softmax

Transformer …
Blocks

Input
Embeddings
In a hole in the ground there

Prefix Text
Figure 9.20 Autoregressive text completion with Transformers.

Text Text summarization is a practical application of context-based autoregressive


summarization
generation. Here, the task is to take a full-length article and produce an effective
summary of it. To train a Transformer-based autoregressive model to perform this
task, we start with a corpus consisting of full-length articles accompanied by their
198 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

corresponding summaries. Fig. 9.21 shows an example of this kind of data from a
widely used summarization corpus consisting of CNN and Daily Mirror news arti-
cles.

Original Article
The only thing crazier than a guy in snowbound Massachusetts boxing up the powdery white stuff
and offering it for sale online? People are actually buying it. For $89, self-styled entrepreneur
Kyle Waring will ship you 6 pounds of Boston-area snow in an insulated Styrofoam box – enough
for 10 to 15 snowballs, he says.
But not if you live in New England or surrounding states. “We will not ship snow to any states
in the northeast!” says Waring’s website, ShipSnowYo.com. “We’re in the business of expunging
snow!”
His website and social media accounts claim to have filled more than 133 orders for snow – more
than 30 on Tuesday alone, his busiest day yet. With more than 45 total inches, Boston has set a
record this winter for the snowiest month in its history. Most residents see the huge piles of snow
choking their yards and sidewalks as a nuisance, but Waring saw an opportunity.
According to Boston.com, it all started a few weeks ago, when Waring and his wife were shov-
eling deep snow from their yard in Manchester-by-the-Sea, a coastal suburb north of Boston.
He joked about shipping the stuff to friends and family in warmer states, and an idea was born.
His business slogan: “Our nightmare is your dream!” At first, ShipSnowYo sold snow packed
into empty 16.9-ounce water bottles for $19.99, but the snow usually melted before it reached its
destination...
Summary
Kyle Waring will ship you 6 pounds of Boston-area snow in an insulated Styrofoam box – enough
for 10 to 15 snowballs, he says. But not if you live in New England or surrounding states.
Figure 9.21 Examples of articles and summaries from the CNN/Daily Mail corpus (Hermann et al., 2015b),
(Nallapati et al., 2016).

A surprisingly effective approach to applying Transformers to summarization is


to append a summary to each full-length article in a corpus, with a unique marker
separating the two. More formally, each article-summary pair (x1 , ..., xm ), (y1 , ..., yn )
in a training corpus is converted into a single training instance (x1 , ..., xm , δ , y1 , ...yn )
with an overall length of n + m + 1. These training instances are treated as long
sentences and then used to train an autoregressive language model using teacher
forcing, exactly as we did earlier.
Once trained, full articles ending with the special marker are used as the context
to prime the generation process to produce a summary as illustrated in Fig. 9.22.
Note that, in contrast to RNNs, the model has access to the original article as well
as to the newly generated text throughout the process.
As we’ll see in later chapters, variations on this simple scheme are the basis
for successful text-to-text applications including machine translation, summariza-
tion and question answering.

9.5 Potential Harms from Language Models


Large neural language models exhibit many of the potential harms discussed in
Chapter 4 and Chapter 6. Problems may occur whenever language models are used
for text generation, such as in assistive technologies like web search query comple-
tion or predictive typing for email (Olteanu et al., 2020).
9.5 • P OTENTIAL H ARMS FROM L ANGUAGE M ODELS 199

Generated Summary

Kyle Waring will

The only … reached its destination Kyle Waring will

Original Story Delimiter

Figure 9.22 Summarization with Transformers.

For example, language models can generate toxic language. Gehman et al.
(2020) show that many kinds of completely non-toxic prompts can nonetheless lead
large language models to output hate speech and abuse. Brown et al. (2020) and
Sheng et al. (2019) showed that large language models generate sentences display-
ing negative attitudes toward minority identities such as being Black or gay.
Indeed, language models are biased in a number of ways by the distributions
of their training data. Gehman et al. (2020) shows that large language model train-
ing datasets include toxic text scraped from banned sites. In addition to problems
of toxicity, internet data is disproportionally generated by authors from developed
countries, and many large language models train on data from Reddit, whose authors
skew male and young. Such biased population samples likely skew the resulting
generation away from the perspectives or topics of underrepresented populations.
Furthermore, language models can amplify demographic and other biases in train-
ing data, just as we saw for embedding models in Chapter 6.
Language models can also be a tool for generating text for misinformation,
phishing, radicalization, and other socially harmful activities (Brown et al., 2020).
(McGuffie and Newhouse, 2020) show how large language models generate text
that emulates online extremists, with the risk of amplifying extremist movements
and their attempt to radicalize and recruit.
Finally, there are important privacy issues. Language models, like other machine
learning models, can leak information about their training data. It is thus possible
for an adversary to extract individual training-data phrases from a language model
such as an individual person’s name, phone number, and address (Carlini et al. 2020,
using the techniques introduced by Henderson et al. 2017). This is a problem if large
language models are trained on private datasets such has electronic health records
(EHRs).
Mitigating all these harms is an important but unsolved research question in
NLP. Extra pre-training (Gururangan et al., 2020) on non-toxic subcorpora seems to
reduce a language model’s tendency to generate toxic language somewhat (Gehman
et al., 2020). And analyzing the data used to pretrain large language models is
important to understand toxicity and bias in generation, as well as privacy, making
it extremely important that language models include datasheets (page ??) or model
cards (page 73) giving full replicable information on the corpora used to train them.
200 C HAPTER 9 • D EEP L EARNING A RCHITECTURES FOR S EQUENCE P ROCESSING

9.6 Summary
This chapter has introduced the concept of recurrent neural networks and how they
can be applied to language problems. Here’s a summary of the main points that we
covered:
• In simple Recurrent Neural Networks sequences are processed naturally as an
element at a time.
• The output of a neural unit at a particular point in time is based both on the
current input and value of the hidden layer from the previous time step.
• RNNs can be trained with a straightforward extension of the backpropagation
algorithm, known as backpropagation through time (BPTT).
• Common language-based applications for RNNs include:
– Probabilistic language modeling, where the model assigns a probability
to a sequence, or to the next element of a sequence given the preceding
words.
– Auto-regressive generation using a trained language model.
– Sequence labeling, where each element of a sequence is assigned a label,
as with part-of-speech tagging.
– Sequence classification, where an entire text is assigned to a category, as
in spam detection, sentiment analysis or topic classification.
• Simple recurrent networks often fail since it is extremely difficult to success-
fully train them do to problems maintaining useful gradients over time.
• More complex gated architectures such as LSTMs and GRUs are designed
to overcome these issues by explicitly managing the task of deciding what to
remember and forget in their hidden and context layers.

Bibliographical and Historical Notes


Influential investigations of the kind of simple RNNs discussed here were conducted
in the context of the Parallel Distributed Processing (PDP) group at UC San Diego
in the 1980’s. Much of this work was directed at human cognitive modeling rather
than practical NLP applications Rumelhart and McClelland 1986 McClelland and
Rumelhart 1986. Models using recurrence at the hidden layer in a feedforward net-
work (Elman networks) were introduced by Elman (1990). Similar architectures
were investigated by Jordan (1986) with a recurrence from the output layer, and
Mathis and Mozer (1995) with the addition of a recurrent context layer prior to the
hidden layer. The possibility of unrolling a recurrent network into an equivalent
feedforward network is discussed in (Rumelhart and McClelland, 1986).
In parallel with work in cognitive modeling, RNNs were investigated extensively
in the continuous domain in the signal processing and speech communities (Giles
et al., 1994). Schuster and Paliwal (1997) introduced bidirectional RNNs and de-
scribed results on the TIMIT phoneme transcription task.
While theoretically interesting, the difficulty with training RNNs and manag-
ing context over long sequences impeded progress on practical applications. This
situation changed with the introduction of LSTMs in Hochreiter and Schmidhuber
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 201

(1997). Impressive performance gains were demonstrated on tasks at the bound-


ary of signal processing and language processing including phoneme recognition
(Graves and Schmidhuber, 2005), handwriting recognition (Graves et al., 2007) and
most significantly speech recognition (Graves et al., 2013b).
Interest in applying neural networks to practical NLP problems surged with the
work of Collobert and Weston (2008) and Collobert et al. (2011). These efforts made
use of learned word embeddings, convolutional networks, and end-to-end training.
They demonstrated near state-of-the-art performance on a number of standard shared
tasks including part-of-speech tagging, chunking, named entity recognition and se-
mantic role labeling without the use of hand-engineered features.
Approaches that married LSTMs with pre-trained collections of word-embeddings
based on word2vec (Mikolov et al., 2013) and GLOVE (Pennington et al., 2014),
quickly came to dominate many common tasks: part-of-speech tagging (Ling et al.,
2015), syntactic chunking (Søgaard and Goldberg, 2016), and named entity recog-
nition via IOB tagging Chiu and Nichols 2016, Ma and Hovy 2016, opinion mining
(Irsoy and Cardie, 2014), semantic role labeling (Zhou and Xu, 2015a) and AMR
parsing (Foland and Martin, 2016). As with the earlier surge of progress involving
statistical machine learning, these advances were made possible by the availability
of training data provided by CONLL, SemEval, and other shared tasks, as well as
shared resources such as Ontonotes (Pradhan et al., 2007b), and PropBank (Palmer
et al., 2005).
CHAPTER

10 Contextual Embeddings
Placeholder

202
CHAPTER

11 Machine Translation and


Encoder-Decoder Models
“I want to talk the dialect of your people. It’s no use of talking unless
people understand what you say.”
Zora Neale Hurston, Moses, Man of the Mountain 1939, p. 121

machine This chapter introduces machine translation (MT), the use of computers to trans-
translation
MT late from one language to another.
Of course translation, in its full generality, such as the translation of literature, or
poetry, is a difficult, fascinating, and intensely human endeavor, as rich as any other
area of human creativity.
Machine translation in its present form therefore focuses on a number of very
practical tasks. Perhaps the most common current use of machine translation is
information for information access. We might want to translate some instructions on the web,
access
perhaps the recipe for a favorite dish, or the steps for putting together some furniture.
Or we might want to read an article in a newspaper, or get information from an
online resource like Wikipedia or a government webpage in a foreign language.
MT for information
access is probably
one of the most com-
mon uses of NLP
technology, and Google
Translate alone (shown above) translates hundreds of billions of words a day be-
tween over 100 languages.
Another common use of machine translation is to aid human translators. MT sys-
post-editing tems are routinely used to produce a draft translation that is fixed up in a post-editing
phase by a human translator. This task is often called computer-aided translation
CAT or CAT. CAT is commonly used as part of localization: the task of adapting content
localization or a product to a particular language community.
Finally, a more recent application of MT is to in-the-moment human commu-
nication needs. This includes incremental translation, translating speech on-the-fly
before the entire sentence is complete, as is commonly used in simultaneous inter-
pretation. Image-centric translation can be used for example to use OCR of the text
on a phone camera image as input to an MT system to translate menus or street signs.
encoder- The standard algorithm for MT is the encoder-decoder network, also called the
decoder
sequence to sequence network, an architecture that can be implemented with RNNs
or with Transformers. We’ve seen in prior chapters that RNN or Transformer archi-
tecture can be used to do classification (for example to map a sentence to a positive
or negative sentiment tag for sentiment analysis), or can be used to do sequence la-
beling (for example to assign each word in an input sentence with a part-of-speech,
or with a named entity tag). For part-of-speech tagging, recall that the output tag is
associated directly with each input word, and so we can just model the tag as output
yt for each input word xt .
204 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

Encoder-decoder or sequence-to-sequence models are used for a different kind


of sequence modeling in which the output sequence is a complex function of the
entire input sequencer; we must map from a sequence of input words or tokens to a
sequence of tags that are not merely direct mappings from individual words.
Machine translation is exactly such a task: the words of the target language
don’t necessarily agree with the words of the source language in number or order.
Consider translating the following made-up English sentence into Japanese.
(11.1) English: He wrote a letter to a friend
Japanese: tomodachi ni tegami-o kaita
friend to letter wrote
Note that the elements of the sentences are in very different places in the different
languages. In English, the verb is in the middle of the sentence, while in Japanese,
the verb kaita comes at the end. The Japanese sentence doesn’t require the pronoun
he, while English does.
Such differences between languages can be quite complex. In the following ac-
tual sentence from the United Nations, notice the many changes between the Chinese
sentence (we’ve given in in red a word-by-word gloss of the Chinese characters) and
its English equivalent.
(11.2) 大会/General Assembly 在/on 1982年/1982 12月/December 10日/10 通过
了/adopted 第37号/37th 决议/resolution ,核准了/approved 第二次/second
探索/exploration 及/and 和平peaceful 利用/using 外层空间/outer space 会
议/conference 的/of 各项/various 建议/suggestions 。
On 10 December 1982 , the General Assembly adopted resolution 37 in
which it endorsed the recommendations of the Second United Nations
Conference on the Exploration and Peaceful Uses of Outer Space .
Note the many ways the English and Chinese differ. For example the order-
ing differs in major ways; the Chinese order of the noun phrase is “peaceful using
outer space conference of suggestions” while the English has “suggestions of the ...
conference on peaceful use of outer space”). And the order differs in minor ways
(the date is ordered differently). English requires the in many places that Chinese
doesn’t, and adds some details (like “in which” and “it”) that aren’t necessary in
Chinese. Chinese doesn’t grammatically mark plurality on nouns (unlike English,
which has the “-s” in “recommendations”), and so the Chinese must use the modi-
fier 各项/various to make it clear that there is not just one recommendation. English
capitalizes some words but not others.
Encoder-decoder networks are very successful at handling these sorts of com-
plicated cases of sequence mappings. Indeed, the encoder-decoder algorithm is not
just for MT; it’s the state of the art for many other tasks where complex mappings
between two sequences are involved. These include summarization (where we map
from a long text to its summary, like a title or an abstract), dialogue (where we map
from what the user said to what our dialogue system should respond), semantic
parsing (where we map from a string of words to a semantic representation like
logic or SQL), and many others.
We’ll introduce the algorithm in sections Section 11.2, and in following sections
give important components of the model like beam search decoding, and we’ll
discuss how MT is evaluated, introducing the popular BLEU metric.
But first, in the next section, we begin by summarizing the linguistic background
to MT: key differences among languages that are important to consider when con-
sidering the task of translation.
11.1 • L ANGUAGE D IVERGENCES AND T YPOLOGY 205

11.1 Language Divergences and Typology


universal Some aspects of human language seem to be universal, holding true for every lan-
guage, or are statistical universals, holding true for most languages. Many universals
arise from the functional role of language as a communicative system by humans.
Every language, for example, seems to have words for referring to people, for talking
about eating and drinking, for being polite or not. There are also structural linguistic
universals; for example, every language seems to have nouns and verbs (Chapter 8),
has ways to ask questions, or issue commands, linguistic mechanisms for indicating
agreement or disagreement.
Yet languages also differ in many ways, and an understanding of what causes
translation
divergence such translation divergences will help us build better MT models. We often distin-
guish the idiosyncratic and lexical differences that must be dealt with one by one
(the word for ”dog” differs wildly from language to language), from systematic dif-
ferences that we can model in a general way (many languages put the verb before the
direct object; others put the verb after the direct object). The study of these system-
typology atic cross-linguistic similarities and differences is called linguistic typology. This
section sketches some typological facts that impact machine translation; the inter-
ested reader should also look into WALS, the World Atlas of Language Structures,
which gives many typological facts about languages (Dryer and Haspelmath, 2013).

11.1.1 Word Order Typology


As we hinted it in our example above comparing English and Japanese, languages
differ in the basic word order of verbs, subjects, and objects in simple declara-
SVO tive clauses. German, French, English, and Mandarin, for example, are all SVO
(Subject-Verb-Object) languages, meaning that the verb tends to come between
SOV the subject and object. Hindi and Japanese, by contrast, are SOV languages, mean-
ing that the verb tends to come at the end of basic clauses, and Irish and Arabic are
VSO VSO languages. Two languages that share their basic word order type often have
other similarities. For example, VO languages generally have prepositions, whereas
OV languages generally have postpositions.
Let’s look in more detail at the example we saw above. In this SVO English
sentence, the verb wrote is followed by its object a letter and the prepositional phrase
to a friend, in which the preposition to is followed by its argument a friend. Arabic,
with a VSO order, also has the verb before the object and prepositions. By contrast,
in the Japanese example that follows, each of these orderings is reversed; the verb is
preceded by its arguments, and the postposition follows its argument.
(11.3) English: He wrote a letter to a friend
Japanese: tomodachi ni tegami-o kaita
friend to letter wrote
Arabic: katabt risāla li ṡadq
wrote letter to friend
Other kinds of ordering preferences vary idiosyncratically from language to lan-
guage. In some SVO languages (like English and Mandarin) adjectives tend to
appear before verbs, while in others languages like Spanish and Modern Hebrew,
adjectives appear after the noun:
(11.4) Spanish bruja verde English green witch
206 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

(a) (b)
Figure 11.1 Examples of other word order differences: (a) In German, adverbs occur in
initial position that in English are more natural later, and tensed verbs occur in second posi-
tion. (b) In Mandarin, preposition phrases expressing goals often occur pre-verbally, unlike
in English.

Fig. 11.1 shows examples of other word order differences. All of these word
order differences between languages can cause problems for translation, requiring
the system to do huge structural reorderings as it generates the output.

11.1.2 Lexical Divergences


Of course we also need to translate the individual words from one language to an-
other. For any translation, the appropriate word can vary depending on the context.
The English source-language word bass, for example, can appear in Spanish as the
fish lubina or the musical instrument bajo. German uses two distinct words for what
in English would be called a wall: Wand for walls inside a building, and Mauer for
walls outside a building. Where English uses the word brother for any male sib-
ling, Chinese and many other languages have distinct words for older brother and
younger brother (Mandarin gege and didi, respectively). In all these cases, trans-
lating bass, wall, or brother from English would require a kind of specialization,
disambiguating the different uses of a word. For this reason the fields of MT and
Word Sense Disambiguation (Chapter 18) are closely linked.
Sometimes one language places more grammatical constraints on word choice
than another. We saw above that English marks nouns for whether they are singular
or plural. Mandarin doesn’t. Or French and Spanish, for example, mark grammat-
ical gender on adjectives, so an English translation into French requires specifying
adjective gender.
The way that languages differ in lexically dividing up conceptual space may be
more complex than this one-to-many translation problem, leading to many-to-many
mappings. For example, Fig. 11.2 summarizes some of the complexities discussed
by Hutchins and Somers (1992) in translating English leg, foot, and paw, to French.
For example, when leg is used about an animal it’s translated as French jambe; but
about the leg of a journey, as French etape; if the leg is of a chair, we use French
pied.
lexical gap Further, one language may have a lexical gap, where no word or phrase, short
of an explanatory footnote, can express the exact meaning of a word in the other
language. For example, English does not have a word that corresponds neatly to
Mandarin xiào or Japanese oyakōkōo (in English one has to make do with awkward
phrases like filial piety or loving child, or good son/daughter for both).
Finally, languages differ systematically in how the conceptual properties of an
event are mapped onto specific words. Talmy (1985, 1991) noted that languages
can be characterized by whether direction of motion and manner of motion are
marked on the verb or on the “satellites”: particles, prepositional phrases, or ad-
verbial phrases. For example, a bottle floating out of a cave would be described in
English with the direction marked on the particle out, while in Spanish the direction
11.1 • L ANGUAGE D IVERGENCES AND T YPOLOGY 207

ANIMAL paw
etape
JOURNEY HUMAN
patte
BIRD
leg foot
ANIMAL CHAIR HUMAN

jambe pied

Figure 11.2 The complex overlap between English leg, foot, etc., and various French trans-
lations as discussed by Hutchins and Somers (1992).

would be marked on the verb:


(11.5) English: The bottle floated out.
Spanish: La botella salió flotando.
The bottle exited floating.
verb-framed Verb-framed languages mark the direction of motion on the verb (leaving the
satellites to mark the manner of motion), like Spanish acercarse ‘approach’, al-
satellite-framed canzar ‘reach’, entrar ‘enter’, salir ‘exit’. Satellite-framed languages mark the
direction of motion on the satellite (leaving the verb to mark the manner of motion),
like English crawl out, float off, jump down, run after. Languages like Japanese,
Tamil, and the many languages in the Romance, Semitic, and Mayan languages fam-
ilies, are verb-framed; Chinese as well as non-Romance Indo-European languages
like English, Swedish, Russian, Hindi, and Farsi are satellite framed (Talmy 1991,
Slobin 1996).

11.1.3 Morphological Typology


Morphologically, languages are often characterized along two dimensions of vari-
isolating ation. The first is the number of morphemes per word, ranging from isolating
languages like Vietnamese and Cantonese, in which each word generally has one
polysynthetic morpheme, to polysynthetic languages like Siberian Yupik (“Eskimo”), in which a
single word may have very many morphemes, corresponding to a whole sentence in
English. The second dimension is the degree to which morphemes are segmentable,
agglutinative ranging from agglutinative languages like Turkish, in which morphemes have rel-
fusion atively clean boundaries, to fusion languages like Russian, in which a single affix
may conflate multiple morphemes, like -om in the word stolom (table-SG-INSTR-
DECL 1), which fuses the distinct morphological categories instrumental, singular,
and first declension.
Translating between languages with rich morphology requires dealing with struc-
ture below the word level, and for this reason modern systems generally use subword
models like the wordpiece or BPE models of Section 11.7.1.

11.1.4 Referential density


Finally, languages vary along a typological dimension related to the things they tend
to omit. Some languages, like English, require that we use an explicit pronoun when
talking about a referent that is given in the discourse. In other languages, however,
we can sometimes omit pronouns altogether, as the following example from Spanish
shows1 :
1 Here we use the 0-notation;
/ we’ll introduce this and discuss this issue further in Chapter 22
208 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

(11.6) [El jefe]i dio con un libro. 0/ i Mostró a un descifrador ambulante.


[The boss] came upon a book. [He] showed it to a wandering decoder.
pro-drop Languages that can omit pronouns are called pro-drop languages. Even among
the pro-drop languages, there are marked differences in frequencies of omission.
Japanese and Chinese, for example, tend to omit far more than does Spanish. This
dimension of variation across languages is called the dimension of referential den-
referential
density sity. We say that languages that tend to use more pronouns are more referentially
dense than those that use more zeros. Referentially sparse languages, like Chinese or
Japanese, that require the hearer to do more inferential work to recover antecedents
cold language are also called cold languages. Languages that are more explicit and make it easier
hot language for the hearer are called hot languages. The terms hot and cold are borrowed from
Marshall McLuhan’s (1964) distinction between hot media like movies, which fill in
many details for the viewer, versus cold media like comics, which require the reader
to do more inferential work to fill out the representation (Bickel, 2003).
Translating from languages with extensive pro-drop, like Chinese or Japanese, to
non-pro-drop languages like English can be difficult since the model must somehow
identify each zero and recover who or what is being talked about in order to insert
the proper pronoun.

11.2 The Encoder-Decoder Model


encoder- Encoder-decoder networks, or sequence-to-sequence networks, are models ca-
decoder
pable of generating contextually appropriate, arbitrary length, output sequences.
Encoder-decoder networks have been applied to a very wide range of applications
including machine translation, summarization, question answering, and dialogue.
The key idea underlying these networks is the use of an encoder network that
takes an input sequence and creates a contextualized representation of it, often called
the context. This representation is then passed to a decoder which generates a task-
specific output sequence. Fig. 11.3 illustrates the architecture

y1 y2 … ym

Decoder

Context

Encoder

x1 x2 … xn

Figure 11.3 The encoder-decoder architecture. The context is a function of the hidden
representations of the input, and may be used by the decoder in a variety of ways.

Encoder-decoder networks consist of three components:


1. An encoder that accepts an input sequence, x1n , and generates a corresponding
sequence of contextualized representations, hn1 . LSTMs, GRUs, convolutional
networks, and Transformers can all be employed as encoders.
2. A context vector, c, which is a function of hn1 , and conveys the essence of the
input to the decoder.
11.3 • E NCODER -D ECODER WITH RNN S 209

3. A decoder, which accepts c as input and generates an arbitrary length se-


quence of hidden states hm 1 , from which a corresponding sequence of output
states ym
1 , can be obtained. Just as with encoders, decoders can be realized by
any kind of sequence architecture.

11.3 Encoder-Decoder with RNNs


Let’s begin by describing an encoder-decoder network based on a pair of RNNs.2
Recall the conditional RNN language model from Chapter 9 for computing p(y),
the probability of a sequence y. Like any language model, we can break down the
probability as follows:

p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 )...P(ym |y1 , ..., ym−1 ) (11.7)

At a particular time t, we pass the prefix of t − 1 tokens through the language


model, using forward inference to produce a sequence of hidden states, ending with
the hidden state corresponding to the last word of the prefix. We then use the final
hidden state of the prefix as our starting point to generate the next token.
More formally, if g is an activation function like tanh or ReLU, a function of
the input at time t and the hidden state at time t − 1, and f is a softmax over the
set of possible vocabulary items, then at time t the output yt and hidden state ht are
computed as:

ht = g(ht−1 , xt ) (11.8)
yt = f (ht ) (11.9)

We only have to make one slight change to turn this language model with au-
source toregressive generation into a translation model that can translate from a source text
target in one language to a target text in a second: add an sentence separation marker at
the end of the source text, and then simply concatenate the target text. We briefly
introduced this idea of a sentence separator token in Chapter 9 when we considered
using a Transformer language model to do summarization, by training a conditional
language model.
If we call the source text x and the target text y, we are computing the probability
p(y|x) as follows:

p(y|x) = p(y1 |x)p(y2 |y1 , x)p(y3 |y1 , y2 , x)...P(ym |y1 , ..., ym−1 , x) (11.10)

Fig. 11.4 shows the setup for a simplified version of the encoder-decoder model
(we’ll see the full model, which requires attention, in the next section).
Fig. 11.4 shows an English source text (“the green witch arrived”), a sentence
separator token (<s>, and a Spanish target text (“llegó la bruja verde”). To trans-
late a source text, we run it through the network performing forward inference to
generate hidden states until we get to the end of the source. Then we begin autore-
gressive generation, asking for a word in the context of the hidden layer from the
end of the source input as well as the end-of-sentence marker. Subsequent words
are conditioned on the previous hidden state and the embedding for the last word
generated.
2 Later we’ll see how to use pairs of Transformers as well; it’s even possible to use separate architectures
for the encoder and decoder.
210 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

Target Text

llegó la bruja verde </s>


(output of source is ignored)
softmax

hidden hn
layer(s)

embedding
layer

the green witch arrived <s> llegó la bruja verde

Separator
Source Text

Figure 11.4 Translating a single sentence (inference time) in the basic RNN version of encoder-decoder ap-
proach to machine translation. Source and target sentences are concatenated with a separator token in between,
and the decoder uses context information from the encoder’s last hidden state.

Let’s formalize and generalize this model a bit in Fig. 11.5. (To help keep things
straight, we’ll use the superscripts e and d where needed to distinguish the hidden
states of the encoder and the decoder.) The elements of the network on the left
process the input sequence x and comprise the encoder. While our simplified fig-
ure shows only a single network layer for the encoder, stacked architectures are the
norm, where the output states from the top layer of the stack are taken as the fi-
nal representation. A widely used encoder design makes use of stacked biLSTMs
where the hidden states from top layers from the forward and backward passes are
concatenated as described in Chapter 9 to provide the contextualized representations
for each time step.

Decoder

y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax

hidden he1 he2 he3 e d


hhn n = c = h 0 hd1 hd2 hd3 hd4 hdn
layer(s)

embedding
layer

x1 x2 x3 xn <s> y1 y2 y3 yn

Encoder

Figure 11.5 A more formal version of translating a sentence at inference time in the basic RNN-based
encoder-decoder architecture. The final hidden state of the encoder RNN, hen , serves as the context for the
decoder in its role as hd0 in the decoder RNN.

The entire purpose of the encoder is to generate a contextualized representation


of the input. This representation is embodied in the final hidden state of the encoder,
hen . This representation, also called c for context, is then passed to the decoder.
The decoder network on the right takes this state and uses it to initialize the first
11.3 • E NCODER -D ECODER WITH RNN S 211

hidden state of the decoder. That is, the first decoder RNN cell uses c as its prior
hidden state hd0 . The decoder autoregressively generates a sequence of outputs, an
element at a time, until an end-of-sequence marker is generated. Each hidden state
is conditioned on the previous hidden state and the output generated in the previous
state.

y1 y2 yi

hd1 hd2 … hdi …

c …

Figure 11.6 Allowing every hidden state of the decoder (not just the first decoder state) to
be influenced by the context c produced by the encoder.

One weakness of this approach as described so far is that the influence of the
context vector, c, will wane as the output sequence is generated. A solution is to
make the context vector c available at each step in the decoding process by adding
it as a parameter to the computation of the current hidden state, using the following
equation (illustrated in Fig. 11.6):

htd = g(ŷt−1 , ht−1


d
, c) (11.11)

Now we’re ready to see the full equations for this version of the decoder in the basic
encoder-decoder model, with context available at each decoding timestep. Recall
that g is a stand-in for some flavor of RNN and ŷt−1 is the embedding for the output
sampled from the softmax at the previous step:

c = hen
hd0 = c
htd = g(ŷt−1 , ht−1
d
, c)
zt = f (htd )
yt = softmax(zt ) (11.12)

Finally, as shown earlier, the output y at each time step consists of a softmax com-
putation over the set of possible outputs (the vocabulary, in the case of language
modeling or MT). We compute the most likely output at each time step by taking the
argmax over the softmax output:

yˆt = argmaxw∈V P(w|x, y1 ...yt−1 ) (11.13)

There are also various ways to make the model a bit more powerful. For example,
we can help the model keep track of what has already been generated and what
hasn’t by conditioning the output layer y not just solely on the hidden state htd and
the context c but also on the output yt−1 generated at the previous timestep:

yt = softmax(ŷt−1 , zt , c) (11.14)

11.3.1 Training the Encoder-Decoder Model


Encoder-decoder architectures are trained end-to-end, just as with the RNN language
models of Chapter 9. Each training example is a tuple of paired strings, a source and
212 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

a target. Concatenated with a separator token, these source-target pairs can now
serve as training data.
For MT, the training data typically consists of sets of sentences and their transla-
tions. These can be drawn from standard datasets of aligned sentence pairs, as we’ll
discuss in Section 11.7.2. Once we have a training set, the training itself proceeds
as with any RNN-based language model. The network is given the source text and
then starting with the separator token is trained autoregressively to predict the next
word, as shown in Fig. 11.7.

Decoder

gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5

Total loss is the average 1X


T
L1 = L2 = L3 = L4 = L5 =
cross-entropy loss per L = Li -log P(y1) -log P(y2) -log P(y3) -log P(y4) -log P(y5) per-word
target word: T i=1 loss

softmax

hidden
layer(s)

embedding
layer
x1 x2 x3 x4

the green witch arrived <s> llegó la bruja verde

Encoder

Figure 11.7 Training the basic RNN encoder-decoder approach to machine translation. Note that in the
decoder we usually don’t propagate the model’s softmax outputs ŷt , but use teacher forcing to force each input
to the correct gold value for training. We compute the softmax output distribution over ŷ in the decoder in order
to compute the loss at each token, which can then be averaged to compute a loss for the sentence.

Note the differences between training (Fig. 11.7) and inference (Fig. 11.4) with
respect to the outputs at each time step. The decoder during inference uses its own
estimated output yˆt as the input for the next time step xt+1 . Thus the decoder will
tend to deviate more and more from the gold target sentence as it keeps generating
teacher forcing more tokens. In training, therefore, it is more common to use teacher forcing in the
decoder. Teacher forcing means that we force the system to use the gold target token
from training as the next input xt+1 , rather than allowing it to rely on the (possibly
erroneous) decoder output yˆt . This speeds up training.

11.4 Attention
The simplicity of the encoder-decoder model is its clean separation of the encoder
— which builds a representation of the source text — from the decoder, which uses
this context to generate a target text. In the model as we’ve described it so far, this
context vector is hn , the hidden state of the last (nth) time step of the source text.
This final hidden state is thus acting as a bottleneck: it must represent absolutely
everything about the meaning of the source text, since the only thing the decoder
knows about the source text is what’s in this context vector. Information at the
11.4 • ATTENTION 213

beginning of the sentence, especially for long sentences, may not be equally well
represented in the context vector.

Encoder bottleneck
bottleneck Decoder

Figure 11.8 Requiring the context c to be only the encoder’s final hidden state forces all the
information from the entire source sentence to pass through this representational bottleneck.

attention The attention mechanism is a solution to the bottleneck problem, a way of


mechanism
allowing the decoder to get information from all the hidden states of the encoder,
not just the last hidden state.
In the attention mechanism, as in the vanilla encoder-decoder model, the context
vector c is a single vector that is a function of the hidden states of the encoder, that is,
c = f (hn1 ). Because the number of hidden states varies with the size of the input, we
can’t use the entire tensor of encoder hidden state vectors directly as the context for
the decoder. The idea of attention is instead to create the single fixed-length vector
c by taking a weighted sum of all the encoder hidden states hn1 .
The weights are used to focus on a particular part of the source text that is rel-
evant for the token currently being produced by the decoder. The context vector
produced by the attention mechanism is thus dynamic, different for each token in
decoding.
Attention thus replaces the static context vector with one that is dynamically
derived from the encoder hidden states at each point during decoding. This context
vector, ci , is generated anew with each decoding step i and takes all of the encoder
hidden states into account in its derivation. We then make this context available
during decoding by conditioning the computation of the current decoder hidden state
on it (along with the prior hidden state and the previous output generated by the
decoder): and the equation (and Fig. 11.9):

hdi = g(ŷi−1 , hdi−1 , ci )

y1 y2 yi

hd1 hd2 … hdi …

c1 c2 ci

Figure 11.9 The attention mechanism allows each hidden state of the decoder to see a
different, dynamic, context, which is a function of all the encoder hidden states.

The first step in computing ci is to compute how much to focus on each encoder
state, how relevant each encoder state is to the decoder state captured in hdi−1 . We
capture relevance by computing— at each state i during decoding—a score(hdi−1 , hej )
for each encoder state j.
dot-product The simplest such score, called dot-product attention, implements relevance as
attention
similarity: measuring how similar the decoder hidden state is to an encoder hidden
state, by computing the dot product between them:

score(hdi−1 , hej ) = hdi−1 · hej (11.15)


214 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

The score that results from this dot product is a scalar that reflects the degree of
similarity between the two vectors. The vector of these scores across all the encoder
hidden states gives us the relevance of each encoder state to the current step of the
decoder.
To make use of these scores, we’ll normalize them with a softmax to create a
vector of weights, αi j , that tells us the proportional relevance of each encoder hidden
state j to the prior hidden decoder state, hdi−1 .

αi j = softmax(score(hdi−1 , hej ) ∀ j ∈ e) (11.16)

exp(score(hdi−1 , hej )
= P d e
(11.17)
k exp(score(hi−1 , hk ))

Finally, given the distribution in α, we can compute a fixed-length context vector for
the current decoder state by taking a weighted average over all the encoder hidden
states.
X
ci = αi j hej (11.18)
j

With this, we finally have a fixed-length context vector that takes into account
information from the entire encoder state that is dynamically updated to reflect the
needs of the decoder at each step of decoding. Fig. 11.10 illustrates an encoder-
decoder network with attention, focusing on the computation of one context vector
ci .

Decoder
X
↵ij hej ci
j yi yi+1
attention
.4 .3 .1 .2
weights
↵ij
hdi 1 · hej
hidden he1 he2 he3 hhen … hdi-1 hdi …
n ci-1
layer(s)

ci
x1 x2 x3 xn
yi-1 yi
Encoder

Figure 11.10 A sketch of the encoder-decoder network with attention, focusing on the computation of ci . The
context value ci is one of the inputs to the computation of hdi . It is computed by taking the weighted sum of all
the encoder hidden states, each weighted by their dot product with the prior decoder hidden state hdi−1 .

It’s also possible to create more sophisticated scoring functions for attention
models. Instead of simple dot product attention, we can get a more powerful function
that computes the relevance of each encoder hidden state to the decoder hidden state
by parameterizing the score with its own set of weights, Ws .

score(hdi−1 , hej ) = ht−1


d
Ws hej
11.5 • B EAM S EARCH 215

The weights Ws , which are then trained during normal end-to-end training, give the
network the ability to learn which aspects of similarity between the decoder and
encoder states are important to the current application. This bilinear model also
allows the encoder and decoder to use different dimensional vectors, whereas the
simple dot-product attention requires the encoder and decoder hidden states have
the same dimensionality.

11.5 Beam Search


The decoding algorithm we gave above for generating translations has a problem (as
does the autoregressive generation we introduced in Chapter 9 for generating from a
conditional language model). Recall that algorithm: at each time step in decoding,
the output yt is chosen by computing a softmax over the set of possible outputs (the
vocabulary, in the case of language modeling or MT), and then choosing the highest
probability token (the argmax):

yˆt = argmaxw∈V P(w|x, y1 ...yt−1 ) (11.19)

greedy Choosing the single most probable token to generate at each step is called greedy
decoding; a greedy algorithm is one that make a choice that is locally optimal,
whether or not it will turn out to have been the best choice with hindsight.
Indeed, greedy search is not optimal, and may not find the highest probability
translation. The problem is that the token that looks good to the decoder now might
turn out later to have been the wrong choice!
search tree Let’s see this by looking at the search tree, a graphical representation of the
choices the decoder makes in searching for the best translation, in which we view
the decoding problem as a heuristic state-space search and systematically explore
the space of possible outputs. In such a search tree, the branches are the actions, in
this case the action of generating a token, and the nodes are the states, in this case
the state of having generated a particular prefix. We are searching for the best action
sequence, i.e. the target string with the highest probability. Fig. 11.11 demonstrates
the problem, using a made-up example. Notice that the most probable sequence is
ok ok ¡/s¿ (with a probability of .4*.7*1.0), but a greedy search algorithm will fail
to find it, because it incorrectly chooses yes as the first word since it has the highest
local probability.
Recall from Chapter 8 that for part-of-speech tagging we used dynamic pro-
gramming search (the Viterbi algorithm) to address this problem. Unfortunately,
dynamic programming is not applicable to generation problems with long-distance
dependencies between the output decisions. The only method guaranteed to find the
best solution is exhaustive search: computing the probability of every one of the V T
possible sentences (for some length value T ) which is obviously too slow.
Instead, decoding in MT and other sequence generation problems generally uses
beam search a method called beam search. In beam search, instead of choosing the best token
to generate at each timestep, we keep k possible tokens at each step. This fixed-size
beam width memory footprint k is called the beam width, on the metaphor of a flashlight beam
that can be parameterized to be wider or narrower.
Thus at the first step of decoding, we compute a softmax over the entire vocab-
ulary, assigning a probability to each word. We then select the k-best options from
this softmax output. These initial k outputs are the search frontier and these k initial
216 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

p(t3|source, t1,t2)

p(t2|source, t1)
ok 1.0 </s>
.7
yes 1.0 </s>
p(t1|source) .2
ok .1 </s>
.4
start .5 yes .3 ok 1.0 </s>
.1 .4
</s> yes 1.0 </s>
.2
</s>

t1 t2 t3

Figure 11.11 A search tree for generating the target string T = t1 ,t2 , ... from the vocabulary
V = {yes, ok, <s>}, given the source string, showing the probability of generating each token
from that state. Greedy search would choose yes at the first time step followed by yes, instead
of the globally most probable sequence ok ok.

words are called hypotheses. A hypothesis is an output sequence, a translation-so-


far, together with its probability.
At subsequent steps, each of the k best hypotheses is extended incrementally
by being passed to distinct decoders, which each generate a softmax over the entire
vocabulary to extend the hypothesis to every possible next token. Each of these k ∗V
hypotheses is scored by P(yi |x, y<i ): the product of the probability of current word
choice multiplied by the probability of the path that led to it. We then prune the k ∗V
hypotheses down to the k best hypotheses, so there are never more than k hypotheses
at the frontier of the search, and never more than k decoders.
Fig. 11.12 illustrates this process with a beam width of 2.
This process continues until a </s> is generated indicating that a complete can-
didate output has been found. At this point, the completed hypothesis is removed
from the frontier and the size of the beam is reduced by one. The search continues
until the beam has been reduced to 0. The result will be k hypotheses.
Let’s see how the scoring works in detail, scoring each node by its log proba-
bility. Recall from Eq. 11.10 that we can use the chain rule of probability to break
down p(y|x) into the product of the probability of each word given its prior context,
which we can turn into a sum of logs (for an output string of length t):
score(y) = log P(y|x)
= log (P(y1 |x)P(y2 |y1 , x)P(y3 |y1 , y2 , x)...P(yt |y1 , ..., yt−1 , x))
X t
= log P(yi |y1 , ..., yi−1 , x) (11.20)
i=1

Thus at each step, to compute the probability of a partial translation, we simply add
the log probability of the prefix translation so far to the log probability of generating
the next token. Fig. 11.13 shows the scoring for the example sentence shown in
Fig. 11.12, using some simple made-up probabilities. Log probabilities are negative
or 0, and the max of two log probabilities is the one that is greater (closer to 0).
Fig. 11.14 gives the algorithm.
One problem arises from the fact that the completed hypotheses may have differ-
ent lengths. Because models generally assign lower probabilities to longer strings,
a naive algorithm would also choose shorter strings for y. This was not an issue
during the earlier steps of decoding; due to the breadth-first nature of beam search
11.6 • E NCODER -D ECODER WITH T RANSFORMERS 217

arrived y2

the green y3

hd1 hd2 y2 y3
y1
a hd hd hd a
y1 1 2 2
EOS arrived … …
aardvark EOS the green mage
a .. ..
… the the
hd1 .. ..
aardvark
witch witch
EOS .. … …
start arrived zebra zebra
..
the
y2 y3

zebra a arrived
… …
aardvark aardvark
the y2 .. ..
green green
.. ..
witch who
hd1 hd2
… y3 …
the witch
zebra zebra
EOS the
hd1 hd2 hd2

t1 t2 EOS the witch t3

Figure 11.12 Beam search decoding with a beam width of k = 2. At each time step, we choose the k best
hypotheses, compute the V possible extensions of each hypothesis, score the resulting k ∗V possible hypotheses
and choose the best k to continue. At time 1, the frontier is filled with the best 2 options from the initial state
of the decoder: arrived and the. We then extend each of those, compute the probability of all the hypotheses so
far (arrived the, arrived aardvark, the green, the witch) and compute the best 2 (in this case the green and the
witch) to be the search frontier to extend on the next step. On the arcs we show the decoders that we run to score
the extension words (although for simplicity we haven’t shown the context value ci that is input at each step).

all the hypotheses being compared had the same length. The usual solution to this is
to apply some form of length normalization to each of the hypotheses, for example
simply dividing the negative log probability by the number of words:
t
1X
score(y) = − log P(y|x) = − log P(yi |y1 , ..., yi−1 , x) (11.21)
T
i=1

Beam search is common in large production MT systems, generally with beam


widths k between 5 and 10. What do we do with the resulting k hypotheses? In some
cases, all we need from our MT algorithm is the single best hypothesis, so we can
return that. In other cases our downstream application might want to look at all k
hypotheses, so we can pass them all (or a subset) to the downstream application with
their respective scores.

11.6 Encoder-Decoder with Transformers


TBD
218 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

log P (arrived the|x) log P (“the green witch arrived”|x)


= -2.3 = log P (the|x) + log P(green|the,x)
+ log P(witch | the, green,x)
the +logP(arrived|the,green,witch,x)
+log P(END|the,green,witch,arrived,x)
log P(arrived|x) -2.7
-.69 log P(arrived witch|x) -3.2
=-1.6 = -3.9 mage -2.5 END
arrived -2.3 witch -2.1 -.22
arrived
-2.3 -4.8
-1.6 -1.6
log P(the green|x) -.36 -3.7 at
start = -1.6 came
log P(the|x)
-.51 witch -1.6
-.92 =-.92 green
-.69
the -2.7
log P(the witch|x)
-2.2 END
-1.2 = -2.1 -.51
witch -.11 arrived
-1.61 -3.8
-2.3
-4.4 by
who
log P(y1|x) log P(y2|y1,x) log P(y3|y2,y1,x) log P(y4|y3,y2,y1,x) log P(y5|y4,y3,y2,y1,x)
y1 y2 y3 y4 y5

Figure 11.13 Scoring for beam search decoding with a beam width of k = 2. We maintain the log probability
of each hypothesis in the beam by incrementally adding the logprob of generating each next token. Only the top
k paths are extended to the next step.

11.7 Some practical details on building MT systems

11.7.1 Tokenization
Machine translation systems generally use a fixed vocabulary, A common way to
wordpiece generate this vocabulary is with the BPE or wordpiece algorithms sketched in Chap-
ter 2. Generally a shared vocabulary is used for the source and target languages,
which makes it easy to copy tokens (like names) from source to target, so we build
the wordpiece/BPE lexicon on a corpus that contains both source and target lan-
guage data. Wordpieces use a special symbol at the beginning of each token; here’s
a resulting tokenization from the Google MT system (Wu et al., 2016):
words: Jet makers feud over seat width with big orders at stake
wordpieces: J et makers fe ud over seat width with big orders at stake
We gave the BPE algorithm in detail in Chapter 2; here’s more details on the
wordpiece algorithm, which is given a training corpus and a desired vocabulary size
V, and proceeds as follows:
1. Initialize the wordpiece lexicon with characters (for example a subset of Uni-
code characters, collapsing all the remaining characters to a special unknown
character token).
2. Repeat until there are V wordpieces:
(a) Train an n-gram language model on the training corpus, using the current
set of wordpieces.
(b) Consider the set of possible new wordpieces made by concatenating two
wordpieces from the current lexicon. Choose the one new wordpiece that
most increases the language model probability of the training corpus.
A vocabulary of 8K to 32K word pieces is commonly used.
11.7 • S OME PRACTICAL DETAILS ON BUILDING MT SYSTEMS 219

function B EAM D ECODE(c, beam width) returns best paths

y0 , h0 ← 0
path ← ()
complete paths ← ()
state ← (c, y0 , h0 , path) ;initial state
frontier ← hstatei ;initial frontier

while frontier contains incomplete paths and beamwidth > 0


extended frontier ← hi
for each state ∈ frontier do
y ← D ECODE(state)
for each word i ∈ Vocabulary do
successor ← N EW S TATE(state, i, yi )
new agenda ← A DD T O B EAM(successor, extended frontier, beam width)

for each state in extended frontier do


if state is complete do
complete paths ← A PPEND(complete paths, state)
extended frontier ← R EMOVE(extended frontier, state)
beam width ← beam width - 1
frontier ← extended frontier

return completed paths

function N EW S TATE(state, word, word prob) returns new state

function A DD T O B EAM(state, frontier, width) returns updated frontier

if L ENGTH(frontier) < width then


frontier ← I NSERT(state, frontier)
else if S CORE(state) > S CORE(W ORST O F(frontier))
frontier ← R EMOVE(W ORST O F(frontier))
frontier ← I NSERT(state, frontier)
return frontier

Figure 11.14 Beam search decoding.

11.7.2 MT corpora
parallel corpus Machine translation models are trained on a parallel corpus, sometimes called a
bitext, a text that appears in two (or more) languages. Large numbers of paral-
Europarl lel corpora are available. Some are governmental; the Europarl corpus (Koehn,
2005), extracted from the proceedings of the European Parliament, contains between
400,000 and 2 million sentences each from 21 European languages. The United Na-
tions Parallel Corpus contains on the order of 10 million sentences in the six official
languages of the United Nations (Arabic, Chinese, English, French, Russian, Span-
ish) Ziemski et al. (2016). Other parallel corpora have been made from movie and
TV subtitles, like the OpenSubtitles corpus (Lison and Tiedemann, 2016), or from
general web text, like the ParaCrawl corpus of with 223 million sentence pairs be-
tween 23 EU languages and English extracted from the CommonCrawl Bañón et al.
(2020).
220 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

Sentence alignment
Standard training corpora for MT come as aligned pairs of sentences. When creating
new corpora, for example for underresourced languages or new domains, these sen-
tence alignments must be created. Fig. 11.15 gives a sample hypothetical sentence
alignment.

E1: “Good morning," said the little prince. F1: -Bonjour, dit le petit prince.

E2: “Good morning," said the merchant. F2: -Bonjour, dit le marchand de pilules perfectionnées qui
apaisent la soif.
E3: This was a merchant who sold pills that had
F3: On en avale une par semaine et l'on n'éprouve plus le
been perfected to quench thirst.
besoin de boire.
E4: You just swallow one pill a week and you F4: -C’est une grosse économie de temps, dit le marchand.
won’t feel the need for anything to drink.
E5: “They save a huge amount of time," said the merchant. F5: Les experts ont fait des calculs.

E6: “Fifty−three minutes a week." F6: On épargne cinquante-trois minutes par semaine.

E7: “If I had fifty−three minutes to spend?" said the F7: “Moi, se dit le petit prince, si j'avais cinquante-trois minutes
little prince to himself. à dépenser, je marcherais tout doucement vers une fontaine..."
E8: “I would take a stroll to a spring of fresh water”

Figure 11.15 A sample alignment between sentences in English and French, with sentences extracted from
Antoine de Saint-Exupery’s Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences
e1 , ..., en , and f1 , ..., fn and finds minimal sets of sentences that are translations of each other, including single
sentence mappings like (e1 ,f1 ), (e4 -f3 ), (e5 -f4 ), (e6 -f6 ) as well as 2-1 alignments (e2 /e3 ,f2 ), (e7 /e8 -f7 ), and null
alignments (f5 ).

Given two documents that are translations of each other, we generally need two
steps to produce sentence alignments:
• a cost function that takes a span of source sentences and a span of target sen-
tences and returns a score measuring how likely these spans are to be transla-
tions.
• an alignment algorithm that takes these scores to find a good alignment be-
tween the documents.
Since it is possible to induce multilingual sentence embeddings (Artetxe and
Schwenk, 2019), cosine similarity of such embeddings provides a natural scoring
function (Schwenk, 2018). Thompson and Koehn (2019) give the following cost
function between two sentences or spans x,y from the source and target documents
respectively:

1 − cos(x, y))nSents(x) nSents(y)


c(x, y) = PS PS (11.22)
s=1 1 − cos(x, ys ) + s=1 1 − cos(xs , y)

where nSents() gives the number of sentences (this biases the metric toward many
alignments of single sentences instead of aligning very large spans). The denom-
inator helps to normalize the similarities, and so x1 , ..., xS , y1 , ..., yS , are randomly
selected sentences sampled from the respective documents.
Usually dynamic programming is used as the alignment algorithm (Gale and
Church, 1993), in a simple extension of the the minimum edit distance algorithm we
introduced in Chapter 2.
Finally, it’s helpful to do some corpus cleanup by removing noisy sentence pairs.
This can involve handwritten rules to remove low-precision pairs (for example re-
moving sentences that are too long, too short, have different URLs, or even pairs
11.8 • MT E VALUATION 221

that are too similar, suggesting that they were copies rather than translations). Or
pairs can be ranked by their multilingual embedding cosine score and low-scoring
pairs discarded.

11.7.3 Backtranslation
We’re often short of data for training MT models, since parallel corpora may be
limited for particular languages or domains. However, often we can find a large
monolingual corpus, to add to the smaller parallel corpora that are available.
backtranslation Backtranslation is a way of making use of monolingual corpora in the target
language by creating synthetic bitexts. In backtranslation, we train an intermediate
target-to-source MT system on the small bitext to translate the monolingual target
data to the source language. Now we can add this synthetic bitext (natural target
sentences, aligned with MT-produced source sentences) to our training data, and
retrain our source-to-target MT model. For example suppose we want to translate
from Navajo to English but only have a small Navajo-English bitext, although of
course we can find lots of monolingual English data. We use the small bitext to build
an MT engine going the other way (from English to Navajo). Once we translate the
monolingual English text to Navajo, we can add this synthetic Navajo/English bitext
to our training data.
Backtranslation has various parameters. One is how we generate the backtrans-
lated data; we can run the decoder in greedy inference, or use beam search. Or
Monte Carlo we can do sampling, or Monte Carlo search. In Monte Carlo decoding, at each
search
timestep, instead of always generating the word with the highest softmax proba-
bility, we roll a weighted die, and use it to choose the next word according to its
softmax probability. This works just like the sampling algorithm we saw in Chap-
ter 3 for generating random sentences from n-gram language models. Imagine there
are only 4 words and the softmax probability distribution at time t is (the: 0.6, green:
0.2, a: 0.1, witch: 0.1). We roll a weighted die, with the 4 sides weighted 0.6, 0.2,
0.1, and 0.1, and chose the word based on which side comes up. Another parameter
is the ratio of backtranslated data to natural bitext data; we can choose to upsample
the bitext data (include multiple copies of each sentence).
In general backtranslation works surprisingly well; one estimate suggests that a
system trained on backtranslated text gets about 2/3 of the gain as would training on
the same amount of natural bitext (Edunov et al., 2018).

11.8 MT Evaluation
Translations can be evaluated along two dimensions, adequacy and fluency.
adequacy adequacy: how well the translation captures the exact meaning of the source sen-
tence. Sometimes called faithfulness or fidelity.
fluency fluency: how fluent the translation is in the target language (is it grammatical, clear,
readable, natural).
Both human and automatic evaluation metrics are used.

11.8.1 Using Human Raters to Evaluate MT


The most accurate evaluations use human raters to evaluate each translation along
the two dimensions (often these raters are online crowdworkers hired specifically to
222 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

evaluate).
For example, along the dimension of fluency, we can ask how intelligible, how
clear, how readable, or how natural the MT output (the target text) is. We can give
the raters a scale, for example, from 1 (totally unintelligible) to 5 (totally intelligible,
or 1 to 100, and ask them to rate each sentence or paragraph of the MT output.
We can do the same thing to judge the second dimension, adequacy, using raters
to assign scores on a scale. If we have bilingual raters, we can give them the source
sentence and a proposed target sentence, and rate, on a 5-point or 100-point scale,
how much of the information in the source was preserved in the target. If we only
have monolingual raters but we have a good human translation of the source text,
we can give the monolingual raters the human reference translation and a target
machine translation and again rate how much information is preserved. If we use
a fine-grained enough scale, we can normalize raters by subtracting the mean from
their scores and dividing by the variance.
ranking An alternative is to do ranking: give the raters a pair of candidate translations,
and ask them which one they prefer.
While humans produce the best evaluations of machine translation output, run-
ning a human evaluation can be time consuming and expensive. In the next section
we introduce an automatic metric that, while less accurate than human evaluation, is
widely used because it can quickly evaluate potential system improvements, or even
be used as an automatic loss function for training.

11.8.2 Automatic Evaluation: BLEU


The most popular automatic metric for machine translation is called BLEU (for
BiLingual Evaluation Understudy). BLEU (along with the many alternative metrics
(e.g., NIST, TER, Precision and Recall, and METEOR) is based on a simple
intuition derived from the pioneering work of Miller and Beebe-Center (1958): a
good machine translation will tend to contain words and phrases that occur in a
human translation of the same sentence.
Consider a test set from a parallel corpus, in which each source sentence has both
a gold human target translation and a candidate MT translation we’d like to evaluate.
The BLEU metric ranks each MT target sentence by function of the number of n-
gram overlaps with the human translation.
Figure 11.16 shows an intuition from two candidate translations of a Spanish
source sentence, shown with a human reference translation. Note that Candidate
1 shares many more n-grams (in boxes) and especially longer n-grams (in darker
boxes) with the reference translations than does Candidate 2.
The intuition in Fig. 11.16 shows a single sentence, but BLEU is actually not a
score for a single sentence; it’s a score for an entire corpus of candidate translation
sentences. More formally, the BLEU score for a corpus of candidate translation
sentences is a function of the n-gram precision over all the sentences combined
with a brevity penalty computed over the corpus as a whole.
What do we mean by n-gram precision? Consider a corpus composed of a single
sentence. The unigram precision for this corpus is the percentage of unigram tokens
in the candidate translation that also occur in the reference translation, and ditto for
bigrams and so on, up to 4-grams. Candidate 1 in Fig. 11.16 has 19 unique unigrams,
some of which occur multiple times, for a total of 26 tokens. Of these, 16 unique
unigrams, totaling 23 tokens, also occur in the reference translation (3 don’t: voice,
deposit, and actions). Thus the unigram precision for the Candidate 1 corpus is
23/26 = .88.
11.8 • MT E VALUATION 223

Source
la verdad, cuya madre es la historia, émula del tiempo, depósito de las acciones,
testigo de lo pasado, ejemplo y aviso de lo presente, advertencia de lo por venir.

Reference
truth, whose mother is history, rival of time, storehouse of deeds,
witness for the past, example and counsel for the present, and warning for the future.
Candidate 1
truth, whose mother is history, voice of time, deposit of actions,
witness for the past, example and warning for the present, and warning for the future

Candidate 2
the truth, which mother is the history, émula of the time, deposition of the shares,
witness of the past, example and notice of the present, warning of it for coming

Figure 11.16 Intuition for BLEU: One of two candidate translations of a Spanish sentence
shares more n-grams, and especially longer n-grams, with the reference human translation.

We extend this unigram metric to a whole corpus of many sentences as follows.


For the numerator, we sum for each sentence the counts of all the unigram types that
also occur in the reference translation, and then sum those counts over all sentences.
The denominator is the total of the counts of all unigrams in all candidate sentences.
We compute this n-gram precision for unigrams, bigrams, trigrams, and 4-grams.
The n-gram precisions precn of a whole corpus of candidate sentences are thus:
X X
Countmatch (n-gram)
C∈{Candidates} n-gram ∈C
precn = X X (11.23)
Count(n-gram0 )
C0 ∈{Candidates} n-gram0 ∈C0

BLEU combines these four n-gram precisions by taking their geometric mean.
In addition, BLEU penalizes candidate translations that are too short. Imagine
our machine translation engine returned the following terrible candidate translation
3 for the example in Fig. 11.16:
(11.24) for the
Because the words for and the and the bigram for the all appear in the human ref-
erence, n-gram precision alone will assign candidate 3 a great score, since it has
perfect unigram and bigram precisions of 1.0!
One option for dealing with this problem is to combine recall with precision,
but BLEU chooses another option: adding a brevity penalty over the whole corpus,
penalizing a system that produces translations that are on average shorter than the
reference translations. Let sys len be the sum of the length of all the candidate trans-
lation sentences, and ref len be the sum of the length of all the reference translation
sentences. If the candidate translations are shorter than the reference, we assign a
brevity penalty BP that is a function of their ratio:
  
ref len
BP = min 1, exp 1 −
sys len
4
! 1
4
Y
BLEU = BP × precn (11.25)
n=1
224 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

Advanced details of BLEU


The above description was simplified in a number of ways. BLEU actually uses a
slightly different version of n-gram precision than the one in Eq. 11.23. Eq. 11.23
has a flaw that rewards candidates that have extra repeated words. Figure 11.17
shows an example of a pathological candidate sentence composed of 7 instances of
the single word the, leading to a unigram precision of 7/7!

Candidate: the the the the the the the

Reference 1: the cat is on the mat

Reference 2: there is a cat on the mat


Figure 11.17 A pathological example showing why BLEU uses a modified precision met-
ric. Unigram precision would be unreasonably high (7/7). Modified unigram precision is
appropriately low (2/7).
modified
n-gram To avoid this problem, BLEU uses a modified n-gram precision metric. We
precision
first count the maximum number of times a word is used in any single reference
translation. The count of each candidate word is then clipped by this maximum
reference count. Thus, the modified unigram precision in the example in Fig. 11.17
would be 2/7, since Reference 1 has a maximum of 2 thes.
To compute a score over the whole test set, BLEU first computes the N-gram
matches for each sentence and sums the clipped counts over all the candidate sen-
tences, then divides by the total number of candidate N-grams in the test set. If we
define the function Countmatch clipped to mean “the clipped count of all n-grams
that match the reference,” the real precn of a whole corpus of candidate sentences
that BLEU uses are:
X X
Countmatch clipped (n-gram)
C∈{Candidates} n-gram∈C
precn = X X (11.26)
Count(n-gram0 )
C0 ∈{Candidates} n-gram0 ∈C0

BLEU also work fine if we have multiple human reference translations for a
source sentence. In fact BLEU works better in this situation, since a source sentence
can be legitimately translated in many ways and n-gram precision will hence be
more robust. We just match an n-gram if it occurs in any of the references. And for
the brevity penalty, we choose for each candidate sentence the reference sentence
that is the closest in length to compute the ref len. But in practice most translation
corpora only have a single human translation to compare against.
Finally, implementing BLEU requires standardizing on many details of smooth-
ing and tokenization; for this reason it is recommended to use standard implemen-
tations like SACREBLEU (Post, 2018) rather than trying to implement BLEU from
scratch.

Statistical Significance Testing for BLEU


BLEU scores are mainly used to compare two systems, with the goal of answering
questions like: did the special new algorithm we invented improve our MT system?
To know if the difference between the BLEU scores of two MT systems is a sig-
nificant difference, we use the paired bootstrap test, or the similar randomization
test.
11.8 • MT E VALUATION 225

To get a confidence interval on a single BLEU score using the bootstrap test,
recall from Section 4.9 that we take our test set (or devset) and create thousands of
pseudo-testsets by repeatedly sampling with replacement from the original test set.
We now compute the BLEU score of each of the pseudo-testsets. If we drop the
top 2.5% and bottom 2.5% of the scores, the remaining scores will give us the 95%
confidence interval for the BLEU score of our system.
To compare two MT systems A and B, we draw the same set of pseudo-testsets,
and compute the BLEU scores for each of them. We then compute the percentage
of pseudo-test-sets in which A has a higher BLEU score than B.

BLEU: Limitations
While automatic metrics like BLEU are useful, they have important limitations.
BLEU is very local: a large phrase that is moved around might not change the
BLEU score at all, and BLEU can’t evaluate cross-sentence properties of a docu-
ment like its discourse coherence (Chapter 22). BLEU and similar automatic met-
rics also do poorly at comparing very different kinds of systems, such as comparing
human-aided translation against machine translation, or different machine transla-
tion architectures against each other (Callison-Burch et al., 2006). Such automatic
metrics are probably most appropriate when evaluating changes to a single system.

11.8.3 Automatic Evaluation: Embedding-Based Methods


The BLEU metric is based on measuring the exact word or n-grams a human ref-
erence and candidate machine translation have in common. However, this criterion
is overly strict, since a good translation may use alternate words or paraphrases. A
solution pioneered in early metrics like METEOR (Banerjee and Lavie, 2005) was
to allow synonyms to match between the reference x and candidate x̃. More recent
metrics use BERT or other embeddings to implement this intuition.
For example, in some situations we might have datasets that have human as-
sessments of translation quality. Such datasets consists of tuples (x, x̃, r), where
x = (x1 , . . . , xn ) is a reference translation, x̃ = (x̃1 , . . . , x̃m ) is a candidate machine
translation, and r ∈ R is a human rating that expresses the quality of x̃ with respect
to x. Given such data, algorithms like BLEURT (Sellam et al., 2020) train a pre-
dictor on the human-labeled datasets, by passing x and x̃ through a version of BERT
(trained with extra pretraining, and then fine-tuned on the human-labeled sentences),
followed by a linear layer that is trained to predict r. The output of such models cor-
relates highly with human labels.
In other cases, however, we don’t have such human-labeled datasets. In that
case we can measure the similarity of x and x̃ by the similarity of their embeddings.
The BERT SCORE algorithm (Zhang et al., 2020) shown in Fig. 11.18, for example,
passes the reference x and the candidate x̃ through BERT, computing a BERT em-
bedding for each token xi and x̃ j . Each pair of tokens (xi , x̃ j ) is scored by its cosine
xi ·x̃ j
|xi ||x̃ j | . Each token in x is matched to a token in x̃ to compute recall, and each token in
x̃ is matched to a token in x to compute precision (with each token greedily matched
to the most similar token in the corresponding sentence). BERT SCORE provides
precision and recall (and hence F1 ):

1 X 1 X
RBERT = max xi · x̃ j PBERT = max xi · x̃ j (11.27)
|x| x ∈x x̃ j ∈x̃ |x̃| x̃ ∈x̃ xi ∈x
i j
Published as a conference paper at ICLR 2020
226 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

Contextual Pairwise Cosine Maximum Similarity Importance Weighting


Embedding Similarity (Optional)
Reference x
1.27
<latexit sha1_base64="f2yzimwbR/Dgjzp6tZ360fHRqNI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N2IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5jmM/A==</latexit>
<latexit

7.94
the weather is

Reference
1.82
cold today (0.713 1.27)+(0.515 7.94)+...
7.90
RBERT =
<latexit sha1_base64="fGWl4NCvlvtMu17rjLtk25oWpdc=">AAACSHicbZBLS+RAFIUrPT7bVzsu3RQ2ghIIqVbpuBgQRZiVqNgqdJpQqa5oYeVB1Y1ME/Lz3Lic3fwGNy6UwZ2VNgtfBwoO372Xe+uEmRQaXPef1fgxMTk1PTPbnJtfWFxqLf8812muGO+xVKbqMqSaS5HwHgiQ/DJTnMah5BfhzUFVv7jlSos0OYNRxgcxvUpEJBgFg4JWcBoUPvA/UOwfnp6VJf6F/UhRVmy4Tpds+SBirjFxOt1N26AdslOjrrO7vWn7cpiCLouqwa6QTRyvUznX9hzPK4NW23XcsfBXQ2rTRrWOg9Zff5iyPOYJMEm17hM3g0FBFQgmedn0c80zym7oFe8bm1BzzKAYB1HidUOGOEqVeQngMX0/UdBY61Ecms6YwrX+XKvgd7V+DpE3KESS5cAT9rYoyiWGFFep4qFQnIEcGUOZEuZWzK6pyRFM9k0TAvn85a/mvOMQ1yEnpL13VMcxg1bRGtpABHXRHvqNjlEPMXSHHtATerburUfrv/Xy1tqw6pkV9EGNxisxMKq0</latexit>
sha1_base64="OJyoKlmBAgUA0KDtUcsH/di5BlI=">AAACSHicbZDLattAFIaPnLRJ3JvTLrsZYgoJAqFxGqwsCqal0FVJQ5wELCNG41EyZHRh5ijECL1EnqAv002X2eUZsumipXRR6Mj2Ipf+MPDznXM4Z/64UNKg7187raXlR49XVtfaT54+e/6is/7y0OSl5mLIc5Xr45gZoWQmhihRieNCC5bGShzFZx+a+tG50Ebm2QFOCzFO2UkmE8kZWhR1ov2oClFcYPX+4/5BXZN3JEw049Wm7/XpdogyFYZQr9ffci3aoTsL1Pd23265oZrkaOqqaXAb5FIv6DXOdwMvCOqo0/U9fyby0NCF6Q52/15+BYC9qHMVTnJepiJDrpgxI+oXOK6YRsmVqNthaUTB+Bk7ESNrM2aPGVezIGryxpIJSXJtX4ZkRm9PVCw1ZprGtjNleGru1xr4v9qoxCQYVzIrShQZny9KSkUwJ02qZCK14Kim1jCupb2V8FNmc0SbfduGQO9/+aE57HnU9+gX2h18hrlW4TVswCZQ6MMAPsEeDIHDN7iBn/DL+e78cH47f+atLWcx8wruqNX6B8dUrVw=</latexit>
sha1_base64="RInTcZkWiVBnf/ncBstCvatCtG4=">AAACSHicbZDPShxBEMZ7Nproxugaj14al4AyMEyvyoyHwGIQPImKq8LOMvT09mhjzx+6a0KWYV4iL5EnySXH3HwGLx4U8SDYs7sHo/mg4eNXVVT1F+VSaHDda6vxbmb2/Ye5+ebHhU+LS63lz6c6KxTjPZbJTJ1HVHMpUt4DAZKf54rTJJL8LLr6VtfPvnOlRZaewCjng4RepCIWjIJBYSs8DssA+A8od/eOT6oKf8VBrCgr113HI5sBiIRrTJyOt2EbtE22p8hzdrY27EAOM9BVWTfYNbKJ43dq59q+4/tV2Gq7jjsWfmvI1LS7O08/f3nLi4dh628wzFiR8BSYpFr3iZvDoKQKBJO8agaF5jllV/SC941NqTlmUI6DqPAXQ4Y4zpR5KeAxfTlR0kTrURKZzoTCpX5dq+H/av0CYn9QijQvgKdssiguJIYM16nioVCcgRwZQ5kS5lbMLqnJEUz2TRMCef3lt+a04xDXIUek3T1AE82hVbSG1hFBHuqifXSIeoih3+gG3aF76491az1Yj5PWhjWdWUH/qNF4BkPYrbk=</latexit>
1.27+7.94+1.82+7.90+8.88

Candidate x̂ <latexit sha1_base64="5QTnVRVSrnyzznVU7d5bF5u03Iw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDbbTbt0swm7E7GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST29zvPHJtRKwecJpwP6IjJULBKFqp0x9TzJ5mg2rNrbtzkFXiFaQGBZqD6ld/GLM04gqZpMb0PDdBP6MaBZN8VumnhieUTeiI9yxVNOLGz+bnzsiZVYYkjLUthWSu/p7IaGTMNApsZ0RxbJa9XPzP66UYXvuZUEmKXLHFojCVBGOS/06GQnOGcmoJZVrYWwkbU00Z2oQqNgRv+eVV0r6oe27du7+sNW6KOMpwAqdwDh5cQQPuoAktYDCBZ3iFNydxXpx352PRWnKKmWP4A+fzB7A8j8k=</latexit>


8.88

it is freezing today idf


weights

Candidate

Figure 1: Illustration of the computation of the recall metric R BERT . Given the reference x and
Figure 11.18
candidate The computation
x̂, we compute of BERT
BERT embeddings andSCORE recall
pairwise cosinefrom reference
similarity. x and candidate
We highlight the greedy x̂,
from Figure 1 in Zhang et al. (2020). This version shows an
matching in red, and include the optional idf importance weighting.extended version of the metric in
which tokens are also weighted by their idf values.

We experiment with different models (Section 4), using the tokenizer provided with each model.
11.9 BiasGivenand Ethical
a tokenized referenceIssues
sentence x = hx , . . . , x i, the embedding model generates a se- 1 k
quence of vectors hx1 , . . . , xk i. Similarly, the tokenized candidate x̂ = hx̂1 , . . . , x̂m i is mapped
to hx̂1 , . . . , x̂l i. The main model we use is BERT, which tokenizes the input text into a sequence
of word pieces (Wu et al., 2016), where unknown words are split into several commonly observed
Machine
sequencestranslation
of characters.raises many of theforsame
The representation ethical
each word issues
piece that we’ve
is computed with a discussed
Transformerin
earlier
encoderchapters.
(Vaswani et For example,
al., 2017) consider
by repeatedly MT self-attention
applying systems translating from
and nonlinear Hungarian
transformations
in an alternating fashion. BERT embeddings have been shown to benefit various NLP tasks (Devlin
(which has the gender neutral pronoun ő) or Spanish
et al., 2019; Liu, 2019; Huang et al., 2019; Yang et al., 2019a).
(which often drops pronouns)
into English (in which pronouns are obligatory, and they have grammatical gender).
Similarity Measure The vector representation allows for a soft measure of similarity instead of
When translating
exact-string (Papineni a et
reference
al., 2002) to
or aheuristic
person(Banerjee
described without
& Lavie, 2005)specified
matching. gender, MT
The cosine
systems often default to male gender (Schiebinger 2014,x> x̂ Prates et
similarity of a reference token xi and a candidate token x̂j is kxiikkx̂j k . We use pre-normalized
j al. 2019). And
MT systems
vectors, whichoften
reducesassign gender according
this calculation to the innerto culture
product stereotypes of the sort we saw
x>i x̂j . While this measure considers
intokens
Section 6.11. the
in isolation, Fig. 11.19 embeddings
contextual shows examples from (Prates
contain information et al.,
from the 2019),
rest of in which
the sentence.
Hungarian
BERTS CORE The complete score matches each token in x to a token in x̂ to compute recall, ő
gender-neutral ő is a nurse is translated with she, but gender-neutral
isand
a CEO is translated
each token with inhe.x toPrates
in x̂ to a token et al.
compute (2019)We
precision. findusethat these
greedy stereotypes
matching can’t
to maximize
the matchingbe
completely similarity
accountedscore,2forwhere
by each tokenbias
gender is matched
in UStolabor
the most similar token
statistics, in thethe
because otherbi-
sentence. We combine precision and recall to compute an F1 measure. For a reference x and
ases are amplified
candidate x̂, the recall,by MT systems,
precision, withare:
and F1 scores pronouns being mapped to male or female
gender with aX probability higher than if the mapping was based on actual labor em-
1 1 X PBERT · RBERT
ployment
RBERT = statistics. max x> i x̂j , PBERT = max x>i x̂j , FBERT = 2 .
x̂ 2x̂ x 2x |x| x j |x̂| i PBERT + RBERT
i 2x x̂j 2x̂

Hungarian (gender neutral) source English MT output


Importance Weighting Previous work on similarity measures demonstrated that rare words can
ő egy
be more indicative she is &
ápolófor sentence similarity than common words (Banerjee a nurse
Lavie, 2005; Vedantam
et al.,ő2015).
egy tudós he is aweighting.
BERTS CORE enables us to easily incorporate importance scientistWe experiment
with őinverse document frequency (idf) scores computed from thehe
egy mérnök testiscorpus. Given M reference
an engineer
sentences {x(i) }M i=1 , the idf score of a word-piece token w is
ő egy pék he is a baker
ő egy tanár 1 X
M
she is a teacher
idf(w) = log I[w 2 x(i) ] ,
ő egy vesküvőszervező M i=1 she is a wedding organizer
ő egy vezérigazgató he is a CEO
where I[·] is an indicator function. We do not use the full tf-idf measure because we process single
Figure 11.19
sentences, whereWhen translating
the term frequencyfrom
(tf) isgender-neutral languages
likely 1. For example, like
recall Hungarian
with intoisEnglish,
idf weighting
current MT systems interpret peopleP from traditionally male-dominated occupations as male,
>
and traditionally female-dominated occupations
xi 2x idf(xi )
P
asmax
female (Prates
x̂j 2x̂ x i x̂j et al., 2019).
RBERT = .
xi 2x idf(xi )
Similarly,
Because we useareference
recent sentences
challenge set, theidf,
to compute WinoMT dataset
the idf scores (Stanovsky
remain the same foretallal., 2019)
systems
shows thatonMT
evaluated systems
a specific perform
test set. We applyworse when
plus-one they to
smoothing arehandle
asked to translate
unknown sentences
word pieces.
that 2describe people with non-stereotypical gender roles, like “The doctor asked the
We compare greedy matching with optimal assignment in Appendix C.
nurse to help her in the operation”.
Many open ethical issues in MT require further research. One is the need for bet-
4
ter metrics for knowing what our systems don’t know. MT systems can be used in
urgent situations where human translators may be unavailable or delayed: in medical
domains, to help translate when patients and doctors don’t speak the same language,
or in legal domains, to help judges or lawyers communicate with witnesses or de-
confidence fendants. In order to ‘do no harm’, systems need ways to assign confidence values
11.10 • S UMMARY 227

to candidate translations, so they can abstain from giving incorrect translations that
may cause harm.
Another is the need for low-resource algorithms that can do translation to and
from the vast majority of the world’s languages, which do not have large parallel
texts available for training. This problem is exacerbated by the fact that cross-lingual
transfer and multilingual approaches to MT tend to focus on the case where one
of the languages is English (Anastasopoulos and Neubig, 2020). ∀ et al. (2020)
propose a participatory design process to encourage content creators, curators, and
low-resourced
languages language technologists who speak these low-resourced languages to participate in
development of MT algorithms. Their method uses online groups, mentoring, and
online infrastructure, and they report on a case study on developing MT algorithms
for low-resource African languages.

11.10 Summary
Machine translation is one of the most widely used applications of NLP, and the
encoder-decoder model, first developed for MT is a key tool that has applications
throughout NLP.
• Languages have divergences, both structural and lexical, that make translation
difficult.
• The linguistic field of typology investigates some of these differences; lan-
guages can be classified by their position along typological dimensions like
whether verbs precede their objects.
• Encoder-decoder networks are composed of an encoder network that takes
an input sequence and creates a contextualized representation of it, the con-
text. This context representation is then passed to a decoder which generates
a task-specific output sequence.
• The attention mechanism enriches the context vector to allowing the decoder
to view information from all the hidden states of the encoder, not just the last
hidden state.
• The encoder-decoder architecture can be implemented by RNNs or by Trans-
formers.
• For the decoder, choosing the single most probable token to generate at each
step is called greedy decoding.
• In beam search, instead of choosing the best token to generate at each timestep,
we keep k possible tokens at each step. This fixed-size memory footprint k is
called the beam width.
• Machine translation models are trained on a parallel corpus, sometimes called
a bitext, a text that appears in two (or more) languages.
• Backtranslation is a way of making use of monolingual corpora in the target
language by running a pilot MT engine backwards to create synthetic bitexts.
• MT is evaluated by measuring a translation’s adequacy (how well it captures
the meaning of the source sentence) and fluency (how fluent or natural it is
in the target language). Human evaluation is the gold standard, but automatic
evaluation metrics like BLEU, which measure word or n-gram overlap with
human translations, or more recent metrics based on embedding similarity, are
also commonly used.
228 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

Bibliographical and Historical Notes


MT was proposed seriously by the late 1940s, soon after the birth of the computer
(Weaver, 1955). In 1954, the first public demonstration of an MT system proto-
type (Dostert, 1955) led to great excitement in the press (Hutchins, 1997). The next
decade saw a great flowering of ideas, prefiguring most subsequent developments.
But this work was ahead of its time—implementations were limited by, for exam-
ple, the fact that pending the development of disks there was no good way to store
dictionary information.
As high-quality MT proved elusive (Bar-Hillel, 1960), there grew a consensus
on the need for better evaluation and more basic research in the new fields of formal
and computational linguistics. This consensus culminated in the famous ALPAC
(Automatic Language Processing Advisory Committee) report of 1966 (Pierce et al.,
1966) that led in the mid 1960s to a dramatic cut in funding for MT in the US. As
MT research lost academic respectability, the Association for Machine Translation
and Computational Linguistics dropped MT from its name. Some MT developers,
however, persevered, and there were early industrial engines like Systran, as well as
early MT systems like Météo, which translated weather forecasts from English to
French (Chandioux, 1976).
In early years, the space of MT architectures spanned three general models.
In perhaps the earliest developed method, direct translation, the system proceeds
word-by-word through the source-language text, translating each word incremen-
tally. Direct translation uses a large bilingual dictionary, each of whose entries is
a small program with the job of translating one word. In transfer approaches, we
first parse the input text and then apply rules to transform the source-language parse
into a target language parse. We then generate the target language sentence from the
parse tree. In interlingua approaches, we analyze the source language text into some
abstract meaning representation, called an interlingua. We then generate into the
target language from this interlingual representation. A common way to visualize
Vauquois
triangle these three early approaches was the Vauquois triangle shown in Fig. 11.20. The
triangle shows the increasing depth of analysis required (on both the analysis and
generation end) as we move from the direct approach through transfer approaches
to interlingual approaches. In addition, it shows the decreasing amount of transfer
knowledge needed as we move up the triangle, from huge amounts of transfer at
the direct level (almost all knowledge is transfer knowledge for each word) through
transfer (transfer rules only for parse trees or thematic roles) through interlingua
(no specific transfer knowledge). We can view the encoder-decoder network as an
interlingual approach, with attention acting as an integration of direct and transfer,
allowing words or their representations to be directly accessed by the decoder.
Statistical methods began to be applied around 1990, enabled first by the devel-
opment of large bilingual corpora like the Hansard corpus of the proceedings of the
Canadian Parliament, which are kept in both French and English, and then by the
growth of the Web. Early on, a number of researchers showed that it was possible
to extract pairs of aligned sentences from bilingual corpora, using words or simple
cues like sentence length (Kay and Röscheisen 1988, Gale and Church 1991, Gale
and Church 1993, Kay and Röscheisen 1993).
At the same time, the IBM group, drawing directly on the noisy channel model
for speech recognition, proposed algorithms for statistical MT, algorithms that be-
IBM Models came known as IBM Models 1 through 5, implemented in the Candide system.
Candide
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 229

Interlingua

sis age

ta gen
aly gu

rg
an la n

et era
Source Text:
Target Text:

lan tion
ce
Semantic/Syntactic
Transfer Semantic/Syntactic

ur

gu
Structure

so

ag
Structure

e
source Direct Translation target
text text
Figure 11.20 The Vauquois (1968) triangle.

The algorithms (except for the decoder) were published in full detail— encouraged
by the US government which had partially funded the work— which gave them a
huge impact on the research community (Brown et al. 1990, Brown et al. 1993).
By the turn of the century, most academic research on machine translation used the
statistical noisy channel model. Progress was made hugely easier by the develop-
ment of publicly available toolkits, like the GIZA toolkit (Och and Ney, 2003) which
implements IBM models 1–5 as well as the HMM alignment model.
Around the turn of the century, an extended approach, called phrase-based
phrase-based translation was developed, which was based on inducing translations for phrase-
translation
pairs (Och 1998, Marcu and Wong 2002, Koehn et al. (2003), Och and Ney 2004,
Deng and Byrne 2005, inter alia). A log linear formulation (Och and Ney, 2004)
was trained to directly optimize evaluation metrics like BLEU in a method known
MERT as Minimum Error Rate Training, or MERT (Och, 2003), also drawing from
speech recognition models (Chou et al., 1993). Popular toolkits were developed like
Moses Moses (Koehn et al. 2006, Zens and Ney 2007).
There were also approaches around the turn of the century that were based on
transduction
grammar syntactic structure (Chapter 12). Models based on transduction grammars (also
called synchronous grammars assign a parallel syntactic tree structure to a pair of
sentences in different languages, with the goal of translating the sentences by ap-
plying reordering operations on the trees. From a generative perspective, we can
view a transduction grammar as generating pairs of aligned sentences in two lan-
inversion
guages. Some of the most widely used models included the inversion transduction
transduction
grammar
grammar (Wu, 1996) and synchronous context-free grammars (Chiang, 2005),
MODERN HISTORY OF encoder-decoder approach HERE; (Kalchbren-
ner and Blunsom, 2013), (Cho et al., 2014), (Sutskever et al., 2014), etc
Beam-search has an interesting relationship with human language processing;
(Meister et al., 2020) show that beam search enforces the cognitive property of uni-
form information density in text. Uniform information density is the hypothe-
sis that human language processors tend to prefer to distribute information equally
across the sentence (Jaeger and Levy, 2007).
Research on evaluation of machine translation began quite early. Miller and
Beebe-Center (1958) proposed a number of methods drawing on work in psycholin-
guistics. These included the use of cloze and Shannon tasks to measure intelligibil-
ity as well as a metric of edit distance from a human translation, the intuition that
underlies all modern automatic evaluation metrics like BLEU. The ALPAC report
included an early evaluation study conducted by John Carroll that was extremely in-
fluential (Pierce et al., 1966, Appendix 10). Carroll proposed distinct measures for
fidelity and intelligibility, and had raters score them subjectively on 9-point scales.
230 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS

More recent work on evaluation has focused on coming up with automatic metrics,
include the work on BLEU discussed in Section 11.8.2 (Papineni et al., 2002), as
well as related measures like NIST (Doddington, 2002), TER (Translation Error
Rate) (Snover et al., 2006), Precision and Recall (Turian et al., 2003), and ME-
TEOR (Banerjee and Lavie, 2005).
Good surveys of the early history of MT are Hutchins (1986) and (1997). Niren-
burg et al. (2002) is a collection of early readings in MT.
See Croft (1990) or Comrie (1989) for introductions to typology.

Exercises
CHAPTER

12 Constituency Grammars

If on a winter’s night a traveler by Italo Calvino


Nuclear and Radiochemistry by Gerhart Friedlander et al.
The Fire Next Time by James Baldwin
A Tad Overweight, but Violet Eyes to Die For by G. B. Trudeau
Sometimes a Great Notion by Ken Kesey
Dancer from the Dance by Andrew Holleran
Six books in English whose titles are not
constituents, from Pullum (1991, p. 195)

The study of grammar has an ancient pedigree; Panini’s grammar of Sanskrit


was written over two thousand years ago and is still referenced today in teaching
syntax Sanskrit. And our word syntax comes from the Greek sýntaxis, meaning “setting
out together or arrangement”, and refers to the way words are arranged together. We
have seen various syntactic notions in previous chapters: ordering of sequences of
words (Chapter 2), probabilities for these word sequences (Chapter 3), and the use of
part-of-speech categories as a grammatical equivalence class for words (Chapter 8).
In this chapter and the next three we introduce a variety of syntactic phenomena that
go well beyond these simpler approaches, together with formal models for capturing
them in a computationally useful manner.
The bulk of this chapter is devoted to context-free grammars. Context-free gram-
mars are the backbone of many formal models of the syntax of natural language (and,
for that matter, of computer languages). As such, they play a role in many computa-
tional applications, including grammar checking, semantic interpretation, dialogue
understanding, and machine translation. They are powerful enough to express so-
phisticated relations among the words in a sentence, yet computationally tractable
enough that efficient algorithms exist for parsing sentences with them (as we show
in Chapter 13). And in Chapter 16 we show how they provide a systematic frame-
work for semantic interpretation. Here we also introduce the concept of lexicalized
grammars, focusing on one example, combinatory categorial grammar, or CCG.
In Chapter 14 we introduce a formal model of grammar called syntactic depen-
dencies that is an alternative to these constituency grammars, and we’ll give algo-
rithms for dependency parsing. Both constituency and dependency formalisms are
important for language processing.
Finally, we provide a brief overview of the grammar of English, illustrated from
a domain with relatively simple sentences called ATIS (Air Traffic Information Sys-
tem) (Hemphill et al., 1990). ATIS systems were an early spoken language system
for users to book flights, by expressing sentences like I’d like to fly to Atlanta.
232 C HAPTER 12 • C ONSTITUENCY G RAMMARS

12.1 Constituency
Syntactic constituency is the idea that groups of words can behave as single units,
or constituents. Part of developing a grammar involves building an inventory of the
constituents in the language. How do words group together in English? Consider
noun phrase the noun phrase, a sequence of words surrounding at least one noun. Here are some
examples of noun phrases (thanks to Damon Runyon):

Harry the Horse a high-class spot such as Mindy’s


the Broadway coppers the reason he comes into the Hot Box
they three parties from Brooklyn
What evidence do we have that these words group together (or “form constituents”)?
One piece of evidence is that they can all appear in similar syntactic environments,
for example, before a verb.

three parties from Brooklyn arrive. . .


a high-class spot such as Mindy’s attracts. . .
the Broadway coppers love. . .
they sit
But while the whole noun phrase can occur before a verb, this is not true of each
of the individual words that make up a noun phrase. The following are not grammat-
ical sentences of English (recall that we use an asterisk (*) to mark fragments that
are not grammatical English sentences):

*from arrive. . . *as attracts. . .


*the is. . . *spot sat. . .
Thus, to correctly describe facts about the ordering of these words in English, we
must be able to say things like “Noun Phrases can occur before verbs”.
preposed Other kinds of evidence for constituency come from what are called preposed or
postposed postposed constructions. For example, the prepositional phrase on September sev-
enteenth can be placed in a number of different locations in the following examples,
including at the beginning (preposed) or at the end (postposed):
On September seventeenth, I’d like to fly from Atlanta to Denver
I’d like to fly on September seventeenth from Atlanta to Denver
I’d like to fly from Atlanta to Denver on September seventeenth
But again, while the entire phrase can be placed differently, the individual words
making up the phrase cannot be:
*On September, I’d like to fly seventeenth from Atlanta to Denver
*On I’d like to fly September seventeenth from Atlanta to Denver
*I’d like to fly on September from Atlanta to Denver seventeenth

12.2 Context-Free Grammars


The most widely used formal system for modeling constituent structure in English
CFG and other natural languages is the Context-Free Grammar, or CFG. Context-
12.2 • C ONTEXT-F REE G RAMMARS 233

free grammars are also called Phrase-Structure Grammars, and the formalism
is equivalent to Backus-Naur Form, or BNF. The idea of basing a grammar on
constituent structure dates back to the psychologist Wilhelm Wundt (1900) but was
not formalized until Chomsky (1956) and, independently, Backus (1959).
rules A context-free grammar consists of a set of rules or productions, each of which
expresses the ways that symbols of the language can be grouped and ordered to-
lexicon gether, and a lexicon of words and symbols. For example, the following productions
NP express that an NP (or noun phrase) can be composed of either a ProperNoun or
a determiner (Det) followed by a Nominal; a Nominal in turn can consist of one or
more Nouns.

NP → Det Nominal
NP → ProperNoun
Nominal → Noun | Nominal Noun

Context-free rules can be hierarchically embedded, so we can combine the pre-


vious rules with others, like the following, that express facts about the lexicon:

Det → a
Det → the
Noun → flight

The symbols that are used in a CFG are divided into two classes. The symbols
terminal that correspond to words in the language (“the”, “nightclub”) are called terminal
symbols; the lexicon is the set of rules that introduce these terminal symbols. The
non-terminal symbols that express abstractions over these terminals are called non-terminals. In
each context-free rule, the item to the right of the arrow (→) is an ordered list of one
or more terminals and non-terminals; to the left of the arrow is a single non-terminal
symbol expressing some cluster or generalization. The non-terminal associated with
each word in the lexicon is its lexical category, or part of speech.
A CFG can be thought of in two ways: as a device for generating sentences
and as a device for assigning a structure to a given sentence. Viewing a CFG as a
generator, we can read the → arrow as “rewrite the symbol on the left with the string
of symbols on the right”.
So starting from the symbol: NP
we can use our first rule to rewrite NP as: Det Nominal
and then rewrite Nominal as: Det Noun
and finally rewrite these parts-of-speech as: a flight
We say the string a flight can be derived from the non-terminal NP. Thus, a CFG
can be used to generate a set of strings. This sequence of rule expansions is called a
derivation derivation of the string of words. It is common to represent a derivation by a parse
parse tree tree (commonly shown inverted with the root at the top). Figure 12.1 shows the tree
representation of this derivation.
dominates In the parse tree shown in Fig. 12.1, we can say that the node NP dominates
all the nodes in the tree (Det, Nom, Noun, a, flight). We can say further that it
immediately dominates the nodes Det and Nom.
The formal language defined by a CFG is the set of strings that are derivable
start symbol from the designated start symbol. Each grammar must have one designated start
symbol, which is often called S. Since context-free grammars are often used to define
sentences, S is usually interpreted as the “sentence” node, and the set of strings that
are derivable from S is the set of sentences in some simplified version of English.
234 C HAPTER 12 • C ONSTITUENCY G RAMMARS

NP

Det Nom

a Noun

flight

Figure 12.1 A parse tree for “a flight”.

Let’s add a few additional rules to our inventory. The following rule expresses
verb phrase the fact that a sentence can consist of a noun phrase followed by a verb phrase:

S → NP VP I prefer a morning flight

A verb phrase in English consists of a verb followed by assorted other things;


for example, one kind of verb phrase consists of a verb followed by a noun phrase:

VP → Verb NP prefer a morning flight

Or the verb may be followed by a noun phrase and a prepositional phrase:

VP → Verb NP PP leave Boston in the morning

Or the verb phrase may have a verb followed by a prepositional phrase alone:

VP → Verb PP leaving on Thursday

A prepositional phrase generally has a preposition followed by a noun phrase.


For example, a common type of prepositional phrase in the ATIS corpus is used to
indicate location or direction:

PP → Preposition NP from Los Angeles

The NP inside a PP need not be a location; PPs are often used with times and
dates, and with other nouns as well; they can be arbitrarily complex. Here are ten
examples from the ATIS corpus:
to Seattle on these flights
in Minneapolis about the ground transportation in Chicago
on Wednesday of the round trip flight on United Airlines
in the evening of the AP fifty seven flight
on the ninth of July with a stopover in Nashville
Figure 12.2 gives a sample lexicon, and Fig. 12.3 summarizes the grammar rules
we’ve seen so far, which we’ll call L0 . Note that we can use the or-symbol | to
indicate that a non-terminal has alternate possible expansions.
We can use this grammar to generate sentences of this “ATIS-language”. We
start with S, expand it to NP VP, then choose a random expansion of NP (let’s say, to
I), and a random expansion of VP (let’s say, to Verb NP), and so on until we generate
the string I prefer a morning flight. Figure 12.4 shows a parse tree that represents a
complete derivation of I prefer a morning flight.
We can also represent a parse tree in a more compact format called bracketed
bracketed notation; here is the bracketed representation of the parse tree of Fig. 12.4:
notation
12.2 • C ONTEXT-F REE G RAMMARS 235

Noun → flights | breeze | trip | morning


Verb → is | prefer | like | need | want | fly
Adjective → cheapest | non-stop | first | latest
| other | direct
Pronoun → me | I | you | it
Proper-Noun → Alaska | Baltimore | Los Angeles
| Chicago | United | American
Determiner → the | a | an | this | these | that
Preposition → from | to | on | near
Conjunction → and | or | but
Figure 12.2 The lexicon for L0 .

Grammar Rules Examples


S → NP VP I + want a morning flight

NP → Pronoun I
| Proper-Noun Los Angeles
| Det Nominal a + flight
Nominal → Nominal Noun morning + flight
| Noun flights

VP → Verb do
| Verb NP want + a flight
| Verb NP PP leave + Boston + in the morning
| Verb PP leaving + on Thursday

PP → Preposition NP from + Los Angeles


Figure 12.3 The grammar for L0 , with example phrases for each rule.

NP VP

Pro Verb NP

I prefer Det Nom

a Nom Noun

Noun flight

morning

Figure 12.4 The parse tree for “I prefer a morning flight” according to grammar L0 .

(12.1) [S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [Nom [N flight]]]]]]
A CFG like that of L0 defines a formal language. We saw in Chapter 2 that a for-
mal language is a set of strings. Sentences (strings of words) that can be derived by a
grammar are in the formal language defined by that grammar, and are called gram-
grammatical matical sentences. Sentences that cannot be derived by a given formal grammar are
ungrammatical not in the language defined by that grammar and are referred to as ungrammatical.
236 C HAPTER 12 • C ONSTITUENCY G RAMMARS

This hard line between “in” and “out” characterizes all formal languages but is only
a very simplified model of how natural languages really work. This is because de-
termining whether a given sentence is part of a given natural language (say, English)
often depends on the context. In linguistics, the use of formal languages to model
generative
grammar natural languages is called generative grammar since the language is defined by
the set of possible sentences “generated” by the grammar.

12.2.1 Formal Definition of Context-Free Grammar


We conclude this section with a quick, formal description of a context-free gram-
mar and the language it generates. A context-free grammar G is defined by four
parameters: N, Σ, R, S (technically this is a “4-tuple”).

N a set of non-terminal symbols (or variables)


Σ a set of terminal symbols (disjoint from N)
R a set of rules or productions, each of the form A → β ,
where A is a non-terminal,
β is a string of symbols from the infinite set of strings (Σ ∪ N)∗
S a designated start symbol and a member of N

For the remainder of the book we adhere to the following conventions when dis-
cussing the formal properties of context-free grammars (as opposed to explaining
particular facts about English or other languages).
Capital letters like A, B, and S Non-terminals
S The start symbol
Lower-case Greek letters like α, β , and γ Strings drawn from (Σ ∪ N)∗
Lower-case Roman letters like u, v, and w Strings of terminals

A language is defined through the concept of derivation. One string derives an-
other one if it can be rewritten as the second one by some series of rule applications.
More formally, following Hopcroft and Ullman (1979),
if A → β is a production of R and α and γ are any strings in the set
directly derives (Σ ∪ N)∗ , then we say that αAγ directly derives αβ γ, or αAγ ⇒ αβ γ.
Derivation is then a generalization of direct derivation:
Let α1 , α2 , . . . , αm be strings in (Σ ∪ N)∗ , m ≥ 1, such that

α1 ⇒ α2 , α2 ⇒ α3 , . . . , αm−1 ⇒ αm

derives We say that α1 derives αm , or α1 ⇒ αm .
We can then formally define the language LG generated by a grammar G as the
set of strings composed of terminal symbols that can be derived from the designated
start symbol S.

LG = {w|w is in Σ ∗ and S ⇒ w}
The problem of mapping from a string of words to its parse tree is called syn-
syntactic
parsing tactic parsing; we define algorithms for constituency parsing in Chapter 13.
12.3 • S OME G RAMMAR RULES FOR E NGLISH 237

12.3 Some Grammar Rules for English


In this section, we introduce a few more aspects of the phrase structure of English;
for consistency we will continue to focus on sentences from the ATIS domain. Be-
cause of space limitations, our discussion is necessarily limited to highlights. Read-
ers are strongly advised to consult a good reference grammar of English, such as
Huddleston and Pullum (2002).

12.3.1 Sentence-Level Constructions


In the small grammar L0 , we provided only one sentence-level construction for
declarative sentences like I prefer a morning flight. Among the large number of
constructions for English sentences, four are particularly common and important:
declaratives, imperatives, yes-no questions, and wh-questions.
declarative Sentences with declarative structure have a subject noun phrase followed by
a verb phrase, like “I prefer a morning flight”. Sentences with this structure have
a great number of different uses that we follow up on in Chapter 24. Here are a
number of examples from the ATIS domain:
I want a flight from Ontario to Chicago
The flight should be eleven a.m. tomorrow
The return flight should leave at around seven p.m.
imperative Sentences with imperative structure often begin with a verb phrase and have
no subject. They are called imperative because they are almost always used for
commands and suggestions; in the ATIS domain they are commands to the system.
Show the lowest fare
Give me Sunday’s flights arriving in Las Vegas from New York City
List all flights between five and seven p.m.
We can model this sentence structure with another rule for the expansion of S:

S → VP

yes-no question Sentences with yes-no question structure are often (though not always) used to
ask questions; they begin with an auxiliary verb, followed by a subject NP, followed
by a VP. Here are some examples. Note that the third example is not a question at
all but a request; Chapter 24 discusses the uses of these question forms to perform
different pragmatic functions such as asking, requesting, or suggesting.
Do any of these flights have stops?
Does American’s flight eighteen twenty five serve dinner?
Can you give me the same information for United?
Here’s the rule:

S → Aux NP VP

The most complex sentence-level structures we examine here are the various wh-
wh-phrase structures. These are so named because one of their constituents is a wh-phrase, that
wh-word is, one that includes a wh-word (who, whose, when, where, what, which, how, why).
These may be broadly grouped into two classes of sentence-level structures. The
wh-subject-question structure is identical to the declarative structure, except that
the first noun phrase contains some wh-word.
238 C HAPTER 12 • C ONSTITUENCY G RAMMARS

What airlines fly from Burbank to Denver?


Which flights depart Burbank after noon and arrive in Denver by six p.m?
Whose flights serve breakfast?
Here is a rule. Exercise 12.7 discusses rules for the constituents that make up the
Wh-NP.
S → Wh-NP VP
wh-non-subject- In the wh-non-subject-question structure, the wh-phrase is not the subject of the
question
sentence, and so the sentence includes another subject. In these types of sentences
the auxiliary appears before the subject NP, just as in the yes-no question structures.
Here is an example followed by a sample rule:
What flights do you have from Burbank to Tacoma Washington?

S → Wh-NP Aux NP VP
Constructions like the wh-non-subject-question contain what are called long-
long-distance
dependencies distance dependencies because the Wh-NP what flights is far away from the predi-
cate that it is semantically related to, the main verb have in the VP. In some models
of parsing and understanding compatible with the grammar rule above, long-distance
dependencies like the relation between flights and have are thought of as a semantic
relation. In such models, the job of figuring out that flights is the argument of have is
done during semantic interpretation. Other models of parsing represent the relation-
ship between flights and have as a syntactic relation, and the grammar is modified to
insert a small marker called a trace or empty category after the verb. We discuss
empty-category models when we introduce the Penn Treebank on page 245.

12.3.2 Clauses and Sentences


Before we move on, we should clarify the status of the S rules in the grammars we
just described. S rules are intended to account for entire sentences that stand alone
as fundamental units of discourse. However, S can also occur on the right-hand side
of grammar rules and hence can be embedded within larger sentences. Clearly then,
there’s more to being an S than just standing alone as a unit of discourse.
What differentiates sentence constructions (i.e., the S rules) from the rest of the
grammar is the notion that they are in some sense complete. In this way they corre-
clause spond to the notion of a clause, which traditional grammars often describe as form-
ing a complete thought. One way of making this notion of “complete thought” more
precise is to say an S is a node of the parse tree below which the main verb of the S
has all of its arguments. We define verbal arguments later, but for now let’s just see
an illustration from the tree for I prefer a morning flight in Fig. 12.4 on page 235.
The verb prefer has two arguments: the subject I and the object a morning flight.
One of the arguments appears below the VP node, but the other one, the subject NP,
appears only below the S node.

12.3.3 The Noun Phrase


Our L0 grammar introduced three of the most frequent types of noun phrases that
occur in English: pronouns, proper nouns and the NP → Det Nominal construction.
The central focus of this section is on the last type since that is where the bulk of
the syntactic complexity resides. These noun phrases consist of a head, the central
noun in the noun phrase, along with various modifiers that can occur before or after
the head noun. Let’s take a close look at the various parts.
12.3 • S OME G RAMMAR RULES FOR E NGLISH 239

The Determiner
Noun phrases can begin with simple lexical determiners:
a stop the flights this flight
those flights any flights some flights
The role of the determiner can also be filled by more complex expressions:
United’s flight
United’s pilot’s union
Denver’s mayor’s mother’s canceled flight
In these examples, the role of the determiner is filled by a possessive expression
consisting of a noun phrase followed by an ’s as a possessive marker, as in the
following rule.
Det → NP 0 s
The fact that this rule is recursive (since an NP can start with a Det) helps us model
the last two examples above, in which a sequence of possessive expressions serves
as a determiner.
Under some circumstances determiners are optional in English. For example,
determiners may be omitted if the noun they modify is plural:
(12.2) Show me flights from San Francisco to Denver on weekdays
As we saw in Chapter 8, mass nouns also don’t require determination. Recall that
mass nouns often (not always) involve something that is treated like a substance
(including e.g., water and snow), don’t take the indefinite article “a”, and don’t tend
to pluralize. Many abstract nouns are mass nouns (music, homework). Mass nouns
in the ATIS domain include breakfast, lunch, and dinner:
(12.3) Does this flight serve dinner?

The Nominal
The nominal construction follows the determiner and contains any pre- and post-
head noun modifiers. As indicated in grammar L0 , in its simplest form a nominal
can consist of a single noun.
Nominal → Noun
As we’ll see, this rule also provides the basis for the bottom of various recursive
rules used to capture more complex nominal constructions.

Before the Head Noun


A number of different kinds of word classes can appear before the head noun but
cardinal after the determiner (the “postdeterminers”) in a nominal. These include cardi-
numbers
ordinal nal numbers, ordinal numbers, quantifiers, and adjectives. Examples of cardinal
numbers
quantifiers numbers:
two friends one stop
Ordinal numbers include first, second, third, and so on, but also words like next,
last, past, other, and another:
the first one the next day the second leg
the last flight the other American flight
Some quantifiers (many, (a) few, several) occur only with plural count nouns:
240 C HAPTER 12 • C ONSTITUENCY G RAMMARS

many fares
Adjectives occur after quantifiers but before nouns.
a first-class fare a non-stop flight
the longest layover the earliest lunch flight
adjective
phrase Adjectives can also be grouped into a phrase called an adjective phrase or AP.
APs can have an adverb before the adjective (see Chapter 8 for definitions of adjec-
tives and adverbs):
the least expensive fare

After the Head Noun


A head noun can be followed by postmodifiers. Three kinds of nominal postmodi-
fiers are common in English:
prepositional phrases all flights from Cleveland
non-finite clauses any flights arriving after eleven a.m.
relative clauses a flight that serves breakfast
They are especially common in the ATIS corpus since they are used to mark the
origin and destination of flights.
Here are some examples of prepositional phrase postmodifiers, with brackets
inserted to show the boundaries of each PP; note that two or more PPs can be strung
together within a single NP:
all flights [from Cleveland] [to Newark]
arrival [in San Jose] [before seven p.m.]
a reservation [on flight six oh six] [from Tampa] [to Montreal]
Here’s a new nominal rule to account for postnominal PPs:
Nominal → Nominal PP
non-finite The three most common kinds of non-finite postmodifiers are the gerundive (-
ing), -ed, and infinitive forms.
gerundive Gerundive postmodifiers are so called because they consist of a verb phrase that
begins with the gerundive (-ing) form of the verb. Here are some examples:
any of those [leaving on Thursday]
any flights [arriving after eleven a.m.]
flights [arriving within thirty minutes of each other]
We can define the Nominals with gerundive modifiers as follows, making use of
a new non-terminal GerundVP:
Nominal → Nominal GerundVP
We can make rules for GerundVP constituents by duplicating all of our VP pro-
ductions, substituting GerundV for V.
GerundVP → GerundV NP
| GerundV PP | GerundV | GerundV NP PP
GerundV can then be defined as
GerundV → being | arriving | leaving | . . .
The phrases in italics below are examples of the two other common kinds of
non-finite clauses, infinitives and -ed forms:
12.3 • S OME G RAMMAR RULES FOR E NGLISH 241

the last flight to arrive in Boston


I need to have dinner served
Which is the aircraft used by this flight?
A postnominal relative clause (more correctly a restrictive relative clause), is
relative
pronoun a clause that often begins with a relative pronoun (that and who are the most com-
mon). The relative pronoun functions as the subject of the embedded verb in the
following examples:

a flight that serves breakfast


flights that leave in the morning
the one that leaves at ten thirty five
We might add rules like the following to deal with these:
Nominal → Nominal RelClause
RelClause → (who | that) VP
The relative pronoun may also function as the object of the embedded verb, as
in the following example; we leave for the reader the exercise of writing grammar
rules for more complex relative clauses of this kind.

the earliest American Airlines flight that I can get


Various postnominal modifiers can be combined:
a flight [from Phoenix to Detroit] [leaving Monday evening]
evening flights [from Nashville to Houston] [that serve dinner]
a friend [living in Denver] [that would like to visit me in DC]

Before the Noun Phrase


predeterminers Word classes that modify and appear before NPs are called predeterminers. Many
of these have to do with number or amount; a common predeterminer is all:
all the flights all flights all non-stop flights
The example noun phrase given in Fig. 12.5 illustrates some of the complexity
that arises when these rules are combined.

12.3.4 The Verb Phrase


The verb phrase consists of the verb and a number of other constituents. In the
simple rules we have built so far, these other constituents include NPs and PPs and
combinations of the two:
VP → Verb disappear
VP → Verb NP prefer a morning flight
VP → Verb NP PP leave Boston in the morning
VP → Verb PP leaving on Thursday
Verb phrases can be significantly more complicated than this. Many other kinds
of constituents, such as an entire embedded sentence, can follow the verb. These are
sentential
complements called sentential complements:
You [VP [V said [S you had a two hundred sixty-six dollar fare]]
[VP [V Tell] [NP me] [S how to get from the airport to downtown]]
I [VP [V think [S I would like to take the nine thirty flight]]
242 C HAPTER 12 • C ONSTITUENCY G RAMMARS

NP

PreDet NP

all Det Nom

the Nom GerundiveVP

Nom PP leaving before 10

Nom PP to Tampa

Nom Noun from Denver

Noun flights

morning

Figure 12.5 A parse tree for “all the morning flights from Denver to Tampa leaving before 10”.

Here’s a rule for these:

VP → Verb S

Similarly, another potential constituent of the VP is another VP. This is often the
case for verbs like want, would like, try, intend, need:
I want [VP to fly from Milwaukee to Orlando]
Hi, I want [VP to arrange three flights]
While a verb phrase can have many possible kinds of constituents, not every
verb is compatible with every verb phrase. For example, the verb want can be used
either with an NP complement (I want a flight . . . ) or with an infinitive VP comple-
ment (I want to fly to . . . ). By contrast, a verb like find cannot take this sort of VP
complement (* I found to fly to Dallas).
This idea that verbs are compatible with different kinds of complements is a very
transitive old one; traditional grammar distinguishes between transitive verbs like find, which
intransitive take a direct object NP (I found a flight), and intransitive verbs like disappear,
which do not (*I disappeared a flight).
subcategorize Where traditional grammars subcategorize verbs into these two categories (tran-
sitive and intransitive), modern grammars distinguish as many as 100 subcategories.
subcategorizes We say that a verb like find subcategorizes for an NP, and a verb like want sub-
for
categorizes for either an NP or a non-finite VP. We also call these constituents the
complements complements of the verb (hence our use of the term sentential complement above).
So we say that want can take a VP complement. These possible sets of complements
subcategorization
frame
are called the subcategorization frame for the verb. Another way of talking about
the relation between the verb and these other constituents is to think of the verb as
a logical predicate and the constituents as logical arguments of the predicate. So we
can think of such predicate-argument relations as FIND (I, A FLIGHT ) or WANT (I, TO
FLY ). We talk more about this view of verbs and arguments in Chapter 15 when we
talk about predicate calculus representations of verb semantics. Subcategorization
frames for a set of example verbs are given in Fig. 12.6.
12.3 • S OME G RAMMAR RULES FOR E NGLISH 243

Frame Verb Example


0/ eat, sleep I ate
NP prefer, find, leave Find [NP the flight from Pittsburgh to Boston]
NP NP show, give Show [NP me] [NP airlines with flights from Pittsburgh]
PPfrom PPto fly, travel I would like to fly [PP from Boston] [PP to Philadelphia]
NP PPwith help, load Can you help [NP me] [PP with a flight]
VPto prefer, want, need I would prefer [VPto to go by United Airlines]
S mean Does this mean [S AA has a hub in Boston]
Figure 12.6 Subcategorization frames for a set of example verbs.

We can capture the association between verbs and their complements by making
separate subtypes of the class Verb (e.g., Verb-with-NP-complement, Verb-with-Inf-
VP-complement, Verb-with-S-complement, and so on):

Verb-with-NP-complement → find | leave | repeat | . . .


Verb-with-S-complement → think | believe | say | . . .
Verb-with-Inf-VP-complement → want | try | need | . . .

Each VP rule could then be modified to require the appropriate verb subtype:

VP → Verb-with-no-complement disappear
VP → Verb-with-NP-comp NP prefer a morning flight
VP → Verb-with-S-comp S said there were two flights

A problem with this approach is the significant increase in the number of rules and
the associated loss of generality.

12.3.5 Coordination
conjunctions The major phrase types discussed here can be conjoined with conjunctions like and,
coordinate or, and but to form larger constructions of the same type. For example, a coordinate
noun phrase can consist of two other noun phrases separated by a conjunction:
Please repeat [NP [NP the flights] and [NP the costs]]
I need to know [NP [NP the aircraft] and [NP the flight number]]
Here’s a rule that allows these structures:

NP → NP and NP

Note that the ability to form coordinate phrases through conjunctions is often
used as a test for constituency. Consider the following examples, which differ from
the ones given above in that they lack the second determiner.
Please repeat the [Nom [Nom flights] and [Nom costs]]
I need to know the [Nom [Nom aircraft] and [Nom flight number]]
The fact that these phrases can be conjoined is evidence for the presence of the
underlying Nominal constituent we have been making use of. Here’s a rule for this:

Nominal → Nominal and Nominal

The following examples illustrate conjunctions involving VPs and Ss.


244 C HAPTER 12 • C ONSTITUENCY G RAMMARS

What flights do you have [VP [VP leaving Denver] and [VP arriving in
San Francisco]]
[S [S I’m interested in a flight from Dallas to Washington] and [S I’m
also interested in going to Baltimore]]
The rules for VP and S conjunctions mirror the NP one given above.
VP → VP and VP
S → S and S
Since all the major phrase types can be conjoined in this fashion, it is also possible
to represent this conjunction fact more generally; a number of grammar formalisms
metarules such as GPSG (Gazdar et al., 1985) do this using metarules like:
X → X and X
This metarule states that any non-terminal can be conjoined with the same non-
terminal to yield a constituent of the same type; the variable X must be designated
as a variable that stands for any non-terminal rather than a non-terminal itself.

12.4 Treebanks
Sufficiently robust grammars consisting of context-free grammar rules can be used
to assign a parse tree to any sentence. This means that it is possible to build a
corpus where every sentence in the collection is paired with a corresponding parse
treebank tree. Such a syntactically annotated corpus is called a treebank. Treebanks play
an important role in parsing, as we discuss in Chapter 13, as well as in linguistic
investigations of syntactic phenomena.
A wide variety of treebanks have been created, generally through the use of
parsers (of the sort described in the next few chapters) to automatically parse each
sentence, followed by the use of humans (linguists) to hand-correct the parses. The
Penn Treebank Penn Treebank project (whose POS tagset we introduced in Chapter 8) has pro-
duced treebanks from the Brown, Switchboard, ATIS, and Wall Street Journal cor-
pora of English, as well as treebanks in Arabic and Chinese. A number of treebanks
use the dependency representation we will introduce in Chapter 14, including many
that are part of the Universal Dependencies project (Nivre et al., 2016b).

12.4.1 Example: The Penn Treebank Project


Figure 12.7 shows sentences from the Brown and ATIS portions of the Penn Tree-
bank.1 Note the formatting differences for the part-of-speech tags; such small dif-
ferences are common and must be dealt with in processing treebanks. The Penn
Treebank part-of-speech tagset was defined in Chapter 8. The use of LISP-style
parenthesized notation for trees is extremely common and resembles the bracketed
notation we saw earlier in (12.1). For those who are not familiar with it we show a
standard node-and-line tree representation in Fig. 12.8.
Figure 12.9 shows a tree from the Wall Street Journal. This tree shows an-
traces other feature of the Penn Treebanks: the use of traces (-NONE- nodes) to mark
1 The Penn Treebank project released treebanks in multiple languages and in various stages; for exam-
ple, there were Treebank I (Marcus et al., 1993), Treebank II (Marcus et al., 1994), and Treebank III
releases of English treebanks. We use Treebank III for our examples.
12.4 • T REEBANKS 245

((S
(NP-SBJ (DT That) ((S
(JJ cold) (, ,) (NP-SBJ The/DT flight/NN )
(JJ empty) (NN sky) ) (VP should/MD
(VP (VBD was) (VP arrive/VB
(ADJP-PRD (JJ full) (PP-TMP at/IN
(PP (IN of) (NP eleven/CD a.m/RB ))
(NP (NN fire) (NP-TMP tomorrow/NN )))))
(CC and)
(NN light) ))))
(. .) ))
(a) (b)
Figure 12.7 Parsed sentences from the LDC Treebank3 version of the (a) Brown and (b)
ATIS corpora.

NP-SBJ VP .

DT JJ , JJ NN VBD ADJP-PRD .

That cold , empty sky was JJ PP

full IN NP

of NN CC NN

fire and light


Figure 12.8 The tree corresponding to the Brown corpus sentence in the previous figure.

syntactic long-distance dependencies or syntactic movement. For example, quotations often


movement
follow a quotative verb like say. But in this example, the quotation “We would have
to wait until we have collected on those assets” precedes the words he said. An
empty S containing only the node -NONE- marks the position after said where the
quotation sentence often occurs. This empty node is marked (in Treebanks II and
III) with the index 2, as is the quotation S at the beginning of the sentence. Such
co-indexing may make it easier for some parsers to recover the fact that this fronted
or topicalized quotation is the complement of the verb said. A similar -NONE- node
marks the fact that there is no syntactic subject right before the verb to wait; instead,
the subject is the earlier NP We. Again, they are both co-indexed with the index 1.
The Penn Treebank II and Treebank III releases added further information to
make it easier to recover the relationships between predicates and arguments. Cer-
tain phrases were marked with tags indicating the grammatical function of the phrase
(as surface subject, logical topic, cleft, non-VP predicates) its presence in particular
text categories (headlines, titles), and its semantic function (temporal phrases, lo-
246 C HAPTER 12 • C ONSTITUENCY G RAMMARS

( (S (‘‘ ‘‘)
(S-TPC-2
(NP-SBJ-1 (PRP We) )
(VP (MD would)
(VP (VB have)
(S
(NP-SBJ (-NONE- *-1) )
(VP (TO to)
(VP (VB wait)
(SBAR-TMP (IN until)
(S
(NP-SBJ (PRP we) )
(VP (VBP have)
(VP (VBN collected)
(PP-CLR (IN on)
(NP (DT those)(NNS assets)))))))))))))
(, ,) (’’ ’’)
(NP-SBJ (PRP he) )
(VP (VBD said)
(S (-NONE- *T*-2) ))
(. .) ))

Figure 12.9 A sentence from the Wall Street Journal portion of the LDC Penn Treebank.
Note the use of the empty -NONE- nodes.

cations) (Marcus et al. 1994, Bies et al. 1995). Figure 12.9 shows examples of the
-SBJ (surface subject) and -TMP (temporal phrase) tags. Figure 12.8 shows in addi-
tion the -PRD tag, which is used for predicates that are not VPs (the one in Fig. 12.8
is an ADJP). We’ll return to the topic of grammatical function when we consider
dependency grammars and parsing in Chapter 14.

12.4.2 Treebanks as Grammars


The sentences in a treebank implicitly constitute a grammar of the language repre-
sented by the corpus being annotated. For example, from the three parsed sentences
in Fig. 12.7 and Fig. 12.9, we can extract each of the CFG rules in them. For sim-
plicity, let’s strip off the rule suffixes (-SBJ and so on). The resulting grammar is
shown in Fig. 12.10.
The grammar used to parse the Penn Treebank is relatively flat, resulting in very
many and very long rules. For example, among the approximately 4,500 different
rules for expanding VPs are separate rules for PP sequences of any length and every
possible arrangement of verb arguments:
VP → VBD PP
VP → VBD PP PP
VP → VBD PP PP PP
VP → VBD PP PP PP PP
VP → VB ADVP PP
VP → VB PP ADVP
VP → ADVP VB PP
as well as even longer rules, such as
VP → VBP PP PP PP PP PP ADVP PP
12.4 • T REEBANKS 247

Grammar Lexicon
S → NP VP . PRP → we | he
S → NP VP DT → the | that | those
S → “ S ” , NP VP . JJ → cold | empty | full
S → -NONE- NN → sky | fire | light | flight | tomorrow
NP → DT NN NNS → assets
NP → DT NNS CC → and
NP → NN CC NN IN → of | at | until | on
NP → CD RB CD → eleven
NP → DT JJ , JJ NN RB → a.m.
NP → PRP VB → arrive | have | wait
NP → -NONE- VBD → was | said
VP → MD VP VBP → have
VP → VBD ADJP VBN → collected
VP → VBD S MD → should | would
VP → VBN PP TO → to
VP → VB S
VP → VB SBAR
VP → VBP VP
VP → VBN PP
VP → TO VP
SBAR → IN S
ADJP → JJ PP
PP → IN NP
Figure 12.10 A sample of the CFG grammar rules and lexical entries that would be ex-
tracted from the three treebank sentences in Fig. 12.7 and Fig. 12.9.

which comes from the VP marked in italics:


This mostly happens because we go from football in the fall to lifting in the
winter to football again in the spring.
Some of the many thousands of NP rules include
NP → DT JJ NN
NP → DT JJ NNS
NP → DT JJ NN NN
NP → DT JJ JJ NN
NP → DT JJ CD NNS
NP → RB DT JJ NN NN
NP → RB DT JJ JJ NNS
NP → DT JJ JJ NNP NNS
NP → DT NNP NNP NNP NNP JJ NN
NP → DT JJ NNP CC JJ JJ NN NNS
NP → RB DT JJS NN NN SBAR
NP → DT VBG JJ NNP NNP CC NNP
NP → DT JJ NNS , NNS CC NN NNS NN
NP → DT JJ JJ VBG NN NNP NNP FW NNP
NP → NP JJ , JJ ‘‘ SBAR ’’ NNS

The last two of those rules, for example, come from the following two noun phrases:
[DT The] [JJ state-owned] [JJ industrial] [VBG holding] [NN company] [NNP Instituto] [NNP Nacional]
[FW de] [NNP Industria]
[NP Shearson’s] [JJ easy-to-film], [JJ black-and-white] “[SBAR Where We Stand]” [NNS commercials]
Viewed as a large grammar in this way, the Penn Treebank III Wall Street Journal
corpus, which contains about 1 million words, also has about 1 million non-lexical
rule tokens, consisting of about 17,500 distinct rule types.
248 C HAPTER 12 • C ONSTITUENCY G RAMMARS

S(dumped)

NP(workers) VP(dumped)

NNS(workers) VBD(dumped) NP(sacks) PP(into)

workers dumped NNS(sacks) P NP(bin)

sacks into DT(a) NN(bin)

a bin
Figure 12.11 A lexicalized tree from Collins (1999).

Various facts about the treebank grammars, such as their large numbers of flat
rules, pose problems for probabilistic parsing algorithms. For this reason, it is com-
mon to make various modifications to a grammar extracted from a treebank. We
discuss these further in Appendix C.

12.4.3 Heads and Head Finding


We suggested informally earlier that syntactic constituents could be associated with
a lexical head; N is the head of an NP, V is the head of a VP. This idea of a head for
each constituent dates back to Bloomfield (1914), and is central to the dependency
grammars and dependency parsing we’ll introduce in Chapter 14. Heads are also
important in probabilistic parsing (Appendix C) and in constituent-based grammar
formalisms like Head-Driven Phrase Structure Grammar (Pollard and Sag, 1994)..
In one simple model of lexical heads, each context-free rule is associated with
a head (Charniak 1997, Collins 1999). The head is the word in the phrase that is
grammatically the most important. Heads are passed up the parse tree; thus, each
non-terminal in a parse tree is annotated with a single word, which is its lexical head.
Figure 12.11 shows an example of such a tree from Collins (1999), in which each
non-terminal is annotated with its head.
For the generation of such a tree, each CFG rule must be augmented to identify
one right-side constituent to be the head child. The headword for a node is then set to
the headword of its head child. Choosing these head children is simple for textbook
examples (NN is the head of NP) but is complicated and indeed controversial for
most phrases. (Should the complementizer to or the verb be the head of an infinite
verb phrase?) Modern linguistic theories of syntax generally include a component
that defines heads (see, e.g., (Pollard and Sag, 1994)).
An alternative approach to finding a head is used in most practical computational
systems. Instead of specifying head rules in the grammar itself, heads are identified
dynamically in the context of trees for specific sentences. In other words, once
a sentence is parsed, the resulting tree is walked to decorate each node with the
appropriate head. Most current systems rely on a simple set of handwritten rules,
such as a practical one for Penn Treebank grammars given in Collins (1999) but
developed originally by Magerman (1995). For example, the rule for finding the
head of an NP is as follows (Collins, 1999, p. 238):

• If the last word is tagged POS, return last-word.


12.5 • G RAMMAR E QUIVALENCE AND N ORMAL F ORM 249

• Else search from right to left for the first child which is an NN, NNP, NNPS, NX, POS,
or JJR.
• Else search from left to right for the first child which is an NP.
• Else search from right to left for the first child which is a $, ADJP, or PRN.
• Else search from right to left for the first child which is a CD.
• Else search from right to left for the first child which is a JJ, JJS, RB or QP.
• Else return the last word

Selected other rules from this set are shown in Fig. 12.12. For example, for VP
rules of the form VP → Y1 · · · Yn , the algorithm would start from the left of Y1 · · ·
Yn looking for the first Yi of type TO; if no TOs are found, it would search for the
first Yi of type VBD; if no VBDs are found, it would search for a VBN, and so on.
See Collins (1999) for more details.

Parent Direction Priority List


ADJP Left NNS QP NN $ ADVP JJ VBN VBG ADJP JJR NP JJS DT FW RBR RBS
SBAR RB
ADVP Right RB RBR RBS FW ADVP TO CD JJR JJ IN NP JJS NN
PRN Left
PRT Right RP
QP Left $ IN NNS NN JJ RB DT CD NCD QP JJR JJS
S Left TO IN VP S SBAR ADJP UCP NP
SBAR Left WHNP WHPP WHADVP WHADJP IN DT S SQ SINV SBAR FRAG
VP Left TO VBD VBN MD VBZ VB VBG VBP VP ADJP NN NNS NP
Figure 12.12 Some head rules from Collins (1999). The head rules are also called a head percolation table.

12.5 Grammar Equivalence and Normal Form


A formal language is defined as a (possibly infinite) set of strings of words. This
suggests that we could ask if two grammars are equivalent by asking if they gener-
ate the same set of strings. In fact, it is possible to have two distinct context-free
grammars generate the same language.
We usually distinguish two kinds of grammar equivalence: weak equivalence
and strong equivalence. Two grammars are strongly equivalent if they generate the
same set of strings and if they assign the same phrase structure to each sentence
(allowing merely for renaming of the non-terminal symbols). Two grammars are
weakly equivalent if they generate the same set of strings but do not assign the same
phrase structure to each sentence.
normal form It is sometimes useful to have a normal form for grammars, in which each of
the productions takes a particular form. For example, a context-free grammar is in
Chomsky Chomsky normal form (CNF) (Chomsky, 1963) if it is -free and if in addition
normal form
each production is either of the form A → B C or A → a. That is, the right-hand side
of each rule either has two non-terminal symbols or one terminal symbol. Chomsky
binary
branching normal form grammars are binary branching, that is they have binary trees (down
to the prelexical nodes). We make use of this binary branching property in the CKY
parsing algorithm in Chapter 13.
Any context-free grammar can be converted into a weakly equivalent Chomsky
normal form grammar. For example, a rule of the form

A → B C D
250 C HAPTER 12 • C ONSTITUENCY G RAMMARS

can be converted into the following two CNF rules (Exercise 12.8 asks the reader to
formulate the complete algorithm):

A → B X
X → C D

Sometimes using binary branching can actually produce smaller grammars. For
example, the sentences that might be characterized as
VP -> VBD NP PP*
are represented in the Penn Treebank by this series of rules:
VP → VBD NP PP
VP → VBD NP PP PP
VP → VBD NP PP PP PP
VP → VBD NP PP PP PP PP
...
but could also be generated by the following two-rule grammar:
VP → VBD NP PP
VP → VP PP
The generation of a symbol A with a potentially infinite sequence of symbols B with
Chomsky-
adjunction a rule of the form A → A B is known as Chomsky-adjunction.

12.6 Lexicalized Grammars


The approach to grammar presented thus far emphasizes phrase-structure rules while
minimizing the role of the lexicon. However, as we saw in the discussions of
agreement, subcategorization, and long-distance dependencies, this approach leads
to solutions that are cumbersome at best, yielding grammars that are redundant,
hard to manage, and brittle. To overcome these issues, numerous alternative ap-
proaches have been developed that all share the common theme of making bet-
ter use of the lexicon. Among the more computationally relevant approaches are
Lexical-Functional Grammar (LFG) (Bresnan, 1982), Head-Driven Phrase Structure
Grammar (HPSG) (Pollard and Sag, 1994), Tree-Adjoining Grammar (TAG) (Joshi,
1985), and Combinatory Categorial Grammar (CCG). These approaches differ with
respect to how lexicalized they are — the degree to which they rely on the lexicon
as opposed to phrase structure rules to capture facts about the language.
The following section provides an introduction to CCG, a heavily lexicalized
approach motivated by both syntactic and semantic considerations, which we will
return to in Chapter 15. Chapter 14 discusses dependency grammars, an approach
that eliminates phrase-structure rules entirely.

12.6.1 Combinatory Categorial Grammar


categorial
grammar In this section, we provide an overview of categorial grammar (Ajdukiewicz 1935,
combinatory
Bar-Hillel 1953), an early lexicalized grammar model, as well as an important mod-
categorial ern extension, combinatory categorial grammar, or CCG (Steedman 1996, Steed-
grammar
man 1989, Steedman 2000).
12.6 • L EXICALIZED G RAMMARS 251

The categorial approach consists of three major elements: a set of categories,


a lexicon that associates words with categories, and a set of rules that govern how
categories combine in context.

Categories
Categories are either atomic elements or single-argument functions that return a cat-
egory as a value when provided with a desired category as argument. More formally,
we can define C, a set of categories for a grammar as follows:
• A ⊆ C, where A is a given set of atomic elements
• (X/Y), (X\Y) ∈ C, if X, Y ∈ C
The slash notation shown here is used to define the functions in the grammar.
It specifies the type of the expected argument, the direction it is expected be found,
and the type of the result. Thus, (X/Y) is a function that seeks a constituent of type
Y to its right and returns a value of X; (X\Y) is the same except it seeks its argument
to the left.
The set of atomic categories is typically very small and includes familiar el-
ements such as sentences and noun phrases. Functional categories include verb
phrases and complex noun phrases among others.

The Lexicon
The lexicon in a categorial approach consists of assignments of categories to words.
These assignments can either be to atomic or functional categories, and due to lexical
ambiguity words can be assigned to multiple categories. Consider the following
sample lexical entries.

flight : N
Miami : NP
cancel : (S\NP)/NP

Nouns and proper nouns like flight and Miami are assigned to atomic categories,
reflecting their typical role as arguments to functions. On the other hand, a transitive
verb like cancel is assigned the category (S\NP)/NP: a function that seeks an NP on
its right and returns as its value a function with the type (S\NP). This function can,
in turn, combine with an NP on the left, yielding an S as the result. This captures the
kind of subcategorization information discussed in Section 12.3.4, however here the
information has a rich, computationally useful, internal structure.
Ditransitive verbs like give, which expect two arguments after the verb, would
have the category ((S\NP)/NP)/NP: a function that combines with an NP on its
right to yield yet another function corresponding to the transitive verb (S\NP)/NP
category such as the one given above for cancel.

Rules
The rules of a categorial grammar specify how functions and their arguments com-
bine. The following two rule templates constitute the basis for all categorial gram-
mars.

X/Y Y ⇒ X (12.4)
Y X\Y ⇒ X (12.5)
252 C HAPTER 12 • C ONSTITUENCY G RAMMARS

The first rule applies a function to its argument on the right, while the second
looks to the left for its argument. We’ll refer to the first as forward function appli-
cation, and the second as backward function application. The result of applying
either of these rules is the category specified as the value of the function being ap-
plied.
Given these rules and a simple lexicon, let’s consider an analysis of the sentence
United serves Miami. Assume that serves is a transitive verb with the category
(S\NP)/NP and that United and Miami are both simple NPs. Using both forward
and backward function application, the derivation would proceed as follows:
United serves Miami
NP (S\NP)/NP NP
>
S\NP
<
S
Categorial grammar derivations are illustrated growing down from the words,
rule applications are illustrated with a horizontal line that spans the elements in-
volved, with the type of the operation indicated at the right end of the line. In this
example, there are two function applications: one forward function application indi-
cated by the > that applies the verb serves to the NP on its right, and one backward
function application indicated by the < that applies the result of the first to the NP
United on its left.
With the addition of another rule, the categorial approach provides a straight-
forward way to implement the coordination metarule described earlier on page 244.
Recall that English permits the coordination of two constituents of the same type,
resulting in a new constituent of the same type. The following rule provides the
mechanism to handle such examples.

X CONJ X ⇒ X (12.6)

This rule states that when two constituents of the same category are separated by a
constituent of type CONJ they can be combined into a single larger constituent of
the same type. The following derivation illustrates the use of this rule.
We flew to Geneva and drove to Chamonix
NP (S\NP)/PP PP/NP NP CONJ (S\NP)/PP PP/NP NP
> >
PP PP
> >
S\NP S\NP
<Φ>
S\NP
<
S

Here the two S\NP constituents are combined via the conjunction operator <Φ>
to form a larger constituent of the same type, which can then be combined with the
subject NP via backward function application.
These examples illustrate the lexical nature of the categorial grammar approach.
The grammatical facts about a language are largely encoded in the lexicon, while the
rules of the grammar are boiled down to a set of three rules. Unfortunately, the basic
categorial approach does not give us any more expressive power than we had with
traditional CFG rules; it just moves information from the grammar to the lexicon. To
move beyond these limitations CCG includes operations that operate over functions.
12.6 • L EXICALIZED G RAMMARS 253

The first pair of operators permit us to compose adjacent functions.

X/Y Y /Z ⇒ X/Z (12.7)


Y \Z X\Y ⇒ X\Z (12.8)
forward
composition The first rule, called forward composition, can be applied to adjacent con-
stituents where the first is a function seeking an argument of type Y to its right, and
the second is a function that provides Y as a result. This rule allows us to compose
these two functions into a single one with the type of the first constituent and the
argument of the second. Although the notation is a little awkward, the second rule,
backward
composition backward composition is the same, except that we’re looking to the left instead of
to the right for the relevant arguments. Both kinds of composition are signalled by a
B in CCG diagrams, accompanied by a < or > to indicate the direction.
type raising The next operator is type raising. Type raising elevates simple categories to the
status of functions. More specifically, type raising takes a category and converts
it to function that seeks as an argument a function that takes the original category
as its argument. The following schema show two versions of type raising: one for
arguments to the right, and one for the left.

X ⇒ T /(T \X) (12.9)


X ⇒ T \(T /X) (12.10)

The category T in these rules can correspond to any of the atomic or functional
categories already present in the grammar.
A particularly useful example of type raising transforms a simple NP argument
in subject position to a function that can compose with a following VP. To see how
this works, let’s revisit our earlier example of United serves Miami. Instead of clas-
sifying United as an NP which can serve as an argument to the function attached to
serve, we can use type raising to reinvent it as a function in its own right as follows.

NP ⇒ S/(S\NP)

Combining this type-raised constituent with the forward composition rule (12.7)
permits the following alternative to our previous derivation.
United serves Miami
NP (S\NP)/NP NP
>T
S/(S\NP)
>B
S/NP
>
S
By type raising United to S/(S\NP), we can compose it with the transitive verb
serves to yield the (S/NP) function needed to complete the derivation.
There are several interesting things to note about this derivation. First, it pro-
vides a left-to-right, word-by-word derivation that more closely mirrors the way
humans process language. This makes CCG a particularly apt framework for psy-
cholinguistic studies. Second, this derivation involves the use of an intermediate
unit of analysis, United serves, that does not correspond to a traditional constituent
in English. This ability to make use of such non-constituent elements provides CCG
with the ability to handle the coordination of phrases that are not proper constituents,
as in the following example.
(12.11) We flew IcelandAir to Geneva and SwissAir to London.
254 C HAPTER 12 • C ONSTITUENCY G RAMMARS

Here, the segments that are being coordinated are IcelandAir to Geneva and
SwissAir to London, phrases that would not normally be considered constituents, as
can be seen in the following standard derivation for the verb phrase flew IcelandAir
to Geneva.
flew IcelandAir to Geneva
(VP/PP)/NP NP PP/NP NP
> >
VP/PP PP
>
VP
In this derivation, there is no single constituent that corresponds to IcelandAir
to Geneva, and hence no opportunity to make use of the <Φ> operator. Note that
complex CCG categories can get a little cumbersome, so we’ll use VP as a shorthand
for (S\NP) in this and the following derivations.
The following alternative derivation provides the required element through the
use of both backward type raising (12.10) and backward function composition (12.8).
flew IcelandAir to Geneva
(V P/PP)/NP NP PP/NP NP
<T >
(V P/PP)\((V P/PP)/NP) PP
<T
V P\(V P/PP)
<B
V P\((V P/PP)/NP)
<
VP
Applying the same analysis to SwissAir to London satisfies the requirements
for the <Φ> operator, yielding the following derivation for our original example
(12.11).
flew IcelandAir to Geneva and SwissAir to London
(V P/PP)/NP NP PP/NP NP CONJ NP PP/NP NP
<T > <T >
(V P/PP)\((V P/PP)/NP) PP (V P/PP)\((V P/PP)/NP) PP
<T <T
V P\(V P/PP) V P\(V P/PP)
< <
V P\((V P/PP)/NP) V P\((V P/PP)/NP)
<Φ>
V P\((V P/PP)/NP)
<
VP

Finally, let’s examine how these advanced operators can be used to handle long-
distance dependencies (also referred to as syntactic movement or extraction). As
mentioned in Section 12.3.1, long-distance dependencies arise from many English
constructions including wh-questions, relative clauses, and topicalization. What
these constructions have in common is a constituent that appears somewhere dis-
tant from its usual, or expected, location. Consider the following relative clause as
an example.
the flight that United diverted
Here, divert is a transitive verb that expects two NP arguments, a subject NP to its
left and a direct object NP to its right; its category is therefore (S\NP)/NP. However,
in this example the direct object the flight has been “moved” to the beginning of the
clause, while the subject United remains in its normal position. What is needed is a
way to incorporate the subject argument, while dealing with the fact that the flight is
not in its expected location.
The following derivation accomplishes this, again through the combined use of
type raising and function composition.
12.7 • S UMMARY 255

the flight that United diverted


NP/N N (NP\NP)/(S/NP) NP (S\NP)/NP
> >T
NP S/(S\NP)
>B
S/NP
>
NP\NP
<
NP
As we saw with our earlier examples, the first step of this derivation is type raising
United to the category S/(S\NP) allowing it to combine with diverted via forward
composition. The result of this composition is S/NP which preserves the fact that we
are still looking for an NP to fill the missing direct object. The second critical piece
is the lexical category assigned to the word that: (NP\NP)/(S/NP). This function
seeks a verb phrase missing an argument to its right, and transforms it into an NP
seeking a missing element to its left, precisely where we find the flight.

CCGBank
As with phrase-structure approaches, treebanks play an important role in CCG-
based approaches to parsing. CCGBank (Hockenmaier and Steedman, 2007) is the
largest and most widely used CCG treebank. It was created by automatically trans-
lating phrase-structure trees from the Penn Treebank via a rule-based approach. The
method produced successful translations of over 99% of the trees in the Penn Tree-
bank resulting in 48,934 sentences paired with CCG derivations. It also provides a
lexicon of 44,000 words with over 1200 categories. Appendix C will discuss how
these resources can be used to train CCG parsers.

12.7 Summary
This chapter has introduced a number of fundamental concepts in syntax through
the use of context-free grammars.
• In many languages, groups of consecutive words act as a group or a con-
stituent, which can be modeled by context-free grammars (which are also
known as phrase-structure grammars).
• A context-free grammar consists of a set of rules or productions, expressed
over a set of non-terminal symbols and a set of terminal symbols. Formally,
a particular context-free language is the set of strings that can be derived
from a particular context-free grammar.
• A generative grammar is a traditional name in linguistics for a formal lan-
guage that is used to model the grammar of a natural language.
• There are many sentence-level grammatical constructions in English; declar-
ative, imperative, yes-no question, and wh-question are four common types;
these can be modeled with context-free rules.
• An English noun phrase can have determiners, numbers, quantifiers, and
adjective phrases preceding the head noun, which can be followed by a num-
ber of postmodifiers; gerundive and infinitive VPs are common possibilities.
• Subjects in English agree with the main verb in person and number.
256 C HAPTER 12 • C ONSTITUENCY G RAMMARS

• Verbs can be subcategorized by the types of complements they expect. Sim-


ple subcategories are transitive and intransitive; most grammars include
many more categories than these.
• Treebanks of parsed sentences exist for many genres of English and for many
languages. Treebanks can be searched with tree-search tools.
• Any context-free grammar can be converted to Chomsky normal form, in
which the right-hand side of each rule has either two non-terminals or a single
terminal.
• Lexicalized grammars place more emphasis on the structure of the lexicon,
lessening the burden on pure phrase-structure rules.
• Combinatorial categorial grammar (CCG) is an important computationally
relevant lexicalized approach.

Bibliographical and Historical Notes


According to Percival (1976), the idea of breaking up a sentence into a hierarchy of
constituents appeared in the Völkerpsychologie of the groundbreaking psychologist
Wilhelm Wundt (Wundt, 1900):
...den sprachlichen Ausdruck für die willkürliche Gliederung einer Ge-
sammtvorstellung in ihre in logische Beziehung zueinander gesetzten
Bestandteile
[the linguistic expression for the arbitrary division of a total idea
into its constituent parts placed in logical relations to one another]
Wundt’s idea of constituency was taken up into linguistics by Leonard Bloom-
field in his early book An Introduction to the Study of Language (Bloomfield, 1914).
By the time of his later book, Language (Bloomfield, 1933), what was then called
“immediate-constituent analysis” was a well-established method of syntactic study
in the United States. By contrast, traditional European grammar, dating from the
Classical period, defined relations between words rather than constituents, and Eu-
ropean syntacticians retained this emphasis on such dependency grammars, the sub-
ject of Chapter 14.
American Structuralism saw a number of specific definitions of the immediate
constituent, couched in terms of their search for a “discovery procedure”: a method-
ological algorithm for describing the syntax of a language. In general, these attempt
to capture the intuition that “The primary criterion of the immediate constituent is the
degree in which combinations behave as simple units” (Bazell, 1966, p. 284). The
most well known of the specific definitions is Harris’ idea of distributional similarity
to individual units, with the substitutability test. Essentially, the method proceeded
by breaking up a construction into constituents by attempting to substitute simple
structures for possible constituents—if a substitution of a simple form, say, man,
was substitutable in a construction for a more complex set (like intense young man),
then the form intense young man was probably a constituent. Harris’s test was the
beginning of the intuition that a constituent is a kind of equivalence class.
The first formalization of this idea of hierarchical constituency was the phrase-
structure grammar defined in Chomsky (1956) and further expanded upon (and
argued against) in Chomsky (1957) and Chomsky (1975). From this time on, most
generative linguistic theories were based at least in part on context-free grammars or
E XERCISES 257

generalizations of them (such as Head-Driven Phrase Structure Grammar (Pollard


and Sag, 1994), Lexical-Functional Grammar (Bresnan, 1982), the Minimalist Pro-
gram (Chomsky, 1995), and Construction Grammar (Kay and Fillmore, 1999), inter
alia); many of these theories used schematic context-free templates known as X-bar
X-bar schemata, which also relied on the notion of syntactic head.
schemata
Shortly after Chomsky’s initial work, the context-free grammar was reinvented
by Backus (1959) and independently by Naur et al. (1960) in their descriptions of
the ALGOL programming language; Backus (1996) noted that he was influenced by
the productions of Emil Post and that Naur’s work was independent of his (Backus’)
own. After this early work, a great number of computational models of natural
language processing were based on context-free grammars because of the early de-
velopment of efficient algorithms to parse these grammars (see Chapter 13).
Thre are various classes of extensions to CFGs, many designed to handle long-
distance dependencies in the syntax. (Other grammars instead treat long-distance-
dependent items as being related semantically rather than syntactically (Kay and
Fillmore 1999, Culicover and Jackendoff 2005).
One extended formalism is Tree Adjoining Grammar (TAG) (Joshi, 1985).
The primary TAG data structure is the tree, rather than the rule. Trees come in two
kinds: initial trees and auxiliary trees. Initial trees might, for example, represent
simple sentential structures, and auxiliary trees add recursion into a tree. Trees are
combined by two operations called substitution and adjunction. The adjunction
operation handles long-distance dependencies. See Joshi (1985) for more details.
Tree Adjoining Grammar is a member of the family of mildly context-sensitive
languages.
We mentioned on page 245 another way of handling long-distance dependencies,
based on the use of empty categories and co-indexing. The Penn Treebank uses
this model, which draws (in various Treebank corpora) from the Extended Standard
Theory and Minimalism (Radford, 1997).
Readers interested in the grammar of English should get one of the three large
reference grammars of English: Huddleston and Pullum (2002), Biber et al. (1999),
and Quirk et al. (1985).
There are many good introductory textbooks on syntax from different perspec-
generative tives. Sag et al. (2003) is an introduction to syntax from a generative perspective,
focusing on the use of phrase-structure rules, unification, and the type hierarchy in
Head-Driven Phrase Structure Grammar. Van Valin, Jr. and La Polla (1997) is an
functional introduction from a functional perspective, focusing on cross-linguistic data and on
the functional motivation for syntactic structures.

Exercises
12.1 Draw tree structures for the following ATIS phrases:
1. Dallas
2. from Denver
3. after five p.m.
4. arriving in Washington
5. early flights
6. all redeye flights
7. on Thursday
8. a one-way fare
258 C HAPTER 12 • C ONSTITUENCY G RAMMARS

9. any delays in Denver


12.2 Draw tree structures for the following ATIS sentences:
1. Does American Airlines have a flight between five a.m. and six a.m.?
2. I would like to fly on American Airlines.
3. Please repeat that.
4. Does American 487 have a first-class section?
5. I need to fly between Philadelphia and Atlanta.
6. What is the fare from Atlanta to Denver?
7. Is there an American Airlines flight from Philadelphia to Dallas?
12.3 Assume a grammar that has many VP rules for different subcategorizations,
as expressed in Section 12.3.4, and differently subcategorized verb rules like
Verb-with-NP-complement. How would the rule for postnominal relative clauses
(12.4) need to be modified if we wanted to deal properly with examples like
the earliest flight that you have? Recall that in such examples the pronoun
that is the object of the verb get. Your rules should allow this noun phrase but
should correctly rule out the ungrammatical S *I get.
12.4 Does your solution to the previous problem correctly model the NP the earliest
flight that I can get? How about the earliest flight that I think my mother
wants me to book for her? Hint: this phenomenon is called long-distance
dependency.
12.5 Write rules expressing the verbal subcategory of English auxiliaries; for ex-
ample, you might have a rule verb-with-bare-stem-VP-complement → can.
possessive 12.6 NPs like Fortune’s office or my uncle’s marks are called possessive or genitive
genitive noun phrases. We can model possessive noun phrases by treating the sub-NP
like Fortune’s or my uncle’s as a determiner of the following head noun. Write
grammar rules for English possessives. You may treat ’s as if it were a separate
word (i.e., as if there were always a space before ’s).
12.7 Page 238 discussed the need for a Wh-NP constituent. The simplest Wh-NP
is one of the Wh-pronouns (who, whom, whose, which). The Wh-words what
and which can be determiners: which four will you have?, what credit do you
have with the Duke? Write rules for the different types of Wh-NPs.
12.8 Write an algorithm for converting an arbitrary context-free grammar into Chom-
sky normal form.
CHAPTER

13 Constituency Parsing

One morning I shot an elephant in my pajamas.


How he got into my pajamas I don’t know.
Groucho Marx, Animal Crackers, 1930

Syntactic parsing is the task of assigning a syntactic structure to a sentence. This


chapter focuses on constituency structures, those assigned by context-free grammars
of the kind described in Chapter 12. In the next chapter we’ll introduce dependency
parses, an alternative kind of parse structure,
Parse trees can be used in applications such as grammar checking: sentence that
cannot be parsed may have grammatical errors (or at least be hard to read). Parse
trees can be an intermediate stage of representation for semantic analysis (as we
show in Chapter 16) and thus play a role in applications like question answering.
For example to answer the question
Which flights to Denver depart before the Seattle flight?
we’ll need to know that the questioner wants a list of flights going to Denver, not
flights going to Seattle, and parse structure (knowing that to Denver modifies flights,
and which flights to Denver is the subject of the depart) can help us.
We begin by discussing ambiguity and the problems it presents, and then give
the Cocke-Kasami-Younger (CKY) algorithm (Kasami 1965, Younger 1967), the
standard dynamic programming approach to syntactic parsing. We’ve already seen
other dynamic programming algorithms like minimum edit distance (Chapter 2) and
Viterbi (Chapter 8).
The vanilla CKY algorithm returns an efficient representation of the set of parse
trees for a sentence, but doesn’t tell us which parse tree is the right one. For that,
we need to augment CKY with scores for each possible constituent. We’ll see how
to do this with neural span-based parsers. And we’ll introduce other methods like
supertagging for parsing CCG, partial parsing methods, for use in situations in
which a superficial syntactic analysis of an input may be sufficient, and the standard
set of metrics for evaluating parser accuracy.

13.1 Ambiguity
Ambiguity is the most serious problem faced by syntactic parsers. Chapter 8 intro-
duced the notions of part-of-speech ambiguity and part-of-speech disambigua-
structural
ambiguity tion. Here, we introduce a new kind of ambiguity, called structural ambiguity,
illustrated with a new toy grammar L1 , shown in Figure 13.1, which adds a few
rules to the L0 grammar from the last chapter.
Structural ambiguity occurs when the grammar can assign more than one parse
to a sentence. Groucho Marx’s well-known line as Captain Spaulding in Animal
Crackers is ambiguous because the phrase in my pajamas can be part of the NP
260 C HAPTER 13 • C ONSTITUENCY PARSING

Grammar Lexicon
S → NP VP Det → that | this | the | a
S → Aux NP VP Noun → book | flight | meal | money
S → VP Verb → book | include | prefer
NP → Pronoun Pronoun → I | she | me
NP → Proper-Noun Proper-Noun → Houston | NWA
NP → Det Nominal Aux → does
Nominal → Noun Preposition → from | to | on | near | through
Nominal → Nominal Noun
Nominal → Nominal PP
VP → Verb
VP → Verb NP
VP → Verb NP PP
VP → Verb PP
VP → VP PP
PP → Preposition NP
Figure 13.1 The L1 miniature English grammar and lexicon.

S S

NP VP NP VP

Pronoun Verb NP Pronoun VP PP

I shot Det Nominal I Verb NP in my pajamas

an Nominal PP shot Det Nominal

Noun in my pajamas an Noun

elephant elephant

Figure 13.2 Two parse trees for an ambiguous sentence. The parse on the left corresponds to the humorous
reading in which the elephant is in the pajamas, the parse on the right corresponds to the reading in which
Captain Spaulding did the shooting in his pajamas.

headed by elephant or a part of the verb phrase headed by shot. Figure 13.2 illus-
trates these two analyses of Marx’s line using rules from L1 .
Structural ambiguity, appropriately enough, comes in many forms. Two common
kinds of ambiguity are attachment ambiguity and coordination ambiguity. A
attachment
ambiguity sentence has an attachment ambiguity if a particular constituent can be attached to
the parse tree at more than one place. The Groucho Marx sentence is an example of
PP-attachment ambiguity. Various kinds of adverbial phrases are also subject to this
kind of ambiguity. For instance, in the following example the gerundive-VP flying
to Paris can be part of a gerundive sentence whose subject is the Eiffel Tower or it
can be an adjunct modifying the VP headed by saw:
(13.1) We saw the Eiffel Tower flying to Paris.
coordination
ambiguity In coordination ambiguity phrases can be conjoined by a conjunction like and.
13.2 • CKY PARSING : A DYNAMIC P ROGRAMMING A PPROACH 261

For example, the phrase old men and women can be bracketed as [old [men and
women]], referring to old men and old women, or as [old men] and [women], in
which case it is only the men who are old. These ambiguities combine in complex
ways in real sentences, like the following news sentence from the Brown corpus:
(13.2) President Kennedy today pushed aside other White House business to
devote all his time and attention to working on the Berlin crisis address he
will deliver tomorrow night to the American people over nationwide
television and radio.
This sentence has a number of ambiguities, although since they are semantically
unreasonable, it requires a careful reading to see them. The last noun phrase could be
parsed [nationwide [television and radio]] or [[nationwide television] and radio].
The direct object of pushed aside should be other White House business but could
also be the bizarre phrase [other White House business to devote all his time and
attention to working] (i.e., a structure like Kennedy affirmed [his intention to propose
a new budget to address the deficit]). Then the phrase on the Berlin crisis address he
will deliver tomorrow night to the American people could be an adjunct modifying
the verb pushed. A PP like over nationwide television and radio could be attached
to any of the higher VPs or NPs (e.g., it could modify people or night).
The fact that there are many grammatically correct but semantically unreason-
able parses for naturally occurring sentences is an irksome problem that affects all
parsers. Fortunately, the CKY algorithm below is designed to efficiently handle
structural ambiguities. And as we’ll see in the following section, we can augment
CKY with neural methods to choose a single correct parse by syntactic disambigua-
syntactic
disambiguation tion.

13.2 CKY Parsing: A Dynamic Programming Approach


Dynamic programming provides a powerful framework for addressing the prob-
lems caused by ambiguity in grammars. Recall that a dynamic programming ap-
proach systematically fills in a table of solutions to sub-problems. The complete ta-
ble has the solution to all the sub-problems needed to solve the problem as a whole.
In the case of syntactic parsing, these sub-problems represent parse trees for all the
constituents detected in the input.
The dynamic programming advantage arises from the context-free nature of our
grammar rules — once a constituent has been discovered in a segment of the input
we can record its presence and make it available for use in any subsequent derivation
that might require it. This provides both time and storage efficiencies since subtrees
can be looked up in a table, not reanalyzed. This section presents the Cocke-Kasami-
Younger (CKY) algorithm, the most widely used dynamic-programming based ap-
proach to parsing. Chart parsing (Kaplan 1973, Kay 1982) is a related approach,
chart parsing and dynamic programming methods are often referred to as chart parsing methods.

13.2.1 Conversion to Chomsky Normal Form


The CKY algorithm requires grammars to first be in Chomsky Normal Form (CNF).
Recall from Chapter 12 that grammars in CNF are restricted to rules of the form
A → B C or A → w. That is, the right-hand side of each rule must expand either to
two non-terminals or to a single terminal. Restricting a grammar to CNF does not
262 C HAPTER 13 • C ONSTITUENCY PARSING

lead to any loss in expressiveness, since any context-free grammar can be converted
into a corresponding CNF grammar that accepts exactly the same set of strings as
the original grammar.
Let’s start with the process of converting a generic CFG into one represented in
CNF. Assuming we’re dealing with an -free grammar, there are three situations we
need to address in any generic grammar: rules that mix terminals with non-terminals
on the right-hand side, rules that have a single non-terminal on the right-hand side,
and rules in which the length of the right-hand side is greater than 2.
The remedy for rules that mix terminals and non-terminals is to simply introduce
a new dummy non-terminal that covers only the original terminal. For example, a
rule for an infinitive verb phrase such as INF-VP → to VP would be replaced by the
two rules INF-VP → TO VP and TO → to.
Unit
productions Rules with a single non-terminal on the right are called unit productions. We
can eliminate unit productions by rewriting the right-hand side of the original rules
with the right-hand side of all the non-unit production rules that they ultimately lead

to. More formally, if A ⇒ B by a chain of one or more unit productions and B → γ
is a non-unit production in our grammar, then we add A → γ for each such rule in
the grammar and discard all the intervening unit productions. As we demonstrate
with our toy grammar, this can lead to a substantial flattening of the grammar and a
consequent promotion of terminals to fairly high levels in the resulting trees.
Rules with right-hand sides longer than 2 are normalized through the introduc-
tion of new non-terminals that spread the longer sequences over several new rules.
Formally, if we have a rule like

A → BCγ

we replace the leftmost pair of non-terminals with a new non-terminal and introduce
a new production, resulting in the following new rules:

A → X1 γ
X1 → B C

In the case of longer right-hand sides, we simply iterate this process until the of-
fending rule has been replaced by rules of length 2. The choice of replacing the
leftmost pair of non-terminals is purely arbitrary; any systematic scheme that results
in binary rules would suffice.
In our current grammar, the rule S → Aux NP VP would be replaced by the two
rules S → X1 VP and X1 → Aux NP.
The entire conversion process can be summarized as follows:
1. Copy all conforming rules to the new grammar unchanged.
2. Convert terminals within rules to dummy non-terminals.
3. Convert unit productions.
4. Make all rules binary and add them to new grammar.
Figure 13.3 shows the results of applying this entire conversion procedure to
the L1 grammar introduced earlier on page 260. Note that this figure doesn’t show
the original lexical rules; since these original lexical rules are already in CNF, they
all carry over unchanged to the new grammar. Figure 13.3 does, however, show
the various places where the process of eliminating unit productions has, in effect,
created new lexical rules. For example, all the original verbs have been promoted to
both VPs and to Ss in the converted grammar.
13.2 • CKY PARSING : A DYNAMIC P ROGRAMMING A PPROACH 263

L1 Grammar L1 in CNF
S → NP VP S → NP VP
S → Aux NP VP S → X1 VP
X1 → Aux NP
S → VP S → book | include | prefer
S → Verb NP
S → X2 PP
S → Verb PP
S → VP PP
NP → Pronoun NP → I | she | me
NP → Proper-Noun NP → TWA | Houston
NP → Det Nominal NP → Det Nominal
Nominal → Noun Nominal → book | flight | meal | money
Nominal → Nominal Noun Nominal → Nominal Noun
Nominal → Nominal PP Nominal → Nominal PP
VP → Verb VP → book | include | prefer
VP → Verb NP VP → Verb NP
VP → Verb NP PP VP → X2 PP
X2 → Verb NP
VP → Verb PP VP → Verb PP
VP → VP PP VP → VP PP
PP → Preposition NP PP → Preposition NP
Figure 13.3 L1 Grammar and its conversion to CNF. Note that although they aren’t shown
here, all the original lexical entries from L1 carry over unchanged as well.

13.2.2 CKY Recognition


With our grammar now in CNF, each non-terminal node above the part-of-speech
level in a parse tree will have exactly two daughters. A two-dimensional matrix can
be used to encode the structure of an entire tree. For a sentence of length n, we will
work with the upper-triangular portion of an (n + 1) × (n + 1) matrix. Each cell [i, j]
in this matrix contains the set of non-terminals that represent all the constituents that
span positions i through j of the input. Since our indexing scheme begins with 0, it’s
natural to think of the indexes as pointing at the gaps between the input words (as in
fenceposts 0 Book 1 that 2 flight 3 ). These gaps are often called fenceposts, on the metaphor of
the posts between segments of fencing. It follows then that the cell that represents
the entire input resides in position [0, n] in the matrix.
Since each non-terminal entry in our table has two daughters in the parse, it fol-
lows that for each constituent represented by an entry [i, j], there must be a position
in the input, k, where it can be split into two parts such that i < k < j. Given such
a position k, the first constituent [i, k] must lie to the left of entry [i, j] somewhere
along row i, and the second entry [k, j] must lie beneath it, along column j.
To make this more concrete, consider the following example with its completed
parse matrix, shown in Fig. 13.4.
(13.3) Book the flight through Houston.
The superdiagonal row in the matrix contains the parts of speech for each word in
the input. The subsequent diagonals above that superdiagonal contain constituents
that cover all the spans of increasing length in the input.
Given this setup, CKY recognition consists of filling the parse table in the right
way. To do this, we’ll proceed in a bottom-up fashion so that at the point where
we are filling any cell [i, j], the cells containing the parts that could contribute to
264 C HAPTER 13 • C ONSTITUENCY PARSING

Book the flight through Houston

S, VP, Verb, S,VP,X2 S,VP,X2


Nominal,
Noun
[0,1] [0,2] [0,3] [0,4] [0,5]
Det NP NP

[1,2] [1,3] [1,4] [1,5]


Nominal, Nominal
Noun

[2,3] [2,4] [2,5]


Prep PP

[3,4] [3,5]
NP,
Proper-
Noun
[4,5]

Figure 13.4 Completed parse table for Book the flight through Houston.

this entry (i.e., the cells to the left and the cells below) have already been filled.
The algorithm given in Fig. 13.5 fills the upper-triangular matrix a column at a time
working from left to right, with each column filled from bottom to top, as the right
side of Fig. 13.4 illustrates. This scheme guarantees that at each point in time we
have all the information we need (to the left, since all the columns to the left have
already been filled, and below since we’re filling bottom to top). It also mirrors on-
line processing, since filling the columns from left to right corresponds to processing
each word one at a time.

function CKY-PARSE(words, grammar) returns table

for j ← from 1 to L ENGTH(words) do


for all {A | A → words[ j] ∈ grammar}
table[ j − 1, j] ← table[ j − 1, j] ∪ A
for i ← from j − 2 down to 0 do
for k ← i + 1 to j − 1 do
for all {A | A → BC ∈ grammar and B ∈ table[i, k] and C ∈ table[k, j]}
table[i,j] ← table[i,j] ∪ A
Figure 13.5 The CKY algorithm.

The outermost loop of the algorithm given in Fig. 13.5 iterates over the columns,
and the second loop iterates over the rows, from the bottom up. The purpose of the
innermost loop is to range over all the places where a substring spanning i to j in
the input might be split in two. As k ranges over the places where the string can be
split, the pairs of cells we consider move, in lockstep, to the right along row i and
down along column j. Figure 13.6 illustrates the general case of filling cell [i, j]. At
each such split, the algorithm considers whether the contents of the two cells can be
combined in a way that is sanctioned by a rule in the grammar. If such a rule exists,
the non-terminal on its left-hand side is entered into the table.
Figure 13.7 shows how the five cells of column 5 of the table are filled after the
word Houston is read. The arrows point out the two spans that are being used to add
an entry to the table. Note that the action in cell [0, 5] indicates the presence of three
alternative parses for this input, one where the PP modifies the flight, one where
13.2 • CKY PARSING : A DYNAMIC P ROGRAMMING A PPROACH 265

[0,1] [0,n]

...
[i,j]

[i,i+1] [i,i+2]
... [i,j-2] [i,j-1]

[i+1,j]

[i+2,j]

[j-2,j]

[j-1,j]

...

[n-1, n]

Figure 13.6 All the ways to fill the [i, j]th cell in the CKY table.

it modifies the booking, and one that captures the second argument in the original
VP → Verb NP PP rule, now captured indirectly with the VP → X2 PP rule.

13.2.3 CKY Parsing


The algorithm given in Fig. 13.5 is a recognizer, not a parser; for it to succeed, it
simply has to find an S in cell [0, n]. To turn it into a parser capable of returning all
possible parses for a given input, we can make two simple changes to the algorithm:
the first change is to augment the entries in the table so that each non-terminal is
paired with pointers to the table entries from which it was derived (more or less as
shown in Fig. 13.7), the second change is to permit multiple versions of the same
non-terminal to be entered into the table (again as shown in Fig. 13.7). With these
changes, the completed table contains all the possible parses for a given input. Re-
turning an arbitrary single parse consists of choosing an S from cell [0, n] and then
recursively retrieving its component constituents from the table.
Returning every parse for a sentence may not be useful, since there may be an
exponential number of parses. We’ll see in the next section how to retrieve only the
best parse.
266 C HAPTER 13 • C ONSTITUENCY PARSING

Book the flight through Houston Book the flight through Houston

S, VP, Verb, S,VP,X2 S, VP, Verb, S,VP,X2


Nominal, Nominal,
Noun Noun
[0,1] [0,2] [0,3] [0,4] [0,5] [0,1] [0,2] [0,3] [0,4] [0,5]
Det NP Det NP NP

[1,2] [1,3] [1,4] [1,5] [1,2] [1,3] [1,4] [1,5]


Nominal, Nominal Nominal,
Noun Noun

[2,3] [2,4] [2,5] [2,3] [2,4] [2,5]


Prep Prep PP

[3,4] [3,5] [3,4] [3,5]


NP, NP,
Proper- Proper-
Noun Noun
[4,5] [4,5]

Book the flight through Houston Book the flight through Houston

S, VP, Verb, S,VP,X2 S, VP, Verb, S,VP,X2


Nominal, Nominal,
Noun Noun
[0,1] [0,2] [0,3] [0,4] [0,5] [0,1] [0,2] [0,3] [0,4] [0,5]
Det NP NP Det NP NP

[1,2] [1,3] [1,4] [1,5] [1,2] [1,3] [1,4] [1,5]


Nominal, Nominal Nominal, Nominal
Noun Noun

[2,3] [2,4] [2,5] [2,3] [2,4] [2,5]


Prep PP Prep PP

[3,4] [3,5] [3,4] [3,5]


NP, NP,
Proper- Proper-
Noun Noun
[4,5] [4,5]

Book the flight through Houston

S, VP, Verb, S1,VP, X2


Nominal, S,
Noun VP, S2, VP
X2 S3
[0,1] [0,2] [0,3] [0,4]
Det NP NP

[1,2] [1,3] [1,4] [1,5]


Nominal, Nominal
Noun

[2,3] [2,4] [2,5]


Prep PP

[3,4] [3,5]
NP,
Proper-
Noun
[4,5]

Figure 13.7 Filling the cells of column 5 after reading the word Houston.
13.3 • S PAN -BASED N EURAL C ONSTITUENCY PARSING 267

13.2.4 CKY in Practice


Finally, we should note that while the restriction to CNF does not pose a prob-
lem theoretically, it does pose some non-trivial problems in practice. Obviously, as
things stand now, our parser isn’t returning trees that are consistent with the grammar
given to us by our friendly syntacticians. In addition to making our grammar devel-
opers unhappy, the conversion to CNF will complicate any syntax-driven approach
to semantic analysis.
One approach to getting around these problems is to keep enough information
around to transform our trees back to the original grammar as a post-processing step
of the parse. This is trivial in the case of the transformation used for rules with length
greater than 2. Simply deleting the new dummy non-terminals and promoting their
daughters restores the original tree.
In the case of unit productions, it turns out to be more convenient to alter the ba-
sic CKY algorithm to handle them directly than it is to store the information needed
to recover the correct trees. Exercise 13.3 asks you to make this change. Many of
the probabilistic parsers presented in Appendix C use the CKY algorithm altered in
just this manner.

13.3 Span-Based Neural Constituency Parsing


While the CKY parsing algorithm we’ve seen so far does great at enumerating all
the possible parse trees for a sentence, it has a large problem: it doesn’t tell us which
parse is the correct one! That is, it doesn’t disambiguate among the possible parses.
To solve the disambiguation problem we’ll use a simple neural extension of the
CKY algorithm. The intuition of such parsing algorithms (often called span-based
constituency parsing, or neural CKY), is to train a neural classifier to assign a
score to each constituent, and then use a modified version of CKY to combine these
constituent scores to find the best-scoring parse tree. Here we’ll describe a version
of the algorithm from Kitaev et al. (2019).

13.3.1 Computing Scores for a Span


span Let’s begin by considering just the constituent (we’ll call it a span) that lies between
fencepost positions i and j with non-terminal symbol label l. We’ll build a classifier
to assign a score s(i, j, l) to this constituent span.
Fig. 13.8 sketches the architecture. The input word tokens are embedded by
passing them through a pretrained language model like BERT. Because BERT oper-
ates on the level of subword (wordpiece) tokens rather than words, we’ll first need to
convert the BERT outputs to word representations. One standard way of doing this
is to simply use the last subword unit as the representation for the word (using the
first subword unit seems to work equivalently well). The embeddings can then be
passed through some postprocessing layers; Kitaev et al. (2019), for example, use 8
Transformer layers.
The resulting word encoder outputs yt are then use to compute a span score.
First, we must map the word encodings (indexed by word positions) to span encod-
ings (indexed by fenceposts). We do this by representing each fencepost with two
separate values; the intuition is that a span endpoint to the right of a word represents
different information than a span endpoint to the left of a word. We convert each
268 C HAPTER 13 • C ONSTITUENCY PARSING

CKY for computing best parse NP

Compute score for span MLP

Represent span hj-hi


i=1 j=3

0 1 2 3 4 5

postprocessing layers
map back to words

BERT
map to subwords

[START] Book the flight through Houston [END]


Figure 13.8 A simplified outline of computing the span score for the span the flight with
the label NP.

word output yt into a (leftward-pointing) value for spans ending at this fencepost,

−y t , and a (rightward-pointing) value ←−
y t for spans beginning at this fencepost, by
splitting yt into two halves. Each span then stretches from one double-vector fence-
post to another, as in the following representation of the flight, which is span(1, 3):
START 0 Book the flight through
y0 →

y0 ←
y−1 y → −
y ←
y− y2 →

y2 ←
y−3 y → −y ←
y− y → −y ←
y− ...
1 1 2 3 3 4 4 4 5
0 1 2 3 4

span(1,3)
A traditional way to represent a span, developed originally for RNN-based models
(Wang and Chang, 2016), but extended also to Transformers, is to take the differ-
ence between the embeddings of its start and end, i.e., representing span (i, j) by
subtracting the embedding of i from the embedding of j. Here we represent a span
by concatenating the difference of each of its fencepost components:
v(i, j) = [→

y −→
−y ;← y −− − ←
j y −−]
i j+1 (13.4)
i+1

The span vector v is then passed through an MLP span classifier, with two fully-
connected layers and one ReLU activation function, whose output dimensionality is
the number of possible non-terminal labels:
s(i, j, ·) = W2 ReLU(LayerNorm(W1 v(i, j))) (13.5)

The MLP then outputs a score for each possible non-terminal.

13.3.2 Integrating Span Scores into a Parse


Now we have a score for each labeled constituent span s(i, j, l). But we need a score
for an entire parse tree. Formally a tree T is represented as a set of |T | such labeled
13.4 • E VALUATING PARSERS 269

spans, with the t th span starting at position it and ending at position jt , with label lt :

T = {(it , jt , lt ) : t = 1, . . . , |T |} (13.6)

Thus once we have a score for each span, the parser can compute a score for the
whole tree s(T ) simply by summing over the scores of its constituent spans:
X
s(T ) = s(i, j, l) (13.7)
(i, j,l)∈T

And we can choose the final parse tree as the tree with the maximum score:

T̂ = argmax s(T ) (13.8)


T

The simplest method to produce the most likely parse is to greedily choose the
highest scoring label for each span. This greedy method is not guaranteed to produce
a tree, since the best label for a span might not fit into a complete tree. In practice,
however, the greedy method tends to find trees; in their experiments Gaddy et al.
(2018) finds that 95% of predicted bracketings form valid trees.
Nonetheless it is more common to use a variant the CKY algorithm to find the
full parse. The variant defined in Gaddy et al. (2018) works as follows. Let’s define
sbest (i, j) as the score of the best subtree spanning (i, j). For spans of length one, we
choose the best label:

sbest (i, i + 1) = max s(i, i + 1, l) (13.9)


l

For other spans (i, j), the recursion is:

sbest (i, j) = max s(i, j, l)


l
+ max[sbest (i, k) + sbest (k, j)] (13.10)
k

For more details on span-based parsing, including the margin-based training al-
gorithm, see Stern et al. (2017), Gaddy et al. (2018), Kitaev and Klein (2018), and
Kitaev et al. (2019).

13.4 Evaluating Parsers


The standard tool for evaluating parsers that assign a single parse tree to a sentence
PARSEVAL is the PARSEVAL metrics (Black et al., 1991). The PARSEVAL metric measures
how much the constituents in the hypothesis parse tree look like the constituents in a
hand-labeled, reference parse. PARSEVAL thus requires a human-labeled reference
(or “gold standard”) parse tree for each sentence in the test set; we generally draw
these reference parses from a treebank like the Penn Treebank.
A constituent in a hypothesis parse Ch of a sentence s is labeled correct if there
is a constituent in the reference parse Cr with the same starting point, ending point,
and non-terminal symbol. We can then measure the precision and recall just as for
tasks we’ve seen already like named entity tagging:

# of correct constituents in hypothesis parse of s


labeled recall: = # of correct constituents in reference parse of s
270 C HAPTER 13 • C ONSTITUENCY PARSING

# of correct constituents in hypothesis parse of s


labeled precision: = # of total constituents in hypothesis parse of s

As usual, we often report a combination of the two, F1 :


2PR
F1 = (13.11)
P+R
We additionally use a new metric, crossing brackets, for each sentence s:
cross-brackets: the number of constituents for which the reference parse has a
bracketing such as ((A B) C) but the hypothesis parse has a bracketing such
as (A (B C)).
For comparing parsers that use different grammars, the PARSEVAL metric in-
cludes a canonicalization algorithm for removing information likely to be grammar-
specific (auxiliaries, pre-infinitival “to”, etc.) and for computing a simplified score
(Black et al., 1991). The canonical implementation of the PARSEVAL metrics is
evalb called evalb (Sekine and Collins, 1997).

13.5 Partial Parsing


Many language processing tasks do not require complex, complete parse trees for all
partial parse inputs. For these tasks, a partial parse, or shallow parse, of input sentences may be
shallow parse sufficient. For example, information extraction systems generally do not extract all
the possible information from a text: they simply identify and classify the segments
in a text that are likely to contain valuable information.
chunking One kind of partial parsing is known as chunking. Chunking is the process
of identifying and classifying the flat, non-overlapping segments of a sentence that
constitute the basic non-recursive phrases corresponding to the major content-word
parts-of-speech: noun phrases, verb phrases, adjective phrases, and prepositional
phrases. The task of finding all the base noun phrases in a text is particularly com-
mon. Since chunked texts lack a hierarchical structure, a simple bracketing notation
is sufficient to denote the location and the type of the chunks in a given example:
(13.12) [NP The morning flight] [PP from] [NP Denver] [VP has arrived.]
This bracketing notation makes clear the two fundamental tasks that are involved
in chunking: segmenting (finding the non-overlapping extents of the chunks) and
labeling (assigning the correct tag to the discovered chunks). Some input words
may not be part of any chunk, particularly in tasks like base NP:
(13.13) [NP The morning flight] from [NP Denver] has arrived.
What constitutes a syntactic base phrase depends on the application (and whether
the phrases come from a treebank). Nevertheless, some standard guidelines are fol-
lowed in most systems. First and foremost, base phrases of a given type do not
recursively contain any constituents of the same type. Eliminating this kind of recur-
sion leaves us with the problem of determining the boundaries of the non-recursive
phrases. In most approaches, base phrases include the headword of the phrase, along
with any pre-head material within the constituent, while crucially excluding any
post-head material. Eliminating post-head modifiers obviates the need to resolve
attachment ambiguities. This exclusion does lead to certain oddities, such as PPs
and VPs often consisting solely of their heads. Thus a flight from Indianapolis to
Houston would be reduced to the following:
13.6 • CCG PARSING 271

(13.14) [NP a flight] [PP from] [NP Indianapolis][PP to][NP Houston]


Chunking Algorithms Chunking is generally done via supervised learning, train-
ing a BIO sequence labeler of the sort we saw in Chapter 8 from annotated training
data. Recall that in BIO tagging, we have a tag for the beginning (B) and inside (I) of
each chunk type, and one for tokens outside (O) any chunk. The following example
shows the bracketing notation of (13.12) on page 270 reframed as a tagging task:
(13.15) The morning flight from Denver has arrived
B NP I NP I NP B PP B NP B VP I VP
The same sentence with only the base-NPs tagged illustrates the role of the O tags.
(13.16) The morning flight from Denver has arrived.
B NP I NP I NP O B NP O O
Since annotation efforts are expensive and time consuming, chunkers usually
rely on existing treebanks like the Penn Treebank, extracting syntactic phrases from
the full parse constituents of a sentence, finding the appropriate heads and then in-
cluding the material to the left of the head, ignoring the text to the right. This is
somewhat error-prone since it relies on the accuracy of the head-finding rules de-
scribed in Chapter 12.
Given a training set, any sequence model can be used to chunk: CRF, RNN,
Transformer, etc. As with the evaluation of named-entity taggers, the evaluation of
chunkers proceeds by comparing chunker output with gold-standard answers pro-
vided by human annotators, using precision, recall, and F1 .

13.6 CCG Parsing


Lexicalized grammar frameworks such as CCG pose problems for which the phrase-
based methods we’ve been discussing are not particularly well-suited. To quickly
review, CCG consists of three major parts: a set of categories, a lexicon that asso-
ciates words with categories, and a set of rules that govern how categories combine
in context. Categories can be either atomic elements, such as S and NP, or functions
such as (S\NP)/NP which specifies the transitive verb category. Rules specify how
functions, their arguments, and other functions combine. For example, the following
rule templates, forward and backward function application, specify the way that
functions apply to their arguments.

X/Y Y ⇒ X
Y X\Y ⇒ X

The first rule applies a function to its argument on the right, while the second
looks to the left for its argument. The result of applying either of these rules is the
category specified as the value of the function being applied. For the purposes of
this discussion, we’ll rely on these two rules along with the forward and backward
composition rules and type-raising, as described in Chapter 12.

13.6.1 Ambiguity in CCG


As is always the case in parsing, managing ambiguity is the key to successful CCG
parsing. The difficulties with CCG parsing arise from the ambiguity caused by the
large number of complex lexical categories combined with the very general nature of
272 C HAPTER 13 • C ONSTITUENCY PARSING

the grammatical rules. To see some of the ways that ambiguity arises in a categorial
framework, consider the following example.
(13.17) United diverted the flight to Reno.
Our grasp of the role of the flight in this example depends on whether the prepo-
sitional phrase to Reno is taken as a modifier of the flight, as a modifier of the entire
verb phrase, or as a potential second argument to the verb divert. In a context-free
grammar approach, this ambiguity would manifest itself as a choice among the fol-
lowing rules in the grammar.

Nominal → Nominal PP
VP → VP PP
VP → Verb NP PP

In a phrase-structure approach we would simply assign the word to to the cate-


gory P allowing it to combine with Reno to form a prepositional phrase. The sub-
sequent choice of grammar rules would then dictate the ultimate derivation. In the
categorial approach, we can associate to with distinct categories to reflect the ways
in which it might interact with other elements in a sentence. The fairly abstract
combinatoric rules would then sort out which derivations are possible. Therefore,
the source of ambiguity arises not from the grammar but rather from the lexicon.
Let’s see how this works by considering several possible derivations for this
example. To capture the case where the prepositional phrase to Reno modifies the
flight, we assign the preposition to the category (NP\NP)/NP, which gives rise to
the following derivation.

United diverted the flight to Reno


NP (S\NP)/NP NP/N N (NP\NP)/NP NP
> >
NP NP\NP
<
NP
>
S\NP
<
S
Here, the category assigned to to expects to find two arguments: one to the right as
with a traditional preposition, and one to the left that corresponds to the NP to be
modified.
Alternatively, we could assign to to the category (S\S)/NP, which permits the
following derivation where to Reno modifies the preceding verb phrase.

United diverted the flight to Reno


NP (S\NP)/NP NP/N N (S\S)/NP NP
> >
NP S\S
>
S\NP
<B
S\NP
<
S
A third possibility is to view divert as a ditransitive verb by assigning it to the
category ((S\NP)/PP)/NP, while treating to Reno as a simple prepositional phrase.
13.6 • CCG PARSING 273

United diverted the flight to Reno


NP ((S\NP)/PP)/NP NP/N N PP/NP NP
> >
NP PP
>
(S\NP)/PP
>
S\NP
<
S

While CCG parsers are still subject to ambiguity arising from the choice of
grammar rules, including the kind of spurious ambiguity discussed in Chapter 12,
it should be clear that the choice of lexical categories is the primary problem to be
addressed in CCG parsing.

13.6.2 CCG Parsing Frameworks


Since the rules in combinatory grammars are either binary or unary, a bottom-up,
tabular approach based on the CKY algorithm should be directly applicable to CCG
parsing. Unfortunately, the large number of lexical categories available for each
word, combined with the promiscuity of CCG’s combinatoric rules, leads to an ex-
plosion in the number of (mostly useless) constituents added to the parsing table.
The key to managing this explosion of zombie constituents is to accurately assess
and exploit the most likely lexical categories possible for each word — a process
called supertagging.
The following sections describe two approaches to CCG parsing that make use of
supertags. Section 13.6.4, presents an approach that structures the parsing process
as a heuristic search through the use of the A* algorithm. The following section
then briefly describes a more traditional classifier-based approach that manages the
search space complexity through the use of adaptive supertagging — a process that
iteratively considers more and more tags until a parse is found.

13.6.3 Supertagging
Chapter 8 introduced the task of part-of-speech tagging, the process of assigning the
supertagging correct lexical category to each word in a sentence. Supertagging is the correspond-
ing task for highly lexicalized grammar frameworks, where the assigned tags often
dictate much of the derivation for a sentence.
CCG supertaggers rely on treebanks such as CCGbank to provide both the over-
all set of lexical categories as well as the allowable category assignments for each
word in the lexicon. CCGbank includes over 1000 lexical categories, however, in
practice, most supertaggers limit their tagsets to those tags that occur at least 10
times in the training corpus. This results in a total of around 425 lexical categories
available for use in the lexicon. Note that even this smaller number is large in con-
trast to the 45 POS types used by the Penn Treebank tagset.
As with traditional part-of-speech tagging, the standard approach to building a
CCG supertagger is to use supervised machine learning to build a sequence labeler
from hand-annotated training data. To find the most likely sequence of tags given a
sentence, it is most common to use a neural sequence model, either RNN or Trans-
former.
It’s also possible, however, to use the CRF tagging model described in Chapter 8,
using similar features; the current word wi , its surrounding words within l words,
local POS tags and character suffixes, and the supertag from the prior timestep,
274 C HAPTER 13 • C ONSTITUENCY PARSING

training by maximizing log-likelihood of the training corpus and decoding via the
Viterbi algorithm as described in Chapter 8.
Unfortunately the large number of possible supertags combined with high per-
word ambiguity leads the naive CRF algorithm to error rates that are too high for
practical use in a parser. The single best tag sequence T̂ will typically contain too
many incorrect tags for effective parsing to take place. To overcome this, we instead
return a probability distribution over the possible supertags for each word in the
input. The following table illustrates an example distribution for a simple sentence,
in which each column represents the probability of each supertag for a given word
in the context of the input sentence. The “...” represent all the remaining supertags
possible for each word.

United serves Denver


N/N: 0.4 (S\NP)/NP: 0.8 NP: 0.9
NP: 0.3 N: 0.1 N/N: 0.05
S/S: 0.1 ... ...
S\S: .05
...

To get the probability of each possible word/tag pair, we’ll need to sum the
probabilities of all the supertag sequences that contain that tag at that location. This
can be done with the forward-backward algorithm that is also used to train the CRF,
described in Appendix A.

13.6.4 CCG Parsing using the A* Algorithm


The A* algorithm is a heuristic search method that employs an agenda to find an
optimal solution. Search states representing partial solutions are added to an agenda
based on a cost function, with the least-cost option being selected for further ex-
ploration at each iteration. When a state representing a complete solution is first
selected from the agenda, it is guaranteed to be optimal and the search terminates.
The A* cost function, f (n), is used to efficiently guide the search to a solution.
The f -cost has two components: g(n), the exact cost of the partial solution repre-
sented by the state n, and h(n) a heuristic approximation of the cost of a solution
that makes use of n. When h(n) satisfies the criteria of not overestimating the actual
cost, A* will find an optimal solution. Not surprisingly, the closer the heuristic can
get to the actual cost, the more effective A* is at finding a solution without having
to explore a significant portion of the solution space.
When applied to parsing, search states correspond to edges representing com-
pleted constituents. Each edge specifies a constituent’s start and end positions, its
grammatical category, and its f -cost. Here, the g component represents the current
cost of an edge and the h component represents an estimate of the cost to complete
a derivation that makes use of that edge. The use of A* for phrase structure parsing
originated with Klein and Manning (2003), while the CCG approach presented here
is based on the work of Lewis and Steedman (2014).
Using information from a supertagger, an agenda and a parse table are initial-
ized with states representing all the possible lexical categories for each word in the
input, along with their f -costs. The main loop removes the lowest cost edge from
the agenda and tests to see if it is a complete derivation. If it reflects a complete
derivation it is selected as the best solution and the loop terminates. Otherwise, new
states based on the applicable CCG rules are generated, assigned costs, and entered
13.6 • CCG PARSING 275

into the agenda to await further processing. The loop continues until a complete
derivation is discovered, or the agenda is exhausted, indicating a failed parse. The
algorithm is given in Fig. 13.9.

function CCG-AS TAR -PARSE(words) returns table or failure

supertags ← S UPERTAGGER(words)
for i ← from 1 to L ENGTH(words) do
for all {A | (words[i], A, score) ∈ supertags}
edge ← M AKE E DGE(i − 1, i, A, score)
table ← I NSERT E DGE(table, edge)
agenda ← I NSERT E DGE(agenda, edge)
loop do
if E MPTY ?(agenda) return failure
current ← P OP(agenda)
if C OMPLETED PARSE ?(current) return table
table ← I NSERT E DGE(chart, edge)
for each rule in A PPLICABLE RULES(edge) do
successor ← A PPLY(rule, edge)
if successor not ∈ in agenda or chart
agenda ← I NSERT E DGE(agenda, successor)
else if successor ∈ agenda with higher cost
agenda ← R EPLACE E DGE(agenda, successor)

Figure 13.9 A*-based CCG parsing.

Heuristic Functions
Before we can define a heuristic function for our A* search, we need to decide how
to assess the quality of CCG derivations. We’ll make the simplifying assumption
that the probability of a CCG derivation is just the product of the probability of
the supertags assigned to the words in the derivation, ignoring the rules used in the
derivation. More formally, given a sentence S and derivation D that contains supertag
sequence T , we have:

P(D, S) = P(T, S) (13.18)


Yn
= P(ti |si ) (13.19)
i=1

To better fit with the traditional A* approach, we’d prefer to have states scored by
a cost function where lower is better (i.e., we’re trying to minimize the cost of a
derivation). To achieve this, we’ll use negative log probabilities to score deriva-
tions; this results in the following equation, which we’ll use to score completed
CCG derivations.

P(D, S) = P(T, S) (13.20)


Xn
= − log P(ti |si ) (13.21)
i=1

Given this model, we can define our f -cost as follows. The f -cost of an edge is
the sum of two components: g(n), the cost of the span represented by the edge, and
276 C HAPTER 13 • C ONSTITUENCY PARSING

h(n), the estimate of the cost to complete a derivation containing that edge (these
are often referred to as the inside and outside costs). We’ll define g(n) for an edge
using Equation 13.21. That is, it is just the sum of the costs of the supertags that
comprise the span.
For h(n), we need a score that approximates but never overestimates the actual
cost of the final derivation. A simple heuristic that meets this requirement assumes
that each of the words in the outside span will be assigned its most probable su-
pertag. If these are the tags used in the final derivation, then its score will equal
the heuristic. If any other tags are used in the final derivation the f -cost will be
higher since the new tags must have higher costs, thus guaranteeing that we will not
overestimate.
Putting this all together, we arrive at the following definition of a suitable f -cost
for an edge.

f (wi, j ,ti, j ) = g(wi, j ) + h(wi, j ) (13.22)


j
X
= − log P(tk |wk ) +
k=i
i−1
X N
X
min (− log P(t|wk )) + min (− log P(t|wk ))
t∈tags t∈tags
k=1 k= j+1

As an example, consider an edge representing the word serves with the supertag N
in the following example.
(13.23) United serves Denver.
The g-cost for this edge is just the negative log probability of this tag, −log10 (0.1),
or 1. The outside h-cost consists of the most optimistic supertag assignments for
United and Denver, which are N/N and NP respectively. The resulting f -cost for
this edge is therefore 1.443.

An Example
Fig. 13.10 shows the initial agenda and the progress of a complete parse for this
example. After initializing the agenda and the parse table with information from the
supertagger, it selects the best edge from the agenda — the entry for United with
the tag N/N and f -cost 0.591. This edge does not constitute a complete parse and is
therefore used to generate new states by applying all the relevant grammar rules. In
this case, applying forward application to United: N/N and serves: N results in the
creation of the edge United serves: N[0,2], 1.795 to the agenda.
Skipping ahead, at the third iteration an edge representing the complete deriva-
tion United serves Denver, S[0,3], .716 is added to the agenda. However, the algo-
rithm does not terminate at this point since the cost of this edge (.716) does not place
it at the top of the agenda. Instead, the edge representing Denver with the category
NP is popped. This leads to the addition of another edge to the agenda (type-raising
Denver). Only after this edge is popped and dealt with does the earlier state repre-
senting a complete derivation rise to the top of the agenda where it is popped, goal
tested, and returned as a solution.
The effectiveness of the A* approach is reflected in the coloring of the states
in Fig. 13.10 as well as the final parsing table. The edges shown in blue (includ-
ing all the initial lexical category assignments not explicitly shown) reflect states in
the search space that never made it to the top of the agenda and, therefore, never
13.7 • S UMMARY 277

Initial
Agenda
1
United: N/N United serves: N[0,2]
.591 1.795
Goal State
2 3 6
serves: (S\NP)/NP serves Denver: S\NP[1,3] United serves Denver: S[0,3]
.591 .591 .716

4 5
Denver: NP Denver: S/(S\NP)[0,1]
.591 .591

United: NP
.716

United: S/S
1.1938

United: S\S United serves Denver


1.494
N/N: 0.591 N: 1.795 S: 0.716
NP: 0.716
serves: N S/S: 1.1938
1.494 S\S: 1.494

[0,1] [0,2] [0,3]
Denver: N
1.795 (S\NP)/NP: 0.591 S/NP: 0.591
N: 1.494

Denver: N/N
2.494
[1,2] [1,3]
NP: 0.591
N: 1.795
N/N: 2.494


[2,3]

Figure 13.10 Example of an A* search for the example “United serves Denver”. The circled numbers on the
blue boxes indicate the order in which the states are popped from the agenda. The costs in each state are based
on f-costs using negative log10 probabilities.

contributed any edges to the final table. This is in contrast to the PCKY approach
where the parser systematically fills the parse table with all possible constituents for
all possible spans in the input, filling the table with myriad constituents that do not
contribute to the final analysis.

13.7 Summary
This chapter introduced constituency parsing. Here’s a summary of the main points:
278 C HAPTER 13 • C ONSTITUENCY PARSING

• Structural ambiguity is a significant problem for parsers. Common sources


of structural ambiguity include PP-attachment, coordination ambiguity,
and noun-phrase bracketing ambiguity.
• Dynamic programming parsing algorithms, such as CKY, use a table of
partial parses to efficiently parse ambiguous sentences.
• CKY restricts the form of the grammar to Chomsky normal form (CNF).

• Parsers are evaluated with three metrics: labeled recall, labeled precision,
and cross-brackets.
• Partial parsing and chunking are methods for identifying shallow syntac-
tic constituents in a text. They are solved by sequence models trained on
syntactically-annotated data.

Bibliographical and Historical Notes


Writing about the history of compilers, Knuth notes:
In this field there has been an unusual amount of parallel discovery of
the same technique by people working independently.
Well, perhaps not unusual, since multiple discovery is the norm in science (see
page ??). But there has certainly been enough parallel publication that this his-
tory errs on the side of succinctness in giving only a characteristic early mention of
each algorithm; the interested reader should see Aho and Ullman (1972).
Bottom-up parsing seems to have been first described by Yngve (1955), who
gave a breadth-first, bottom-up parsing algorithm as part of an illustration of a ma-
chine translation procedure. Top-down approaches to parsing and translation were
described (presumably independently) by at least Glennie (1960), Irons (1961), and
Kuno and Oettinger (1963). Dynamic programming parsing, once again, has a his-
tory of independent discovery. According to Martin Kay (personal communica-
tion), a dynamic programming parser containing the roots of the CKY algorithm
was first implemented by John Cocke in 1960. Later work extended and formalized
the algorithm, as well as proving its time complexity (Kay 1967, Younger 1967,
WFST Kasami 1965). The related well-formed substring table (WFST) seems to have
been independently proposed by Kuno (1965) as a data structure that stores the re-
sults of all previous computations in the course of the parse. Based on a general-
ization of Cocke’s work, a similar data structure had been independently described
in Kay (1967) (and Kay 1973). The top-down application of dynamic programming
to parsing was described in Earley’s Ph.D. dissertation (Earley 1968, Earley 1970).
Sheil (1976) showed the equivalence of the WFST and the Earley algorithm. Norvig
(1991) shows that the efficiency offered by dynamic programming can be captured
in any language with a memoization function (such as in LISP) simply by wrapping
the memoization operation around a simple top-down parser.
While parsing via cascades of finite-state automata had been common in the
early history of parsing (Harris, 1962), the focus shifted to full CFG parsing quite
soon afterward. Church (1980) argued for a return to finite-state grammars as a
processing model for natural language understanding; other early finite-state parsing
models include Ejerhed (1988).
E XERCISES 279

The classic reference for parsing algorithms is Aho and Ullman (1972); although
the focus of that book is on computer languages, most of the algorithms have been
applied to natural language. A good programming languages textbook such as Aho
et al. (1986) is also useful.

Exercises
13.1 Implement the algorithm to convert arbitrary context-free grammars to CNF.
Apply your program to the L1 grammar.
13.2 Implement the CKY algorithm and test it with your converted L1 grammar.
13.3 Rewrite the CKY algorithm given in Fig. 13.5 on page 264 so that it can accept
grammars that contain unit productions.
13.4 Discuss the relative advantages and disadvantages of partial versus full pars-
ing.
13.5 Discuss how to augment a parser to deal with input that may be incorrect, for
example, containing spelling errors or mistakes arising from automatic speech
recognition.
13.6 Implement the PARSEVAL metrics described in Section 13.4. Next, use a
parser and a treebank, compare your metrics against a standard implementa-
tion. Analyze the errors in your approach.

You might also like