Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
Lee
Natural
Language
Processing
A Textbook with Python
Implementation
Natural Language Processing
Raymond S. T. Lee
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
This book is dedicated to all readers and
students taking my undergraduate and
postgraduate courses in Natural Language
Processing, your enthusiasm in seeking
knowledge incited me to write this book.
Preface
Natural Language Processing (NLP) and its related applications become part of
daily life with exponential growth of Artificial Intelligence (AI) in past decades.
NLP applications including Information Retrieval (IR) systems, Text Summarization
System, and Question-and-Answering (Chatbot) System became one of the preva-
lent topics in both industry and academia that had evolved routines and benefited
immensely to a wide array of day-to-day services.
The objective of this book is to provide NLP concepts and knowledge to readers
with a 14-h 7 step-by-step workshops to practice various core Python-based NLP
tools: NLTK, spaCy, TensorFlow Keras, Transformer, and BERT Technology to
construct NLP applications.
vii
viii Preface
This introductory chapter begins with human language and intelligence con-
stituting six levels of linguistics followed by a brief history of NLP with major
components and applications. It serves as the cornerstone to the NLP concepts
and technology discussed in the following chapters. This chapter also serves as
the conceptual basis for Workshop#1: Basics of Natural Language Toolkit
(NLTK) in Chap. 10.
• Chapter 2: N-gram Language Model
Language model is the foundation of NLP. This chapter introduces N-gram
language model and Markov Chains using classical literature The Adventures of
Sherlock Holmes by Sir Conan Doyle (1859–1930) to illustrate how N-gram
model works that form NLP basics in text analysis followed by Shannon’s model
and text generation with evaluation schemes. This chapter also serves as the con-
ceptual basis for Workshop#2 on N-gram modelling with NLTK in Chap. 11.
• Chapter 3: Part-of-Speech Tagging
Part-of-Speech (POS) Tagging is the foundation of text processing in
NLP. This chapter describes how it relates to NLP and Natural Language
Understanding (NLU). There are types and algorithms for POS Tagging includ-
ing Rule-based POS Tagging, Stochastic POS Tagging, and Hybrid POS Tagging
with Brill Tagger and evaluation schemes. This chapter also serves as the concep-
tual basis for Workshop#3: Part-of-Speech using Natural Language Toolkit in
Chap. 12.
• Chapter 4—Syntax and Parsing
As another major component of Natural Language Understanding (NLU),
this chapter explores syntax analysis and introduces different types of constitu-
ents in English language followed by the main concept of context-free grammar
(CFG) and CFG parsing. It also studies different major parsing techniques,
including lexical and probabilistic parsing with live examples for illustration.
• Chapter 5: Meaning Representation
Before the study of Semantic Analysis, this chapter explores meaning repre-
sentation, a vital component in NLP. It studies four major meaning representa-
tion techniques which include: first-order predicate calculus (FOPC), semantic
net, conceptual dependency diagram (CDD), and frame-based representation.
After that it explores canonical form and introduces Fillmore’s theory of univer-
sal cases followed by predicate logic and inference work using FOPC with live
examples.
• Chapter 6: Semantic Analysis
This chapter studies Semantic Analysis, one of the core concepts for learning
NLP. First, it studies the two basic schemes of semantic analysis: lexical and
compositional semantic analysis. After that it explores word senses and six com-
monly used lexical semantics followed by word sense disambiguation (WSD)
and various WSD schemes. Further, it also studies WordNet and online thesauri
for word similarity and various distributed similarity measurement including
Point-wise Mutual Information (PMI) and Positive Point-wise Mutual informa-
tion (PPMI) models with live examples for illustration. Chapters 4 and 5 also
Preface ix
serve as the conceptual basis for Workshop#4: Semantic Analysis and Word
Vectors using spaCy in Chap. 13.
• Chapter 7: Pragmatic Analysis
After the discussion of semantic meaning and analysis, this chapter explores
pragmatic analysis in linguistics and discourse phenomena. It also studies coher-
ence and coreference as the key components of pragmatics and discourse critical
to NLP, followed by discourse segmentation with different algorithms on Co-
reference Resolution including Hobbs Algorithm, Centering Algorithm, Log-
Linear Model, the latest machine learning methods, and evaluation schemes.
This chapter also serves as the conceptual basis for Workshop#5: Sentiment
Analysis and Text Classification in Chap. 14.
• Chapter 8: Transfer Learning and Transformer Technology
Transfer learning is a commonly used deep learning model to minimize com-
putational resources. This chapter explores: (1) Transfer Learning (TL) against
traditional Machine Learning (ML); (2) Recurrent Neural Networks (RNN), a
significant component of transfer learning with core technologies such as Long
Short-Term Memory (LSTM) Network and Bidirectional Recurrent Neural
Networks (BRNNs) in NLP applications, and (3) Transformer technology archi-
tecture, Bidirectional Encoder Representation from Transformers (BERT)
Model, and related technologies including Transformer-XL and ALBERT tech-
nologies. This chapter also serves as the conceptual basis for Workshop#6:
Transformers with spaCy and Tensorflow in Chap. 15.
• Chapter 9: Major Natural Language Processing Applications
This is a summary of Part I with three core NLP applications: Information
Retrieval (IR) systems, Text Summarization (TS) systems, and Question-and-
Answering (Q&A) chatbot systems, how they work and related R&D in building
NLP applications. This chapter also serves as the conceptual basis for
Workshop#7: Building Chatbot with TensorFlow and Transformer Technology
in Chap. 16.
provide a foundation technique for text analysis, parsing and semantic analysis
in subsequent workshops. Part II introduces spaCy, the second important NLP
Python implementation tools not only for teaching and learning (like NLTK) but
also widely used for NLP applications including text summarization, informa-
tion extraction, and Q&A chatbot. It is a critical mass to integrate with
Transformer Technology in subsequent workshops.
• Chapter 12: Workshop#3 Part-of-Speech Tagging with Natural Language Toolkit
(Hour 5–6)
In Chap. 3, we studied basic concepts and theories related to Part-of-Speech
(POS) and various POS tagging techniques. This workshop explores how to
implement POS tagging by using NLTK starting from a simple recap on tokeni-
zation techniques and two fundamental processes in word-level progressing:
stemming and stop-word removal, which will introduce two types of stemming
techniques: Porter Stemmer and Snowball Stemmer that can be integrated with
WordCloud commonly used in data visualization followed by the main theme of
this workshop with the introduction of PENN Treebank Tagset and to create your
own POS tagger.
• Chapter 13: Workshop#4 Semantic Analysis and Word Vectors using spaCy
(Hour 7–8)
In Chaps. 5 and 6, we studied the basic concepts and theories related to mean-
ing representation and semantic analysis. This workshop explores how to use
spaCy technology to perform semantic analysis starting from a revisit on word
vectors concept, implement and pre-train them followed by the study of similar-
ity method and other advanced semantic analysis.
• Chapter 14: Workshop#5 Sentiment Analysis and Text Classification (Hour 9–10)
This is a coherent workshop of Chap. 7, this workshop explores how to posi-
tion NLP implementation techniques into two important NLP applications: text
classification and sentiment analysis. TensorFlow and Kera are two vital compo-
nents to implement Long Short-Term Memory networks (LSTM networks), a
commonly used Recurrent Neural Networks (RNN) on machine learning espe-
cially in NLP applications.
• Chapter 15: Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
In Chap. 8, the basic concept about Transfer Learning, its motivation and
related background knowledge such as Recurrent Neural Networks (RNN) with
Transformer Technology and BERT model are introduced. This workshop
explores how to put these concepts and theories into practice. More importantly,
is to implement Transformers, BERT Technology with the integration of spaCy’s
Transformer Pipeline Technology and TensorFlow. First, it gives an overview
and summation on Transformer and BERT Technology. Second, it explores
Transformer implementation with TensorFlow by revisiting Text Classification
using BERT model as example. Third, it introduces spaCy’s Transformer Pipeline
Technology and how to implement Sentiment Analysis and Text Classification
system using Transformer Technology.
Preface xi
This book is both an NLP textbook and NLP Python implementation book tai-
lored for:
• Undergraduates and postgraduates of various disciplines including AI, Computer
Science, IT, Data Science, etc.
• Lecturers and tutors teaching NLP or related AI courses.
• NLP, AI scientists and developers who would like to learn NLP basic concepts,
practice and implement via Python workshops.
• Readers who would like to learn NLP concepts, practice Python-based NLP
workshops using various NLP implementation tools such as NLTK, spaCy,
TensorFlow Keras, BERT, and Transformer technology.
This book can be served as a textbook for undergraduates and postgraduate courses
on Natural Language Processing, and a reference book for general readers who
would like to learn key technologies and implement NLP applications with contem-
porary implementation tools such as NLTK, spaCy, TensorFlow, BERT, and
Transformer technology.
Part I (Chaps. 1–9) covers the main course materials of basic concepts and key
technologies which include N-gram Language Model, Part-of-Speech Tagging,
Syntax and Parsing, Meaning Representation, Semantic Analysis, Pragmatic
xii Preface
xiii
About the Book
xv
Contents
xvii
xviii Contents
16
Workshop#7 Building Chatbot with TensorFlow and Transformer
Technology (Hour 13–14)������������������������������������������������������������������������ 401
16.1 Introduction������������������������������������������������������������������������������������ 401
16.2 Technical Requirements������������������������������������������������������������������ 401
16.3 AI Chatbot in a Nutshell ���������������������������������������������������������������� 402
16.3.1 What Is a Chatbot?������������������������������������������������������������ 402
16.3.2 What Is a Wake Word in Chatbot?������������������������������������ 403
16.3.3 NLP Components in a Chatbot ���������������������������������������� 404
16.4 Building Movie Chatbot by Using TensorFlow
and Transformer Technology���������������������������������������������������������� 404
16.4.1 The Chatbot Dataset���������������������������������������������������������� 405
16.4.2 Movie Dialog Preprocessing�������������������������������������������� 405
16.4.3 Tokenization of Movie Conversation�������������������������������� 407
16.4.4 Filtering and Padding Process������������������������������������������ 408
16.4.5 Creation of TensorFlow Movie Dataset
Object (mDS)�������������������������������������������������������������������� 409
16.4.6 Calculate Attention Learning Weights������������������������������ 410
16.4.7 Multi-Head-Attention (MHAttention)������������������������������ 411
16.4.8 System Implementation���������������������������������������������������� 412
16.5 Related Works �������������������������������������������������������������������������������� 430
References�������������������������������������������������������������������������������������������������� 431
Index�������������������������������������������������������������������������������������������������������������������� 433
About the Author
Raymond Lee is the founder of the Quantum Finance Forecast System (QFFC)
(https://qffc.uic.edu.cn) and currently an Associate Professor at United International
College (UIC) with 25+ years’ experience in AI research and consultancy, Chaotic
Neural Networks, NLP, Intelligent Fintech Systems, Quantum Finance, and
Intelligent E-Commerce Systems. He has published over 100 publications and
authored 8 textbooks in the fields of AI, chaotic neural networks, AI-based fintech
systems, intelligent agent technology, chaotic cryptosystems, ontological agents,
neural oscillators, biometrics, and weather simulation and forecasting systems.
Upon completion of the QFFC project, in 2018 he joined United International
College (UIC), China, to pursue further R&D work on AI-Fintech and to share his
expertise in AI-Fintech, chaotic neural networks, and related intelligent systems
with fellow students and the community. His three latest textbooks, Quantum
Finance: Intelligent Forecast and Trading Systems (2019), Artificial Intelligence in
Daily Life (2020), and this NLP book have been adopted as the main textbooks for
various AI courses in UIC.
xxix
Abbreviations
AI Artificial intelligence
ASR Automatic speech recognition
BERT Bidirectional encoder representations from transformers
BRNN Bidirectional recurrent neural networks
CDD Conceptual dependency diagram
CFG Context-free grammar
CFL Context-free language
CNN Convolutional neural networks
CR Coreference resolution
DNN Deep neural networks
DT Determiner
FOPC First-order predicate calculus
GRU Gate recurrent unit
HMM Hidden Markov model
IE Information extraction
IR Information retrieval
KAI Knowledge acquisition and inferencing
LSTM Long short-term memory
MEMM Maximum entropy Markov model
MeSH Medical subject thesaurus
ML Machine learning
NER Named entity recognition
NLP Natural language processing
NLTK Natural language toolkit
NLU Natural language understanding
NN Noun
NNP Proper noun
Nom Nominal
NP Noun phrase
PCFG Probabilistic context-free grammar
PMI Pointwise mutual information
xxxi
xxxii Abbreviations
POS Part-of-speech
POST Part-of-speech tagging
PPMI Positive pointwise mutual information
Q&A Question-and-answering
RNN Recurrent neural networks
TBL Transformation-based learning
VB Verb
VP Verb phrase
WSD Word sense disambiguation
Part I
Concepts and Technology
Chapter 1
Natural Language Processing
Consider this scenario: Late in the evening, Jack starts a mobile app and talks with
AI Tutor Max.
1.1 Introduction
There are many chatbots that allow humans to communicate with a device in natural
language nowadays. Figure 1.1 illustrates dialogue between a student who had
returned to dormitory after a full day classes and initiated communication with a
mobile application called AI Tutor 2.0 (Cui et al. 2020) from our latest research on
AI tutor chatbot. The objective is to enable the user (Jack) not only can learn from
book reading but also can communicate candidly with AI Tutor 2.0 (Max) to provide
knowledge responses in natural language. It is different from chatbots that respond
with basic commands but is human–computer interaction to demonstrate how a user
wishes to communicate in a way like a student convers with a tutor about subject
knowledge in the physical world. It is a dynamic process consisting of (1) world
knowledge on simple handshaking dialogue such as greetings and general discus-
sions. This is not an easy task as it involves knowledge and common sense to con-
struct a functional chatbot with daily dialogues, and (2) technical knowledge of a
particular knowledge domain, or domain expert as it required to learn from author’s
book AI in Daily Life (Lee 2020) first which covers all basic knowledge on the sub-
ject to form a knowledge tree or ontology graph that can be served as a new type of
publication and interactive device between human and computer to learn new
knowledge.
Natural language processing (NLP) is related to several disciplines including
human linguistic, computation linguistic, statistical engineering, AI in machine
learning, data mining, human voice processing recognition and synthesis, etc. There
are many genius chatbots initiated by NLP and AI scientists which become com-
mercial products in past decades.
This chapter will introduce this prime technology and components followed by
pertinent technologies in subsequent chapters.
There is an old saying: The way you behave says more about who you are. It is
because we never know what people think, the only method is to evaluate and judge
their behaviors.
1.2 Human Language and Intelligence 5
NLP core technologies and methodologies arose from famous Turing Test
(Eisenstein 2019; Bender 2013; Turing 1936, 1950) proposed by Sir Alan Turing
(1912–1954) in 1950s, the father of AI. Figure 1.2 shows a human judge convers
with two individuals in two rooms. One is a human, the other is either a robot, a
chatbot, or an NLP application. During a 20 min conversation, the judge can ask
human/machine technical/non-technical questions and require response on every
question so that the judge can decide whether the respondent is a human or a
machine. NLP in Turing Test is to recognize, understand questions, and respond in
human language. It remains a popular topic in AI today because we cannot see and
judge people’s thinking to define intelligence. It is the ultimate challenge in AI.
Human language is a significant component in human behavior and civilization.
It can be categorized into (1) written and (2) oral aspects generally. Written lan-
guage undertakes to process, store, and pass human/natural language knowledge to
next generations. Oral or spoken language acts as a communication media among
other individuals.
NLP has examined the basic effects on philosophy such as meaning and knowl-
edge, psychology in words meanings, linguistics in phrases and sentences forma-
tion, computational linguists in language models. Hence, NLP is cross-disciplinary
integration of disciplines such as philosophy in human language ontology models,
psychology behavior between natural and human language, linguistics in mathe-
matical and language models, computational linguistics in agents and ontology
trees technology as shown in Fig. 1.3.
6 1 Natural Language Processing
Pragmatic ambiguity arises from a statement that is not clearly defined when the
context of a sentence provides multiple interpretations such as I like that too. It can
describe I like that too, other likes that too but the description of that is uncertain.
NLP analyzes sentences ambiguity incessantly. If they can be identified earlier,
it will be easier to define proper meanings.
There are several major NLP transformation stages in NLP history (Santilal 2020).
NLP major development was focused on how it can be used in different areas
such as knowledge engineering called agent ontology to shape meaning repre-
sentations following AI grew popular over time. BASEBALL system
1.5 A Brief History of NLP 9
(Green et al. 1961) was a typical example of Q&A-based domain expert system
of human and computer interaction developed in 1960s, but inputs were restric-
tive and language processing techniques remained in basic language processing.
In 1968, Prof. Marvin Minsky (1927–2016) developed a more powerful NLP
system. This advanced system used an AI-based question-answering inference
engine between humans and computers to provide knowledge-based interpretations
of questions and answers. Further, Prof. William A. Woods proposed an augmented
translation network (ATN) to represent natural language input in 1970. During this
period, many programmers started to transcribe codes in different AI languages to
conceptualize natural language ontology knowledge of real-world structural infor-
mation into human understanding mode status. Yet these expert systems were unable
to meet expectation signified the second winter of AI.
NLP statistical technique and rule-based system R&D had evolved into cloud com-
puting technology on mobile computing and big data in deep network analysis, e.g.
recurrent neural networks using LSTM and related networks. Google, Amazon,
Facebook contributed to agent technologies and deep neural networks development
in 2010 to devise products such as auto-driving, Q&A chatbots, and storage devel-
opment are under way.
1.6 NLP and AI
Spoken
Language
Speech
Recognition
Lexicon
Syntax
Analysis
Grammar
Semantic Semantic
Rules Analysis
Contextual Pragmatic
Information Analysis
Target Meaning
Representation
1.8.1 Speech Recognition
Speech recognition (Li et al. 2015) is the first stage in NLU that performs phonetic,
phonological, and morphological processing to analyze spoken language. The task
involves breaking down the stems of spoken words called utterances, into distinct
tokens representing paragraphs, sentences, and words in different parts. Current
speech recognition models apply spectrogram analysis to extract distinct frequen-
cies, e.g. the word uncanny can be split into two-word tokens un and canny. Different
languages have different spectrogram analysis.
1.8.2 Syntax Analysis
Syntax analysis (Sportier et al. 2013) is the second stage of NLU direct response
speech recognition, analyzing the structural meaning of spoken sentences. This task
has two purposes: (1) check syntax correctness of the sentence/utterance, (2) break
down spoken sentences into syntactic structures to reflect syntactic relationship
between words. For instance, the utterance oranges to the boys will be rejected by
syntax parser because of syntactic errors.
1.8.3 Semantic Analysis
Semantic analysis (Goddard 1998) is the third stage in NLU which corresponds to
syntax analysis. This task is to extract the precise meaning of a sentence/utterance,
or dictionary meanings defined by the text and reject meaningless, e.g. semantic
analyzer rejects word phrase like hot snowflakes despite correct syntactic words
meaning but incorrect semantic meaning.
1.8.4 Pragmatic Analysis
Pragmatic analysis (Ibileye 2018) is the fourth stage in NLU and a challenging part
in spoken language analysis involving high level or expert knowledge with common
sense, e.g. will you crack open the door? I’m getting hot. This sentence/utterance
requires extra knowledge in the second clause to understand crack is to break in
semantic meaning, but it should be interpreted as to open in pragmatic meaning.
14 1 Natural Language Processing
After years of research and development from machine translation and rule-based
systems to data mining and deep networks, NLP technology has a wide range of
applications in everyday activities such as machine translation, information retrieval,
sentiment analysis, information extraction, and question-answering chatbots as in
Fig. 1.8.
Machine translation (Scott 2018) is the earliest application in NLP since 1950s.
Although it is not difficult to translate one language to another yet there are two
major challenges (1) naturalness (or fluency) means different languages have differ-
ent styles and usages and (2) adequacy (or accuracy) means different languages may
present independent ideas in different languages. Experienced human translators
address this trade-off in creative ways such as statistical methods, or case-by-case
rule-based systems in the past but since there have been many ambiguity scenarios
in language translation, the goal of machine translation R&D nowadays strive sev-
eral AI techniques applications for recurrent networks, or deep networks backbox
systems to enhance machine learning capabilities.
1.9.4 Sentiment Analysis
Sentiment analysis (Liu 2012) is a kind of data mining system in NLP to analyze
user sentiment towards products, people, ideas from social media, forums, and
online platforms. It is an important application for extracting data from messages,
comments, and conversations published on these platforms; and assigning a labeled
sentiment classification as in Fig. 1.9 to understand natural language and utterances.
Deep networks are ways to analyze large amounts of data. In Part II: NLP
Implementation Workshop will explore how to implement sentiment analysis in
detail using Python spaCy and Transformer technology.
16 1 Natural Language Processing
Q&A systems is the objective in NLP (Raj 2018). A process flow is necessary to
implement a Q&A chatbot. It includes voice recognition to convert into a list of
tokens in sentences/utterances, syntactic grammatical analysis, semantic meaning
analysis of whole sentences, and pragmatic analysis for embedded or complex
meanings. When enquirer’s utterance meaning is generated, it is necessary to search
from knowledge base for the most appropriate answer or response through inferenc-
ing either by rule-based system, statistical system, or deep network, e.g. Google
BERT system. Once a response is available, reverse engineering is required to gen-
erate natural voice from verbal language called voice synthesis. Hence, Q&A sys-
tem in NLP is an important technology that can apply to daily activities such as
human–computer interaction in auto-driving, customer services support, and lan-
guage skills improvement.
The final workshop will discuss how to integrate various Python NLP implemen-
tation tools including NLTK, spaCy, TensorFlow Keras, and Transformer Technology
to implement a Q&A movies chatbot system.
References
Cui, Y., Huang, C., Lee, Raymond (2020). AI Tutor: A Computer Science Domain Knowledge
Graph-Based QA System on JADE platform. World Academy of Science, Engineering and
Technology, Open Science Index 168, International Journal of Industrial and Manufacturing
Engineering, 14(12), 543 - 553.
Eisenstein, J. (2019) Introduction to Natural Language Processing (Adaptive Computation and
Machine Learning series). The MIT Press.
Goddard, C. (1998) Semantic Analysis: A Practical Introduction (Oxford Textbooks in Linguistics).
Oxford University Press.
Green, B., Wolf, A., Chomsky, C. and Laughery, K. (1961). BASEBALL: an automatic question-
answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM com-
puter conference (IRE-AIEE-ACM ’61 (Western)). Association for Computing Machinery,
New York, NY, USA, 219–224.
Hausser, R. (2014) Foundations of Computational Linguistics: Human-Computer Communication
in Natural Language (3rd edition). Springer.
Hemdev, P. (2011) Information Extraction: A Smart Calendar Application: Using NLP,
Computational Linguistics, Machine Learning and Information Retrieval Techniques. VDM
Verlag Dr. Müller.
Ibileye, G. (2018) Discourse Analysis and Pragmatics: Issues in Theory and Practice.
Malthouse Press.
Lee, R. S. T. (2020). AI in Daily Life. Springer.
Li, J. et al. (2015) Robust Automatic Speech Recognition: A Bridge to Practical Applications.
Academic Press.
Liu, B. (2012) Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.
Peters, C. et al. (2012) Multilingual Information Retrieval: From Research To Practice. Springer.
Raj, S. (2018) Building Chatbots with Python: Using Natural Language Processing and Machine
Learning. Apress.
Santilal, U. (2020) Natural Language Processing: NLP & its History (Kindle edition). Amazon.com.
Scott, B. (2018) Translation, Brains and the Computer: A Neurolinguistic Solution to Ambiguity
and Complexity in Machine Translation (Machine Translation: Technologies and Applications
Book 2). Springer.
Sportier, D. et al. (2013) An Introduction to Syntactic Analysis and Theory. Wiley-Blackwell.
Tuchong (2020a) The Turing Test. https://stock.tuchong.com/image/detail?imag
eId=921224657742331926. Accessed 14 May 2022.
Tuchong (2020b) NLP and AI. https://stock.tuchong.com/image/detail?imag
eId=1069700818174345308. Accessed 14 May 2022.
Turing, A. (1936) On computable numbers, with an application to the Entscheidungs problem. In:
Proc. London Mathematical Society, Series 2, 42:230–26
Turing, A. (1950) Computing Machinery and Intelligence. Mind, LIX (236): 433–460.
Chapter 2
N-Gram Language Model
2.1 Introduction
A text highlights spelling and grammatic errors in yellow and blue colors is
shown in Fig. 2.2. This method can calculate words probabilities occurrence fre-
quency to provide substitution of higher frequency probability but cannot always
present accurate options.
Figure 2.3 illustrates a simple scenario of next word prediction in sample utter-
ances I like photography, I like science, and I love mathematics. The probability of
I like is 0.67 (2/3) compared with I love is 0.33 (1/3), the probability of like photog-
raphy and like science is similar at 0.5 (1/2). Assigning probability to scenarios, I
like photography and I like science are both 0.67 × 0.5 = 0.335, and I love mathe-
matics is 0.33 × 1 = 0.33.
When applying probability on language models, it must always note (1) domain
specific verity of keywords togetherness and terminology knowledge varies accord-
ing to domains, e.g. medical science, AI, etc., (2) syntactic knowledge attributes to
syntax, lexical knowledge, and (3) common sense or world knowledge attributes to
the collection of habitual behaviors from past experiences, and (4) languages usage
significance in high-level NLP.
When applying probability on words prediction in an utterance, there are words
often proposed by rank and frequency to provide a sequential optimum estimation.
For example:
[2.1] I notice three children standing on the ??? (ground, bench …)
[2.2] I just bought some oranges from the ??? (supermarket, shop …)
[2.3] She stopped the car and then opened the ??? (door, window, …)
The structure of [2.3] is perplexed because word counting method with a sizeable
knowledge domain is adequate but common sense, world knowledge, or specific
domain knowledge are among the sources. It involves scenario syntactic knowledge
that attributes to do something with superior level at scene such as descriptive
knowledge to help the guesswork. Although it is plain and mundane to study pre-
ceding and words tracking but it is one the most useful techniques on words predic-
tion. Let us begin with some simple word counting methods in NLP, the N-gram
language model.
It was learnt that the motivations on words prediction can apply to voice recogni-
tion, text generation, and Q&A chatbot. N-gram language model, also called
N-gram model or N-gram (Sidorov 2019; Liu et al. 2020) is a fundamental method
to formalize words prediction using probability calculation. N-gram is statistical
model that consists of word sequence in N-number, commonly used N-grams
include:
• Unigram refers to a single word, i.e. N = 1. It is seldomly used in practice because
it contains only one word in N-gram. However, it is important to serve as the base
for higher order N-gram probability normalization.
• Bigram refers to a collection of two words, i.e. N = 2. For example: I have, I do,
he thinks, she knows, etc. It is used in many applications because its occurrence
frequency is high and easy to count.
• Trigram refers to a collection of three words, i.e. N = 3. For example: I noticed
that, noticed three children, children standing on, standing on the. It is useful
because it contains more meanings and not lengthy. Given a count knowledge of
first three words can easily guess the next word in a sequence. However, its
occurrence frequency is low in a moderate corpus.
22 2 N-Gram Language Model
Here is a list of common terminologies in NLP (Jurafsky et al. 1999; Eisenstein 2019):
• Sentence is a unit of written language. It is a basic entity in a conversation or
utterance.
2.2 N-Gram Language Model 23
Fig. 2.4 Computerized axial tomography scanner (aka. CAT scan) (Tuchong 2022)
P ( A ∩ B)
P ( A|B ) = (2.1)
P ( B)
P ( A ∩ B ) = P ( A|B ) P ( B ) (2.3)
For a sequence of events, A, B, C and D, the Chain Rule formulation will become
In general:
If word sequence from position 1 to n as w1n is defined, the Chain Rule applied to
word sequence will become
Note: Normally, <s> and </s> are used to denote the start and end of sentence/
utterance for better formulation.
This method seems fair and easy to understand but poses two major problems.
First, it is unlikely to gather the right statistics for prefixes which means that not
knowing the starting point of the sentence. Second, the calculation for word
sequence probability is mundane. If it is a long sentence, conditional probability at
the end of this equation is complex to calculate.
Let us explore how genius Markov Chain is applied to solve this problem.
26 2 N-Gram Language Model
In general:
n
P ( w1n ) ≈ ∏P ( wk |wk −1 ) (2.9)
k =1
(2.10)
P ( doubt|no ) ∗ P ( that|doubt ) ∗ P ( I|that )
Fig. 2.7 Unigram counts for words “I have no doubt that” from The Adventures of Sherlock Holmes
Fig. 2.8 Bigram grammar fragment from The Adventures of Sherlock Holmes
2.4 Live Example: The Adventures of Sherlock Holmes 29
Fig. 2.9 Bigram grammar fragment related to utterance “I have no doubt that I” from The
Adventures of Sherlock Holmes
Counting all conditional bigram probabilities based on unigram count in Fig. 2.7
showed I have no doubt that for I is at 0.138 which is very high, but it is interested
to note that no doubt is even higher at 0.167 but again since it is a detective story
with a restricted domain, doubt that is very high at 0.202 because the character
always involves guesswork and frequent grammar usage. Further, the probability of
bigram that I is much higher than other combination like that he, that she, and that
it. The occurrence frequency in other literatures is much lower but because the char-
acter is a self-assured and intelligent expert, so he said that I is more often than that
he or that she. That is the significance of a domain specific KB/corpus to check for
N-gram probability.
So, let us look at some N-gram probabilities calculation, e.g. the probability of
P(I have no doubt that I) given by Eq. (2.10) :
This example test results led to several observations. It is noted that all these
probabilities are limited in general. Conditional probability is limited in a long sen-
tence and required for Markov chain. If applying traditional method on conditional
probability with complex calculation, most of the time the probability is dimin-
ished. Further, the probability seems to capture both syntactic facts and world
knowledge. Although that I or that he are often used in English grammar, the prob-
ability in this literature that I is more frequent. Hence, it is related to both syntactic
usage, common sense, and specific domain knowledge. It depends on KB domains
leading to diverse probability calculation results.
It is also noted that most of the conditional probabilities are limited because the
multiplication of all probability calculation in a long sentence becomes diminished,
so it is important to apply Markov chain and convert complex conditional
30 2 N-Gram Language Model
Fig. 2.10 Bigram counts for “I have no doubt that I” in The Adventures of Sherlock Holmes
Fig. 2.11 Bigram probability (normalized) for “I have no doubt that I ” in The Adventures of
Sherlock Holmes
2.5 Shannon’s Method in N-Gram Model 31
Fig. 2.13 Sentence generation using Shannon method from “The Complete Works of Shakespeare”
In summary:
• Unigram results showed that the four random sentences are almost meaningless
because it used a single word to calculate probability that is mostly without
relations.
• Bigram results showed that the four random sentences have little meanings
because it used two words to calculate. It reflected its high occurrence probabil-
ity frequency but not grammatically correct.
• Trigram results showed that words relations are coherent because it used three
words to calculate. It reflected the conditional probability ranking had improved
grammar and meanings like human language.
2.5 Shannon’s Method in N-Gram Model 33
• Quadrigram results showed that the language of sentences is almost perfect per
original sentences since it used four words co-relation to calculate, but its high
occurrence conditional probability frequency are the words encountered with
low-ranking options due to copious information to search. It may not be benefi-
cial to text generation.
Although quadrigrams can provide realistic language, sentence/utterance but
lack freedoms to generate new sentences. Hence, trigram is often a suitable option
for language generation. Again, if corpus is not sizeable enough to accommodate
tokens and words volume like this literature, trigram will be unable to provide the
frequent words for N-gram may need to switch using bigram in this case. Hence,
quadrigram is unsuitable for text generation because it will be too close to the origi-
nal words or sentences.
Corpus used in this example is also domain specific from The Complete Works of
Shakespeare. It consists of 884,647 tokens and 29,066 distinct words that are
approximately 10 times more as compared with The Adventures of Sherlock Holmes.
It has approximately 300,000 bigram types out of all these tokens and the number
of bigram combinations will be 844 million possible bigrams. In other words, less
than 1% is used and other 99.96% of possible bigrams are never used. It makes
sense because most of these random bigram generations are grammatic, syntactic,
or even pragmatic meaningless, but pose a problem in N-gram calculations for text
generation.
For illustration purposes on how domain knowledge affects N-gram generation,
Fig. 2.14 shows some sample sentences generated by Wall Street Journal (WSJ)
articles as the corpus (Jurafsky et al. 1999). It showed that trigram has the best per-
formance in terms of sentence structure and meaningfulness on text generation.
Fig. 2.14 Sample sentence generation using Shannon method with Wall Street Journal articles
34 2 N-Gram Language Model
2.6.1 Perplexity
Perplexity (PP) is the probability of test set assigned by the language model, nor-
malized by the number of words as given by
1
PP (W ) = N (2.11)
P ( w1 w2 … wN )
Next step is to manage zero counts problems. Let us return to The Adventures of
Sherlock Holmes example, this literature had produced 109,139 bigram types over
100 million of possible bigrams as recalled, so there is approximately 99.89% of
possible bigrams never seen that have zero entries in the bigram table. In other
words, most of these 0s conditional probability are bigrams that required to manage
especially in different NLP applications like text generation and speech
recognition.
There is a brief synopsis in such zero-count dilemma. Some of these 0s are truly
zeros which means that cannot or should not occur because it will not make
36 2 N-Gram Language Model
grammatical or syntactic sense, however, some are only rare events which means
they occurred infrequently, i.e. with an extensive training corpus.
Further, Zipf’s law (Saichev et al. 2010) stated that, a long tail phenomenon is
rare events occurred in a very high frequency, and large events numbers occurred in
a low frequency constantly. These are two extremes which means some popular
words always occurred in a high frequency, and most are bigrams in low frequency.
Hence, it is clear to collect statistics on high frequency events and may have to wait
for a long time until a rare event occurs, e.g. a bigram to take a count on this low
occurrence frequency event. In other words, high occurrence frequency events
always dominate the whole corpus. This phenomenon is essential because it always
occurs in website statistics or website counting. These high frequency websites and
N-grams are usually the top 100 and others with limited visit counts and occurrence,
so the estimate results are sparse as there are neither counts nor rare events that
required to estimate the likelihood of unseen or 0 count N-grams.
2.6.4 Smoothing Techniques
Every N-gram training matrix is sparse even with large corpora because of Zipf’s
law phenomenon. The solution is to use likelihood estimation for figures on unseen
N-grams or 0 count N-grams to judge the rest of corpus accommodated with these
phantom/shadow N-grams. It will affect the rest of corpus.
Let us assume that an N-gram is used, all the words are known and seen before-
hand. When assigning a probability to a sequence where one of these components is
0, the initial process is to search for a low N-gram order and backoff from a bigram
to unigram and replace 0 with something else, or a value with several methods to
resolve zero count problems based on this concept; these collective methods are
called smoothing techniques.
This section explores four commonly used smoothing techniques: (1) Laplace
(Add-one) Smoothing, (2) Add-k Smoothing, (3) Backoff and Interpolation
Smoothing, and (4) Good Turing Smoothing (Chen and Goodman 1999; Eisenstein
2019; Jurafsky et al. 1999).
Laplace (Add-one) Smoothing (Chen and Goodman 1999; Jurafsky et al. 1999)
logic is to consider all zero counts are rare events and add 1 into them. These rare
events are neither occurred nor sampled during corpus training.
For unigram:
1. Add 1 to every single word (type) count.
2. Normalize N token/(N (tokens) + V (types)).
2.6 Language Model Evaluation and Smoothing Techniques 37
N
ci∗ = ( ci + 1) (2.14)
N +V
ci + 1
p∗ = (2.15)
N +V
For bigram:
1. Add 1 into every bigram c(wn − 1wn) + 1.
2. Increase unigram count by vocabulary size c(wn − 1) + V.
Figure 2.16 showed a bigram count with and without Laplace Method for previ-
ous example I have no doubt that I from The Adventures of Sherlock Holmes. It
indicated that all 0s become 1 so that no I becomes 1, others like I have will come
from 288 to 289, the calculation is simple but effective.
For bigram probability calculation is given by
C ( wn −1 wn )
P ( wn |wn −1 ) = (2.16)
C ( wn −1)
C ( wn −1 wn ) + 1 C ( wn −1 wn ) + 1
PLap ( wn |wn −1 ) = = (2.17)
∑ (C ( w
w n −1 w ) + 1) C ( wn −1 ) + V
Figure 2.17 shows the bigram probabilities with and without Laplace Method for
previous example I have no doubt that I from The Adventures of Sherlock Holmes.
Note: The bigram probability is calculated by the division of unigram originally
but now it will be the division by the count of unigram + total number of word type
(V) which is equal to 9886 e.g. P(have | I) = 288/2755 = 0.105. Applying Laplace
method, it becomes 289/(2755 + 9886) = 0.023. It showed that all zero cases will
become 1 which is simple for text generation, but the problem is, some probabilities
Orginal Bigram Table of "I have no doubt that I" Bigram Table of "I have no doubt that I" with Laplace Method
(By Bigram Count) (By Bigram Count)
Orginal Bigram Table of "I have no doubt that I" Bigram Table of "I have no doubt that I" with Laplace Method
(By Bigram Probability) (By Bigram Probability)
have changed notably such as I have from 0.105 to 0.023, and no doubt has the high-
est change from 0.1667 to only 0.00463.
Although it is adequate to assign a number to all zero events but the one with
high frequency becomes insignificant because of copious word types in corpus base,
indicating that the performance of Laplace Add-one smoothing may not be effective
in many cases and required to look for alternatives.
2.6.6 Add-k Smoothing
Add-k Smoothing (Chen and Goodman 1999; Jurafsky et al. 1999) logic is to assume
that each N-gram is seen in k times, but the occurrence is too rare to be observed.
These zeros are rare events that are less than 1 and unnoticeable meaning that there
is a line between 0 and 1, it can be 0.1, 0.01, 0.2 or even smaller; so a non-integer
count is added instead of 1 to each count, e.g. 0.05, 0.1, 0.2, typically, 0 < k < 1
provided that k must be a small number less than 1 in practical applications. It is
because if k is too large, it will cause similar problem occurred in Laplace method.
By using the same logical as Add-1 method, Add-k Smoothing is given by
C ( wn −1 wn ) + k
− k ( wn |wn −1 ) =
∗
PAdd (2.18)
C ( wn ) + kV
Backoff and Interpolation (B&I) Smoothing (Chen and Goodman 1999; Suyanto
2020) logic is to look for a lower dimension N-gram if there is no example of a
particular N-gram. If N − 1 gram has insufficient number count (or does not exist),
then will switch to N − 2 gram and so on. Although it is not the perfect option but
at least it can produce some viable counting for words prediction. That is to estimate
a probability with a bigram instead of trigram if there is none to be found.
Furthermore, it can look up to unigram if no bigram either. This is a kind of backoff
method and by interpolation, can always weight and combine with quadrigram,
trigram, bigram, and unigram probabilities counts, e.g. when calculating trigram
probability with unigram, bigram, and trigram, each weighted by some λ values.
Note the sum of all λs must be 1 given by these equations:
PB & I ( wn |wn − 2 wn −1 ) = λ1 P ( wn )
+λ2 P ( wn |wn −1 ) (2.19)
+λ3 P ( wn |wn − 2 wn −1 )
It is noted that by comparing with previous Eq. (2.19), this equation also consid-
ers conditional probability in all N-gram levels. Hence, both simple interpolation
and conditional interpolation methods are learnt from a held-out corpus. A held-out
corpus is an additional training corpus to set hyperparameters like λ values by
choosing λ values that can maximize the likelihood of held-out corpus. By adjusting
N-gram probabilities and search for λ value is to provide the highest probability of
held-out set. In fact, there are numerous approaches to find this optimal set of λ, a
simple way is applying EM algorithm which is an interactive learning algorithm to
converge locally optimal λ.
40 2 N-Gram Language Model
Good Turing (GT) Smoothing (Chen and Goodman 1999; Gale and Sampson 1995)
logic is to use the total frequency of events that occurred only once to estimate how
much mass shift to unseen events, e.g. using a bag of green color beans to estimate
the probability of an unseen red color bean.
This technique uses the frequency of N-grams occurrence to reallocate probabil-
ity distribution in two criteria, e.g. N-gram statistics of The Adventures of Sherlock
Holmes in Fig. 2.17. It showed that the probability of have doubt = 0 without
smoothing, so by using bigrams frequency that occurred once, i.e. probability of I
doubt to represent the total number of bigrams for unknown bigrams given by
c∗
Pknown ( wi |wi −1 ) =
N
N c +1
where c∗ = ( c + 1) ∗ and c = count of input bigram. (2.22)
Nc
Exercise: Try to calculate these probabilities from data provided by Fig. 2.17.
Exercises
2.1 What is Language Model (LM)? Discuss the roles and importance of language
model in NLP.
2.2 What is N-gram? Discuss and explain the importance of N-gram in NLP and
text analysis.
2.3 State the Chain Rule and explain how it works for the formulation of N-gram
probabilities. Use trigram as example to illustrate.
2.4 What is a Markov Chain? State and explain how it works for the formulation
of N-gram probabilities.
2.5 Use The Adventures of Sherlock Holmes as corpus, calculate N-gram proba-
bility for sentence “I don’t believe in that” with Markov Chain and evaluate all
related bigram probabilities.
References 41
2.6 Repeat Exercise 2.5 by using another famous literature Little Women by
Louisa May Alcott (1832–1888) (Alcott 2017) to calculate N-gram probabil-
ity of sentence “I don’t believe in that” and compare with results in 2.5. What
is (are) the finding(s)?
2.7 Use Shannon’s text generation scheme on The Adventures of Sherlock Holmes
as corpus, generate sample sentences like Fig. 2.14 using unigram, bigram,
trigram, and quadrigram text generation methods.
2.8 Repeat Exercise 2.7 using literature Little Women (Alcott 2017) to generate
corresponding sample sentences and compare with results in 2.7. What is (are)
the finding(s)?
2.9 What is Perplexity (PP) in N-gram model evaluation? Use The Adventures of
Sherlock Holmes as corpus with sample test set, evaluate PP values from uni-
gram to trigram and compare with Fig. 2.15. What is (are) the finding(s)?
2.10 Use Little Women (Alcott 2017) as corpus and some sample test set. Compare
the performance of Add-1 smoothing against Add-k (k = 0.5). Which one is
better? Why?
2.11 What is Backoff and Interpolation (B&I) method in N-gram smoothing?
Repeat 2.10 using B&I smoothing method with λ1 = 0.4, λ2 = 0.3 and λ3 = 0.3.
Compare the performance with results obtained in 2.10.
2.12 What is Good Turing (GT) Smoothing in N-gram smoothing? Repeat Exercise
2.10 using GT Smoothing and compare performance results obtained in 2.10
and 2.11. Which one is better? Why?
References
Liu, Z., Lin, Y. and Sun, M. (2020) Representation Learning for Natural Language Processing.
Springer.
Pustejovsky, J. and Stubbs, A. (2012) Natural Language Annotation for Machine Learning: A
Guide to Corpus-Building for Applications. O’Reilly Media.
Saichev, A. I., Malevergne, Y. and Sornette, D. (2010) Theory of Zipf’s Law and Beyond (Lecture
Notes in Economics and Mathematical Systems, 632). Springer.
Shakespeare, W. (2021) The Complete Works of Shakespeare (AmazonClassics Edition).
AmazonClassics.
Shannon, C. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
27 (3): 379–423.
Sidorov, G. (2019) Syntactic n-grams in Computational Linguistics. Springer.
Suyanto, S. (2020). Phonological similarity-based backoff smoothing to boost a bigram syllable
boundary detection. International Journal of Speech Technology, 23(1), 191-204.
Tuchong (2022) Computerized Axial Tomography Scanner (“Cat scan”). https://stock.tuchong.
com/image/detail?imageId=902001913134579722. Accessed 12 July 2022.
Chapter 3
Part-of-Speech (POS) Tagging
Every word in English sentences fall into nine major POS types. They are (1) adjec-
tives, (2) verbs, (3) pronouns, (4) conjunctions, (5) prepositions, (6) articles (deter-
miners), (7) adverbs, (8) nouns, and (9) interjections as shown in Fig. 3.1. Some
linguists include only first eight as major POS and leave interjections as an indi-
vidual category.
3.2 POS Tagging
Part-of-Speech Tagging (Khanam 2022; Sree and Thottempudi 2011), also called
POS tagging, POST, or grammatical tagging is the operation of labelling a word in
a text, or corpus according to a particular POS based on definition and contexts in
linguistics. A simplified format is usually learnt by students to identify word types
such as adjectives, adverbs, nouns, verbs, etc. Grammars vary in foreign languages
leading to several POS tagging categorization.
3.2 POS Tagging 45
PENN Treebank is a frequently used POS tag databank provided by the PENN
Treebank corpus (Marcus et al. 1993). It is an English corpus marked by a TreeTagger
tool developed by Prof. Helmut Schmid at University of Stuttgart in Germany. It
classifies 9 major POS into subclasses that has a total of 45 POS tags with punctua-
tion and examples as shown in Fig. 3.3, its English Penn Treebank (PTB) corpus has
a comprehensive section of Wall Street Journal (WSJ) articles to be used on sequen-
tial labelling models’ evaluation as well as characters and word levels language
modelling.
A POS tagging table for sentence [3.3] David has purchased a new laptop from
Apple store in Fig. 3.4 showed that Apple is a proper noun because it can be differ-
entiated by capital letter A as a product brand name.
Fig. 3.2 POS example for utterance “She sells seashells on the seashore”
46 3 Part-of-Speech (POS) Tagging
Fig. 3.4 Penn Treebank POS tags of sample sentence “David has purchased a new laptop from
Apple store”
A proper POS tagging can provide correct translation between foreign languages.
Further, it is to stress different accents and avoid confusion of the same word (word
type) with different POS in a sentence/utterance. There are three types:
1. Noun vs. Verb confusion, e.g. ABstract (noun) vs. abstRACT (verb)
2. Adjective vs. Verb confusion, e.g. PERfect (adjective) vs. perFECT (verb)
3. Adjective vs. Noun confusion, e.g. miNUTE (adjective) vs. MInute (noun)
Figure 3.5 shows some common examples of English words from CELEX online
dictionary with different stress accents and meanings often occurred when dealing
with noisy channels to differentiate every word’s role in a sentence/utterance. They
can be solved by applying statistical probabilistic N-gram method or stochastic
techniques and corpora to facts analysis. Nevertheless, POS tagging is the initial
step for resolution.
Fig. 3.5 Common example of same English word with different stress accents
48 3 Part-of-Speech (POS) Tagging
Computational linguistics (CL) (Bender 2013; Clark et al. 2012; Mitkov 2005) can
be considered as the understanding of written or spoken language from computa-
tional and scientific perspective. It focuses on building artifacts to process and ana-
lyze language. Language is like a mirror of mind to reflect of what humans think. A
computational interpretation of language provides a new insight to how human
thinks and intelligence works.
As human language is natural and the most polytropic means of communication
either person-to-person or person-to-machine, linguistically enabled computer sys-
tems provide a new era of NLP applications. There are two major issues to address
in computational linguistics: (1) linguistic itself refers to facts about language and
(2) algorithmic refers to effective computational procedures dealing with these facts.
Consider the following situations, what is the POS for word purple:
[3.12] It’s so purple.
[3.13] Both purples should be okay for the room.
[3.14] The purple is a bit odd for the white carpet.
In [3.12] purple is an adjective. However, in [3.13] is a particular noun in plural
forms. Same notion for purple in [3.14] is also an indifferent noun to classify as a
group against uncountable objects in purple.
There are nine key POS in English: (1) pronoun, (2) verb, (3) adjective, (4) interjec-
tion, (5) noun, (6) adverb, (7) conjunction, (8) preposition, and (9) article as shown
in Fig. 3.7. Some linguists consider interjections as separate POS category to
express strong feeling or emotion in a single word or a phrase, e.g. [3.15] Hooray!
It’s the last day of school. It is distinct compared with other POS.
There are two types of English word classes: (1) closed-class and (2) open-class.
Both classes are important to understand proper sentences in different languages.
Closed-class words are also known as functional/grammar words. They are
closed since new words are seldom created in the class. For example, conjunctions,
determiners, pronouns, and prepositions are closed-class. Conversely, new items
are added to open classes regularly. As closed-class words are usually used with a
particular grammatical structure, it cannot be interpreted in isolation, e.g. [3.16] the
style of this painting, both the and this have no special meaning as compared with
painting that has a specific meaning in usual knowledge.
Open-class words are also known as lexical/content words. They are open since
the meaning of an open-class word to be found in dictionary so the meaning can be
interpreted in isolation. For example, noun, verb, adjective, and adverbs are open-
class that made up of the entire subclass of words. These connective words are
restrictive and used frequently to describe different scenarios or meanings about
spatial positions of two object nouns, e.g. [3.17] The cat sits by/under/above the
piano. Further, there are new types of open-class objects created from scratch or
combination of existing word according to contemporary times, e.g. fax, telex,
internet, iPhone, hub, bitcoin, metaverse, etc.
3.4.2 What Is a Preposition?
Preposition (PP) is POS with a word (group of words) being used before a noun,
pronoun, or a noun phrase to indicate direction, location, spatial relationships, time;
or to describe an object; or information to the recipient. There are approximately
80–100 prepositions in English to generate functional sentences/utterances.
This information can include where something takes place, e.g. [3.18] before
dinner, or general descriptive information e.g. [3.19] the girl with ponytail. The
target of preposition is the noun that followed the preposition. It is also the ending
point for each preposition phrase. For instance, [3.20] to the supermarket. The word
to is a preposition and supermarket is the target of preposition, and [3.21] over the
rainbow, the word over is the preposition and rainbow is the target of preposition.
A list of top 40 preposition from CELEX online dictionary (CELEX 2022) of
COBUILD 16-million-word corpus is shown in Fig. 3.8. It showed of, in, for, to and
with are the top five prepositions to correlate with ideas and additional information
of a sentence/utterance.
52 3 Part-of-Speech (POS) Tagging
Fig. 3.8 TOP 40 commonly used prepositions extracted from CELEX online dictionary
3.4.3 What Is a Conjunction?
Conjunction (CONJ or CNJ) is POS to connect words, clauses, or phrases that are
known as conjuncts. This definition may sometime overlap with other POS so that
the constitute of a conjunction must be defined for each foreign language. For
instance, a word in English may have several senses and meanings. It can be consid-
ered as either a conjunction or preposition highly dependable on syntax of the sen-
tence/utterance, e.g. after is a preposition in [3.22] Jane left after the show but is a
conjunction in [3.23] Jane left after she finished her homework.
Co-ordinating conjunction allows joining words, clauses, or phrases of equal
grammatic rank in a sentence/utterance. Common coordinating conjunctions are
and, but, for, nor or yet which include logical meaning at times.
3.4 9 Key POS in English 53
3.4.4 What Is a Pronoun?
Pronoun (PRN or PN) is POS that can be considered as a word (phrase) to serve as
substitution for a noun or noun phrase. It is also called the pronoun’s antecedent.
Pronouns are usually appeared as short words to replace a noun (noun phrase) for
the construction of a sentence/utterance. Commonly used pronouns are I, he, she,
you, me, we, us, this, them, that.
A pronoun can be served as a subject, direct (indirect) object, object of preposi-
tion and more to substitute any person, location, animal, or thing. It can replace a
person’s name in a sentence/utterance, e.g. [3.26] Jack is sick today, he cannot
attend the evening seminar. Pronoun is also a powerful tool to simplify the contents
of a dialogue and conversation by replacing with simple token. A list of top 50 com-
monly used pronouns extracted from CELEX online dictionary is shown in Fig. 3.10.
It showed it, I, he, you, and his are used frequently.
The truth is, without pronouns, nouns become repetitive and cumbersome in
speech and writing. However, pronoun may cause ambiguity, e.g. [3.27] Jack
blamed Ivan for losing the car key, he felt sorry for that. He normally refers to the
first person which is Jack but make sense in pragmatic meaning for Ivan to feel sorry
because Jack blamed him for the loss.
3.4.5 What Is a Verb?
Verb (VB) can be considered as a word syntax to conduct an action, process, occur-
rence, or state-of-being. In general, verbs are inflected to encode tense, aspect,
mood, and voice in many languages, but are interchangeable with nouns of a word
in some foreign languages. In English, a verb may also conform with gender, per-
son, or numbers of arguments such as its subject or object.
54 3 Part-of-Speech (POS) Tagging
Fig. 3.9 TOP 50 commonly used conjunctions extracted from CELEX online dictionary
English verbs have tenses consideration: (1) present tense to notify that an action
is being carried out, (2) past tense to notify that an action has been completed, (3)
future tense to notify that an action to be happened in future, and (4) future perfect
tense to notify an action will be completed in future.
A modal verb is a category of verb that contextually indicates a modality such as
ability, advice, capacity, likelihood, order, obligation, permission, request, or
3.4 9 Key POS in English 55
Fig. 3.10 TOP 50 commonly used pronouns extracted from CELEX online dictionary
Fig. 3.11 TOP 25 commonly used modal verbs extracted from CELEX online dictionary
3.5.1 What Is Tagset?
There are nine POS in English, pronoun, verb, adjective, interjection, noun, adverb,
conjunction, preposition, and article learnt as students but there are clearly more
subcategories that can be further divided. For example, in nouns, the plural, posses-
sive, and singular forms can be distinguished and further classified.
A Tagset is a batch of POS tags (POS tags or POST) to indicate the part of speech
and sometimes other grammatical categories such as case, tense for the classifica-
tion of each word in a sentence/utterance.
Brown Corpus Tagset (Brown 2022), PENN Treebank Tagset (Treebank 2022),
and CLAWS (CLAWS7 2022) are commonly used. Brown Corpus was the first well-
organized corpus of English for NLP analysis developed by Profs Emeritus Henry
Kučera (1925–2010) and W. Nelson Francis (1910–2002) at Brown University,
USA in mid-1960s. It consists of over 1 million of English words which extracted
from over 500 samples of randomly chosen publications. Each sample consists of
over 2000 words with 87 tags defined (Brown 2022).
The English PENN Treebank Tagset originated by English corpora is annotated
with TreeTagger tool. PENN Treebank Tagset is developed by Prof. Helmud Schmid
in the University of Stuttgart, Germany. It consists of 45 distinct tags (Abeillé 2003;
Treebank 2022).
3.5 Different Types of POS Tagset 57
It may wonder the necessity of tagset databank against dictionary to check out
POS. A reason is that there are ambiguities in POS tags for many words:
1. Noun-verb ambiguity
For example: record: [3.28] records the lecture vs. [3.29] play CD records.
2. Adjective-verb ambiguity
For example: perfect: [3.30] a perfect plan vs. [3.31] Jack perfects the
invention.
3. Adjective-noun ambiguity
For example: complex: [3.32] a complex case vs. [3.33] a shopping complex.
3 tags 264
4 tags 61
5 tags 12
6 tags 2
7 tags 1
Ambiguous % 10.40%
58 3 Part-of-Speech (POS) Tagging
(Note: As an exercise, find out the other three POS tag usages for still). Overall
speaking, there is a total of 10.4% ambiguous word types often used in language in
which over 40% ambiguous words are easy to disambiguate.
There are four methods to acquire knowledge from POS tagging: (1) dictionary, (2)
morphological rules, (3) N-gram frequencies, and (4) structural relationships
combination.
Dictionary is the basic method for tag usage, but it may not be fully reliable
because there are ambiguous words meaning that the same word can have more than
single POS tagging in diverse scenarios.
Morphological rules are to identify well-known words shapes and patterns, e.g.
the inflection -ed for past tense, verb + -ing for continuous form, -tion for noun
description, -ly for adjective, and capitalization such as New York for proper noun.
N-gram frequencies checking, also called next word prediction, e.g. grammatic
pattern to ___. When there is a to, if the next word is a verb, it must be in present
and not past tense. If it is a determiner, the next word must be a noun.
Structural relationships combination method means to combine several methods
to acquire tag information. e.g. [3.41] She barely heard the foghorns knelling her
demise vs. [3.42] The hunter’s horn sounded the final knell. If there is no under-
standing on what knell means, there is an -ing pattern to indicate that is a verb in
continuous tense, and final is an adjective description to indicate that knell is
likely a noun.
There are three basic approaches of POS Tagging: (1) Rule-based, (2) Stochastic-
based, and (3) Hybrid Tagging.
according to surrounding words. The obtained rule sets directly affect tagging
results accuracy. The lexicon is used initially for basic segmentation and tagging of
the corpus, listing all possible lexical properties of the object, and combine rule-
base with contextual information to disambiguate and retain the only suitable lexi-
cal properties.
The rule generation can be achieved by (1) hand creation and (2) training from a
corpus with machine learning. The advantages of hand creation are that it is more
sensible and explainable to humans, but manual construction of rules is usually
labor intensive. Also, if rules are described with too many details, the coverage of
rules will be greatly reduced and difficult to adjust according to actual situation.
Conversely, if rules are not based on contexts but rather on the lexical nature of
rules, ambiguity may arise, i.e. If the preceding of a word is an article, then the word
must be a noun.
For example, consider: [3.43] a book. a is an article as per possible tags that can
assign directly, but a book can either be a noun or a verb. If consider a book, a is an
article and follow rules above, book should be a noun because article is often fol-
lowed by a noun, so a tag of noun is assigned to book. Word structures are often
complex leading to more ambiguities and rules are required for differentiation.
Step 1: Assign each word with a list of possible tags based on a dictionary.
Step 2: Work out unknown and ambiguous words with two approaches: (1) rules
that specify what (1) to do; and (2) not to do.
Figure 3.13 shows a sample adverbial that rule (Jurafsky et al. 1999):
It showed that:
–– The first two statements of this rule verify the word that is directly precedes a
sentence/utterance’s final adjective, adverb, or quantifier.
–– For all other cases, the adverb reading is eliminated.
–– The last clause eliminates cases which are preceded by verbs like consider or
believe which can take a noun and an adjective.
–– The logic behind is to avoid tagging the following instance of that as an adverb
such as [3.44] It isn’t that odd.
–– The other rule is used to verify if the previous word is a verb which expects a
complement (like think or hope), and if that is followed by the beginning of a
noun phrase, and a finite verb such as [3.45] I consider that a win or more com-
plex structure such as [3.46] I hope that she is confident.
Stochastic-based approach (Dermatas and Kokkinakis 1995) is different from
rule-based approach in which it is a supervised model using frequencies or proba-
bilities of tags appeared in the training corpus to assign a tag to a new word. This
tagging method depends on tag occurrence statistics, i.e. probability of the tags.
Stochastic taggers are further categorized into two parts: (1) word frequency and (2)
tag sequence frequency to determine a tag.
Word frequency is to identify the tag that has a notable occurrence of the word,
e.g. based on the counting from a corpus, the word list occurs ten times in which six
times as noun and four times as verb, and the word cloud will always be assigned as
noun since it has a notable occurrence in the training corpus. Hence, a word fre-
quency approach is not very reliable in certain scenario.
Tag sequence frequency, also called N-gram approach is assigned the best tag to
a word evaluated by the probability of N previous words tags. Although it provides
better outcomes than word frequency approach, it may be unable to provide accu-
rate outcomes for some rare words and phrases.
Stochastic POS tag model allows features to be non-independent and allows for
the addition of various granularities features. Hidden Markov Model (HMM) Tagger
is a common stochastic-based approach, its Maximum Entropy Markov Model
(MEMM) (Huang and Zhang 2009) is a stochastic POS tagging model that deter-
mines an exponential algorithm for each state as the conditional probability of the
next state given the current state, which has the advantages of a stochastic POS tag-
ging model. However, it also suffers from label bias problems. Unlike MEMM
model, the Conditional Random Field (CRF) model uses only one model as the joint
probability of the entire label sequence given the observations sequence. Lafferty
et al. (2001) verified that this model can effectively solve the tagging bias problems.
Let us use HMM Tagger as example. The rationale of HMM tagger is applying
N-gram frequencies to determine the best tag for a given word, like the same con-
cept to investigate N-gram with Markov Chain. Mathematically, all is needed to
maximize the conditional probability. The conditional probability wi is tag ti in the
context given wi by
For bigram-HMM tagger, select tag ti for wi, that is most probable given the pre-
vious tag ti−1, and the current word wi in this equation by
ti = argmax
j P ( t j |,ti −1 |,wi ) (3.3)
ti = argmax
j P ( t j |ti −1 ) P ( wi |t j ) (3.4)
There are five steps in TBL by comparison to analogue on oil painting with layering-
and-refinement approach.
1. Start with background theme such as sky or household background.
2. Paint background first, e.g. if sky is the background scheme, paint clouds over it.
3. Paint the main theme or object over the background, e.g. landscape, birds.
4. Refine the main theme or object over background to make it more precise, e.g.
paint landscape, add trees and animals layer-by-layer.
5. Further refine objects or main theme until perfect, e.g. apply layering process or
refinement for every single tree and animal (Fig. 3.14).
62 3 Part-of-Speech (POS) Tagging
Fig. 3.14 Oil painting analog to Brill Tagger transformation technique (Tuchong 2022)
Brill Tagger is a type of hybrid TBL. Hybrid refers to integrate rule-based and
stochastic-based methods in a Brill’s algorithm.
Rule 1: Label each word of the tag that is mostly likely given on contextual infor-
mation, e.g.
3.7 Taggers Evaluations
There are several considerations when POS taggers have implemented (Padro and
Marquez 1998):
1. Evaluate algorithm adequacy
2. Identify errors origin
3. Repair and solve
A confusion matrix suggests that current taggers face with major problems:
1. Noun-single or mass vs. proper noun-singular vs. adjective (NN vs. NNP vs. JJ).
These are hard to distinguish as proper noun is crucial for information extrac-
tion, retrieval, and machine translation for different languages have diverse tag-
ging algorithms or classification schemes.
64 3 Part-of-Speech (POS) Tagging
Fig. 3.16 Confusion matrix from HMM of The Adventures of Sherlock Holmes
2. Adverb vs. adverb vs. preposition-sub-conjunction (RP vs. RB vs. IN). All of
these can appear in satellite sequences following a verb immediately.
3. Verb-base form vs. verb-past participle vs. adjective (VB vs. VBN vs. JJ). They
are crucial to distinguish for partial parsing, i.e. participles to identify passives,
and to label the edges of noun phrases correctly.
The confusion matrix from HMM error analysis of The Adventures of Sherlock
Holmes (Doyle 2019) is shown in Fig. 3.16. For example, mis-tagging of (1) NN by
JJ is 7.56%, (2) NNP by NN is 5.23%, and (3) JJ by NN is 4.35%. Hence, mistaking
NN by JJ is occurred more often than JJ by NN in English texts but it may vary in
other foreign languages.
Exercises
3.1 What is Part-of-Speech (POS)? How it is critical for NLP systems/applica-
tions implementation?
3.2 State and explain NINE basic types of POS in English Language. For each
POS type, give an example for illustration.
3.3 What is POS Tagging in NLP? How is it important to NLP systems/applica-
tions implementation? Give two examples of NLP systems/applications for
illustration.
References 65
3.4 State and explain THREE types of POS Tagging methods in NLP.
3.5 What is PENN Treebank tagset? Perform POS Tagging for the following sen-
tences/utterance using PENN Treebank tagset.
[3.47] POS tagging is a very interesting topic.
[3.48] It is not difficult to learn PENN Treebank tagset provided that we
have sufficient examples.
3.6 What is Natural Language Understanding (NLU)? State and explain FIVE
major components of NLU in NLP.
3.7 Why semantic meaning is an important factor in POS tagging? Give two
examples to support your answer.
3.8 What is ambiguous in POS tags? Give two examples word with three and four
ambiguous of POS tags.
3.9 What is Rule-based approach in POS Tagging? Give an example of POS tag-
ging rule to illustrate how it works.
3.10 What is Stochastic-based approach in POS Tagging? Give a live example to
explain how word frequency and tag sequences frequency are applied for POS
tagging.
3.11 State and explain Transformation-based Learning (TBL). Give a live example
to support your answer.
References
Abeillé, A. (ed) (2003) Treebanks: Building and Using Parsed Corpora (Text, Speech and Language
Technology Book 20). Springer.
Allen, J. (1994) Natural Language Understanding (2nd edition). Pearson
Bender, E. M. (2013) Linguistic Fundamentals for Natural Language Processing: 100 Essentials
from Morphology and Syntax (Synthesis Lectures on Human Language Technologies). Morgan
& Claypool Publishers
Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A
case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–566.
Brown (2022) Brown corpus tagset. https://web.archive.org/web/20080706074336/http://www.
scs.leeds.ac.uk/ccalas/tagsets/brown.html. Accessed 15 July 2022.
CELEX (2022) CELEX corpus official site. https://catalog.ldc.upenn.edu/LDC96L14. Accessed
15 July 2022.
Clark, A., Fox, C. and Lappin, S. (2012) The Handbook of Computational Linguistics and Natural
Language Processing. Wiley-Blackwell.
CLAWS7 (2022) UCREL CLAWS7 Tagset. https://ucrel.lancs.ac.uk/claws7tags.html. Accessed
15 July 2022.
DeRose, S. J. (1988). Grammatical category disambiguation by statistical optimization.
Computational Linguistics, 14, 31–39.
Dermatas, E. and Kokkinakis, G. (1995). Automatic stochastic tagging of natural language texts.
Computational Linguistics, 21(2), 137–164.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Eisenstein, J. (2019) Introduction to Natural Language Processing (Adaptive Computation and
Machine Learning series). The MIT Press.
66 3 Part-of-Speech (POS) Tagging
This chapter will explore syntax analysis and introduce different types of constitu-
ents in English language followed by the main concept of context-free grammar
(CFG) and CFG parsing. We will also study different major parsing techniques,
including lexical and probabilistic parsing with live examples for illustration.
Linguistic and grammatical aspects are addressed in NLP to identify patterns
that govern the creation of language sentences like English. They include the inves-
tigation of Part-of-Speech (POS) mentioned in Chap. 3 and grammatic rules to cre-
ate sentences/utterances with syntactic rules. These syntactic rules relied on
effective computational procedures such as rule-based, stochastic-based, techniques
and machine learning to deal with language syntax (Bender 2013; Gorrell 2006).
Another motivation is to study syntax and parsing methods or algorithms so that
they can fall into an automatic system like forming a parser to understand syntactic
structure during the construction process. Figure 4.1 illustrates the relationship
between grammar, syntax, and corresponding parse tree of a sentence/utterance
with four tokens: Tom pushed the car. Syntax level analysis is to analyze structure
and the relationship between tokens to create a parse tree accordingly.
4.2 Syntax Analysis
4.2.1 What Is Syntax
Syntax can be considered as the rules that manage group of words are combined to
create phrases, clauses, and sentences/utterances in linguistics (Bender 2013; Brown
and Miller 2020). The term syntax come from Greek word σύνταξη meaning
arrange words togetherness. Syntax provides a proper and organized way to form
meaningful phrase or sentence. It is a vital tool in technical writing and sentence
construction.
The fact is all native speakers learn proper syntax of their mother languages by
nature. The complexity sentences by a writer or speaker create formal or informal
level, or phrase and clauses presentation to audiences.
Syntax can be considered as the proper ordering of word tokens in written/spo-
ken sentences/utterances so that computer systems can understand how to process
these tokens without knowing the exact meaning from NLP perspective.
4.2.2 Syntactic Rules
POS in English often follows patterns orders in sentences and clauses (Khanam
2022; Jurafsky et al. 1999). For instance, compound sentences are combined by
conjunctions like and, or, with or multiple adjectives transformation of the same
noun based on order(s) according to their classes, e.g. [4.1] The big black dog.
Syntactic rules also described to assist language parts make sense. For example,
sentences/utterances in English usually begin with a subject followed by a predicate
4.2 Syntax Analysis 69
(i.e. a verb in the simplest form) and an object or a complement to show what’s
acted upon, e.g. [4.2] Jack chased the dog is a typical sentence with a subject-verb-
object pattern of syntactic rule in English. However, [4.3] Jack quickly chased the
dog at lush green field contains adverbs and adjectives to take their places in front
of the sentence transformation (quickly chased, lush green field) with informative
description.
There are five major components in NLU as recalled in Fig. 4.2, syntax and parsing
components have central roles to link up natural language with its syntactic struc-
ture prior understanding its semantic or embedded (discourse pragmatic) meanings
in NLP (Allen 1994; Eisenstein 2019). It is the first tier to analyze whether sen-
tences/utterances make sense or not. In other words, if a sentence/utterance or dia-
log has syntactic error, e.g. Jack buys (buys what?), it will not make sense, let alone
to study semantic meaning.
Syntax and parsing are sole process beneficial to:
1. Check grammar by word-processing applications such as Microsoft Word.
2. Speech recognizer at human speech real-time syntactic level in noisy environment.
It has significance in high-level NLP applications such as machine translation
and Q&A chatbot systems.
4.3.1 What Is Constituent?
4.3.2 Kinds of Constituents
Constituents exist in every sentence, phrase, and clause. Every sentence, in other
words, is constructed by the combination of all these components into meaningful
sentences/utterances (Bender 2013; Brown and Miller 2020). The commonly used
constituent types include: (1) noun phrase, (2) verb phrase, and 3) preposition
phrase. For instance:
[4.23] My cat Coco scratches the UPS courier on the table.
– These constituents are made up of noun phrase (my cat Coco), predicate and
verb phrase (scratches the UPS courier on the table).
4.3.3 Noun-Phrase (NP)
An NP consists of a noun and its modifiers. Modifiers that go before the noun such
as adjectives, articles, participles, possessive nouns, or possessive pronouns; or go
after the noun such as adjective clauses, participle phrases, or prepositional phrases.
For example, in [4.23] My cat Coco is a NP consists of determiner (DT) My + noun
(NN) cat + proper noun (NNP) Coco.
There are other NPs appear as objects of prepositions or objects of verbs:
[4.24] The milky cat with long tail is meowing.
[4.25] Very few cats wore a collar.
[4.26] The long tail is brought to room.
[4.27] Many places hear meowing.
[4.28] A cat with a long tail and a collar is meowing.
[4.29] Jane saw so many cats in the room.
4.3.4 Verb-Phrase (VP)
A VerbPhrase (VP) consists of a main verb followed by other linking verbs or modi-
fiers that act as a sentence’s verb. Modifiers in VP are words that can change, adapt,
limit, expand, or help to define a certain word in a sentence. They are usually auxil-
iary verbs such as is, has, am, and are that work with the main verb. A main verb in
VP holds information about an event or activity being referred to, and auxiliary
verbs add meaning by relating to time or aspect of the phrase.
There are nine common VP types:
1. Singular main verb
[4.30] Jack catches a deer.
2. Auxiliary verb (to be) + main verb -ing form
4.3 Types of Constituents in Sentences 73
When the main verb is used in -ing form, e.g. walking, talking, it expresses a
continuous aspect to show whether is in the past, present, or future.
[4.31] Jack is singing.
3. Auxiliary verb (have) + main verb (past participle form)
When verb to have (i.e. have, has, had) and the main verb in past partici-
ple form.
[4.32] Jack has broken the vase.
4. Modal verb + main verb
When a modal verb is combinedly used with a main verb, it includes things
such as possibility, probability, ability, permission, and obligation. Examples of
modal words include must, shall, will, should, would, can, could, may, and might.
[4.33] Jack will leave.
5. Auxiliary verb (have + been) + main verb (-ing form)
When both continuous and perfect aspects are expressed, the continuous
aspect comes from -ing verb and the perfect aspect comes from auxiliary verb
have been.
[4.34] Jack has been washing the car.
6. Auxiliary verb (to be) + main verb (past participle form)
When a verb to be is combined with main verb in past participle form to
express a passive voice. The passive voice is used to indicate an action is happen-
ing to the subject of sentence than the subject performing the action.
[4.35] The lunch was served.
7. Negative and interrogative verb phrases
VP gets separated when sentences have negative or interrogative nature.
[4.36] Jack is not answering the exam questions.
8. Emphasize verb phrases
Use auxiliary verbs, e.g. do, does, did to emphasize a sentence.
[4.37] Jack did enjoy the vacation.
9. Composite VP
When it consists of other VP or NP
[4.38] My cat Coco scratches the UPS courier on the table.
– Scratch is the main verb in VP to describe action/event happens to object
UPS courier, on the table is auxiliary information to further explain the event. It
still makes sense with/without it. It includes scratch VP + UPS courier NP + on
the table PP.
Single word constituents are POS studied in Chap. 3, types of single word constitu-
ent types depend on tagset sizes. There are other complex design decisions:
[4.39] Jane bought the big red handbag ☑ vs.
[4.40] Jane bought the red big handbag ☒
74 4 Syntax and Parsing
Although there are two POS correct syntactic meanings but incorrect syntax, e.g.
red before big. There are also incomplete simple constituent types.
[4.41] The cat with a long tail meowing a collar. ☒
– Does not make sense although NP is correct, collar is an incorrect description.
[4.42] Jane imagined a cat with a long tail. ☑
[4.43] Jane decided to go. ☑
– Both make sense without further description in syntactic structure.
[4.44] Jane decided a cat with a long tail. ☒
– Does not make sense again in syntactic correctness.
[4.45] Jane decided a cat with a long tail should be her next pet. ☑
– Syntactic correct although the sentence structure is slightly complex.
[4.46] Jane gave Lily some food. ☑
– Syntactic correct although most of the time it describes food.
[4.47] Jane decided Lily some food. ☒
– Although syntactic structure is the same, but they have different designs to
further describe food types and purposes.
For adjectives:
[4.55] Jack is angry with Sophia vs. [4.56] Jack is angry at Sophia.
[4.57] Jack is mad at Sophia vs. [4.58] Jack is mad with Sophia. ☒
There are patterns and rules. Both are correct for [4.55] and [4.56] , but for the
verb mad, [4.57] is correct while incorrect for [4.58] which means subcategoriza-
tion is acceptable for some pattern but not only on syntax.
For nouns: [4.59] Janet has a passion for classical music vs. [4.60] Janet has an
interest in classical music.
They have different patterns of syntactic rules.
English sentences can be very large and complex in structure. A concise sentence
usually consists of limited set of constituent types, i.e. NP, VP, and PP to recursion
and construct grammar rules as follows:
S → NP VP [4.61] My good friend Jack buys a flat.
VP → V NP [4.62] buys a flat.
NP → NP PP [4.63] My good friend.
NP → NP S [4.64] The boy who come early today won the game.
PP → prep NP [4.65] The cupcake with sprinkles is yours.
prefix, cycle, reversal, quotient, union, intersection, difference with RL, and homo-
morphism. CFL and CFG have NLP and computer language designs in computer
science and linguistics.
CFG is to describe CFL as a set of recursive rules for generating string patterns,
because the application of production rules in a grammar is context-independent,
meaning they do not depend on other symbols with the rules (Bender 2013; Brown
and Miller 2020).
CFG is commonly applied in linguists and compiler design to describe program-
ming languages and parsers that can be created automatically.
CFG consists of four major components (Bender 2013; Jurafsky et al. 1999):
1. A set of nonterminal symbols N are placeholders for patterns of terminal sym-
bols created by nonterminal symbols. These symbols usually located at the LHS
(left-hand-side) of production rules (P). The strings generated by CFG usually
consist of symbols only from nonterminal symbols.
2. A set of terminal symbols Σ (disjoint from N) are characters appear in strings
generated by grammar. Terminal symbols usually located only at RHS (right-
hand-side) of production rules (P).
3. A set of production rules P: A → α,where A is a nonterminal symbol and α is a
string of symbols from the infinite set of strings (Σ ∪ N).
4. A designated start symbol S is a start symbol of the sentence/utterance.
Σ is a set of POS and N is the set of constituent types, i.e. NP, VP, and PP men-
tioned in Chap. 3 and previous section, respectively.
78 4 Syntax and Parsing
{
LG = w|w is in Σ∗ and S ⇒ w } (4.1)
Let Σ be the set of POS, so CFG in Eq. (4.1) can create grammar like this:
N V det N (4.2)
S → NP VP (4.4)
Equation (4.4) is the most basic grammar rule where a sentence is generated
from a NP and a VP that can be further decomposed recursively as shown in Fig. 4.6.
It shows CFG rules and its corresponding parse tree for sentence/utterance [4.66]
Jane plays the piano. There are four tokens in this sentence/utterance to form a well-
defined syntactic structure generated by NP and VP. NP can be designated to a name
pointed to token Jane, and for VP, is decomposed into verb or NP as shown in four
production rules shown in the top left corner at Fig. 4.6. In this case, verb is pointed
to plays, NP can be decomposed into a determiner and a noun pointed to the and
piano, respectively.
Name V NP
Fig. 4.6 CFG rules and corresponding parse tree for sentence [4.66] Jane plays the piano
4.5 CFG Parsing 79
4.5 CFG Parsing
There are three CFG parsing levels: (1) morphological, (2) phonological, and (3)
syntactic (Grune and Jacob 2007; Jurafsky et al. 1999).
4.5.1 Morphological Parsing
4.5.2 Phonological Parsing
Phonological parsing is the second level using the sounds of a language, i.e. pho-
nemes to process sentences/utterances (Wagner and Torgesen 1987).
Phonological processing includes (1) awareness, (2) working memory, and (3)
phonological retrieval. All three components are important to speech production
and written language skills development. Hence, it is necessary to observe chil-
dren’s spoken and written language development with phonological processing
difficulties.
Phonological parsing is to interpret sounds into words and phrases to gener-
ate parser.
4.5.3 Syntactic Parsing
Syntactic parsing is the third level to identify relevant components and correct
grammar of a sentence. Abstract meaning representation is assigned to define legal
strings of a language like CFG without recognizing the structure.
Parsing algorithms are applied to analyze sentences/utterances within language
and assign appropriate syntactic structures into them. Parse trees are useful to study
grammar, semantic analysis, machine translation, speech recognition, Q&A chat-
bots in NLP.
80 4 Syntax and Parsing
Syntactic parsing can be considered as search within a set of parse trees, its main
purpose is to identify the right path and space through automation in an FSA system
structure.
CFG is a process to determine the right parse tree among all possible options. If
there is more than one possible parse tree, stochastic method (or other machine
learning methods) will be applied to locate a probable one. In other words, it is a
process to identify search space defined by grammatical rules so that their con-
straints can become inputs to perform automatic parsing and study grammars.
English grammar and lexicon simplified domains are applied to reveal CFG rules in
an example of musical instruments as shown in Fig. 4.7. It consists of production
rules from several categories S → NP VP, S → Aux NP VP, S → VP as well as pro-
duction rules for NP, Nom, and VP with components Det, N, V, Prep, and PropN.
A parse tree of sentence/utterance [4.67] play the piano is shown in Fig. 4.8. It has
three tokens play—Verb, the—Det, and piano—Noun to construct a parse tree from
top node S to generate VP, VP to generate Verb and NP, and NP to decompose into
Det Nom, and Nom to generate Noun.
VP
NP
Nom
4.5.7 Top-Down Parser
There are (1) top-down and (2) bottom-up parser approaches to construct a parse
tree. Top-down parser constructs from rootnode S down to leavenodes (words in the
sentence/utterance). The first step is to identify all trees with root S, the next step is
to expand all constituents in these trees based on the given production rules. The
whole process is operated in level-by-level process until parse trees reach the leaves,
i.e. POS tokens of the sentence/utterance. For candidate parse trees that cannot
match the leave nodes, i.e. POS tokens are discarded and considered as failed parse
tree(s). Figure 4.9 shows first three-level construction all possible parse trees apply-
ing Top-Down parser.
It showed that the parse tree construction started from base level with S tag (root
node). The second level has generated an additional layer with three possible pro-
duction rules: S → NP & VP, S → Aux & NP & VP, and S → VP. The third level is
complex because it has decomposed into three levels, S → NP & VP is the first
variation to decompose into Det and Nom. NP is the second variation to decompose
into PropN. It is noted that LHS is the expanded part for demonstration purpose, but
both LHS and RHS are required expansion. S → Aux & NP & VP are the second
variation where NP to decompose into Det & Nom, and a NP decompose into
PropN. VP decomposition in the first four parse tree is not shown as they all failed
to match the leave nodes except only the fifth case is correct to form a complete play
the piano parse tree.
Top-down approach by CFG on terminals and nonterminals is shown in Fig. 4.10.
It showed rule 3 as the first one to apply and rule 2 for VP decomposed into V NP
and V to decompose play and then NP to Det and Nom, rule 4 and rule 5 are Det
points to the, and rule 6 Nom points to an and final rule points to end, and rule 7
82 4 Syntax and Parsing
Base level: S
S S S
Second level:
NP VP Aux NP VP VP
S S S S S S
Third level:
NP VP NP VP Aux NP VP Aux NP VP VP VP
Fig. 4.9 A three-level expansion of parse tree generation using top-down approach
Fig. 4.10 CFG rules and terminal/nonterminal nodes being used with top-down approach parsing
points to piano. This will complete a top-down approach parsing with the fifth parse
tree end-up as valid solution. Readers can base on these seven-step process to com-
plete the construction of parse tree for the fifth case as an exercise.
4.5.8 Bottom-Up Parser
the play the piano as N Det or as V Det N. Since this approach cannot indicate
which one is the correct option so the parsing operation will continue to grow until
they can reach the root node S, and if they cannot match the root node, the tree(s)
will be discarded.
Figure 4.11 shows the first three-level expansion of a parse tree using bottom-up
approach. So, in this case play the piano has two variations either play is N or
V. There are two parts one is play consider as N and other as V from base level. So,
at second level is to further expand the line pointed to play and tried to expand N
pointed to play into N in the first case. In second case is to further expand N pointed
to Nom in second layer. In the third level, second case is further expanded into two
options, one is Nom → V and the other is VP → V & NP, NP → Det & Nom, and
further up to S → VP to complete the whole parsing, in which other two parse tree
options endedup with invalid parsing as shown in Fig. 4.11.
Figure 4.12 shows CFG rules for terminal and nonterminal nodes applying bot-
tom-up approach parsing. Again, it consists of seven steps. Rule 1 is V pointed to
play, rule 2 is Det pointed to the, rule 3 is N pointed to piano, rule 4 is Nom pointed
to N, rule 5 is NP pointed to Det & Nom, rule 6 is VP pointed to NP, and rule 7 is S
point to VP to complete the whole parse tree until it can finally match the root/
source node S.
VP
NP NP
Third Level: Noun Det Noun Verb Det Noun Verb Det Noun
Fig. 4.11 A three-level expansion of parse tree generation using bottom-up approach
84 4 Syntax and Parsing
Fig. 4.12 CFG rules and terminal/nonterminal nodes being used with bottom-up approach parsing
4.5.9 Control of Parsing
Although both top-down and bottom-up parsing are straightforward, the control of
parsing is still needed to consider (1) which node that need to expand first and (2)
select grammatical rules sequence wisely to save time as most of the parse tree gen-
eration are dead-end and wastage of resources.
Pros
Since it starts from root/source node S, it can always generate a correct parse tree
unless the sentence has a syntactic error. In other words, it never explores the parse
that will not end up to root/source node S which means it will always find a solution.
Cons
This approach does not consider final word/token tags during parsing from the very
beginning, it will waste a lot of time to generate tree(s) that may be totally unrelated
to the correct result. play should parse as V instead of N as shown in Fig. 4.9, this
approach showed that all first fourth parts of parse tree using play as N are invalid
and waste of time to parse tree generation.
4.6 Lexical and Probabilistic Parsing 85
Pros
Since it starts from sentence tokens/POS, it can always generate parse tree with all
tokens/POS in the sentence considered and reduced time on rules unrelated to these
tokens which means it can sort out problems occurred in top-down approach for all
production rules without POS tags.
Cons
This approach may often end up with broken tree(s) that cannot match the root node
S to complete parse tree as it starts from leave node instead of root/source node S. It
makes sense because although there are many ways to match production rules, the
variations of most parse trees are syntactic incorrect so they cannot match the root/
source node S. All parse trees in Fig. 4.11 showed that except the last one (also the
correct one), others ended up with broken trees and failed to match the root/source
node S again wasted time to parse tree generation.
Let us look at lexicalized and probabilistic parsing as alternative.
There are two reasons using probabilities parsing (Eisenstein 2019; Jurafsky et al.
1999): (1) resolve ambiguity and (2) word prediction in voice recognition. For
instance:
[4.68] I saw Jane with the telescope. (Jane with telescope or I use telescope to
see Jane?)
[4.69] I saw the Great Pyramid flying over Giza plateau vs.
[4.70] I saw UFO flying over Giza plateau
Although both situations have pragmatic problems in which [4.69] is incorrect
because Great Pyramid is an unmovable architecture. It can be solved using proba-
bilities in parsing from a large corpus and knowledge base (KB) to identify the
frequencies of a particular term or constituent is used correctly without pragmatical
analysis.
For example, in voice recognition:
[4.71] Jack has to go vs.
[4.72] Jack half to go vs.
[4.73] If way thought Jack wood go
86 4 Syntax and Parsing
The following examples show how semantic meanings (Bunt et al. 2013; Goddard
1998) affect/determine the validness of sentence/utterance in parsing:
[4.74] Jack drew one card from a desk [?] vs.
[4.75] Jack drew one card from a deck.
Note: drew → deck is clearly a semantic concern.
[4.76] I saw the Great Pyramid flying over Giza plateau. [?] vs.
[4.77] I saw a UFO flying over Giza plateau.
Note: movable vs. unmovable objects.
[4.78] The workers dumped sacks into a pin. [?] vs.
[4.79] The workers dumped sacks into a bin.
Note: dump looks for a locative complement.
[4.80] Tom hit the ball with the pen. [?] vs.
[4.81] Tom hit the ball with the bat.
Note: which object can use to hit the ball?
[4.82] Visiting relatives can be boring. [?] vs.
[4.83] Visiting museums can be boring.
Note: Visiting relatives are genuinely ambiguous. Visiting museums are evident
as only animate bodies can visit. There is no need for abstraction with enough data,
in other words, sufficient large corpus, databank, or dialogue databank can sort out
ambiguity problems to work out correct syntax with semantic meaning in many cases.
There are two classical approaches to add semantics into parsing: (1) cascade
systems to construct all parses and use semantics for rating tedious and complex; (2)
do semantics incrementally.
A modern approach is to forget the meaning and only based on KB and corpus.
If a corpus contains sufficient sentences and knowledge, facts about meaning
emerge in the probability of observed sentences themselves. It is modern because
constructing world models are harder than early researchers realized but there are
huge text corpora to construct useful statistics. Here comes lexical and probabilistic
approach of parsing.
4.6 Lexical and Probabilistic Parsing 87
4.6.3 What Is PCFG?
A → β [ p] (4.5)
P ( A → β |A ) (4.6)
This section used sentences/utterances [4.84] buy coffee from Starbucks as example
to illustrate how PCFG works. It has simple CFG rules and probabilities in a seg-
ment of an AI chatbot dialogue for food ordering at campus as shown in Fig. 4.13.
The probability of each production rule type must sum up to 1 is one of the most
important basic criteria of PCFG as shown. For instance, three production rules of
S: S → NP VP (0.82), S → Aux NP VP (0.12) and S → VP (0.06) must sum-up to
1. It is the same as other production rules for NP, Nom, VP, Det, N, V, Aux, Proper-N,
and Pronoun. Of course, if the corpus is very large, some of these probability values
will be very small, just like N-gram probability evaluation discussed in Chap. 2.
It can apply either top-down parser or bottom-up parser approach to generate
parse tree with the following PCFG probability evaluation scheme:
P (T ) = ∏p ( r ( n ) ) (4.7)
n∈T
where p(r(n)) is the probability that rule r will be applied to expand the non-
terminal n.
88 4 Syntax and Parsing
Fig. 4.13 Sample CFG rules and their probabilities in AI Chatbot dialogues (food ordering
at campus)
arg max P ( T )
Tˆ ( S ) = (4.8)
T ∈τ ( S )
P ( PT1 ) = 0.12 × 0.36 × 0.06 × 0.06 × 0.37 × 0.72 × 0.43 × 0.36 × 0.41 × 0.63 × 0.75
= 1.242 × 10 −6
P ( PT2 ) = 0.12 × 0.36 × 0.36 × 0.06 × 0.05 × 0.72 × 0.43 × 0.36 × 0.41 × 0.63 × 0.75
= 1.007 × 10 −6
4.6 Lexical and Probabilistic Parsing 89
PT1: Can you buy Starbucks coffee? PT2: Can you buy Starbucks coffee?
S S
Aux NP VP Aux NP VP
V NP NP V NP
Nom
Nom Nom
Can you buy Starbucks coffee Can you buy Starbucks coffee
Fig. 4.14 Two possible parse trees for utterance “Can you buy Starbucks coffee”?
Fig. 4.15 CFG rules and associated probabilities for two possible parse trees PT1 vs. PT2
CFG probability algorithm parse tree 1 has a high probability. In other words, it
is more possible the meaning is to buy coffee other than buy other things from
Starbucks. It also shows an efficient solution to differentiate which parse tree is
more probable, when there are ambiguities in two or more parse trees provided with
sufficient lexical probabilities and corpus to calculate probabilities.
90 4 Syntax and Parsing
P (S ) = ∑ P (T ) (4.9)
T ∈τ ( S )
In many situations, it is adequate to know that one rule is used more frequently than
another, e.g.
[4.86] Can you buy Starbucks coffee? vs. [4.87] Can you buy KFC coffee?
But often it matters what the context is.
For example:
S → NP VP
NP → Pronoun [0.80] (4.11)
NP → LexNP [0.20]
For example, when NP is the subject, the probability of a pronoun may be higher
at 0.91. When NP is the direct object, the probability of a pronoun may be lower at
0.34 which means it depends on NP position in a sentence/utterance. In other words,
the probabilities also often depend on lexical options as shown in the following
examples:
[4.88] I saw the Great Pyramid flying over Giza Plateau. vs.
[4.89] I saw a UFO flying over Giza Plateau.
[4.90] Farmer dumped sacks in the bin. vs.
[4.91] Farmer dumped sacks of apples.
[4.92] Jack hit the ball with the bag. vs.
[4.93] Jack hit the ball with the bat.
[4.94] Visiting relatives can be boring. vs.
4.6 Lexical and Probabilistic Parsing 91
PT1: (boys in park) NP and girls PT2: boys in (park and girls) NP
NP NP
NP Conj NP NP PP
Fig. 4.16 Two interpretations of the utterance “boys in park and girls”
92 4 Syntax and Parsing
S (dumped)
NP (farmer) VP (dumped)
DT (the) NN (bin)
Fig. 4.17 Lexical tree for the utterance “workers dumped sacks into a bin”
(
Given P h ( n ) = word i |, n|, h ( m ( n ) ) )
VP ( dumped ) → PP ( into ) , p = p1 (4.17)
VP ( dumped ) → PP ( of ) , p = p2
NP ( sacks ) → PP ( of ) , p = p3
(
P ( T ) = ∏ p ( r ( n ) |, n|, h ( n ) ) ∗ p h ( n ) |, n|, h ( m ( n ) )
n∈T
) (4.18)
parse contribution of this part to the total scores for two candidates will be
So, we should consider dumped into instead of sacks into in this case.
Exercises
4.1 What is syntax and parsing in linguistic? Discuss why they are impor-
tant in NLP?
4.2 What is syntactic rule? State and explain SEVEN commonly used syntactic
patterns in English language, with an example each to illustrate.
4.3 Answer (4.2) by applying to other language such as Chinese, French, or
Spanish. What is (are) the different of the syntactic rules between these two
languages with example to illustrate.
4.4 What are constituents in English language? State and explain three commonly
used English constituents, with an example each to illustrate how it works.
4.5 What is context-free grammar (CFG)? State and explain the importance of
CFG in NLP.
4.6 State and explain FOUR major CFG components in NLP. Use an example
sentence/utterance to illustrate.
4.7 What are TWO major types of CFG parsing scheme? Use an example sen-
tence/utterance [4.102] Jack just brought an iPhone from Apple store to illus-
trate how these parsers work.
94 4 Syntax and Parsing
4.8 What is PCFG in NLP parsing? Use same example [4.102] Jack just brought
an iPhone from Apple store to illustrate how it works. Compare with parsers
used in (4.7), which one is better?
4.9 What are the advantages and limitations of PCFG in NLP parsing? Use some
sample sentences/utterances to support your answers.
4.10 What is lexical parsing in NLP parsing? Discuss and explain how it works by
using sample sentence [4.102] Jack just brought an iPhone from Apple store
for illustration.
References
5.1 Introduction
5.2 What Is Meaning?
5.3 Meaning Representations
This chapter will adopt similar approach as per syntax and morphology analysis
(Bender 2013) to create linguistics inputs representations and capture their mean-
ings. These linguistic representations are meanings characterization of sentences
and state-of-affairs in real-world situation.
Unlike parse trees, these representations are not primarily descriptions of input
structure, but is a kind of representation of how humans understand, mean anything
such as actions, events, and objects, etc. and try to make sense of it in our environ-
ment—the meaning of everything.
There are five types of meaning representation: (1) categories, (2) events, (3)
time, (4) aspect, and (5) beliefs, desires, and intentions.
1. Categories refer to specific objects and entities, e.g. company names, locations,
objects.
2. Events refer to actions or phenomena experienced, e.g. eating lunch, watching a
movie. They are relevant to verbs or verb phases expressed in POS.
3. Time refers to exact or reference moment, e.g. 9:30 am, next week, 2023.
4. Aspects refer to:
(a) Stative—to state facts.
For example: [5.2] Jane knows how to run.
(b) Activity—to describe action.
For example: [5.3] Jane is running.
(c) Accomplishment—to describe completed action without ending terms.
For example: [5.4] Jane booked the room.
(d) Achievement—to describe terminated action.
For example: [5.5] Jane found the book.
5.4 Semantic Processing 97
5.4 Semantic Processing
Semantic processing (Bender and Lascarides 2019; Best et al. 2000; Goddard 1998)
undertakes meaning representation to encode and interpret meanings. These repre-
sentations allow to:
1. Reason relations with the environment
For example: [5.9] Is Jack inside the classroom?
2. Answer questions based on contents
For example: [5.10] Who got the highest grade in the test?
3. Perform inference based on knowledge and determine the verity of unknown
fact(s), thing(s), or event(s),
For example: [5.11] If Jack is in the classroom, and Mary is sitting next to him,
then Mary is also in the classroom.
Semantic processing applied to typical applications includes Q&A chatbot sys-
tems, it is necessary to understand meanings, i.e. the ability to answer questions
about context or discourse with knowledge, literal or even embedded meanings for
implementation. The following shows live examples in our AI Tutor chatbot (Cui
et al. 2020) which involve different degrees of semantic processing:
[5.12] What is the meaning of NLP?
– Basic level of semantic processing for the meaning of certain concept.
[5.13] How does N-gram model work?
– Requires understandings on facts and meanings to respond.
[5.14] Is Turing Test still exist?
– Involves high-level query and inference from previous knowledge.
[5.15] Why do we need to study self-awareness in AI?
– Involves high-level information such as world knowledge or common
sense aside AI terminology knowledge base to respond.
[5.16] Should I study AI?
– Involves the highest-level information about user’s common sense and
world knowledge aside AI concepts learnt by the book.
98 5 Meaning Representation
There are four common methods of meaning representation scheme: (1) First-Order
Predicate Calculus (FOPC), (2) Semantic Networks (semantic net), (3) Conceptual
Dependency Diagram (CDD), and (4) Frame-based Representation. A sample sen-
tence [5.17] Jack drives a Mercedes is used to illustrate how they perform.
First-Order Predicate Logic (FOPL) (Dijkstra and Scholten 1989; Goldrei 2005) is
also known as predicate logic, or first-order predicate calculus. It is a robust lan-
guage representation scheme to express the relationship between information
objects as predicates. For example, FOPC meaning for [5.17] is given by
5.5.2 Semantic Networks
Semantic networks (semantic nets) (Jackson 2019; Sowa 1991) are knowledge rep-
resentation technique used for propositional information. They convey knowledge
meanings in a two-dimensional representation. A semantic net can be represented as
a labelled directed graph. The logic behind is that a concept meaning is connected
to other concepts and can be represented as a graph. The information in semantic net
is characterized as a set of concept nodes to link up with each other by set of labelled
arcs which characterized the relationship as illustrated in Fig. 5.1 for example sen-
tence [5.17].
Driver DriveThing
Jack Mercedes
5.5 Common Meaning Representation 99
Driving is the core concept connected to two nodes (concepts): Driver and
DriveThing which links to Jack as the driver and Mercedes as drivething respectively.
5.5.4 Frame-Based Representation
Drive-by
Jack
100 5 Meaning Representation
There are three factors to fulfill a meaning representation (Bunt 2013; Butler 2015;
Potts 1994): (1) Verifiability, (2) Ambiguity, and (3) Vagueness considerations.
5.6.1 Verifiability
5.6.2 Ambiguity
Ambiguity is a word, statement, or phrase that consists of more than one meaning.
Ambiguous words or phrases can cause confusion, misunderstanding, or even
humor situations.
For example: [5.19] Jack rode a horse in brown outfit.
5.6 Requirements for Meaning Representation 101
This clause may drive readers to wonder that the horse wore brown outfit instead
of the rider. Likewise, same words with different meanings induce ambiguity, e.g.
Jack took off his gun at the bank. It is diverting to confuse the meaning of bank
refers to a building or the land alongside of a river or lake. Context meaning is
important to resolve ambiguity.
5.6.3 Vagueness
Vagueness is to describe borderline cases, e.g. tall is a vague term in the sense that
a person who is 1.6 m in height is neither tall nor short since there is no amount of
conceptual analysis or empirical investigation can settle whether a 1.6 m person is
tall or not without any frame of reference. Here is another live example:
[5.20] He lives somewhere in the south of US.
– is also vague as to the meaning of location.
Ambiguity and vagueness are two varieties of uncertainty which are often dis-
cussed together but are distinct in essential features and significances in semantic
theory. Ambiguity involves uncertainty about mapping between representation lev-
els which have more than a single meaning with different structural characteristics,
while vagueness involves uncertainly about the actual meaning of terms. Hence, a
good meaning representation system should resolve vagueness and avoid ambiguity.
5.6.4 Canonical Forms
Advantages
Disadvantages
5.7 Inference
5.7.1 What Is Inference?
Inference (Blackburn and Bos 2005) is divided into deduction and induction with
origin dated back to Ancient Greece from Aristotle 300s BCE. Deduction refers to
use available information to guess or draw conclusion about facts such as legendary
Sherlock Holmes’ deductive reasoning methods (Doyle 2019). Examples of infer-
ence by deduction reasoning:
5.8 Fillmore’s Theory of Universal Cases 103
Inferencing with FOPC is to come up with valid conclusions which leaned on inputs
meaning representation and knowledge base. For example:
[5.31] Does Jack eat KitKat?
It consists of two FOPC statements:
Given the above two FOPC statements are true, it can infer the saying [5.31] as
yes by using inductive or deductive reasoning.
Theory of Universal Cases (Fillmore 1968). He believed that there are only a
restricted number of semantic roles, called case roles appeared in every sentence/
utterance being constructed with the verb.
The Fillmore’s Theory of Universal Cases (Fillmore 2020; Mazarweh 2010) ana-
lyzes fundamental syntactic structure of sentences/utterances by exploring the asso-
ciation of semantic roles such as: agent, benefactor, location, object, or instrument,
etc. which are required by the verb in sentence/utterance. For instance, the verb pay
consists of semantic roles such as agent (A), beneficiary (B), and object (O) for
sentence construction. For example:
[5.32] Jane (A) pays cash (O) to Jack (B).
According to Fillmore’s Case Theory, each verb needs a certain number of case
roles to form a case-frame. Thus, case-frame determines the vital aspects of seman-
tic valency of verbs, adjectives, and nouns. Case-frames are conformed to certain
limitations such as a particular case role can appear only once per sentence. There
are mandatory and optional cases. Mandatory cases cannot be deleted; otherwise,
it will produce ungrammatical sentences. For example:
[5.33] This form is used to provide you.
This sentence/utterance makes no sense without an additional role that explain
provide you to or with what matter or notion. One possible solution is:
[5.34] This form is used to provide you with the necessary information.
The association between nouns and their structures contains both syntactic and
semantic importance. The syntactic positional relationship between forms in a sen-
tence varies from language to language, so grammarians can observe, examine
semantic values in these nouns, and provide information to consider case role in a
specific language.
One of the major tasks of semantic analysis in Fillmore’s Theory is to offer a
possible mapping between syntactic constituents of a parsed clause and their seman-
tic roles associated with the verb. The term case role is widely used for purely
semantic relations, including theta and thematic roles. The theta role (θ-role) refers
to a formal device for representing syntactic argument structure required syntacti-
cally by a particular verb. For instance:
[5.35] Jack gives the toy to Ben.
Statement [5.35] shows the verb give has three arguments, whereas Jack is deter-
mined as the external theta role of agent, toy is determined as the theme role, and to
Ben is determined as the goal role.
5.8 Fillmore’s Theory of Universal Cases 105
Thematic role, also called semantic role, refers to case role that a noun phase
(NP) may deploy with respect to action or state used by the main verb. For example:
[5.36] Jack gets a prize.
Statement [5.36] shows Jack is the agent as he is doer to get, the prize is the
object being received, so it is a patient.
Note that:
1. They are general rules, some verbs may have exception.
2. Every syntactic constituent can only fill-in one case at a time.
3. No case role can appear twice in the same rule.
4. Only NPs of same case role can be co-joined in the rule.
5.9 First-Order Predicate Calculus 107
5.8.3.1 Selectional Restrictions
Selectional restrictions are methods to restrict types of certain roles to be used for
semantic consideration. For instance:
1. Agents must be animate, i.e. a living thing such as a person, Jack.
2. Instruments must be inanimate objects, i.e. non-living things such as rock.
3. Themes are types that may be dependent on verbs, e.g. window relates to the
verb break.
Such constraints can be applied to the following examples to check whether they
make sense or not:
[5.57] Someone assassinated the President vs.
[5.58] The spider assassinated the fly. ☒
Nevertheless, additional rules can be deployed to state that assassinate has inten-
tional or political killing such that [5.58] may be incorrect. In fact, such method is
usually applied for semantic analysis to be discussed in Chap. 6.
FOPC consists of four major elements: (1) terms, (2) predicates, (3) connectives,
and (4) quantifiers.
108 5 Meaning Representation
1. Terms
Terms are objects names with three representations: (a) constants, (b) func-
tions, and (c) variables.
Constants refer to specific object described in sentence/utterance, e.g.
Jack, IBM.
Functions refer to concepts expressed as genitives such as brand name, loca-
tion, e.g. Brandname(Mercedes), LocationOf(KFC) can be regarded as single-
argument predicate.
Variables refer to objects without reference which object is referred to, like
variables x, y and z used in a mathematical equation x + y = z(e. g. a, b, c, x, y, z, etc.).
They are frequently used in FOPC for query and inferencing operations.
2. Predicates
Predicates (Epstein 2012) refer to a predicate notion in traditional grammar
traces back to Aristotelian logic (Parry and Hacker 1991). A predicate is regarded
as the property of a subject has or is characterized by. It can be considered as the
expression of fact to the relations that link up some fixed number of objects in a
specific domain, e.g. he talks, she cries, Jack plays football, etc. Predicates are
often represented with capital letters like Buy or Play in FOPC and combine with
object-names to form a proposition, e.g. Drive(Mercedes), Drive(Mercedes,
Jack), Drive(Mercedes, x), Drive(Mercedes, Jack, UIC, Starbucks), Drive(car, x,
org, dest), etc.
3. Connectives
Connectives refer to proposition combinations. Conjunctions (and as in
English, written as & or Λ), disjunctions (or as in English, written V), and impli-
cations (as if-then in English, written → or ⊃). Negation (as not in English, writ-
ten ¬ or ~ ) is also regarded a connective, even though it operates on a single
proposition.
4. Quantifiers
Quantifiers refer to generalizations. There are two major kinds of quantifiers:
universal (as all in English, written ∀) and existential (as some in English, writ-
ten ∃). The term first-order in FOPC means that this logic only uses quantifiers
to generalize objects, but never onto predicates.
A FOPC Context-Free Grammar (CFG) specification is shown in Fig. 5.4.
Connective → ņ| Ģ →
Quantifier →
Constant → IBM Tesla USA Jack A ...
Variable → x y z
Predicate → Drive Buy Find ...
Function → LocationOf Brandname ...
It can have other configurations to describe the same predicate logic for example:
Note that all these predicates should be treated individually as their arguments
have different overall meanings.
A predicate that represents a verb meaning, e.g. give has the same argument num-
bers present as its syntactic categorization frame. It is still difficult to (1) determine
the correct role numbers for an event, (2) manifest facts about case role(s) associ-
ated with the event, and (3) ensure correct inference(s) is/are derived from meaning
representation.
According to above considerations, the FOPC formulation stated in Eq. (5.5) is
not as useful as it seems, it would be preferable if roles or cases are separated and
flexible when deciding the whole FOPC statement like this:
Note: Just like Isa() to serve as predicate Is a, AKO() is a useful predicate to serve
as the meaning a kind of. In fact, FOPC materializes events so that they can be quan-
tified, related to other events and objects through a defined set of relationships, and
logical connections between closely related instances without meaning assumptions.
P → Q or P Q (5.14)
where P, Q, and P → Q are statements or propositions in a formal language and ├
is a metalogical symbol, meaning that Q is a syntactic consequence of P and P → Q
in a logical system. MP rule justification in a classical two-valued logic is given by
a truth table as shown in Fig. 5.5.
The following example uses a Tesla car to demonstrate how FOPC works in
logic inference. It has three statements to process:
ElectricCar Tesla
x ElectricCar x Fuel x,Electricity (5.15)
Fuel Tesla,Electricity
112 5 Meaning Representation
Note: First statement says Tesla is an electric car, second statement says for all
electric cars x, if a car is an electric car, the fuel being used must be electricity.
The above predicate Electric Car (Tesla) matches the antecedent of the rule, so
based on simple MP deduction to conclude that Fuel(Tesla, Electricity) is a True
statement.
In fact, MP can be applied in Forward and Backward Reasoning modes.
Forward Reasoning (FR), also called normal mode is used in normal situation by
adding all facts into a KB to invoke all applicable implication rules to examine
clause correctness or new knowledge addition.
Backward Reasoning (BR) is MP operates in reverse mode to prove specific
proposition or called query in computer science. That is, to examine if the query
formula is true by its presence in KB, or without negative implication or facts on
return query results.
Exercises
5.1 What is meaning representation? Explain why meaning representation is
important in NLP. Give one or two live examples to support your answer.
5.2 State and explain FIVE major categories of meaning representation. For each,
give one live example to support your answer.
5.3 State and explain FOUR common types of meaning representation in NLP. For
each type, use the following sample sentence/utterance [5.70] [5.69] Jack buys
a new flat in London to illustrate how they work for meaning representation.
5.4 What are the THREE basic requirements for meaning representation. For each
requirement, give two live examples to support your answer.
5.5 What is Canonical Form? How canonical form is applied to meaning repre-
sentation. For sample sentence/utterance [5.70] [5.69] Jack buys a new flat in
London, give five variations of this sentence and work out the canonical form
in the forms of FOPC and Semantic Net.
5.6 What is Inference? Explain why inference is vital to NLP and the implementa-
tion of NLP applications such as Q&A chatbot.
5.7 What is Fillmore’s Theory of universal cases? State and explain SIX major
case roles of Fillmore’s Theory in meaning representation. Use a live example
for illustration.
5.8 What is the complication of Fillmore’s Theory in meaning representation by
using several live examples, explain how it can be solved.
References 113
5.9 What are FOUR basic components of First-Order Predicate Calculus (FOPC)?
State and explain their roles and function in FOPC formulation.
5.10 What is Modus Ponens (MP) in inferencing? In addition to MP, state and
explain other possible inferencing methods that can be applied to FOPC in
meaning representation.
References
Bender, E. M. (2013) Linguistic Fundamentals for Natural Language Processing: 100 Essentials
from Morphology and Syntax (Synthesis Lectures on Human Language Technologies). Morgan
& Claypool Publishers
Bender, E. M. and Lascarides, A. (2019) Linguistic Fundamentals for Natural Language Processing
II: 100 Essentials from Semantics and Pragmatics (Synthesis Lectures on Human Language
Technologies). Springer.
Best, W., Bryan, K. and Maxim, J. (2000) Semantic Processing: Theory and Practice. Wiley.
Blackburn, P and Bos, J. (2005) Representation and Inference for Natural Language: A First
Course in Computational Semantics (Studies in Computational Linguistics). Center for the
Study of Language and Information.
Bunt, H. (2013) Computing Meaning: Volume 4 (Text, Speech and Language Technology Book
47). Springer.
Butler, A. (2015) Linguistic Expressions and Semantic Processing: A Practical Approach. Springer.
Cui, Y., Huang, C., Lee, Raymond (2020). AI Tutor: A Computer Science Domain Knowledge
Graph-Based QA System on JADE platform. World Academy of Science, Engineering and
Technology, Open Science Index 168, International Journal of Industrial and Manufacturing
Engineering, 14(12), 543 - 553.
Dijkstra, E. W. and Scholten, C. S. (1989) Predicate Calculus and Program Semantics (Monographs
in Computer Science). Springer. Advanced Reasoning Forum.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Epstein, R. (2012) Predicate Logic. Advanced Reasoning Forum.
Fillmore, C. J. (1968) The Case for Case. In Bach and Harms (Ed.): Universals in Linguistic
Theory. New York: Holt, Rinehart, and Winston, 1-88.
Fillmore, C. J. (2020) Form and Meaning in Language, Volume III: Papers on Linguistic Theory
and Constructions (Volume 3). Center for the Study of Language and Information.
Goddard, C. (1998) Semantic Analysis: A Practical Introduction (Oxford Textbooks in Linguistics).
Oxford University Press.
Goldrei, D. (2005) Propositional and Predicate Calculus: A Model of Argument. Springer.
Jackson, P. C. (2019) Toward Human-Level Artificial Intelligence: Representation and Computation
of Meaning in Natural Language (Dover Books on Mathematics). Dover Publications.
Mazarweh, S. (2010) Fillmore Case Grammar: Introduction to the Theory. GRIN Verlag.
Minsky, M. (1975). A framework for representing knowledge. In P. Winston, Ed., The Psychology
of Computer Vision. New York: McGraw-Hill, pp. 211-277.
Parry, W. T. and Hacker, E. A. (1991) Aristotelian logic. Suny Press.
Potts, T. C. (1994) Structures and Categories for the Representation of Meaning. Cambridge
University Press.
Schank, R. C. (1972). Conceptual dependency: A theory of natural language processing. Cognitive
Psychology, 3, 552–631.
Sowa, J. (1991) Principles of Semantic Networks: Explorations in the Representation of Knowledge
(Morgan Kaufmann Series in Representation and Reasoning). Morgan Kaufmann Publication.
Chapter 6
Semantic Analysis
6.1 Introduction
Semantic analysis (Cruse 2011; Goddard 1998; Kroeger 2019) can be considered as
the process of identifying meanings from texts and utterances by analyzing gram-
matic structures relationships between words, tokens of written texts, or verbal
communications in NLP.
Semantic analysis tools can assist organizations to extract meaningful informa-
tion from unstructured data automatically such as emails, conversations, and cus-
tomers feedbacks. There are many ways ranging from complete ad-hoc
domain-oriented techniques to some theoretical but impractical methods. It is a
sophisticated task for a machine to perform interpretation due to complexity and
subjectivity involved in human languages. Semantic analysis on natural language
captures text meaning with contexts, sentences, and grammar logical structures
(Bender and Lascarides 2019; Butler 2015).
Semantic analysis is a process to transform linguistic inputs to meaning repre-
sentation and stamina for machine learning tools like text analysis, search engines,
and chatbots. From computer science perspective, semantics can be considered as
group of words, phrases, or clauses that provide concern specific context to lan-
guage, or clues to word meanings and relationships. For instance, a successful
semantic analysis will base on quantity methods such as word frequency and con-
text on location to generate cognitive connection between the clause giant panda is
a portly folivore found in China and its semantic meaning instead of just the name
(panda) it stands for.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 115
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_6
116 6 Semantic Analysis
Humans extract abstract ideas and notions like breathing without awareness. Use
the meaning of apple as example, when discussing about the concept of apple is
referred to fruit consume regularly. But now, a majority is referred to brand name
Apple that dominates mobile phones and computers industry. In other words,
humans are competent to extract context surrounding words, phrases, objects, sce-
narios and compare information with prior experience, common sense, and world
knowledge to construct overall meanings in a text or conversation. These analyses
outputs will be used to predict outcome with incredible accuracy, but algorithms and
computers capacity upgrades had modified habitual practices to fit in with machine
learning and NLP allowing machine-driven semantic analysis becomes reality. Such
machine learning based semantic analysis schemes can help to reveal the meanings
in online messages and conversations, determine answers to questions without man-
ually extracting relevant information from large volumes of unstructured data. The
truth is semantic analysis aims to make sense of everything from words to languages
in daily life.
6.2 Lexical Vs Compositional Semantic Analysis 117
There are six types of commonly used lexical semantics: (1) homonymy, (2) poly-
semy, (3) metonymy, (4) synonyms, (5) antonyms, (6) hyponymy and hypernymy.
6.3.2.1 Homonymy
Homophones are words that are spelled and pronounced the same but have different
meanings. The word homonym comes from prefix homo- stands for same and
suffix -nym stands for name.
Example 1: bank1: financial institution vs bank2: slopping land:
[6.13] He went to the bank and withdrew some cash.
[6.14] He was standing at the bank of the lake in the forest.
Example 2: bat1: a sporting club for ball hitting vs bat2: a kind of flying mammal:
[6.15] He handles his bat skillfully during the game.
[6.16] Bats live the longest as compared with other species of similar size.
Example 3: play1: light-hearted recreational activity for amusement vs
play2: the activity of doing something in an agreed succession.
[6.17] This Shakespeare play is excellent.
[6.18] It is still my play.
There are two related concepts with homonymy: (1) homographs are usually
defined as words that have the same spelling with different pronunciations and (2)
homophones are words that share same pronunciation regardless of spellings as
examples above. Further, homographs are words with same spellings and hetero-
graphs are words that share same pronunciation but different spellings, e.g. chart vs
chat, peace vs piece, right vs write, etc.
Homonymy often causes problems in the following NLP applications:
1. information retrieval confusion e.g., cat scan.
2. machine translation confuses foreign languages’ meanings:
e.g. bank1—Financial institution, bank (English) → la banque (French)
[6.19] He goes to the bank and withdraws some cash. (English)
[6.20] Il va à la banque et retire de l’argent. (French)
e.g. bank2—sloping land, bank (English) → la rive (French)
120 6 Semantic Analysis
6.3.2.2 Polysemy
Polysemy are words with same spellings but different in meanings and context. The
difference between homonymy and polysemy is delicate and subjective.
e.g., bank
[6.23] The bank was built in 1866. (financial building)
[6.24] He withdrew some money from the bank early this morning. (financial
organization)
In fact, many commonly words are polysemy with multiple contexts and mean-
ings in different sentence situations.
e.g., get is a commonly used word that has at least three distinct meanings.
[6.25] I get an apple from the basket. (have something)
[6.26] I get it. (understand)
[6.27] She gets thinner. (reach or cause to a specified state or condition)
6.3.2.3 Metonymy
6.3.2.4 Zeugma Test
Zeugma is the usage of a word(s) that makes sense in one way but not the other.
Examples of zeugma that caused conflicts in semantics:
[6.29] Wage neither war nor peace.
–– There is a term to wage war but literally incorrect to say to wage peace.
6.3 Word Senses and Relations 121
6.3.2.5 Synonyms
Synonyms are words with same meaning in some or all contexts. They usually
appear in language in different contexts, such as formal and informal language,
daily conversations, and business correspondence. Synonyms have modest meaning
when used, although they have the same meaning, e.g., create/make, start/begin,
big/huge, attempt/try, house/mansion, pretty/beauty, etc. Synonyms have two lex-
emes if they are interchangeable in all cases and retain the same meaning.
However, there are very few truths synonymy in the real-world situation as to
whether two words are truly synonyms. The logic behind is if they are different
words then they must mean something else or have some context differences in
usage and cannot be the same in all situations. In many cases, two words are not
exactly interchangeable when they appear, even though many aspects of the mean-
ing are the same. These words are used and mean differently due to concepts of
politeness, slang, register, and genre etc.
e.g., large vs big (are they exactly the same?)
[6.34] This building is very big vs
[6.35] This building is very large.
[6.36] Janet is her big sister vs
[6.37] Janet is her large sister.
Although both words have same meanings in the description of size, the word big
has an additional notion of older in terms of seniority description.
6.3.2.6 Antonyms
Antonyms is the word sense between words with opposite context meanings. It has
the notion in which other sense relations do not occupy synonym regardless of human
tendency to categorize experience in dichotomous contrast is not easily judged.
However, the notion of antonyms is immeasurable. Humans understand the con-
cept of opposite from childhood, encounter them in daily life, and even use
122 6 Semantic Analysis
Hyponym is a word sense of another word if the first word sense is specific, denoting
a subclass of the other sense in linguistics, e.g. truck is a hyponym of vehicle, mango
is a hyponym of fruit, and chair is a hyponym of furniture; or conversely hypernym/
superordinate (hyper is super), e.g. vehicle is a hypernym of truck, fruit is a hyper-
nym of mango, furniture is hypernym of chair.
It is interesting to know that hyponymy is not only limited to nouns but it can
also be found in verbs, e.g. gaze, glimpse, or stare are all regarded to specific
moment of seeing.
Hyponymy and hypernymy relationship between word sense and relation are
regarded as the relationship between class and subclass concepts in object-oriented
programming from computer science perspective, e.g. the class of vehicle have
three subclasses: car, lorry, and bus, while the class of fruit can have numerous
subclasses such as apple, orange, and mango; or in reverse manner, the concept
vehicle is the superclass of car, and the concept fruit is the superclass of mango.
Furthermore, words have hyponyms of same broader term are hypernym known
as co-hyponyms, e.g. daisy and rose and broader term of flower called hyponymy or
inclusion, which has the same situation for word sense relation of co-hypernymy.
Hyponymy has (1) extensional, (2), entailment, (3) transitive, and (4) IS-A hier-
archy characteristics:
1. Extensional is the class represented by the parent extension, including the class
represented by hyponym, e.g. the relations between vehicle and truck.
2. Entailment is a hyponym sense A of sense B if A entails B.
3. Transitive means if A entails B and B entails C, then A entails C, e.g. truck,
vehicle, transport where truck is a hyponymy of vehicle, vehicle is a hyponymy
of transport, so truck is a hyponymy of transport.
4. IS-A hierarchy where A IS-A B (or A IsA B), and B subsumes A in object-oriented
programming (OOP).
Hyponyms has notions of instance and class. In linguistics, an instance can be con-
sidered as proper noun with unique entity, e.g. New York is an instance of city; USA
is an instance of country. It is regarded as the relationship between class vs object
in object programming.
6.4 Word Sense Disambiguation 123
In short, class is the notion of things and objects, whereas object is the instance
of class, e.g. person is a class concept to describe an individual person while Jack is
an object, which is an instance of that class concept.
A simple test: the relationship between car and Tesla, are they class-object rela-
tionship or class-subclass relationship?
There are five major concerns in WSD: (1) difference meaning across dictionaries,
(2) POS tagging, (3) inter-judge variance, (4) pragmatic (discourse), and (5) senses
discreteness.
1. Meaning across dictionaries
A problem with WSD is senses decision as dictionaries and thesauri offer
several words divisions into senses. Many WSD research are commonly used
WordNet (WordNet 2022a) as the reference word sense corpus for English. It
124 6 Semantic Analysis
Commonly used WSD methods include: (1) knowledge base, (2) supervised learn-
ing, (3) semi-supervised, and (4) unsupervised learning methods for WSD (Agirre
and Edmonds 2007; Preiss 2006).
6.4 Word Sense Disambiguation 125
6.5.1 What Is WordNet?
WordNet (WordNet 2022a) is a lexical corpus of words with over 200 languages
with adjectives, adverbs, nouns, and verbs grouped into a set of synonyms where
each word in WordNet has a distinct concept. It is organized by concepts and mean-
ings against a dictionary in alphabets. Since traditional dictionaries were created by
humans, a lexical resource is required for computers effecting WordNet is applica-
ble in NLP. It is available for public access and free download with statistical infor-
mation as shown in Fig. 6.1.
WordNet’s structure is vital tool for computational linguistics and NLP imple-
mentations. It resembles a thesaurus and group words by meanings. However, they
have basic differences: (a) WordNet indicates word senses in addition to word forms.
As a result, words that are found near one another in network are semantically
related or even synonym with each other, (b) WordNet encodes semantic relations
among words, whereas words in a thesaurus does not follow a distinct pattern other
than the similarity in surface meaning.
6.5.2 What Is Synsets?
WordNet term, and synonyms that are part of a synset are lexical variants of that
concept. Figure 6.2 shows a synset tree for synset concept book and all concept
relationships with all other related synsets. Meaningful related words and concepts
in the generated network can be browsed from WordNet browser (WordNet 2022b).
6.6.1 What Is MeSH?
MeSH glossary contains several entry terms intended to be synonyms for canoni-
cal title terms in addition to hierarchical set of canonical terms.
6.8 Introduction
The difference between word similarity and word relation is that similar words
are almost synonyms, e.g. car, bicycle are similar in concept but not a kind of Is-A
relation, whereas related words can be related in any way, e.g. car, gasoline are
highly related but not similar in semantic meaning.
There are two types of similarity algorithms: (1) Thesaurus-based algorithms
and (2) Distributional algorithms. Thesaurus-based algorithms are designed to
examine adjacent words in a hypernym hierarchy with similar annotations or defini-
tions. Distribution algorithms are designed to examine words with similar distribu-
tional contexts.
6.8.1 Path-based Similarity
Path-based similarity aims to examine two concepts in general. Two concepts are
similar if they are in the vicinity of thesaurus hierarchy. Synset tree (graph), the
distance (path) between two synset nodes can provide a good indication of semantic
similarity between two concepts. This evaluation method is known as path-based
similarity measurement. Figure 6.6 depicts an example of path-based similarity for
concept car. Note that all concepts have path value 1 point to themselves.
12
object
artifact
10
instrumentality article
7
transport ware
5 vehicle tableware
5
wheeled vehicle cutlery
2
automotive motorcyle fork
3
car truck minibike
For example:
pathlen(car, car) = 1
pathlen(car, automotive) = 2
pathlen(car, truck) = 3
pathlen(car, minibike) = 5
pathlen(car, transport) = 5
pathlen(car, artifact) = 7
pathlen(care, tableware) = 10
pathlen(car, fork) = 12
In general:
1
simpath ( c1 , c2 ) = (6.2)
pathlen ( c1 , c2 )
Let us assume every link denotes a uniform distance. It seems that car to minibike
is closer than car to transport because higher synsets are more abstract in synset
tree, e.g. object is abstract than artifact, transport is abstract than vehicle, etc.
Despite simpath(car, minibike) and simpath(car, transport) have identical values,
their semantic relationship between each other is different, naturally synsets in
other branch of the synset tree are less related in concept, e.g. car vs tableware or
even fork.
134 6 Semantic Analysis
Thus, it is suggested to have a metric that can represent the cost of each edge
independently, so that words associated with abstract nodes should have less simi-
larity scores.
∑ w∈words ( c )
count ( w )
P (c) = (6.4)
N
IC ( c ) = − log P ( c ) (6.5)
i.e. the lower node in hierarchy that subsumes (is a hypernym of) both c1 and c2 is
ready to apply information content as a similarity metric.
6.8 Introduction 135
transport 0.415
vehicle 0.225
Fig. 6.7 Synset tree of “car” with associated P(c) (up to transport level in the corpus)
Resnik Method (Resnik 1995, 1999) refers to the similarity between two words that
are in the vicinity of their common information. It is defined to measure the most
informative common information contents, i.e. (lowest) subsumer (MIS/LCS) of
two nodes, given by
Dekang Lin method was proposed by Prof. Dekang Lin with his work Information-
Theoretic Definition of Similarity at ICML in 1998 (Lin 1998). It determines the
similarity between concepts A and B is not only what they have in common but also
the differences between them. It concerns with (1) commonality and (2) difference.
Commonality, denoted by IC(common(A,B)) means A and B are more in common
that has more similarity. Difference, denoted by IC(description(A,B) −
IC(common(A,B)) means more differences between A and B that has less similarity.
Similarity theorem is similarity between A and B measured by the ratio between
amount of information required to state commonality of A and B, the information
required to describe what A and B are given by
2 x log P ( LCS ( c1 , c2 ) )
SimLin ( c1 , c2 ) = (6.9)
log P ( c1 ) + log P ( c2 )
2x log P ( automotive )
SimLin ( car, truck ) =
log P ( car ) + log P ( truck )
2 x log P ( 0.0172 )
= = 0.707
log P ( 0.00872 ) + log P ( 0.00117 )
This calculation showed that car is related to truck than minibike at hierarchy
tree in Fig. 6.10.
sim eLesk ( c1 , c2 ) = ∑ (
overlap gloss ( r ( c1 ) ) , gloss ( q ( c2 ) ) ) (6.10)
r , q∈RELS
6.9 Distributed Similarity 137
6.9 Distributed Similarity
6.9.2 Word Vectors
Word vector is a vector of weights. In a simple 1-of-N encoding every element in the
vector is associated with a word in vocabulary. Word encoding is vector where the
corresponding element is set to one, and other elements are zero.
Given a target word w, assume there is a binary feature fi for each N words in
lexicon vi, the word vector is given by:
W = ( f1 , f 2 , f 3 , … f N ) (6.11)
w = (1, 1, 0, 1, …) (6.12)
6.9.3 Term-Document Matrix
Text data is denoted as a matrix in this method. The rows represent sentences from
the data to be analyzed and columns represent words of the matrix. Each cell is the
counting of term t in a document d:tft,d, and each document is a counter vector in ℕͮ .
138 6 Semantic Analysis
Fig. 6.11 Sample of two similar word by vector comparison across six documents
A term-context matrix can be formed using smaller context, e.g. a set of 10 suc-
cessive words from a paragraph or search engine. A word is now defined by a vec-
tor over the number of context words, which can be an entire document, literature,
or a list of words in a search engine, etc.
There is an argument as to whether raw counts can be used. tf-idf (term frequency
and inverse document frequency) are commonly used in place of raw term counts
for term-document matrix whereas Positive Point-wise Mutual Information (PPMI)
method for term-context matrix, respectively.
P ( x,y )
PMI ( X ,Y ) = log 2 (6.13)
P ( x )( y )
For word similarity measurement application, Church and Hanks (1990) pro-
posed PMI between two words given by
P ( word1 , word 2 )
PMI ( word1 , word 2 ) = log 2 (6.14)
P ( word1 ) P ( word 2 )
Niwa and Nitta (1994) proposed Positive PMI (or PPMI) by replacing all PMI
values less than zero into zero values, which is now commonly used in PMI calcula-
tions for document similarity comparison.
140 6 Semantic Analysis
Given matrix F with C columns (contexts), W rows (words) and fij is the number of
times wi occurs in context cj, Positive PMI(PPMI) between word1 and word2 is
given by:
P ( word1 , word 2 )
PPMI ( word1 , word 2 ) = max log 2 , 0 (6.15)
P ( word1 ) P ( word 2 )
where:
∑
C
∑ f
W
f ij fij
p (W , C ) = , p (Wi ) p (C ) =
j =1 ij
= , j
i =1
(6.17)
∑ ∑
W C
f N N
i =1 j =1 ij
in which: p(W, C) is the probability of considering target word W and context word
C together. p(W) and p(C) is the probability of occurring target word W and context
word C, if they are independent fij is the number of times Wi occurs in context Cj.
Let us use the previous document term matrix of six English literatures as exam-
ple to calculate word and context of total counts and probabilities as shown in
Figs. 6.12 and 6.13.
Fig. 6.12 Term-context matrix of six contexts with word and context total counts
6.9 Distributed Similarity 141
Fig. 6.13 Term-context matrix of six contexts with word and context total probabilities
Let us calculate PMI score for the word fool co-occurred with context from C1 =
As You Like It based on the above information from Fig. 6.13.
p (W , C )
Using PMI (W , C ) = log
p (W ) p ( C )
0.164
PMI ( fool, C1) = log = 0.604 (6.18)
0.493 ∗ 0.182
Similarly, the rest of PMI values for this term-context matrix are calculated as
follows in Fig. 6.14:
PMI (W , C ) , if PMI (W , C ) > 0
Note that: PPMI (W , C ) = from (6.16)
0, if PMI (W , C ) < 0
It is noted that PMI is biased toward infrequent events from above matrix such as
rare words have high PMI values. There are two possible methods to improve PMI
values: (1) apply add-k smoothing, e.g. add-1 Smoothing and (2) assign rare words
with higher probabilities.
142 6 Semantic Analysis
Since PMI is usually biased with infrequent events, K-smoothing method can be
solution. For example, apply add 2 Smoothing (i.e., set k = 2) in every cell of co-
occurrence matrix as in Fig. 6.15 and see how it works.
The corresponding probabilities matrix after Add-2 Smoothing is shown in
Fig. 6.16.
The Term-context matrix with PPMI values after applying Add-2 Smoothing is
shown in Fig. 6.17.
Fig. 6.15 Term-context matrix of six contexts with word and context total count with Add-2
Smoothing
Fig. 6.16 Term-context matrix of six contexts with word and context total prob. with Add-2
Smoothing
Fig. 6.17 Term-context matrix of six contexts with PPMI values with Add-2 smoothing
6.9 Distributed Similarity 143
It may have certain improvement in PPMI values giving the rate context words
theoretically.
However, there were not many improvements in this case.
Another method to achieve is by raising context probabilities to a certain factor
α, say 0.8.
P ( w, c )
PPMIα ( w, c ) = max log , 0 (6.19)
P ( w ) Pα ( c )
count ( c )
α
where: Pα ( c ) =
∑ count ( c )
α
c
0.950.8 0.050.8
Pα ( a ) = = 0 . 913, Pα ( b ) = = 0.083. (6.20)
0.950.8 + 0.050.8 0.950.8 + 0.050.8
Results using α = 0.8 and 0.9 are shown in Figs. 6.18 and 6.19, respectively.
Fig. 6.18 Term-context matrix of six contexts with PPMI values with α = 0.80
Fig. 6.19 Term-context matrix of six contexts with PPMI values with α = 0.90
144 6 Semantic Analysis
When applying context and world similarity measurement against context and word
vector, remember that cosine for computing similarity is given by
∑
N
v ·w v wi
cos ( v , w ) = = i =1 i
(6.21)
v w
∑ ∑
N N
v2
i =1 i
w2
i =1 i
where vi is PPMI value for word v in context i; wi is PPMI value for word w in con-
text I, cos(v,w) is the cosine similarity of v and w.
Context and word similarity measurement of six literatures is shown in Fig. 6.20.
For context comparison, cosine similarity measurement is performed between
C1 As You Like It and other five literatures, in which cosine (C1, C2) have the high-
est 0.453 as compared with others ranging from 0.044 (C3:Julius Caesar) and 0.157
(C6:Moby Dick). It showed that it makes sense as the context of As You Like It has
theme similarity with Twelfth Night than other literatures.
For word comparison, comparison is performed at W4: trick with three other
words across six literatures, in which cosine W4:trick, W3:fool have the highest
similarities among other two words W1:battle and W2:Solder which in fact they are
related in meanings and English usage.
It also showed other possible similarity measurements include Jaccard, Dice,
and JSs methods given by
∑ min ( vi , wi )
N
sim Jaccard ( v , w ) = i =1
(6.22)
∑ max ( vi , wi )
N
i =1
2 x ∑ i =1 min ( vi , wi )
N
sim Dice ( v , w ) = (6.23)
∑ i =1 ( vi + wi )
N
v + w
v + w
sim JS ( v w ) = D v | + Dw |
2 2
Fig. 6.20 Context and Word Similarity from six sample literature
6.9 Distributed Similarity 145
6.9.9 Evaluating Similarity
Like N-grams, similarity methods have (1) intrinsic and (2) extrinsic evaluation
schemes. Intrinsic evaluation refers to the correlation between similarity scores of
algorithms and human words. Extrinsic evaluation, also called task-based or end-to-
end evaluation, refers to detect misspellings, word sense disambiguation (WSD),
and use in grading essays or TOEFL multiple choice vocabulary tests.
Exercises
6.1 What is semantic analysis? State and explain the importance of semantic anal-
ysis in NLP. Give 2 live examples for illustration.
6.2 State and explain how humans are good in semantic analysis. Give 2 daily life
examples to support your answers.
6.3 What is the difference between lexical vs compositional semantic analysis?
Each of them gives 2 examples to support your answers.
6.4 What is word sense in linguistic? State and explain any 5 basic types of lexical
semantics and their word senses. For each of them, give 2 examples for
illustration.
6.5 What is Zeugma is linguistic and why is important in NLP? Give 2 live exam-
ple to illustrate how Zeugma Test for testing semantic correctness of sen-
tences/utterances.
6.6 What are the major concerns and difficulties encountered in word sense dis-
ambiguation (WSD). Give one example for each concern to support your
answers.
6.7 State and explain FOUR major methods to tackle word sense disambiguation
(WSD). Which one(s) is(are) commonly used in NLP application nowadays to
tackle WSD? Why?
6.8 What is Synsets in WordNet framework? Give 2 examples on how it works to
support your answers.
6.9 What is Path-based Similarity in Semantic Analysis? Use book as the basic
synset to construct a synset tree like Fig. 6.9 and calculate all the related Path-
based Similarity between different concepts related to book.
6.10 Based on the synset tree created in question 6.9, calculate the similarity values
by using: (1) Resnik Method and (2) Dekang Lin Method and compare them
with the ones calculated in 6.9.
6.11 What is distributed similarity? State and explain methods used for distributed
similarity measurement.
6.12 Use four famous literatures: (1) Moby Dick (Melville 2012), (2) Little Women
by Louisa Mary Alcott (1832–1888) (Alcott 2017), (3) The Adventures of
Sherlock Holmes (Doyle 2019), and (4) War and Peace by Leo Tolstoy
(1828–1910) (Tolstoy 2019) as context documents, and select ANY 4 words
(wisely) to illustrate how term-context matrix, PMI and PPMI for document
and word similarity measurement in semantic analysis.
146 6 Semantic Analysis
6.13 Repeat question 6.12 by using K-smoothing method for PMI/PPMI calcula-
tions (with k = 1 and 2) and different values of α and compare them with
results found in 6.12. Explain why it can/cannot be improved.
References
Agirre, E. and Edmonds, P. (Eds) (2007) Word Sense Disambiguation: Algorithms and Applications
(Text, Speech and Language Technology Book 33). Springer.
Alcott, L. M. (2017) Little Women. AmazonClassics.
Ayetiran, E. F., & Agbele, K. (2016). An optimized Lesk-based algorithm for word sense disam-
biguation. Open Computer Science, 8(1), 165-172.
BabelNet. 2022. BabelNet official site. https://babelnet.org/. Accessed 25 July 2022.
Bender, E. M. and Lascarides, A. (2019) Linguistic Fundamentals for Natural Language Processing
II: 100 Essentials from Semantics and Pragmatics (Synthesis Lectures on Human Language
Technologies). Springer.
Butler, A. (2015) Linguistic Expressions and Semantic Processing: A Practical Approach. Springer.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicog-
raphy. Computational Linguistics - Association for Computational Linguistics, 16(1), 22-29.
Cruse, A. (2011) Meaning in Language: An Introduction to Semantics and Pragmatics (Oxford
Textbooks in Linguistics). Oxford University Press
Cruse, A. (1986) Lexical Semantics (Cambridge Textbooks in Linguistics). Cambridge
University Press.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Firth, J. R. (1957). Papers in Linguistics 1934-1951. Oxford University Press.
Goddard, C. (1998) Semantic Analysis: A Practical Introduction (Oxford Textbooks in Linguistics).
Oxford University Press.
Harris, Z. S. (1954). Distributional structure. Word (Worcester), 10(2-3), 146-162. doi: https://doi.
org/10.1080/00437956.1954.11659520.
Kilgarriff, A. and Rosenzweig, J. (2000). Framework and results for English
SENSEVAL. Computers and the Humanities, 34(1/2), 15-48.
Kroeger, P. (2019) Analyzing Meaning: An Introduction to Semantics and Pragmatics (Textbooks
in Language Sciences). Freie Universität
Lesk, M. 1986. Automatic Sense Disambiguation Using Machine Readable Dictionaries: How
to Tell a Pine Cone from an Ice Cream Cone. ACM Special Interest Group for Design of
Communication: Proceedings of the 5th Annual International Conference on Systems
Documentation. ACM; 24–26. https://doi.org/10.1145/318723.318728.
Lin, D. K. (1998) An Information-Theoretic Definition of Similarity. In Proceedings of the
Fifteenth International Conference on Machine Learning (ICML '98). Morgan Kaufmann
Publishers Inc., 296–304.
Melville, H. 2012. Moby-dick. Penguin English Library.
MESH. 2022. MeSH browser official site. https://www.nim.nih.gov/mesh/meshome.html.
Accessed 25 July 2022.
Niwa, Y. and Nitta. Y. 1994. Co-occurrence Vectors from Corpora vs. Distance Vectors from
Dictionaries. In COLING 1994 Volume 1: The 15th International Conference on Computational
Linguistics, Kyoto, Japan. https://aclanthology.org/C94-1049.pdf.
Preiss, J. (2006). A detailed comparison of WSD systems: An analysis of the system answers for
the SENSEVAL-2 English all words task. Natural Language Engineering, 12(3), 209-228.
References 147
7.1 Introduction
Pragmatics and discourse analysis (Bender and Lascarides 2019; Cruse 2011;
Goddard 1998; Kroeger 2019) refer to the study of language in context meaning of
sentences/utterances unlike word layers, syntax, grammatic relationship, semantic
and meaning presentations learnt in previous chapters.
Pragmatics analysis focuses on context meaning. Discourse analysis studies
social context in written and spoken language. They consist of structured, coherent,
and cohesive sets of sentences/utterances to reflect what constitutes an utterance
versus a set of unrelated sentences and how the text is related.
There are two types of discourse in daily life: (1) monologue and (2) dialogue. A
monologue is a one way communication between a speaker (writer) and an audi-
ence (reader), e.g. read or write a book, watch a TV show or a play, listen to a
speech, attend a presentation, or a lecture depends on the deposition of dialogue.
Dialogue refers to participation in turn to speaker and hearer. It has a two-way or
multiple ways of communications.
There are also two types of dialogue: (1) human-to-human, e.g. daily conversa-
tions, group discussions, and (2) (a) human-to-computer interaction (HCI), e.g.
conversational agent, chatbot in NLP, and (b) computer-to-computer interaction
(CCI), e.g. cross machines verbal communication in smart city and intelligent trans-
portation system, multi-agent based bargain and negotiation systems, etc.
7.2 Discourse Phenomena
There are many discourse phenomena solved by humans naturally, but some like
conference resolution required a lot of effort by machine to solve.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 149
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_7
150 7 Pragmatic Analysis and Discourse
7.2.1 Coreference Resolution
7.2.2 Why Is it Important?
[7.10] Watson was seized with a keen desire to see Holmes again, and to know how
Holmes was employing Holmes’ extraordinary powers (with coreference
resolution)
or more challenging sentences of famous discourse from A Scandal in Bohemia:
[7.11] To Sherlock Holmes, she is always “ the woman”. I have seldom heard him
mention her under any other name. (original sentence)
[7.12] To Sherlock Holmes, Irene Adler is always “the woman”. Watson has seldom
heard Holmes mention Irene Adler under any other name. (with coreference
resolution)
[7.11] is more challenging as the reference name Irene Adler for she did not
occur, but after two sentences, there is no emotion akin to affection for Irene Adler.
This phenomenon is called cataphor to acquire meaning from a subsequent word
or phrase in linguistics.
The subsequent phrase (or word group) is called antecedent or a referent against
anaphora, a rhetorical term for a phrase (or word group) repetition at the start of
consecutive sentences/utterances used in many English sentences’ construction, i.e.
[7.5], [7.7], [7.9] are reference terms mentioned repetitively prior pronouns
replacement.
Coreference resolution is a versatile tool applied in many NLP applications
including text understanding and analysis, information retrieval and extraction, text
summarization, machine translation, and even sentiment analysis. It is a great way
to obtain unambiguous sentences comprehensible by computers.
7.2.3.1 What Is Coherence?
7.2.3.2 What Is Coreference?
Coreference (co-reference) appears when two (or group of) terms refer to the same
person or thing with a unified reference to achieve linguistic coherence. For example:
[7.14] Jack said Helen would arrive soon, and she did.
– Helen and she refer to the same person.
Conference is not always trivial to determine, e.g.
[7.15] Jack said he would join the term vs
[7.16] Jack told Ian to come, he smiled.
When comparing [7.15] vs [7.16], [7.15] is trivial as there is only a subject
(noun) that he can refer to (i.e. Jack), while he in [7.16] can refer to either Jack or Ian.
Determining coreference expressions are important in many NLP applications
such as information retrieval and extraction, text summarization and dialogues
understanding in Q&A chatbot systems.
7.2.5 Entity-Based Coherence
7.3 Discourse Segmentation
[7.29] Yesterday was Jane’s birthday. Betty and Mary went to buy a present from the
gift shop. Mary intended to buy a purse. “Don’t do that.”, mentioned Betty. “Jane
already got one. She will ask you to return it.”
Non-lexical cohesion approach using anaphora.
[7.30] Peel, core and slice peaches and pineapples, then place these fruits in the
skillet.
Unsupervised discourse segmentation was proposed by Prof. Marti Hearst in his
classical works on TextTiling in early 1990.
Fig. 7.1 Distribution of selected terms in Stargazer text (blanks mean zero frequency)
7 Pragmatic Analysis and Discourse
7.3 Discourse Segmentation 157
2. less common but evenly distributed, such as scientists and form are both generic
to create a subtopic title,
3. like space and star occurred more frequent from sentences 5 to 20 and 60 to 90,
while term life to planet occurred more frequent from sentences 58 to 78 which
may create two distinct clusters of subtopic discussion, and
4. like life to species have similar phenomena occurred to create a natural cluster
between sentences 35 and 55 and conform with human judgement as subtopic
discussion of How the moon helped life evolve on earth.
These results suggested that the logic behind sentences or paragraphs in subtop-
ics are consistent with each other but not with paragraphs in adjacent topics.
7.3.4 TextTiling Algorithm
TextTiling algorithm (Hearst 1997) for discourse segmentation and subtopic struc-
ture characterization using term repetition consists of three processes: (1) tokeniza-
tion, (2) lexical score determination, and (3) boundary identification.
Tokenization includes converting words to lowercase, removing stop-words and
root-words, and converting words into pseudo-sentences with the same length such
as 15 words.
Lexical score determination includes calculating lexical cohesion scores for each
gap between pseudo-sentences. This lexical cohesion score represents words simi-
larity. For instance, take 10 pseudo-sentences each before and after gap, followed
by the computation of cosine similarity between word vectors which is given by
∑ b ×a
N
b ⋅a
sim cosine ( )
b ,a = =
b a
i =1 i i
(7.1)
∑ b ∑
N N
i =1 i
2
a2
i =1 i
It is relatively easy to collect bounded training data using supervised discourse seg-
mentation such as news reports from TV shows, paragraph segmentation in text or
dialogue to find paragraphs in speech recognition output.
Several classifiers can be used to achieve supervised segmentation, one is called
feature set which is a superset for unsupervised segmentation with often domain-
specific utterance tokens and keywords.
Supervised discourse segmentation is also a model. It is (1) a classification task
that uses one of the supervised classifier methods, such as SVM, Naïve Bayer, maxi-
mum entropy, etc. to distinguish whether sentence boundaries have paragraph
boundaries, or (2) a sequence labeling task to label sentences with or without para-
graph borders. It uses cohesive features including word overlap, word cosine simi-
larity, anaphora, and additional features such as discourse markers or keywords.
Discourse tokens or keywords/phrases indicate discourse structure, e.g. good
evening, join our broadcast news now, or join the company at the beginning/end of
the segment, etc. They can be manual codes or automatically determined by feature
selection.
However, measuring precision, recall, and F-measure are not always good evalu-
ation ideas as they are insensitive to near misses. Pevzner and Hearst (2002) pro-
posed a good and effective evaluation metric for text segmentation called
WindowDiff method.
7.4 Discourse Coherence
Coherence relation refers to discourse properties that make each discourse mean-
ingful (or have appropriate meaning) in the context. It refers to common denomina-
tor to identify possible connections between utterances in a series of statements or
discourses about the same topic.
These sense relations in discourse analysis named Coherence Relations by Prof.
Jerry R. Hobbs in his works Coherence and Coreference published by Cognitive
Science in 1979 (Hobbs 1979) had further developed by other linguistics including
Sanders et al. (1992) and Kehler (2002) into a well-defined theory.
These meaning relationships, called propositional relations defined by Mann
and Thompson (1986), are encoded in text recognized by the reader trying to under-
stand the text and its components, and to see why the speaker or author added the
sentence. Coherent relationships are sometimes referred to as types of thematic
development such as the narrative of a movie or TV show involving cause-and-
effect story type in sense relations development.
There are five major types of coherence relations (1) parallel, (2) elaboration, (3)
cause-and-effect, (4) contrast, and (5) occasion.
1. Parallel infers p(a1, a2, …) from the assertion of S0 and p(b1, b2…) from the
assertion of S1, where ai and bi are similar for all i.
[7.31] Rich man wants more power. Poor man wants more food.
160 7 Pragmatic Analysis and Discourse
They are frequently used in describing two sense relations with similar situa-
tion (meaning) but different in object, reference, and scenario.
2. Elaboration infers the same proposition P from the assertions of S0 and S1.
[7.32] Dorothy was from Kansas. She lived in the great Kansas prairies.
[7.33] Nicolas Telsa was a genius. He invented over hundreds of things in
his life.
They are frequently used in discourse construction, the successive sentences/
utterances are further elaboration of the previous one.
3. Cause-and-effect are S0 and S1 if S1 infers S0, i.e. S1 → S0
[7.34] Jack cannot afford to buy the car. He lost his job.
[7.35] Nicolas Tesla invented over hundreds of things in his life. He was
a genius.
Cause-and-effect discourse relation can refer to animate or inanimate subjects
in [7.35] which is the reverse of elaboration statement [7.33] but do not always occur.
4. Contrast in S0 and S1 if P0 and P1 infer from S0 and S1 with one pair of elements
that are contrast with each other, where other elements are similar in context.
[7.36] Hope for the best. Prepare for the worst.
[7.37] Jack is meticulous while Bob is sloppy.
Contrast coherence relations can exist within a sentence, or in successive sen-
tences/utterances. It often refers to two subjects, or events with contrast sense
relations.
5. Occasion is the alteration of state that can infer from the assertion of S0, where
final state can infer from S1, or the alteration of state can infer from the assertion
of S1, whose initial state can infer from S0.
[7.38] Jane put the books into a schoolbag, she left the classroom with Helen.
[7.39] Jack failed in the exam. He started to work hard.
State change invokes new action.
Discourse coherence can also be revealed by the hierarchy between coherent rela-
tions. For example:
[7.40] Jack went to town to buy a toy.
[7.41] He took a bus to the shopping mall.
[7.42] He needed to buy a toy for his child.
[7.43] It is Jane’s birthday.
[7.44] He also wanted to buy some books for weekend reading.
A hierarchical structure of discourse coherence is shown in Fig. 7.4. [7.40]–
[7.44] can be organized in a hierarchy tree structure, e.g. Occasion consists of two
expressions, one is expression e1 (statement [7.40]) and the other is an explanatory
clause which in turn consists of expression e2 (statement [7.41]) and a parallel clause
7.4 Discourse Coherence 161
which consists of two entities, one is explanatory expression e3 and the other is
expression e5 (statement [7.44]), e3 is further divided into statements [7.42] and
[7.43], respectively.
Referring expression (RE) is a surrogate for any noun phrase or noun phrase whose
function in utterance is to identify some discrete objects. There are five frequently
used REs in discourse coherence: (1) indefinite noun phrases, (2) definite noun
phrases, (3) pronouns, (4) demonstratives, and (5) names.
1. Indefinite noun phrases introduce entities into context that are new to listener,
e.g., a policeman, some apples, a new iPad, etc.
[7.45] I go to the electronic store to buy a new notebook computer.
2. Definite noun phrases refer to entities recognizable by listener such as above-
mentioned, combination of beliefs about the world, e.g., a furry white cat, the
cat, etc.
[7.46] Don’t look at the sun directly with bare eyes, it will hurt yourself.
3. Pronouns are another form of definite designation, usually with stronger restric-
tions than standard designation, e.g., s/he, it, they, etc.
[7.47] I go to the electronic store to buy a new notebook computer. This com-
puter is rather light and fast.
4. Demonstratives are pronouns that can act alone or as determiners, e.g. this, that.
[7.48] That book seems to be very interesting and worth buying it.
5. Names are common methods to refer people, organizations, and locations.
[7.49] I bought lunch at KFC today.
162 7 Pragmatic Analysis and Discourse
There are four common features to filter potential references in discourse coher-
ence: (1) number agreement, (2) person agreement, (3) gender agreement, and (4)
binding theory constraints.
1. Number agreement are pronouns and references must agree in number, e.g., sin-
gle or plural.
[7.50] The children are playing in the park. They look happy.
2. Person agreement refers to the first, second, or third person.
[7.51] Jane and Helen got up early. They needed to take an exam this morning.
3. Gender agreement refers to male, female, or non-person, e.g. he, she, or it.
[7.52] Jack looked tired. He didn’t sleep last night.
4. Binding theory constraints refer to constraints imposed by syntactic relations
between denotative expressions and possible preceding noun phrases in the same
sentence.
[7.53] Jane purchased herself an iPad. (herself should be Jane)
[7.54] Jane purchased her an iPad. (her may not be Jane)
[7.55] She claimed that she purchased Mary a iPad. (She and she may not
be Mary)
There are six types of preferences in pronoun interpretation: (1) recency, (2) gram-
matical role, (3) repeated mention, (4) parallelism, (5) verb semantics, and (6)
selected restrictions.
Recency refers to entities from recent utterances:
[7.56] Tim went to see a doctor at the clinic. He felt sick. It might be influenza.
Grammatical role is to emphasize the hierarchy of entities according to gram-
matical position of the terms that represent them, e.g. subject, object, etc.
[7.57] Jane went to Starbucks to meet Jackie. She ordered a hot mocha. (She should
be Jane)
[7.58] Jane discussed with Jackie about her exam results. She felt so nervous about
it. (She should be Jackie instead of Jane)
[7.59] Jane discussed with Jackie about her exam results. She felt so sorry about it.
(She should be Jane instead of Jackie)
Repeated mention refers to mentioning about the same thing.
[7.60] Jane went to supermarket to buy some food. It turned out it was closed.
Parallelism refers to subject-to-subject or object-to-object kind of expression:
7.5 Algorithms for Coreference Resolution 163
[7.61] Mary went with Jane to Starbucks. Ian went with her to the bookstore after-
wards. (her should probably be Jane instead of Mary)
Verb semantics are verbs that seem to emphasize one of their argument positions:
[7.62] Jane warned Mary. She might fail the test.
[7.63] Jane blamed Mary. She lost the watch.
In [7.62] She should be Mary as Mary is the one being warned about failing the
test. For [7.63] She should be Jane who suffered. It is a pragmatic phenomenon
because it involves common sense by word meaning blamed to understand correct
coreference in the second statement.
Selectional restrictions refer to another semantic knowledge playing a role:
[7.64] Mary lost her iPhone in the shopping mall after carrying it the whole
afternoon.
Note that [7.64] involves high-level semantic or common sense understanding of
it can mean iPhone or shopping mall but it has been carried for the whole afternoon,
so it cannot be an unmovable object except iPhone.
7.5.1 Introduction
Coreference resolution (CR) is the task of finding all linguistic expressions (called
mentions) in any text involving real-world entities. After finding these mentions and
grouping them, they can be resolved by replacing pronouns with noun phrases.
There are three fundamental algorithms for conference resolution: (1) Hobbs
algorithm, (2) Centering algorithm, and (3) Log-linear model.
7.5.2 Hobbs Algorithm
Hobbs algorithm was one of the early approaches to pronoun resolution proposed
by Prof. Jerry R. Hobbs in 1978 (Hobbs 1978) and further consolidated as well-
known algorithm for coreference resolution in his remarkable work Coherence and
Coreferences published in Cognitive Science 1979 (Hobbs 1979).
He original work proposed two CR algorithms, a simple algorithm based purely
on grammar, and a complex algorithm that incorporated semantics into parsing
methods (Hobbs 1978, 1979).
164 7 Pragmatic Analysis and Discourse
Unlike other algorithms, Hobbs' algorithm does not turn to a discourse model for
parsing because its parse tree and grammar rules are the only information used in
pronoun parsing. Let us look at how it works.
7.5.2.2 Hobbs’ Algorithm
Hobbs’ algorithm assumes a parse tree where each NP node has an N type node
below it as the parent of a lexical object. It operates as follows:
1. Start with the node of noun phrase (NP) that directly dominates the pronoun.
2. Go up tree to the first NP or sentence (S) node visited, denote this node as X, and
name the path being applied to reach it as p.
3. Visit all branches under node X to the left of path p, breadth-first, from left to
right, taking any NP node found as an antecedent, there is an NP or S-node
between it and X.
4. If node X is the highest S-node in sentence, visit the surface parse trees of previ-
ous sentences in the text with the most recent first, each tree is then visited in a
left-to-right and breadth-first manner. When an NP node is encountered, it is
recommended as an antecedent. If X is not the first S-node in the set, go to step 5.
5. Climb up from node X to the first NP or S-node encountered, denote this new
node as X and name the path as p.
6. If X is an NP vertex, and if the path p to X does not pass through a nominal ver-
tex immediately dominated by X, then denote X as an antecedent.
7. Visit all branches under node X to the left of path p, breadth-first manner, from
left to right, denoting each NP node encountered as an antecedent.
8. If X is an S-node, visit all branches of node X to the right of path p from left to
right and breadth-first manner, but do not visit below any NP or S being encoun-
tered as the antecedent.
9. Return to Step 4.
Statement [7.65] is a classic example stated in Hobbs’ original paper (Hobbs 1978)
to demonstrate how Hobbs’ algorithm works as shown in Fig. 7.5.
[7.65] The castle in Camelot remained the residence of the king until 536 when he
moved it to London.
Example—What does it stand for?
1. Start with node NP1, step 2 climbs up to node S1.
2. Step 3 searches the left part of S1’s tree but fails to locate any eligible NP node.
3. Step 4 fails to apply.
7.5 Algorithms for Coreference Resolution 165
S2
NP3 VP
N VP
Det N
he moved NP1 PP
the king
it to NP
London
In the original work, Hobbs manually analyzed 100 consecutive examples from
three different texts, assuming correct parsing was available, and the algorithm was
72.7% correct (Hobbs 1978); which is quite impressive for such simple algorithm.
If the algorithm is integrated with syntactic constraints when resolving pronouns as
shown in Fig. 7.5, the performance can be even higher.
166 7 Pragmatic Analysis and Discourse
7.5.3 Centering Algorithm
Centering Theory (CT) was proposed by Profs Barbara J. Grosz and Candace
L. Sidner in their distinguished work Attention, Intentions, and the Structure of Dis-
Course, as part of its main theory of discourse analysis (Grosz and Sidner 1986). It
is a theory of discourse structure that models the interrelationships between foci or
centers as the choice of reference terms and the perceived coherence of discourse.
The basic idea is:
1. a discourse has a focus, or center,
2. the center typically remains the same for a few sentences, then shifts to a
new object,
3. the center of a sentence is typically pronominalized,
4. once a center is determined, there is a strong inclination for subsequent pronouns
to continue referring to it.
In centering algorithm, utterances from a discourse have a backward-looking
center (Cb) and a set of forward-looking centers (Cf). The Cf set of an utterance U0
is the set of utterance units elicited by that utterance. Cf set is ranked by discourse
emphasis, the most accepted ranking is by grammatical role. The highest-ranked
element in this list is called the preferred center (Cp), which represents the highest-
ranked element among previous utterances found in the current utterance and serves
as a link between these utterances. Any sudden shifts in the topic of utterances are
reflected in changes in Cb between utterances.
Centering algorithm (Grosz and Sidner 1986; Tetreault 2001) consists of three parts:
(1) initial settings, (2) constraints, (3) rules and algorithm.
7.5 Algorithms for Coreference Resolution 167
Clearly, centering algorithm implicitly accounts for grammatical roles, recency, and
repeated-mention preference in pronoun interpretation.
However, the grammatical role hierarchy affects emphasis indirectly because the
final conversion type specifically determines the final reference assignment.
Confusion can arise if the former lead to a high-level transformation in this case,
where a referent in a low-level grammatical role prefers a referent in a high-level
role. For instance:
U1: Jane opened a new music store in the city.
U2: Mary entered the store and looked at some CDs.
U3: She finally bought some.
In this example, common sense indicates that She in U3 should refer to Mary
instead of Jane. However, by applying Centering algorithm in this case, it will
assign she to Jane incorrectly because Cb(U2) = Jane becomes Continue while Mary
becomes a Smooth-shift. While if applying Hobbs’ algorithm, Mary will still be
assigned as the referent.
Obviously, such situation occurs usually depended on situation and thematic sce-
nario. As Prof. Marilyn A. Walker in her study A corpus-based evaluation of center-
ing and pronoun resolution (Walker 1989) compared a version of Centering to
Hobbs on 281 examples from three genres of text in 1989 with 77.6% and 81.8%
accuracy, respectively.
Big data and AI offer advancement for current machine learning models CR research
focus on Convolutional Neural Networks (CNN) (Auliarachman and Purwarianti
2019), Recurrent Neural Networks (RNN) (Afsharizadeh et al. 2021), Long-short
Term Memory Networks (LSTM) (Li et al. 2021), Transformers and BERT Models
(Joshi et al. 2019).
Fig 7.7 Table of feature vector values for sentence U2: He showed it to Jim
7.6 Evaluation 171
7.6 Evaluation
Exercises
7.1 What is pragmatic analysis and discourse in linguistics? Discuss their roles
and importance in NLP.
7.2 What is the difference between pragmatic analysis and semantic analysis in
terms of their functions and roles in NLU (Natural Language Understanding)?
7.3 What is coreference resolution in linguistics? Why it is important in NLP?
Use two live examples as illustration to support your answer.
7.4 State and explain the differences between the concept of coherence vs corefer-
ence in pragmatic analysis. Give two live examples to support your answer.
7.5 What is discourse segmentation? State and explain why it is vital to pragmatic
analysis and the implementation of NLP application such Q&A chatbot. Give
two examples to support your answer.
7.6 State and explain Hearst’s TextTiling technique on discourse segmentation.
How can it be further improved by using nowadays’ AI and machine learning
technology?
7.7 What is coherence relation? State and explain five basic types of coherence
relations. For each type, give a live example for illustration.
7.8 What is referencing expression in pragmatic analysis? State and explain five
basic types of referencing expressions. For each type, please provide a live
example for illustration.
7.9 State and explain Hobbs’ algorithm for coreference resolution. Use a sample
sentence/utterance (other than the one given in the book) to illustrate how
it works.
172 7 Pragmatic Analysis and Discourse
7.10 State and explain the pros and cons of Hobbs’ algorithms for coreference reso-
lution. Use live example(s) to support your answer.
7.11 State and explain Centering algorithm for coreference resolution. Use a sam-
ple sentence/utterance (other than the one given in the book) to illustrate how
it works.
7.12 Compare pros and cons between Hobbs’ algorithm vs Centering algorithm.
Use live example(s) to support your answer.
7.13 What is machine learning? State and explain how machine learning can be
used for coreference resolution. Use live example(s) to support your answer.
7.14 Name any three types of machine learning models for coreference resolution.
State and explain how they work.
7.15 Name any two types of evaluation method/metrics for coreference resolution
model in pragmatic analysis. State and explain how they work.
References
Afsharizadeh, M., Ebrahimpour-Komleh, H., and Bagheri, A. (2021). Automatic text summariza-
tion of COVID-19 research articles using recurrent neural networks and coreference resolution.
Frontiers in Biomedical Technologies. https://doi.org/10.18502/fbt.v7i4.5321
Auliarachman, T., & Purwarianti, A. (2019). Coreference resolution system for Indonesian text
with mention pair method and singleton exclusion using convolutional neural network. Paper
presented at the 1-5. https://doi.org/10.1109/ICAICTA.2019.8904261
Bender, E. M. and Lascarides, A. (2019) Linguistic Fundamentals for Natural Language Processing
II: 100 Essentials from Semantics and Pragmatics (Synthesis Lectures on Human Language
Technologies). Springer.
Cornish, F. (2009). Inter-sentential anaphora and coherence relations in discourse: A perfect
match. Language Sciences (Oxford), 31(5), 572-592.
Cruse, A. (2011) Meaning in Language: An Introduction to Semantics and Pragmatics (Oxford
Textbooks in Linguistics). Oxford University Press
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Goddard, C. (1998) Semantic Analysis: A Practical Introduction (Oxford Textbooks in Linguistics).
Oxford University Press.
Grosz, B. J., Joshi, A. K., and Weinstein, S. (1995). Centering: A framework for modeling the
local coherence of discourse. Computational Linguistics - Association for Computational
Linguistics, 21(2), 203-225.
Grosz, B. J., and Sidner, C. L. (1986). Attention, intentions, and the structure of discourse.
Computational Linguistics - Association for Computational Linguistics, 12(3), 175-204.
Hearst, M. A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages.
Computational Linguistics - Association for Computational Linguistics, 23(1), 33-64.
Hobbs, J. R. (1979) Coherence and Coreference. Cognitive Science 3, 67-90.
Hobbs, J. R. (1978) Resolving pronoun references. Lingua, 44:311–338.
Joshi, M., Levy, O., Weld, D.S., and Zettlemoyer, L. (2019) BERT for Coreference Resolution:
Baselines and Analysis. In Proc. of Empirical Methods in Natural Language Processing
(EMNLP) 2019. https://doi.org/10.48550/arXiv.1908.09091
Kehler, A. (2002) Coherence, Reference, and the Theory of Grammar. Stanford, Calif.: CSLI
Publishers.
References 173
Kehler, A., Kertz, L., Rohde, H., and Elman, J. L. (2008). Coherence and coreference revisited.
Journal of Semantics (Nijmegen), 25(1), 1-44.
Kroeger, P. (2019) Analyzing meaning: An introduction to semantics and pragmatics (Textbooks in
Language Sciences). Freie Universität Berlin.
Lata, K., Singh, P., & Dutta, K. (2022). Mention detection in coreference resolution: Survey.
Applied Intelligence (Dordrecht, Netherlands), 52(9), 9816-9860.
Li, Y., Ma, X., Zhou, X., Cheng, P., He, K. and Li, C. (2021). Knowledge enhanced LSTM for
coreference resolution on biomedical texts. Bioinformatics, 37(17), 2699-2705. https://doi.
org/10.1093/bioinformatics/btab153
Mann, W. C. and Thompson, S. A. (1988) Rhetorical Structure Theory: Toward a functional theory
of text organization. Text & Talk, 8, 243 - 281.
Mann, W. C. and Thompson S. A. (1986) Relational Propositions in Discourse. Discourse
Processes 9: 57-90.
Pevzner, L., and Hearst, M. A. (2002). A critique and improvement of an evaluation metric for
text segmentation. Computational Linguistics - Association for Computational Linguistics,
28(1), 19-36.
Sanders, T., Spooren, W. and Noordman, L.G. (1992). Toward a taxonomy of coherence relations.
Discourse Processes, 15, 1-35.
Tetreault, J. R. (2001). A corpus-based evaluation of centering and pronoun resolution.
Computational Linguistics. Association for Computational Linguistics, 27(4), 507-520.
Walker, Marilyn A. (1989). Evaluating discourse processing algorithms. In Proceedings of the 27th
Annual Meeting of the Association for Computational Linguistics, pp. 251-261.
Wolna, A., Durlik, J., and Wodniecka, Z. (2022). Pronominal anaphora resolution in pol-
ish: Investigating online sentence interpretation using eye-tracking. PloS One, 17(1),
e0262459-e0262459.
Chapter 8
Transfer Learning and Transformer
Technology
Transfer learning focuses in solving a problem from acquired knowledge and apply-
ing such knowledge to solve another related problem(s) (Pan and Yang 2009; Weiss
et al. 2016; Zhuang et al. 2020). It is like two students learn to play guitar. One has
musical knowledge and the other has not. It is natural for the one to transfer back-
ground knowledge to the learning process. Every task has its isolated datasets and
trained model in traditional ML, whereas learning a new task in TL relies on previ-
ous learned tasks to acquire knowledge with larger datasets as shown in Fig. 8.1.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 175
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_8
176 8 Transfer Learning and Transformer Technology
Traditional ML datasets and trained model parameters cannot reuse. They involve
enormous, rare, inaccessible, time-consuming, and costly training process in NLP
tasks and computer vision. For example, if a task is text sentiment reviews predic-
tions on laptops, there are large amounts of labeled data, target data, and training
data from these reviews.
Traditional ML can work well on correlated domains, but when there are large
amounts of target data like food reviews, the inference results will be unsatisfactory
due to domains differences. Nevertheless, these domains are correlated in some
sense to bear same domain reviews as language characteristics and terminology
expressions, which makes TL possible to apply in a high-level approach on predic-
tion task. This approach enables source domains to become a target domain and
determine its sub-domains correlations as shown in Fig. 8.2.
TL has been implemented to several machine learning applications such as image
and text sentiment classifications.
Additional
Transfer Learned Layers
knowledge
Transfer
Large amount
Classification Old Model Learning
of data/labels
Model without output
Layers
If a task is defined by a label space Y with a predictive function f(⋅), f(⋅) is repre-
sented by a conditional probability distribution given by (8.1)
f ( xi ) = P ( yi |xi ) (8.1)
If a function f(⋅) and label space Y between two tasks are different, they are dif-
ferent tasks.
Now TL can give a new representation by above definitions that have Ds as
source domain and Ts as source learning task. Dt represents target domain and Tt
represents target learning task. Given two domains are unidentical or has two differ-
ent tasks, TL aim is to improve the results P(Yt| Xt) of Dt when Ts and Ds knowledge
can be obtained.
There are two types of TL (1) heterogeneous and (2) homogeneous as shown in
Fig. 8.3.
Heterogeneous Transfer Learning When source feature space and feature space
are different which means that Yt ≠ Ys and/or Xt ≠ Xs. Under the condition of same
domain distributions, the strategy of resolution is to adjust feature space smaller and
transform it to homogeneous so that the differences between marginal or condi-
tional of source and target domains will be reduced.
There are four methods to solve problems produced by homogeneous and heteroge-
neous TL: (1) instance-based, (2) feature-based, (3) parameter-based, and (4)
relational-based.
1. Instance-based
This method reweights samples from source domains and uses them as target
domain data to bridge the gap of marginal distribution differences which works
best when conditional distributions of two tasks are equal.
2. Feature-based
This method works for both heterogeneous and homogeneous TL problems.
For homogeneous types is to bridge the gap between conditional and marginal
distributions of target and source domains. For heterogeneous types is to reduce
the differences between source and target features spaces. It has two approaches
(a) asymmetric and (b) symmetric.
(a) Asymmetric feature transformation aims to modify the source domain and
reduce the gap between source and target instances by transforming one of
the source and target domains to the other as shown in Fig. 8.4. It can be
applied when Ys and Yt are identical.
(b) Symmetric feature transformation aims to transform source and target
domains into their shared feature space, starting from the idea of discovering
meaningful structures between domains. The feature space they share is usu-
ally low-dimensional. The purpose of this approach is to reduce the marginal
distribution distance between destination and source. The difference between
symmetric and symmetric feature transformation is shown in Fig. 8.5.
3. Parameter-based
This method transfers learnt knowledge by sharing parameters common to
the models of source and target learners. It applies to the idea that two related
tasks have similarity in model structure. The trained model is transferred from
source domain to target domain with parameters. This approach has a huge
advantage because the parameters are usually trained from randomly initialized
parameters as training process can be time-consuming for models trained from
the beginning. This approach can train more than one model on the source data
and combine parameters learnt from all models to improve results of the target
learner. It is often used in deep learning applications as shown in Fig. 8.6.
Fig. 8.5 Symmetric feature transformation (left) and asymmetric feature transformation (right)
Fig. 8.7 Relational-based approaches: an example of learning sentence structure of food reviews
to help with movie reviews’ sentiment analysis
4. Relational-based
This method transfers learnt knowledge by sharing its learnt relations between
different samples parts of source and target domains as shown in Fig. 8.7. Food
and movie domains are a related domain example. Although the reviews texts are
different, sentence structures are similar. It aims to transfer learnt relations of
different review sentences parts from these domains to improve text sentiment
analysis results.
180 8 Transfer Learning and Transformer Technology
8.4.1 What Is RNN?
8.4.2 Motivation of RNN
There are many learning tasks required: sequential data processing which includes
speech recognition, image captioning, and synced sequence in video classification.
Sentiment analysis and machine translation model outputs are sequences, but tasks
inputs are time or space related that cannot be modeled by traditional neural net-
works to assume that test and training data are independent.
For example, a language translation task aims to translate a phrase feel under the
weather means unwell. This phrase makes sense only when it is expressed in that
specific order. Thus, the positions of each word in sentence must be considered
when model predicts the next word.
There are five major categories of RNN architecture corresponding to different
tasks: (1) simple one to one model for image classification task, (2) one to many for
image captioning tasks, (3) many to one model for sentiment analysis tasks, (4)
many to many models for machine translation, and (5) complex many to many mod-
els for video classification tasks as shown in Fig. 8.9.
8.4.3 RNN Architecture
RNN is like standard neural networks consists of input, hidden, and output layers as
shown in Fig. 8.10.
An unfolded RNN architecture is narrated by xt as the input at time step t, st
stores the values of hidden units/states at time t and ot is the output of network at
time-step t. U are weights of inputs, Ws are weights of hidden units, V is bias as
shown in Fig. 8.11.
with the activation function f, the hidden states st is calculated by equation:
Fig. 8.12 A simple recurrent neural network right (left) and fully connected recurrent neural net-
work (right)
The hidden states st are considered as network memory units which consists of
hidden states from several former layers. Each layer’s output is only related to hid-
den states of the current layer. A significant difference between RNN and traditional
neural networks is that weights and bias U, W, and V are shared among layers.
There will be an output at each step of the network but unnecessary. For instance,
if inference is applied for sentiment expressed by a sentence, only an output is
required when the last word is input, and none after each word for input. The key to
RNNs is the hidden layer to capture sequence information.
For RNN feedforward process, if the number of time steps is k, then hidden unit
values and output will be computed after k + 1 time steps. For backward process,
RNN applies an algorithm called backpropagation through time (BPTT).
RNN topologies range from partly to fully recurrent. Partly recurrent is a layered
network with distinct output and input layers where recurrence is limited to the hid-
den layer. Fully recurrent neural network (FRNN) connects all neurons’ outputs to
inputs as shown in Fig. 8.12.
8.4 Recurrent Neural Network (RNN) 183
8.4.4.1 What Is LSTM?
8.4.4.2 LSTM Architecture
xtand ht − 1 are concatenated inputs from the state of previous step to train with acti-
vations for four states as shown in Fig. 8.14.
z is input calculated by multiplying the concatenate vector with weights w and
converted into values 0-1 through activation function tanh. zf,zi,zo are calculated by
multiplying the concatenate vector with corresponding weights and converting to
values 0-1 by a sigmoid function σ to generate gate states. zf represents forget gate,
zi represents input gate, zo represents output gate. A memory cell of LSTM calcula-
tion is shown in Fig. 8.15.
Memory cells ct,ht,yt are calculated by gate states as equations below: (⨁ is matrix
addition, ⊙ is Hadamard Product)
c t = z f c t -1 + z 1 z
ht = z o tanh ( c t ) (8.4)
y = σ (W' h
t t
)
LSTM has: (1) forget, (2) memory select, and (3) output stages.
1. Forget stage
This stage retains important information passed in by previous node ct‐1 (the
previous cell state) and discards unimportant ones. The calculated zf is used as a
forget gate to control what type of ct‐1 information should be retained or discarded.
2. Memory select stage
This stage remembers input xt selectively to record important information. z
refers to present input. zi is the input gate to control gating signals.
3. Output stage
8.4 Recurrent Neural Network (RNN) 185
8.4.5.1 What Is GRU?
Gate Recurrent Unit (GRU) can be considered as a kind of RNN like LSTM but to
manage backpropagation gradients problems (Chung et al. 2014; Dey and Salem
2017). GRU proposed in 2014 and LSTM proposed in 1997 had similar perfor-
mances in many cases but the former is often exercised due to simple calculation
with comparable results than the latter.
GRU’s input and output structures are like RNN. There are inputs xt and ht‐1 to
contain relevant information of the prior node. Current outputs yt and ht are calcu-
lated by combining xt and ht‐1. A GRU architecture is shown in Fig. 8.16.
r is reset gate and z is update gate. They are concatenated with input xt and hidden
state ht‐1 from the prior node and multiply results with weights as shown in Fig. 8.17.
When a gate control signal is available, apply r reset gate to obtain data ht‐1 = ht‐
1
er after reset, ht‐1 are concatenated with xt and apply a tanh function to generate data
that lies within range (−1, 1) as shown in Fig. 8.18.
At this point, h′ contains current input xt, its selection memory stage is like LSTM.
ht-1 GRU ht
xt
186 8 Transfer Learning and Transformer Technology
xt
z = s( Wz )
ht-1
Finally, update memory stage is the most critical step where forget and remem-
ber steps are performed simultaneously. The gate z obtained earlier is applied as:
ht = (1 − z ) ht −1 + z h′ (8.5)
where z (gate signal) is within range 0–1. If it is close to 1 or 0, it signifies more data
is remained or forgotten, respectively.
(1‐z) ⊙ ht‐1 represents to forget the original hidden state selectively. (1−z) is con-
sidered as a forget gate to forget ht‐1 unimportant information.
z ⊙ h′ represents h′ memory selective information of present node. Like (1‐z), it
will forget h′ unimportant information or is considered as selective h′ information.
ht = (1‐z) ⊙ ht‐1 + z ⊙ h′ is the calculation to forget ht‐1 information from passed
down and add information from the current node.
It is noted that forget z and select (1-z) factors are linked, which means it will
forget the passed in information selectively. When weights (z) are forgotten, it will
apply weights in h′ to configurate (1-z) at a constant state.
GRU’s input and output structures are like RNN, its internal concept is like
LSTM. GRU has one less internal gate as compared with LSTM and fewer param-
eters but can achieve comparable satisfactory results with reduced time and compu-
tational resources. A GRU computation module is shown in Fig. 8.19.
Bidirectional Recurrent Neural Network (BRNN) is a type with RNN layers in two
directions (Singh et al. 2016). It links with previous and subsequent information
outputs to perform inference against both RNN and LSTM to possess information
from previous one. For example, in text summarization, it is insufficient to consider
8.4 Recurrent Neural Network (RNN) 187
the information from previous content, sometimes it also requires subsequent text
information for words prediction of a sentence. BRNN is proposed to deal with
these circumstances.
BRNN consists of two RNNs superimposed on top of each other. The output is
mutually generated by two RNNs states. A BRNN structure is shown in Fig. 8.20.
BRNN training process is as follows:
1. begin forward propagation from time step 1 to time step T to calculate hidden
layer’s output and save at each time step,
2. proceed from time step T to time step t to calculate backward hidden layer output
and save at each time step,
188 8 Transfer Learning and Transformer Technology
3. obtain each moment final output according to forward and backward hidden lay-
ers after calculating all input moments from both forward and backward
directions.
8.5 Transformer Technology
8.5.1 What Is Transformer?
8.5.2 Transformer Architecture
A transformer model has two parts (1) encoder and (2) decoder. Language sequence
extracts as input, encoder maps it into a hidden layer, and decoder maps the hidden
layer inversely to a sequence as output.
8.5.2.1 Encoder
There are six identical encoder layers in the transformer with two sublayers: (1)
self-attention and (2) feedforward in each encoder layer. Self-attention layer is the
first sublayer to exercise attention mechanism, and a simple fully connected feed-
forward network is the second sublayer. There follows a residual connection and
layer normalization from each of the sublayers. An encoder layer architecture is
shown in Fig. 8.22.
8.5.2.2 Decoder
There are 6 identical encoder layers in the transformer. In addition to identical two
sublayers as each encoding layer, a third sublayer is added to the decoder to perform
multi-head attention, taking the output of last encoder layer as input. Residual con-
nections and layer normalization are used sequentially for all sublayers, which is the
same as the encoder. The decoder's self-awareness is modified by the mask to ensure
that inference of the position can only use information from a known position, or in
other words, its previous position.
8.5 Transformer Technology 189
Output
Probabilities
Softmax
Linear
Feed
Forward
Positional Positional
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(Shifted right)
8.5.3.1 Positional Encoding
Since transformer has no iterative process, each word’s position information must
be provided to ensure that it can recognize the position relationship in language.
Linear transformation of sin and cos functions is applied to provide model position
information as equation:
190 8 Transfer Learning and Transformer Technology
(
PE ( pos, 2i ) = sin pos / 10,0002i / dmodel )
(
PE ( pos, 2i + 1) = cos pos / 10,0002i / dmodel ) (8.6)
8.5.3.2 Self-Attention Mechanism
For input sentence, the word vector of each word is obtained through word embed-
ding, and the position vector of all words is obtained in same dimensions through
positional encoding that can be added directly to obtain the true vector representa-
tion. ith word’s vector is written as xi, X is input matrix combined by all word vec-
tors. ith row refers to ith word vector.
WQ, WK, WV are matrices defined to perform three linear transformations with X
to generate three matrices Q (queries), K (keys), and V (values), respectively.
8.5 Transformer Technology 191
Q = X ⋅ WQ
K = X ⋅ WK
V = X ⋅ WV (8.7)
QK T
Attention ( Q, K , V ) = softmax V (8.8)
d
k
The dot products are calculated by multiplying query Q by keys K, dividing the
result by d k , and applying a softmax function to obtain value scores V.
8.5.3.3 Multi-Head Attention
The previously defined set of Q, K, V allows a word that uses the information of
related words. Multiple Q, K, V defined groups can enable a word to represent sub-
spaces at different positions with identical calculation process, except that the
matrix of linear transformation has changed from one group (WQ, WK, WV) to multi-
ple groups ( WQ0 ,WK0 ,WV0 ), ( WQ1 ,WK1 ,WV1 )... as equation:
where
8.5.3.5 Feedforward Layer
(
X hidden = Linear ReLU ( Linear ( X attention ) ) )
followed by residual connection and layer normalization scheme:
8.6 BERT
8.6.1 What Is BERT?
8.6.2 Architecture of BERT
8.6.3 Training of BERT
BERT has two training process steps: (1) pre-training and (2) fine-tuning.
8.6 BERT 193
8.6.3.1 Pre-training BERT
There are tasks such as question answering and natural language reasoning to
understand the relationship between two sentences. Sentence-level representations
cannot be captured directly, as MLM tasks tend to extract token-level representa-
tions. BERT applies NSP pre-training task to let the model understands the relation-
ships between sentences and predict whether they are connected.
194 8 Transfer Learning and Transformer Technology
For every training sample, select Set A and B from corpus to create a sample,
where Set A is 50% of Set B (labeled “IsNext”), and Set B is 50% random. Next,
training examples are put into BERT model to generate binary classification
predictions.
8.6.3.3 Fine-tuning BERT
8.7.1 Transformer-XL
8.7.1.1 Motivation
Transformers are widely used as a feature extractor in NLP but required to set a
fixed length input sequence, i.e. the default length for BERT is 512. If text sequence
length is shorter than fixed length, it must be solved by padding. If text sequence
length exceeds fixed length, it can be divided into multiple segments. Each segment
is processed at training separately as shown in Fig. 8.24.
Nevertheless, there are two problems: (1) segments are trained independently,
the largest dependency between different tokens depends on the segment length; (2)
segments are separated according to a fixed length without sentences’ natural
boundaries consideration to produce semantically incomplete segments. Thus,
transformer-XL (Dai et al. 2019) is proposed.
8.7.1.2 Transformer-XL technology
8.7.2 ALBERT
BERT model has many parameters, but it is limited by GPU/TPU memory size as
model size increases. Google proposed A Lite BERT (ALBERT) to solve this prob-
lem (Lan et al. 2019). ALBERT applies two techniques to reduce parameters and
improve NSP pre-training task, which include:
1. parameter sharing—apply same weights to all 12-layers,
2. factorize embeddings—shorten initial embeddings to 128 features,
3. pretrain by LAMB optimizer—replace ADAM Optimizer,
4. Sentence Order Prediction (SOP)—replace BERT’s Next Sentence Prediction
(NSP) task,
5. N-gram masking—modify Masked Language Model (MLM) task to mask out
words’ N-grams instead of single words.
Exercises
8.1. What is Transfer Learning (TL)? Compare the major differences between
Transfer Learning (TL) and traditional Machine Learning (ML) in AI.
8.2. Describe and explain how Transfer Learning (TL) can be applied to NLP. Give
two NLP applications as examples to support your answer.
8.3. Compare the major differences between Heterogeneous vs. Homogeneous
Transfer Learning. Give two NLP applications/systems as examples for
illustration.
8.4. What is Recurrent Neural Network (RNN)? State and explain why RNN is
important for the building of NLP applications. Give 2 NLP applications as
example to support your answer.
8.5. State and explain FIVE major categories of Recurrent Neural Networks (RNN).
For each type, give a live example for illustration.
8.6. What is LSTM network? State and explain how it works by using NLP applica-
tion such as Text Summarization.
8.7. What is Gate Recurrent Unit (GRU)? Use an NLP application as example, state
and explain the major differences between GRU and standard RNN.
8.8. State and explain the key functions and architecture of Transformer technol-
ogy. Use NLP application as example, state briefly how it works.
8.9. What is BERT model? Use NLP application such as Q&A chatbot as example,
state and explain briefly how it works.
References
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio,
Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent
neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl:
Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dey, R., & Salem, F. M. (2017, August). Gate-variants of gated recurrent unit (GRU) neu-
ral networks. In 2017 IEEE 60th international midwest symposium on circuits and systems
(MWSCAS) (pp. 1597-1600). IEEE.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),
1735-1780.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert
for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and
data engineering, 22(10), 1345-1359.
Sherstinsky, A. (2020). Fundamentals of recurrent neural network (RNN) and long short-term
memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306.
References 197
Singh, B., Marks, T. K., Jones, M., Tuzel, O., & Shao, M. (2016). A multi-stream bi-directional
recurrent neural network for fine-grained action detection. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition (pp. 1961-1970).
Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM--a tutorial into long short-term
memory recurrent neural networks. arXiv preprint arXiv:1909.09586.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin,
I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big
data, 3(1), 1-40.
Yin, W., Kann, K., Yu, M., & Schutze, H. (2017). Comparative study of CNN and RNN for natural
language processing. arXiv preprint arXiv:1702.01923.
Yu, Y., Si, X., Hu, C., & Zhang, J. (2019). A review of recurrent neural networks: LSTM cells and
network architectures. Neural computation, 31(7), 1235-1270.
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., & He, Q. (2020). A comprehensive survey
on transfer learning. Proceedings of the IEEE, 109(1), 43-76.
Chapter 9
Major NLP Applications
9.1 Introduction
This chapter will study three major NLP applications: (1) Information Retrieval
Systems (IR), (2) Text Summarization Systems (TS), and (3) Question-&-Answering
Chatbot System (QA Chatbot).
Information retrieval is the process of obtaining the required information from
large-scale unstructured data relative to traditional structured database records from
texts, images, audios, and videos. Information retrieval systems are not only com-
mon search engines but also recommendation systems like e-commerce sites, ques-
tion and answer, or interactive systems.
Text Summarization is the process of diminishing a set of data computationally,
creating a subset or summary to represent relevant information for NLP tasks such
as text classification, question answering, legal texts, news summarization, and
headlines generation.
Question-Answer (QA) system represents human–machine interaction system
with human natural language is the communication medium. It is a task-oriented
system to deal with objectives or answer specific questions through dialogues with
sentiment analysis.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 199
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_9
200 9 Major NLP Applications
Corpora cater for IR in open machine-readable standard format had grown expo-
nentially due to pre-trained models’ technological advancements. IR models for
generic language that combines generic terms with domain-specific terms, e.g.
lease can be a place or a leasehold, its objectives can be organized by abstract, for-
mal, or colloquial language in a large narrative component based on document type
to improve retrieval results.
Text or document classification and clustering in IR research focuses on two
aspects: 1) text representation and 2) clustering algorithms. Text representation is to
convert unstructured text into a computer-processable data format. During text rep-
resentation process, it is necessary to extract and mining textual information.
Semantic similarity computation is the link between text modelling and representa-
tion with application on potential information text layer. Clustering algorithms are
to extract semantic information to facilitate similarity calculation for text classifica-
tion and clustering effectiveness.
Vector Space Model (Salton et al. 1975) was a leading IR method from 1960 to
1970. Queries and retrieved documents are represented as vectors with dimension-
ality related to word list size in this model. A retrieved document D can be repre-
sented as a vector of lexical items: Di = (d1, d2, , , , dn ), where di is the weight of a ith
lexical item in Di. Query Q is expressed as a lexical item vector: Q = (q1, q2, , , , qn ),
where qi is the weight of ith lexical item in query term. The relevance is determined
by computing the distance between lexical item vectors of the retrieved document
and query based on this representation. Although it cannot prove cosine relevance is
superior to other similarity methods, but it achieved satisfactory performance
according to search engines evaluation results. Cosine similarity for angle between
retrieved document and query calculation is expressed as:
∑ d ⋅q
n
di ⋅ q j =1 ij j
sim ( Di , Q ) =
= (9.1)
di × q
∑ d ⋅∑
n n
2
j =1 ij
q2
j =1 j
Equation (9.1) is the weights for dot or inner product of all word terms in query
matching documents. There are many words item weights for vector space models.
Most of the weighting methods are based on TF (Term-Frequency) variation.
Inverted document frequency (IDF) (Aizawa 2003) represents the number of term
occurrences in retrieved document and reveals lexical term significance in the entire
document dataset. A lexical item is insignificant with high occurrence frequency in
multiple retrieved documents.
9.2 Information Retrieval Systems 201
There are other text representations methods in addition to vector space model,
e.g. phrase or concept representations. Although phrase representation can improve
semantic contents, the reduced statistical quality of feature vector becomes sparse
and difficult to extract statistical properties applying machine learning algorithms.
Figures 9.1 and 9.2 show a text is encoded by Sentence Transformers (Reimers and
Gurevych 2019) to demonstrate and compute cosine similarity between embed-
dings. It uses a pre-trained model to encode two sentences and outperform other
pre-train model like BERT (Vaswani et al. 2017).
It is natural to identify the combination with the highest cosine similarity score.
By doing so, an intense ranking scheme is used as shown in Fig. 9.3 to identify the
highest scoring pair with a secondary complexity. However, it may not work for
long lists of sentences.
A chunking concept to divide corpus into smaller parts is shown in Figs. 9.4 and
9.5. For example, parse 1000 sentences at a time to search the rest (all other sen-
tences) of corpus or search a list of 20k sentences to divide into 20 × 1000 sen-
tences. Each query is compared with 0–10k sentences first and 10k–20k sentences
to reduce memory storage. The increases of these two values intensified speed and
memory storage, then identified pair with the highest similarity to extract top K
scores for each query as opposed to extract and sort scores for all n2 pairs.
Such method is faster than brute force methods due to fewer samples. In practical
industrial scenarios, more attention is paid to the speed of pre-trained models,
encoding methods, and data retrieval. For example, two-tower model (Yang et al.
2020) and Wide&Deep model (Cheng et al. 2016) are shown in Figs. 9.6 and 9.7.
Probabilistic Ranking Principle (PRP) models were firstly proposed by Croft and
Harper in 1979 (Croft and Harper 1979) to compute query relevance degrees and
retrieval. PRP regards IR as a process of statistical inference, where an IR system
predicts query relevance from retrieved documents and sorts in descending order
based on predicted relevance scores. This approach is like Bayesian model machine
learning. A PRP model combines relevant feedback information with IDF and esti-
mate each item’s probabilities to optimize search engine retrieval performance.
However, it is a difficult task to estimate each probability accurately in practical
applications. Okapi BM25 (Whissell and Clarke 2011) retrieval model had solved
9.2 Information Retrieval Systems 203
( ri + 0.5) / ( R − ri + 0.5)
sim ( Q, D ) = ∑ log
q∈Q ( ni − ri + 0.5) / ( N − ni − R + ri + 0.5)
( k1 + 1) fi ( k2 + 1) qfi
⋅ ⋅ (9.2)
K + fi k2 + qf i
204 9 Major NLP Applications
There are two approaches to consider which is the best BM25 method:
1. BM25 + Word2Vec embedding across all documents.
2. BM25 + BERT + Word2Vec embedding for each top-k documents, select the
most similar sentence embedding across top-k paragraphs.
Word2vec (Church 2017) is about word occurrences proportions in relations
holding in general over large text corpora and combine vectors of similar words into
a vector space called distributional hypothesis. Word2vec embeddings are to com-
pare query with sentence embeddings to select the one with higher cosine similarity.
Transformer-based neural network models are popular NLP research areas on
enhanced parallelized processing capabilities. BERT is among one that uses
Transformer-based deep bidirectional encoders to learn contextual semantic rela-
tionships between lexical items and performed satisfactory in many NLP tasks.
It began to retrieve document with the most relevant documents followed by
paragraphs and extract sentences from selected paragraphs. BERT embeddings are
used to compare query with paragraphs and select the one with higher cosine simi-
larity. Once relevant paragraphs are available, select sentence with answer by com-
paring sentence embeddings based on Word2Vec embeddings trained on the whole
dataset, then average word embeddings in the paragraph with BM25 score calcula-
tion are shown in Fig. 9.8.
9.2 Information Retrieval Systems 205
Output Units
Hidden Layers
Dense
Embeddings
Sparse Features
Wide Models Wide & Deep Models Deep Models
Fig. 9.8 Sample code for Word2vec embeddings with BM25 score calculation
8000
Number of documents
Number of documents
6000
4000
2000
0
0 5 10 15 20 25 0 10 20 30 40 50 60
Scores Scores
queries which have general terms, e.g. age, human, climate performed satisfactorily
with the most relevant documents instead of embeddings comparison across all of
them. Thus, it is reasonable to compare each time the results of two approaches and
select the appropriate one based on words distribution for each query.
9.2 Information Retrieval Systems 207
9.2.4.1 Query-Likelihood
k
P ( q1 … qk M D ) = ∏P ( qi M D ) (9.4)
i =1
Search results are obtained by sorting all computed results. However, this method
calculates the probability for each Doc independently from other Docs, and the
relevant documents are not utilized.
9.2.4.2 Document-Likelihood
Determine each Query corresponding MQ. Calculate the probability that any given
document will be generated under the query’s language model (Zhuang and
Zuccon 2021):
( )
P D M Q = ∏P ω M Q
ω ∈D
( ) (9.5)
9.2.5 Discourse Segmentation in IR
Document contents combine with articulated parts such as paragraphs exalt auto-
matic documents segmentation according to meanings using machine learning
methods to compare two adjacent sentences similarity in turn and generate segmen-
tation point with the lowest similarity. This unsupervised method is called TextTiling
(Hearst 1997) as shown in Fig. 9.14. Further, supervised learning methods can also
9.2 Information Retrieval Systems 209
(Rothkopf 1971), Syntax features (Sadler and Spencer 2001), Lexical and distribu-
tional similarities (Weeds et al. 2004).
Discourse segmentation task is a significant evaluation indicator for NLP devel-
opment directions. From the application perspective, discourse segmentation can
assist users rely on intelligence to improve productivity, its technology core value
can convert semi-structured and unstructured data to specific description structured
in turn to support substantial downstream applications.
212 9 Major NLP Applications
9.3.1.1 Motivation
There is excess information from copious sources to obtain the latest information
daily. Although automatic and accurate summarization systems can assist users to
simplify, identify, and understand key information in the shortest possible time but
they remain challenging as new words and complex text structure documents are
available constantly.
9.3.1.2 Task Definition
9.3.1.3 Basic Approach
Summarization approaches are mainly divided into extractive and abstractive (Chen
and Zhuge 2018).
Extractive methods select important phrases from input text, combine them to
form a summary like a copy and paste process. Many traditional text summarization
methods use Extractive Text Summary (ETS) because it is simple to generate sen-
tences without grammatical errors but cannot reflect exact sentences meanings.
They are inflexible to use novel expressions, words, or connectors outside text
descriptions.
9.3 Text Summarization Systems 213
9.3.1.4 Task Goals
Summarization task objectives are to assist users to understand raw text within a
short period as shown in Fig. 9.18.
9.3.1.5 Task Sub-processes
Summarization tasks are divided into the following modules as shown in Fig. 9.19.
Input document or documents are first combined and preprocessed from continu-
ous text form to split sentences. The sentences will be encoded into vectors form
data to fit into a matrix for similarity scores calculation to obtain sentence rankings,
followed by a summary with the highest possibility according to the ranking list.
Text summarization datasets commonly used include DUC (DUC 2022), New York
Times (NYT 2022), CNN/Daily Mail (CNN-DailyMail 2022), Gigaword (Gigaword
2022), and LCSTS datasets (LCSTS 2022).
DUC datasets (DUC 2022) are the most fundamental text summarization datas-
ets developed and used for testing purposes only. They consist of 500 news articles,
each with four human-written summaries.
NYT datasets (NYT 2022) contain articles published in New York Times between
1996 and 2007 with abstracts compiled by experts. The abstract datasets are some-
times incomplete and sporadic short sentences with average of 40 words.
CNN/Daily Mail datasets (CNN-DailyMail 2022) are widely used multi-sentence
summary datasets often trained by generative summary system. They have (a) ano-
nymized version to include entity names and (b) non-anonymized version to replace
entities with specific indexes.
Gigaword datasets (Gigaword 2022) are abstracts comprising of the first sen-
tence and article title with heuristic rules of approximately 4-million articles.
LCSTS datasets (LCSTS 2022) are Chinese short texts abstract datasets con-
structed by Weibo (2022).
Text summarization task for input documents can be divided into two types:
1. Single document summarization considers each input is one document.
2. Multiple document summarization considers input has several documents.
Text summarization task viewpoint can be divided into three classes:
1. Query-focused summarization adds viewpoint to query.
2. Generic summarization is generic.
3. Update summarization is a special type which sets difference (update) viewpoint.
Summarization systems based on contents can be divided into four types:
1. Indicative summarization describes contexts without revealing details especially
the endings, it contains partial information only.
2. Informative summarization contains all information in a document or documents.
9.3 Text Summarization Systems 215
A proper generic summarization should cover main topics as many as possible and
minimize redundancy leading to fractious system generation and evaluation. It often
lacks consensus on summary output and performance judgments without query pro-
visions and topics to summary task.
Typical generic summarization ranking models and selected sentences are based
on relevance similarity values and other semantic analysis (Gong and Liu 2001).
9.3.6.2 Graph-Based Method
They generate graphs from input document and summary by considering the
relationships between nodes (units of text) (Chi and Hu 2021). TextRank (Mihalcea
and Tarau 2004) is a typical graph-based approach that has developed many models.
A summarization of TextRank system to extract keywords from a sample text and
graph is shown in Figs. 9.22 and 9.23.
This kind of system is based on PageRank algorithm (Langville and Meyer 2006)
applied by Google’s search engine, its algorithm principle is linked pages are good,
and even better if they come from multiple linked pages. Links between pages are
represented by matrices like circular tables. This matrix can be converted to a transi-
tion probability matrix divided by the sum of links per page, and the page will be
moved by page viewer following a feature matrix in Fig. 9.24.
TextRank processes words and sentences as pages in PageRank, its algorithm
defines text units and adds them as nodes in a graph with relations are defined
between text units and added as edges in the graph. Generally, the weights of edges
are set by similarity or score values.
Then, PageRank algorithm is used to solve the graph. There are other similar
systems such as LexRank (Erkan and Radev 2004) to consider sentences as nodes
and similarity as relations or weights, i.e. IDF-modified cosine similarity to calcu-
late similarity.
9.3.6.3 Feature-Based Method
Classification
Layer
1 0 1
DOC
Sentence
layer
REP
Word layer
Input Layer
Sentence 1 Sentence 2 Sentence 3
level classifies sentence level. Double arrows indicate two-way RNN. The top layer
numbered with 1s and 0s is a classification layer based on sigmoid activation to
determine whether each sentence is a summary. Each sentence decision depends on
substantial sentence contents, sentences to document relevance, sentences to cumu-
lative summary representation originality, and other positional characteristics.
Topic-based model considers document’s topic features and input sentences scores
according to topic types contained as major topic would obtain a high rate when
scoring sentences.
Latent Semantic Analysis (LSA) is based on Singular Value Decomposition
(SVD) to detect topics (Ozsoy et al. 2011). An LSA based sentence selection pro-
cess is shown in Fig. 9.26 by topics represented by eigenvectors or principal axes
with corresponding scores.
9.3.6.5 Grammar-Based Method
Grammar-based model parses text and constructs a syntax structure, selects, or reor-
ders the substructure. A representation framework is shown in Fig. 9.27.
222 9 Major NLP Applications
Topic2
NP NP
PP/Prep_in PP/Prep_in
of GHI Island
NP/dobj PP/Prep_in
9.4 Question-and-Answering Systems
for optimization response in Natural Language Generation (NLG). Apart from text
aspect, ASR and TTS are procedures that resemble machine by human voice recog-
nition and generation.
QA system research is divided into two categories: (1) pattern matching with
rule-based and (2) language generated-based on information retrieval and neural
network. However, the backend equipped more than one method to generate mean-
ingful communication and provide meaningful feedbacks. A QA system in a chatbot
includes an open-domain focus on (1) common sense/world knowledge and (2)
task-oriented for special domain knowledge databases that resemble expert system
involving in-depth knowledge base to support appropriate responses.
First rule-based human–computer interaction as in Fig. 9.30 pattern recognition
system challenged the Turing test in 1950s, reaching a milestone where humans
could not recognize whether the opposite was a machine or human. After a long
period of data collection, database used for dialogue pattern matching is large
enough to rank appropriate feedbacks and give the highest scoring answers, which
is a process of selection from a database of human answers regardless of the
machine. After decades of development, search engines and data crawlers have sup-
ported sources for building knowledge bases, including information retrieval,
enabling search engines to retrieve relevant and up-to-date data for structured pro-
cessing to form answers from QA systems. The advent of AI era enhanced QA
systems mainstream can focus on cognitive science than big data feeds of neural
networks on systems generations. Gradually, traditional QA system is replaced by
AI machine communication as rule-based matching recurrent neural network train-
ing to realize large knowledge base to support the AI brain to imitate human reason-
ing called Natural Language Understanding (NLU).
The main source of knowledge base in a typical QA system comes from: (1)
human–human dialogue collection with handcraft is the answer from human lan-
guage in linguistic and meaning where database consist of pairs dialogues. Without
any imitation or learning ability, this first version rule-based QA system relies on
pattern matching to measure the distance between proposed question and Question-
Answer pattern stored pair in database. For example, Artificial Intelligence Markup
Language (AIML) can answer most of daily or even professional dialogues based
on large and classified handcraft database without intelligence; (2) building data-
base focus on search engine for Information Retrieval-based knowledge base. The
feature of IR-based QA system is the combination of knowledge building from up-
to-date knowledge bases. An IR-based QA system uses domain knowledge such as
expert system to extract and generate knowledge. The procedure of unstructured
data extraction and reorganization depends on Natural Language Understanding
(NLU) for reasoning. Natural Language generation (NLG) includes knowledge
engineering analysis for reasoning and rerank candidates’ answers optimization.
The latest database used big data for data-driven model to realize machine intel-
ligence. When neural network had provided with sufficient data, sequence-to-
sequence model like Recurrent Neural Network and its related Long-Short-Term
Memory naturally model as in Fig. 9.31 skilled in sequential data processing (Cho
et al. 2014). A neural network model is considered as the black box producing learn-
ing ability with accuracy but cannot comprehend by humans. Prior preprocessing
data was fed to neural model, they required to transform data format from natural
word to vector for data training (Mikolov et al. 2013). Tokenization has three levels:
(1) character, (2) word, and (3) sentence. The input format decides output outcomes
in encoder–decoder framework. Recurrent Neural Network generated words may
not be meaningful in English dictionary because the character level training lacked
enough corpus for a well-trained model. Further, transfer learning with enormous
data pre-trained Transformer model required to select the intended decoder for
training target. For example, Dialogue GPT from OpenAI focuses on formatted
dialogue training to generate responses.
Neural Network system transformed natural language to word vectors for math-
ematical computation to acquire response in NLP. Neural Network can generate
own natural language as compared with traditional techniques.
9.4.1.1 Rule-based QA Systems
Rule-based QA systems were proposed at the same time as Turing test in 1950s.
However, original QA systems only followed rules set by humans without self-
improvement capabilities like machine learning, number of dialogue pairs is stored
in database prior the system provided a concrete answer. The simplest but most
efficient way to measure similarity of two groups is the cosine distance of two vec-
tors. It is undeniable that rule-based systems have collected huge dialogue corpora
over decades, giving system confidence when relying on new problems with high
vector similarity. To date, mature rule-based systems are quintessential for all com-
mercial QA systems, as the accumulation of corpora can avoid meaningless
responses that compensate for insufficient domain knowledge with appropriate and
specific human feedbacks.
The knowledge base for IR is unstructured data source using data mining methods
obtained by websites, wordnet which are different from the paired dialogue.
Question-Answer System (KBQA system) is a significant branch of IR-based QA
system knowledge base, its usage depends on knowledge base size of unstructured
data for storage. That is related to knowledge base construction to extract purpose-
ful knowledge from mass data. There are two methods (1) property and (2) relations
to process natural language. Property refers to the definition or concept of one thing
in English–English dictionary to explain another concept. Relations refer to the
relationship between two entities, where a Name Entity Recognition (NER) and
idea from Ontology with Subject–Predicate–Object (SPO) triple must be used to
extract relation. KBQA extension is ontology or knowledge graph (KG) in research.
When entities are linked, the knowledge for one entity can be extracted according to
questions during Natural Language Understanding (NLU). A typical KBQA with
domain knowledge about ontology is shown in Fig. 9.32, its fundamental question
is about who and what correspond to name and relations entities (Cui et al. 2020).
228 9 Major NLP Applications
proposed during the training period for language model performance (Chen et al.
2017) and on system design sufficient for both languages generations.
Since Encoder–Decoder framework proposed as an end-to-end system and a
sequential language model, RNN is a popular generated-based model in commer-
cial and academics. However, its applications are mainly focused on casual scenar-
ios at open domain without proposed question details. Thus, the response from a
generated-based QA system is appropriate in pairs but lack contents due to the data-
driven model considered basic linguistic and excluded facts from knowledge base
which are identical to traditional dialogue system with meaningless answers. A
knowledge grounded neural conversation model (Ghazvininejad et al. 2018) is pro-
posed based on sequence-to-sequence RNN model and combined dialogue history
with facts related to current contexts as shown in Fig. 9.33.
Microsoft extended its industrial conversation system to achieve useful conver-
sational applications on knowledge grounded with conversation history and external
facts in 2018. It has significant progress in real situations according to conversation
history in Dialog-Encoder, word, and contextually relevant facts in Facts Encoder to
responses as compared with baseline seq2seq model.
The data-driven model of QA system divided source data into conversation data
and non-conversational text which means the conversation pairs are used to training
system in linguistic; however, non-conversational text is the knowledge base to be
filled including real-world information related to system target usage.
The performance of versatility and scalability in open domain with external
information knowledge combined with textual and structured data of QA system is
shown in Fig. 9.34. Datasets like Wikipedia, IMDB, TripAdvisor are used to gener-
ate conversation with real-world information and included a recommendation sys-
tem function.
Fig. 9.34 Response from conversation model knowledge grounded (Ghazvininejad et al. 2018)
After fact-based encoder, the response from this system becomes more meaning-
ful with related information and logical content. Based on this model, 23 million
open domains Twitter conversations and 1.1 million Foursquare tips are used to
achieve a significant improvement over the previous seq2seq model, and different
from the traditional content filling which add the predefined content and fill the
space in sentences.
It is well known that industrial QA systems are not limited to one model, many
models have been assembled into a language model for end-to-end dialogue. In this
architecture, the dialogue encoder is independent of fact encoder in the system, but
it is complementary to fact encoder when applied because facts require information
from dialogue history, especially to match context-dependent information bands.
There is intentional information as part of the response. From implementation per-
spective, multi-task learning is used to handle factual, non-factual, and autoencoder
tasks depending on intended work of the system. Multi-task learning can separate
two encoders independently while training the model, and after training on dialogue
dataset, the factual encoder part uses information retrieval (IR) to expand knowl-
edge base for more meaningful answers. In a way, a fact encoder is like a memory
network, which uses a store of relevant facts relevant to a particular problem. Once
the query contains a specific entity in the sentence, the sentence has been assigned
a specific name entity, the name entity recognizes (NER) by matching keywords or
linked entities, or even named entities and calculates its weight on input and dia-
logue history to generate a response. The original storage network model uses a bag
of words, but in this model the encoder directly converts input set to a vector unlike
storage network model.
Since the system is a fully neural-based data-driven model, they created an end-
to-end RNN system using a traditional seq2seq model, including long short-term
memory (LSTM) and Gate Recurrent Unit (GRU) model. For ensemble structures
such as two-class RNNs, constructing a simple GRU is usually faster than LSTM
model. The implementation of GRU means that the system does not have
Transformer’s attention mechanism or other invariants for neural network
computation.
9.4 Question-and-Answering Systems 231
9.4.2.1 AliMe QA System
Microsoft used GRU to reduce computational power and response time span as
well as AliMe selected RNN GRU to improve response efficiency. During optimiza-
tion, beam search in decoder assisted to identify the highest conditional probability
to obtain optimizer response sentence within parameters. The performance showed
that IR+generation+rerank approach by seq2seq model and mean probability scor-
ing function evaluation obtained the highest score as compared with other methods.
Xiao Ice (Zhou et al. 2020) is an AI companion sentient chatbot with more than 660
million users worldwide which takes Intelligent Quotient (IQ) and Emotional
Quotient (EQ) in system design as shown in Fig. 9.37. It focused on chitchat com-
pared with other commonly used QA systems. According to Conversation-turns Per
Session (CPS) evaluation score, its grade is 23 higher than most chatbots. Figure 9.36
shows a system architecture of Xiao Ice.
Xiao Ice exists on 11 social media platforms including WeChat, Tencent QQ,
Weibo, and Facebook as an industrial application. It has equipped with two-way
text-to-speech voice and can process text, images, voice, and video clips for
message-based conversations. Also, its core chat function can distinguish common
or specific domain topic chat types so that it can change topics easily and automati-
cally provide users with deeper domain knowledge. A dialog manager is like an
NLP general pipeline with dialog management to path conversation states such as
core chat contents for open or special domains to process data from different sources
are tractable. The Global State Tracker is a vector of Xiao Ice's responses to analyze
text strings for entities and empathy. It is vacant and gradually filled with rounds of
conversations. Dialogue strategies are primarily designed for long-term users, based
on their feedbacks to enhance interactions engagement, optimize personality with
two or three levels achievements. A trigger mechanism is to change topic when the
chatbot repeats or answers information that are always valid, or when a user's feed-
back is mundane within three words. Once the user's input has a predefined format,
a skill selection part is activated to process different input. For example, images can
be categorized into different task-oriented scenarios. If an image is food related, the
user will be taken to a restaurant display, like a task completion by personal assis-
tants in advising weather information or making reservations, etc.
Xiao Ice has a few knowledge graphs in the data layer as its original datasets
come from popular forums such as Instagram in English or Douban in Chinese.
These datasets are categorized as multiple topics with a small knowledge base as
possible answers. It also follows the rules of updating knowledge base through
machine learning when new topics emerge. It is noted that not all new entities or
topics are collected unless the entity is contextually relevant, or a topic has higher
popularity or freshness in the news for rankings. User's personal interests can be
adjusted individually.
However, with so many features that can include the core part Empathetic
Computing as an add-on, it is not a mandatory part of a full chatbot, but a functional
and compelling feature to compete with the industry. The core of Xiao Ice is a RNN
language model that creates open and special domain knowledge. Figures 9.37 and
9.38 show an RNN-based neural response generator with examples of inconsistent
responses generated by seq2seq model in Xiao Ice QA system, respectively.
Exercises
9.1. What is Information Retrieval (IR) in NLP? State and explain why IR is vital
for the implementation of NLP applications. Give two NLP applications to
illustrate.
9.2. In terms of implementation technology of Information Retrieval (IR) sys-
tems, what are the major difference between traditional and latest IR systems.
Give one IR system implementation example to support your answer.
9.3. What is Discourse Segmentation? State and explain why Discourse
Segmentation is critical for the implementation of Information Retrieval (IR)
systems.
9.4. What is Text Summarization (TS) in NLP? State and explain the relationship
and differences between TS system and IR (Information Retrieval) systems.
9.5. What are two basic approaches of Text Summarization (TS)? Give live exam-
ples of TS systems to discuss how they work by using these two approaches.
9.6. What are the major differences between Single vs Multiple documentation
summarization systems? State and explain briefly the related technologies
being used in these TS systems.
9.7. What are the major characteristics of contemporary Text Summarization (TS)
systems as compared with traditional TS systems in the past century? Give
live example(s) to support your answer.
9.8. What is a QA system in NLP? State and explain why QA system is critical to
NLP. Give two live examples to support your answer.
9.9. Choose any two industrial used QA systems and compare their pros and cons
in terms of functionality and system performance.
9.10. What is Transformer technology? State and explain how it can be used for the
implementation of QA system. Use a live example to support your answer.
236 9 Major NLP Applications
References
Agarwal, N., Kiran, G., Reddy, R. S. and Rosé, C. P. (2011) Towards Multi-Document
Summarization of Scientific Articles: Making Interesting Comparisons with SciSumm. In Proc.
of the Workshop on Automatic Summarization for Different Genres, Media, and Languages,
Portland, Oregon, pp. 8– 15.
Agrawal, K. (2020) Legal case summarization: An application for text summarization. In Proc. Int.
Conf. Comput. Commun. Informat. (ICCCI), pp. 1–6.
Aharon, M., Elad, M. and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcom-
plete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):
4311-4322.
Aizawa A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing
& Management, 39(1): 45-65.
Alami, N., Meknassi, M and Rais, N. (2015). Automatic texts summarization: Current state of the
art. Journal of Asian Scientific Research, 5(1), 1-15.
Alomari, A., Idris, N., Sabri, A., and Alsmadi, I. (2022) Deep reinforcement and transfer learning
for abstractive text summarization: A review. Comput. Speech Lang. 71: 101276.
Banko, M., Mittal, V. O. and Witbrock, M. J. (2000) Headline Generation Based on Statistical
Translation. ACL 2000, pp. 318-325.
Baumel, T., Eyal, M. and Elhadad, M. (2018) Query Focused Abstractive Summarization:
Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints
into seq2seq Models. CoRR abs/1801.07704.
Barzilay, R., McKeown, K. and Elhadad, M. (1999) Information fusion in the context of multi-
document summarization. In Proceedings of ACL '99, pp. 550–557.
Baxendale, P. (1958) Machine-made index for technical literature - an experiment. IBM Journal of
Research Development, 2(4):354-361.
Brown TB, Mann B, Ryder N, et al (2020) Language models are few-shot learners. Adv Neural Inf
Process Syst 2020-Decem:pp.3–63
Carbonell, J. and Goldstein, J. (1998) The use of MMR, diversity-based reranking for reordering
documents and producing summaries. In Proceedings of SIGIR '98, pp. 335-336, NY, USA.
Chen, J. and Zhuge H. (2018) Abstractive text-image summarization using multi-modal atten-
tional hierarchical RNN. In Proc. Conf. Empirical Methods Natural Lang. Process., Brussels,
Belgium, pp. 4046–4056.
Chen H, Liu X, Yin D, Tang J (2017) A Survey on Dialogue Systems. ACM SIGKDD Explor
Newsl 19:25–35. https://doi.org/10.1145/3166054.3166058
Cheng, H. T. et al. (2016). Wide & deep learning for recommender systems. In Proceedings of the
1st workshop on deep learning for recommender systems, pp. 7-10.
Chi, L. and Hu, L. (2021) ISKE: An unsupervised automatic keyphrase extraction approach using
the iterated sentences based on graph method. Knowl. Based Syst. 223: 107014.
Chopra, S., Auli, M. and Rush, A. M. (2016) Abstractive Sentence Summarization with Attentive
Recurrent Neural Networks. HLT-NAACL 2016, pp. 93-98.
Cho K, Van Merriënboer B, Gulcehre C, et al (2014) Learning phrase representations using RNN
encoder-decoder for statistical machine translation. EMNLP 2014 - 2014 Conf Empir Methods
Nat Lang Process Proc Conf 1724–1734. https://doi.org/10.3115/v1/d14-1179
Church, K. W. (2017). Word2Vec. Natural Language Engineering, 23(1): 155-162.
CNN-DailyMail (2022) CNN/Daily-Mail Datasets. https://www.kaggle.com/datasets/gowrishan-
karp/newspaper-text-summarization-cnn-dailymail. Accessed 9 Aug 2022.
Columbia (2022). Columbia Newsblaster. http://newsblaster.cs.columbia.edu. Accessed 14
June 2022.
Croft, W. B. & Harper, D. J. (1979). Using probabilistic models of document retrieval without
relevance information. Journal of documentation, 35(4): 285-295.
Cui Y, Huang C, Lee R (2020) AI Tutor : A Computer Science Domain Knowledge Graph-Based
QA System on JADE platform. Int J Ind Manuf Eng 14:603–613
References 237
Luhn, H. P. (1958) The Automatic Creation of Literature Abstracts. IBM J. Res. Dev. 2(2): 159-165.
Mahalakshmi, P. and Fatima, N. S. (2022) Summarization of Text and Image Captioning in
Information Retrieval Using Deep Learning Techniques. IEEE Access 10: 18289-18297.
Malki, Z., Atlam, E., Dagnew, G., Alzighaibi, A., Ghada, E. and Gad I. (2020) Bidirectional resid-
ual LSTM-based human activity recognition, Comput. Inf. Sci., 13(3):1–40.
Mani I. and Bloedorn, E. (1997) Multi-document summarization by graph search and matching.
AAAI/IAAI, vol. cmplg/ 9712004, pp. 622-628, 1997.
Mihalcea, R. and Tarau, P. (2004) TextRank: Bringing Order into Text. EMNLP 2004: 404-411
Nallapati, R., Zhai, F. and Zhou. B. (2017) Summarunner: A recurrent neural network-based
sequence model for extractive summarization of documents. AAAI 2017: 3075-3081.
arXiv:1611.04230
NewsInEssence (2022). NewsInEssence News. http://NewsInEssence.com. Accessed 14 June 2022.
NYT (2022) NYT Dataset. https://www.kaggle.com/datasets/manueldesiretaira/dataset-for-text-
summarization. Accessed 9 Aug 2022.
Mikolov T, Sutskever I, Chen K, et al (2013) Distributed representations of words and phrases and
their compositionality. Adv Neural Inf Process Syst 1–9
Ozsoy, M. G., Alpaslan, F. N. and Cicekli, I. (2011) Text summarization using Latent Semantic
Analysis. J. Inf. Sci. 37(4): 405-417.
Pervin S. and Haque M. (2013) Literature Review of Automatic Multiple Documents Text
Summarization, International Journal of Innovation and Applied Studies, 3(1) 121-129.
Qiu M, Li F-L, Wang S, et al (2017) AliMe Chat: A Sequence to Sequence and Rerank based Chatbot
Engine. In: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Stroudsburg,
PA, USA, pp 498–503
Rath, G. J., Resnick A. and Savage, T. R. (1961) Comparisons of four types of lexical indicators
of content. Journal of the American Society for Information Science and Technology, 12(2):
126-130.
Reimers, N., and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-
networks, arXiv preprint arXiv:1908.10084.
Robertson, A. M. & Willett, P. (1998). Applications of n-grams in textual information systems.
Journal of Documentation, 54(1): 8-67.
Rothkopf, E. Z. (1971). Incidental memory for location of information in text. Journal of verbal
learning and verbal behavior, 10(6), 608-613.
Sadler, L. & Spencer, A. (2001). Syntax as an exponent of morphological features. In Yearbook of
morphology 2000, pp. 71-96. Springer.
Salton, G. (1989) Automatic Text Processing: the transformation, analysis, and retrieval of infor-
mation by computer. Addison- Wesley Publishing Company, USA.
Salton G, Wong A and Yang C S. (1975) A vector space model for automatic indexing.
Communications of the ACM, 18(11): 613-620.
See, A., Liu, P. J. and Manning, C. D. (2017) Get To The Point: Summarization with Pointer-
Generator Networks. ACL (1) 2017: 1073-1083.
Svore, K. M., Vanderwende L. and Burges, J.C. (2007) Enhancing Single document Summarization
by Combining RankNet and Third-party Sources. In Proc. of the Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning,
pp. 448–457.
Taboada, M. & Mann, W. C. (2006). Applications of rhetorical structure theory. Discourse studies,
8(4): 567-588.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin,
I. (2017) Attention is All you Need. NIPS 2017: 5998-6008. arXiv:1706.03762.
Wan, X. (2008) An Exploration of Document Impact on Graph-Based Multi-Document
Summarization. Proc. of the Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, pp. 755–762.
References 239
Weeds, J., Weir, D. and McCarthy, D. (2004). Characterising measures of lexical distributional sim-
ilarity. In COLING 2004: Proceedings of the 20th international conference on Computational
Linguistics, pp. 1015-1021.
Weibo (2022) Sina Weibo official site. https://weibo.com. Accessed 29 Sept 2022.
Whissell, J. S. & Clarke, C. L. (2011). Improving document clustering using Okapi BM25 feature
weighting. Information retrieval, 14(5): 466-487.
Wolf T, Sanh V, Chaumond J, Delangue C (2018) TransferTransfo: A Transfer Learning Approach
for Neural Network Based Conversational Agents
Yang, R., Bu, Z. and Xia, Z. (2012) Automatic Summarization for Chinese Text Using Affinity
Propagation Clustering and Latent Semantic Analysis. WISM 2012, pp. 543-550
Yang, J., Yi, X., Cheng, D. Z., Hong, L., Li, Y. and Wong, S. (2020). Mixed negative sampling
for learning two-tower neural networks in recommendations. In Proceedings of the Web
Conference 2020, pp. 441-447.
You O., Li W. and Lu, Q. (2009) An Integrated Multi-document Summarization Approach based
on Word Hierarchical Representation. In Proc. of the ACL-IJCNLP Conference, Singapore,
pp. 113–116.
Zhang, Y., Jin, R. and Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical frame-
work. International Journal of Machine Learning and Cybernetics, 1(1): 43-52.
Zhou L, Gao J, Li D, Shum H-Y (2020) The Design and Implementation of XiaoIce, an Empathetic
Social Chatbot. Comput Linguist 46:53–93. https://doi.org/10.1162/coli_a_00368
Zhuang, S. & Zuccon, G. (2021). TILDE: Term independent likelihood moDEl for passage re-
ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 1483-1492.
Zhu T. and Zhao, X. (2012) An Improved Approach to Sentence Ordering For Multi-document
Summarization. IACSIT Hong Kong Conferences, IACSIT Press, Singapore, vol. 25, pp. 29-33.
Part II
Natural Language Processing Workshops
with Python Implementation in 14 Hours
Chapter 10
Workshop#1 Basics of Natural Language
Toolkit (Hour 1–2)
10.1 Introduction
Part 2 of this book will provide seven Python programming workshops on how each
NLP core component operates and integrates with Python-based NLP tools includ-
ing NLTK, spaCy, BERT, and Transformer Technology to construct a Q&A chatbot.
Workshop 1 will explore NLP basics including:
1. Concepts and installation procedures
2. Text processing function with examples using NLTK
3. Text analysis lexical dispersion plot in Python
4. Tokenization in text analysis
5. Statistical tools for text analysis
NLTK (Natural Language Toolkit 2022) is one of the earliest Python-based NLP
development tool invented by Prof. Steven Bird and Dr. Edward Loper in the
Department of Computer and Information Science of University of Pennsylvania
with their classical book Natural Language Processing with Python published by
O'Reilly Media Inc. in 2009 (Bird et al. 2009). There are over 30 universities in
USA and 25 countries using NLTK for NLP related courses until present. This book
is considered as bible for anyone who wishes to learn and implement NLP applica-
tions using Python.
NLTK offers user-oriented interfaces with over 50 corpora and lexical resources
such as WordNet, a large lexical database of English. Nouns, verbs, adjectives, and
adverbs are grouped into sets of cognitive synonyms (synsets); each expresses a
© The Author(s), under exclusive license to Springer Nature Singapore Pte 243
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_10
244 10 Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)
Let us look at NLTK text tokenization using Jupyter Notebook (Jupyter 2022;
Wintjen and Vlahutin 2020) as below:
Out[5] ['At', 'every', 'weekend', ',', 'early', 'in', 'the', 'morning', '.', 'I', 'drive',
'my', 'car', 'to', 'the', 'car', 'center', 'for', 'car', 'washing', '.', 'Like', 'clock-
work', '.']
10.4 How to Install NLTK? 245
Python toolkit and packages overtook C, C++, Java especially in data science, AI,
and NLP software development since 2000 (Albrecht et al. 2020; Kedia and Rasu
2020). There are several reasons to drive the changes because:
• it is a generic language without specific area unlike other language such as Java
and JavaScript specifically designed for web applications and websites
developments.
• it is easier to learn and user-friendly compared with C and C++ especially for
non-computer science students and scientists.
• its lists and list-processing data types provide an ideal environment for NLP
modelling and text analysis.
10.5 Why Using Python for NLP? 247
Python codes perform word number counts from literature Alice’s Adventures in
Wonderland by Lewis Carroll (1832–1898) as below:
In[9] # Define method to count the number of word tokens in text file (cwords)
def cwords(literature):
try:
with open(literature, encoding='utf-8') as f_lit:
c_lit = f_lit.read()
except FileNotFoundError:
err = "Sorry, the literature " + literature + " does not exist."
print(err)
else:
w_lit = c_lit.split()
nwords = len(w_lit)
print("The literature " + literature + " contains " + str(nwords) +
" words.")
literature = 'alice.txt'
cwords(literature)
This workshop has extracted four famous literatures from Project Gutenberg (2022):
1. Alice’s Adventures in Wonderland by Lewis Carroll (1832–1898) (alice.txt)
2. Little Women by Louisa May Alcott (1832–1888) (little_women.txt)
3. Moby Dick by Herman Melville (1819–1891) (moby_dick.txt)
4. The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (1859–1930)
(Adventures_Holmes.txt)
248 10 Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)
In[10] cwords('Adventures_Holmes.txt')
Out[10] The literature Adventures_Holmes.txt contains 107411 words.
NLTK are Python tools and methods to learn and practice starting from basic text
processing in NLP. They include:
• Text processing as lists of words
• Statistics on text processing
• Simple text analysis
NLTK provides 9 different types of text documents from classic literatures, Bible
texts, famous public speeches, news, and articles with personal corpus for text pro-
cessing. Let us start and load these text documents.
In[11] # Let's load some sample books from the nltk databank
import nltk
from nltk.book import *
Out[11]
10.7 Simple Text Analysis with NLTK 249
The above example shows all Sherlock occurrences indicating that Sherlock is
a special word linked with surname Holmes in text document.
In[18] tholmes.similar("extreme")
Out[18] dense gathering
It showed that word usage of extreme varies by authors and text types e.g. it
has different styles in The Adventures of Sherlock Holmes as compared
with usage in Sense and Sensibility by Jane Austin (1775-1817) which is
more vivid but has standard and fixed usage in Inaugural Address Corpus.
which means after analyzing extreme and huge in The Adventures of Sherlock
Holmes, no common context meaning can be found.
Call concordance() function of these two words and check against the extracted
patterns as shown below:
Out[24]
Out[25]
Text analysis was learnt to study word patterns and common contexts in previous
workshop.
Dispersion Plot in Python NLTK is to identify occurrence frequencies of key-
words from the whole document.
Dispersion is quantification of each point deviation from the mean value in basic
statistics.
NLTK Dispersion Plot produces a plot showing words distribution throughout
the text. Lexical dispersion is used to indicate homogeneity of words (word tokens)
occurred in the corpus (text document) achieved by the dispersion_plot() in NLTK.
To start, let us use NLTK book object to call function dispersion_plot().
Note: requires pylab installation prior this function.
The following example uses text1 to verify basic information about
dispersion_plot().
In[26] text1.dispersion_plot?
Are there any lexical patterns for positive words such as good, happy, and strong
versus negative words such as bad, sad, or weak in literature?
Workshop 1.2 Lexical Dispersion Plot over Context using Sense and
Sensibility
Use dispersion_plot to plot Lexical Dispersion Plot keywords: good, happy,
strong, bad, sad, and weak from Sense and Sensibility.
1. Study any lexical pattern between positive and negative keywords.
2. Check these patterns against Moby Dick to see if this pattern occurs and
explain.
3. Choose two other sentiment keywords to see if this pattern remains valid.
254 10 Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)
Lexical usage is to analyze word pattern changes in written English over time. The
Inaugural Address Corpus addressed by US presidents of past 220 years is a text
document in NLTK book library to study lexical dispersion plot patterns changes on
keywords war, peace, freedom, and united for this workshop.
Workshop 1.3 Lexical Dispersion Plot over Time using Inaugural Address
Corpus
1. Use dispersion_plot to invoke Lexical Dispersion Plot for Inaugural
Address Corpus.
2. Study and explain lexical pattern changes for keywords America, citizens,
democracy, freedom, war, peace, equal, united.
3. Choose any two meaningful keywords and check for lexical pattern
changes.
10.9 Tokenization in NLP with NLTK 255
Tokenization
Fig. 10.5 Tokenization example of a sample utterance “Jane lent $100 to Peter early this morning”
256 10 Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)
NLTK provides flexibility to tokenize any string of text using tokenize() function
as shown below:
Out[29] ['Jane', 'lent', '$', '100', 'to', 'Peter', 'early', 'this', 'morning', '.']
Python provides split() function to split a sentence of text into words as recalled in
Sect. 10.1 Let us see how it works with Tokenize() function.
NLTK provides a simple way to count total number of tokens in a Text Document
using len() in NLTK package.
Try len(tholmes) will notice:
In[32] len(tholmes)
Out[32] 128366
In[33] tholmes?
In[34] set(tholmes)
Out[34]
In[35] len(set(tholmes))
Out[35] 10048
This example showed that The Adventures of Sherlock Holmes contains 128,366
tokens, i.e. words and punctuations, and 10,048 distinct tokens, or types. Try other
literatures and see vocabulary can be learnt from these great literatures.
258 10 Workshop#1 Basics of Natural Language Toolkit (Hour 1–2)
The following example shows how to sort distinct tokens using sorted() function.
In[36] sorted(set(tholmes)
Out[36]
Since books are tokenized in NLTK as a list book object, contents can be accessed
by using list indexing method as below:
10.9.4 Lexical Diversity
Token usage frequency, also called Lexical Diversity is to divide the total number of
tokens by total number of token types as shown:
In[40] len(text1)/len(set(text1))
Out[40] 13.502044830977896
10.9 Tokenization in NLP with NLTK 259
In[41] len(text2)/len(set(text2))
Out[41] 20.719449729255086
In[42] len(text3)/len(set(text3))
Out[42] 16.050197203298673
In[43] len(text4)/len(set(text4))
Out[43] 15.251970074812968
Python codes above analyze token usage frequency of four literatures: Moby Dick,
Sense and Sensibility, Book of Genesis, and Inaugural Address Corpus. It has
usage frequency range from 13.5 to 20.7. What are the implications?
There are many commonly used words in English. The following example shows
the pattern of word usage frequency for the from above literatures.
In[44] text1.count('the')
Out[44] 13721
In[45] text1.count('the')/len(text1)*100
Out[45] 5.260736372733581
In[46] text2.count('the')/len(text2)*100
Out[46] 2.7271571452788606
In[47] text3.count('the')/len(text3)*100
Out[47] 5.386024483960325
In[48] text4.count('the')/len(text4)*100
Out[48] 6.2491416014283745
Text analysis is a NTLK tool that can tokenize a string or a book of text document.
Frequency Distribution—FreqDist() is an initial built-in method in NLTK to ana-
lyze frequency distribution of every token type in a text document.
Inaugural Address Corpus is used as an example to show how it works.
In[49] text4
Out[49] <Text: Inaugural Address Corpus>
In[50] FreqDist?
In[52] fd4
Out[52] FreqDist({'the': 9555, ',': 7275, 'of': 7169, 'and': 5226, '.': 5011, 'to': 4477, 'in':
2604, 'a': 2229, 'our': 2062, 'that': 1769, ...})
It is noted that FreqDist() will return key-value pairs from Dictionary object to
reflect the Key that store Token Type name and the Value which are corresponding
frequency of occurrence in a text. Since FreqDist() returns a Dictionary object,
keys() can be used to return the list of all Token Types as shown below.
Use list item access method to obtain frequency distribution of any token types. FD
value of token type for the is shown below.
In[54] fd4['the']
Out[54] 9555
1. What are five common word types (token types without punctuations) in any
text document?
2. Use FreqDist() to verify.
NLTK is a useful tool to study the top frequency distribution token types for any
document using plot() function with FreqDist() method. FreqDist.plot() can also
plot the top XX frequently used token types in a text document.
1. Use fd3 to study FreqDist.plot() documentation using fd3.plot().
2. Plot top 30 frequently used token types from the Book of Genesis (Non-
Cumulative mode).
3. Do the same plot with Cumulative mode.
In[55] fd4.plot?
Hapaxes are words that occur only once in a body of work whether it is a publication
or an entire language.
Ancient texts are full of hapaxes. For instance, in Shakespeare's Love's Labour's
Lost contains hapax honorificabilitudinitatibus which means able to achieve honors.
NLTK provides method hapaxes() under FreqDist object to list out all word types
that occurred once in text document.
Try FreqDist() with The Adventures of Sherlock Holmes and see how useful it is.
In[58] tholmes
Out[58] <Text: The Adventures of Sherlock Holmes by Arthur Conan...>
In[59] fd = FreqDist(tholmes)
10.10 Basic Statistical Tools in NLTK 263
10.10.3 Collocations
There are many cases in English where strong collocations are word pairings
always appear together such as make and do, e.g. You make a cup of coffee, but you
do your work.
Collocations are frequently used in business settings when nouns are combined
with verbs or adjectives, e.g. set up an appointment, conduct a meeting, set the
price etc.
10.10.3.2 Collocations in NLTK
In[62] text1.collocations()
Out[62]
In[63] text2.collocations()
Out[63]
In[64] text3.collocations()
Out[64]
In[65] text4.collocations()
Out[65]
References 265
References
Albrecht, J., Ramachandran, S. and Winkler, C. (2020) Blueprints for Text Analytics Using
Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications.
O’Reilly Media.
Antic, Z. (2021) Python Natural Language Processing Cookbook: Over 50 recipes to understand,
analyze, and generate text for implementing language processing tasks. Packt Publishing.
Arumugam, R. and Shanmugamani, R. (2018) Hands-On Natural Language Processing with
Python: A practical guide to applying deep learning architectures to your NLP applications.
Packt Publishing.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python. O'Reilly.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Gutenberg (2022) Project Gutenberg official site. https://www.gutenberg.org/ Accessed 16
June 2022.
Hardeniya, N., Perkins, J. and Chopra, D. (2016) Natural Language Processing: Python and
NLTK. Packt Publishing.
Jupyter (2022) Jupyter official site. https://jupyter.org/. Accessed 16 June 2022.
Kedia, A. and Rasu, M. (2020) Hands-On Python Natural Language Processing: Explore tools and
techniques to analyze and process text with a view to building real-world NLP applications.
Packt Publishing.
NLTK (2022) NLTK official site. https://www.nltk.org/. Accessed 16 June 2022.
Perkins, J. (2014). Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd.
Wintjen, M. and Vlahutin, A. (2020) Practical Data Analysis Using Jupyter Notebook: Learn how
to speak the language of data by extracting useful and actionable insights using Python. Packt
Publishing.
WordNet (2022) WordNet official site. https://wordnet.princeton.edu/. Accessed 16 June 2022.
Chapter 11
Workshop#2 N-grams in NLTK
and Tokenization in SpaCy (Hour 3–4)
11.1 Introduction
11.2 What Is N-Gram?
© The Author(s), under exclusive license to Springer Nature Singapore Pte 267
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_11
268 11 Workshop#2 N-grams in NLTK and Tokenization in SpaCy (Hour 3–4)
N-gram models are widely used (Albrecht et al. 2020; Arumugam and Shanmugamani
2018; Hardeniya et al. 2016; Kedia and Rasu 2020) in:
• Speech recognition where phonemes and sequences of phonemes are modeled
using a N-gram distribution.
• Parsing on words are modeled so that each N-gram is composed of N words. For
language identification, sequences of characters/graphemes (e.g. letters of the
alphabet) are modeled for different languages.
• Auto sentences completion
• Auto spell-check
• Semantic analysis
NLTK (NLTK 2022; Bird et al. 2009; Perkins 2014) offers useful tools in NLP
processing.
Ngrams() function in NLTK facilitates N-gram operation.
Python code uses N-grams in NLTK to generate N-grams for any text string. Try
it and study how it works.
The following example is the first sentence of A Scandal in Bohemia from The
Adventures of Sherlock Holmes (Doyle 2019): To Sherlock Holmes she is always
“The Woman.” I have seldom heard him mention her under any other name, demon-
strating how N-gram generator works in NLTK.
Out[1]
11.4 Generation of N-Grams in NLTK 269
Out[2]
Out[3]
270 11 Workshop#2 N-grams in NLTK and Tokenization in SpaCy (Hour 3–4)
NLTK offers an easy solution to generate N-gram of any N-number which are
useful in N-gram probability calculations and text analysis
Once N-grams are generated, the next step is to calculate term frequency (TF) of
each N-grams from a document to list out top items.
NLTK-based Python codes extend previous example to create N-grams statistics
to list out top 10 N-grams.
Let us try first two sentences of A Scandal in Bohemia from The Adventures of
Sherlock Holmes.
In[4] sentence
Out[4] 'To Sherlock Holmes she is always "The Woman". I have seldom heard him
mention her under any other name.'
In[6] ngrams?
Out[7]
In[8]
In[9] first_para
Use Python script to remove punctuation marks and tokenize the first_para object:
Out[11]
The results are satisfactory. It is noted that bigram in a has the most occurrence
frequency, i.e. three times while four other bigrams: Irene Adler, and that, for
the, his own have occurred twice each within the paragraph. Bigram in a, and
that and for the are frequently used English phrases which occurred in almost
every text document. How about To Sherlock and Irene Adler? There are two
N-gram types frequently used in N-gram Language Model studied in Chap. 2.
One is the frequently used N-gram phrase in English like in a, and that and for
that in our case. These bigrams are common phrases in other documents and
literature writings. Another is domain-specific N-grams. These types are only
frequently used in specific domain, documents, and genre of literatures. Hence,
To Sherlock and Irene Adler are frequently used related to this story only and
not in other situations.
Bigram analysis is required to examine which bigrams are commonly used not
only on single paragraph but also for the whole document or literature. Remember
in Workshop 1 NLTK has a built-in list of tokenized sample literatures in nltk.book.
Let us refer them first by using the nltk.book import statement.
274 11 Workshop#2 N-grams in NLTK and Tokenization in SpaCy (Hour 3–4)
In[12] # Let's load some sample books from the nltk databank
import nltk
from nltk.book import *
Out[12]
In[13] text1
Out[13] <Text: Moby Dick by Herman Melville 1851>
In[15] moby
Review the first 50 elements of Moby Dick text object to see whether they are
tokenized.
Out[16]
Use Collections class and ngrams() method for Bigram statistics to identify top
20 most frequently bigrams occurred for the entire Moby Dick literature.
Out[17]
11.6 spaCy in NLP
11.6.1 What Is spaCy?
SpaCy (2022) is a free, open-source library for advanced NLP written in Python and
Cython programming languages.
The library is published under MIT license developed by Dr. Matthew Honnibal
and Dr. Ines Montani, founders of software company Explosion.
SpaCy is designed specifically for production use and build NLP applications to
process large volumes of text (Altinok 2021; Srinivasa-Desikan 2018; Vasiliev
2020) different from NLTK focused on teaching and learning perspective.
It also provides workflow pipelines for machine learning and deep learning tools
that can integrate with common platforms such as PyTorch, MXNet, and TensorFlow
with its machine learning library called Thinc. spaCy provides recurrent neural mod-
els such as convolution neural networks (CNN) by adopting Thinc for NLP imple-
mentation such as Dependency Parsing (DP), NER (Named Entity Recognition),
POS Tagging, and Text Classification and other advanced NLP applications such as
Natural Language Understanding (NLU) systems, Information Retrieval (IR),
Information Extraction (IE) systems, and Question-and-Answer Chatbot systems.
A spaCy system architecture is shown in Fig. 11.1, its major features support:
• NLP-based statistical models for over 19 commonly used languages,
• tokenization tools implementation for over 60 international languages,
• NLP pipeline components include NER, POS Tagging, DP, Text Classification,
and Chatbot implementation,
• integration with common Python platforms such as TensorFlow, PyTorch and
other high-level frameworks,
• integration with the latest Transformer and BERT technologies,
• user-friendly modular system packaging, evaluation, and deployment tools.
Use en_core_web_md-3.2.0 package for English pipeline optimized for CPU in the
current platform with components including: tok2vec, tagger, parser, senter, ner,
attribute_ruler, lemmatizer.
Note: Since text file already exists, skip try-except module to save program-
ming steps.
Use read() method to read whole text document as a complex string object "holmes".
11.8 Tokenization using spaCy 279
Review total number of characters in The Adventures of Sherlock Holmes and exam-
ine the result document.
Out[23] 580632
In[24] holmes
Out[24]
280 11 Workshop#2 N-grams in NLTK and Tokenization in SpaCy (Hour 3–4)
SpaCy nlp() method is an important Text Processing Pipeline to initialize nlp object
(English in our case) for NLP processing such as tokenization. It will convert any
text string object into a nlp object.
Study nlp() docstring to see how it works.
In[25] nlp?
In[27] holmes_doc
Out[27]
SpaCy is practical for text document tokenization to convert text document object
into (1) sentence objects and (2) tokens.
This example uses for-in statement to convert the whole Sherlock Holmes docu-
ment into holmes_sentences.
11.8 Tokenization using spaCy 281
Examine the structure of spaCy sentences and see what can be found.
In[29] holmes_sentences?
Out[30] 6830
282 11 Workshop#2 N-grams in NLTK and Tokenization in SpaCy (Hour 3–4)
In[31] holmes_sentences[50:60]
Out[31]
Tokenize text document into word tokens by using “token” object in spaCy instead
of text document object extraction into sentence list object. Study how it operates.
In[33] holmes_words?
11.8 Tokenization using spaCy 283
Out[37] 0.0s
References
Albrecht, J., Ramachandran, S. and Winkler, C. (2020) Blueprints for Text Analytics Using
Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications.
O’Reilly Media.
Altinok, D. (2021) Mastering spaCy: An end-to-end practical guide to implementing NLP applica-
tions using the Python ecosystem. Packt Publishing.
Arumugam, R., & Shanmugamani, R. (2018). Hands-on natural language processing with python.
Packt Publishing.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python. O'Reilly.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Gutenberg (2022) Project Gutenberg official site. https://www.gutenberg.org/ Accessed 16
June 2022.
Hardeniya, N., Perkins, J. and Chopra, D. (2016) Natural Language Processing: Python and
NLTK. Packt Publishing.
Melville, H. (2006) Moby Dick. Hard Press.
Kedia, A. and Rasu, M. (2020) Hands-On Python Natural Language Processing: Explore tools and
techniques to analyze and process text with a view to building real-world NLP applications.
Packt Publishing.
NLTK (2022) NLTK official site. https://www.nltk.org/. Accessed 16 June 2022.
Perkins, J. (2014). Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd.
SpaCy (2022) spaCy official site. https://spacy.io/. Accessed 16 June 2022.
Sidorov, G. (2019) Syntactic n-grams in Computational Linguistics (SpringerBriefs in Computer
Science). Springer.
Srinivasa-Desikan, B. (2018). Natural language processing and computational linguistics: A prac-
tical guide to text analysis with python, gensim, SpaCy, and keras. Packt Publishing, Limited.
Vasiliev, Y. (2020) Natural Language Processing with Python and spaCy: A Practical Introduction.
No Starch Press.
Chapter 12
Workshop#3 POS Tagging Using NLTK
(Hour 5–6)
12.1 Introduction
Text sentences are divided into subunits first and map into vectors in most NLP
tasks. These vectors are fed into a model to encode where output is sent to a down-
stream task for results. NLTK (NLTK 2022) provides methods to divide text into
subunits as tokenizers. Twitter sample corpus is extracted from NLTK to perform
tokenization (Hardeniya et al. 2016; Kedia and Rasu 2020; Perkins 2014) in proce-
dures below (Albrecht et al. 2020; Antic 2021; Bird et al. 2009):
1. Import NLTK package.
2. Import Twitter sample data.
3. List out fields.
4. Get Twitter string list.
5. List out first 15 Twitters.
6. Tokenize twitter.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 285
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_12
286 12 Workshop#3 POS Tagging Using NLTK (Hour 5–6)
Let us start with the import of NLTK package and download Twitter samples
provided by NLTK platform.
# Download twitter_samples
# nltk.download('twitter_samples')
Import twitter samples dataset as twtr and check file id using fileids() method:
Out[7]
It can also tokenize words between hyphens and other punctuations. Further,
NLTK’s regular expression (RegEx) tokenizer can build custom tokenizers:
12.3.1 What Is Stemming?
Stemming usually removes prefixes or suffixes such as -er, -ion, -ization from words
to extract the base or root form of a word, e.g. computers, computation, and com-
puterization. Although these words spell differently but shared identical concept
related to compute, so compute is the stem of these words.
12.3.2 Why Stemming?
There is needless to extract every single word in a document but only concept or
notion they represent such as information extraction and topic summarization in
NLP applications. It can save computational capacity and preserve overall meaning
of the passage. Stemming technique is to extract the overall meaning or words’ base
form instead of distinct words.
Let us look at how to perform stemming on text data.
12.3.4 Porter Stemmer
Porter Stemmer is the earliest stemming technique used in 1980s. Its key procedure
is to remove words common endings and parse into generic forms. This method is
simple and used in many NLP applications effectively.
Import Porter Stemmer from NLTK library:
In[12] p_stem().stem("computer")
Out[12] 'comput'
In[13] p_stem().stem("dogs")
Out[13] 'dog'
For the above code, dogs are converted from plural to singular, remove
suffix -s and convert to dog.
In[14] p_stem().stem("traditional")
Out[14] 'tradit'
Stemmer may output an invalid word when dealing with special words e.g.
tradit is acquired if suffix -ional is removed. tradit is not a word in English,
it is a root form.
Let’s work on words in plural form. There are 26 words extracted from a – z in
plural form to perform PorterStemming:
Out[16] ape bag comput dog ego fresco gener hat igloo jungl kite learner mice nativ
open photo queri rat scene tree utensil vein well xylophon yoyo zen
Porter Stemming will remove suffixes -s or -es to extract root form, that may
result in single form such as apes, bags, dogs, etc. But in some cases, it will
generate non-English words such as gener, jungl and queri.
12.3 Stemming Using NLTK 291
12.3.5 Snowball Stemmer
Use same list of plural words (w_plu) to check how it works in Snowball
Stemmer for comparison:
Out[21] ape bag comput dog ego fresco generous hat igloo jungl kite learner mice
nativ open photo queri rat scene tree utensil vein well xylophon yoyo zen
There are input words and utterances to filter out impractical stop-words in NLP
preprocessing such as: a, is, the, of, etc.
NLTK already provides a built-in stop-words package for this function. Let us
see how it works.
Import stopwords module and call stopwords.words() method to list out all stop-
words in English.
12.4 Stop-Words Removal with NLTK 293
Out[22]
The above list shows all stop-words. Let us use a simple utterance:
Review results:
1. All commonly used stop-words such as to, for, the, it, are removed as
shown in the example.
2. It has little effect to overall meaning of the utterance.
3. It requires same computational time and effort.
The following example uses Hamlet from The Complete Works of Shakespeare to
demonstrate how stop-words are removed from text processing in NLP.
In[27] len(hamlet_clean)*100.0/len(hamlet)
Out[27] 69.26124197002142
Stop-word corpus can extract a list of string that can add any stop-words with sim-
ple append() function, but it is advisable to create a new stop-word library object
name to begin.
12.4.4.2 Step 2: Check Object Type and Will See It Has a Simple List
In[29] My_sws?
In[30] My_sws
Out[30]
In[31] My_sws.append('sampleSW')
My_sws[160:]
Out[31]
296 12 Workshop#3 POS Tagging Using NLTK (Hour 5–6)
When text data has been processed and tokenized, basic analysis are required to
calculate words or tokens, their distribution and usage frequency in NLP tasks. This
allows understanding of main contents and topics accuracy in the document. Import
a sample webtext (Firefox.txt) from NLTK library.
12.5 Text Analysis with NLTK 297
It can also obtain vocabulary size by passing through a set as shown in the fol-
lowing code:
Out[37]
The above code generates top 30 frequently used words and punctuations in
the whole text. in, to and the are top 3 on the list like other literatures as
Firefox.txt text is the collection of users’ discussion messages and contents
about Firefox browser like conversations.
To exclude stop-words such as the, and not, use the following code to see f words
frequency distribution longer than 3.
Out[38]
Exclude stop-words such as the, and, is, and create a tuple dictionary to
record words frequency. Visualize and transform them into a NLTK fre-
quency distribution graph based on this dictionary as shown above.
12.6 Integration with WordCloud 299
12.6.1 What Is WordCloud?
Wordcloud, also known as tag cloud, is a data visualization method commonly used
in many web statistics and data analysis scenarios. It is a graphical representation of
all words and keywords in sizes and colors. A word has the largest and bold in word
cloud means it occurs frequently in the text (dataset).
The earlier part of this workshop had studied several NLP preprocessing tasks:
tokenization, stemming, stop-word removal, word distribution in text corpus and
data visualization using WordCloud. This section will explore POS tagging
in NLTK.
English Penn Treebank Tagset is used with English corpora developed by Prof.
Helmut Schmid in TC project at the Institute for Computational Linguistics of the
University of Stuttgart (TreeBank 2022). Figure 12.4 shows an original 45 used
Penn Treebank Tagset.
A recent version of this English POS Tagset can be found at Sketchengine.eu
(Sketchengine 2022a) and Chinese POS Tagset (Sketchengine 2022b).
NLTK provides direct mapping from tagged corpus such as Brown Corpus
(NLTK 2022) to universal tags for implementation, e.g. tags VBD (for past tense
verb) and VB (for base form verb) map to VERB only in universal tagset.
In[43] bwn.tagged_words()[0:40]
Out[43]
Fulton is tagged as NP-TL in example code above, a proper noun (NP) appears in a
title (TL) context in Brown corpus that mapped to noun in universal tagset. These
subcategories are to be considered instead of generalized universal tags in NLP
application
POS tagging is commonly used in many NLP applications ranging from Information
Extraction (IE), Named Entity Recognition (NER) to Sentiment Analysis and
Question-&-Answering systems.
Try the following and see how it works:
Out[50] Tokens are: ['Can', 'you', 'please', 'buy', 'me', 'Haagen-Dazs', 'Icecream', '?',
'It', "'s", '$', '30.8', '.']
1. The system treats ‘$’, ‘30.8’, and ‘.’ as separate tokens in this example. It
is crucial because contractions have their own semantic meanings and own
POS leading to the ensuing part of NLTK library POS tagger.
2. POS tagger in NLTK library outputs specific tags for certain words.
3. However, it makes a mistake in this example. Where is it?
4. Compare POS Tagging for the following sentence to identify problem.
Explain.
This section will create own POS tagger using NLTK’s tagged set corpora and
sklearn Random Forest machine learning model.
The following example demonstrates a classification task to predict POS tag for
a word in a sentence using NLTK treebank dataset for POS tagging, and extract
word prefixes, suffixes, previous and neighboring words as features for system
training.
Import all necessary Python packages as below:
In[57] utt_tagged
Out[57]
308 12 Workshop#3 POS Tagging Using NLTK (Hour 5–6)
for ut in utt_tag:
for idx in range(len(ut)):
utt.append(ufeatures(RUutterance(ut), idx))
tag.append(ut[idx][1])
This example uses DVect to convert feature-value dictionary into training vectors.
If the number of possible values for suffix3 feature is 40, there will be 40 features
in output. Use following code to DVect:
Xtran = dvect.fit_transform(X[0:nsize])
ysap = y[0:nsize]
This example has a sample size of 10,000 utterances which 80% of the dataset
is used for training and other 20% is used for testing. RF (Random Forecast)
Classifier is used as POS tagger model as shown:
Out[62] RandomForestClassifier(n_jobs=4)
After system training, can perform POS Tagger validation by using some
sample utterances. But before passing to ptag_predict() method, extract fea-
tures are required by ufeatures() method as shown:
In[66] a_score(ytest,predict)
Out[66] 0.9355
310 12 Workshop#3 POS Tagging Using NLTK (Hour 5–6)
The overall a_score has approximately 93.6% accuracy rate and satisfactory.
Next, let’s look at confusion matrix (c-mat) to check how well can POS tagger
perform.
In[68] pyplt.figure(figsize=(10,10))
pyplt.xticks(np.arange(len(rfclassifier.classes_)),rfclassifier.classes_)
pyplt.yticks(np.arange(len(rfclassifier.classes_)),rfclassifier.classes_)
pyplt.imshow(c_mat, cmap=pyplt.cm.Blues)
pyplt.colorbar()
Out[68]
Use classes from random forest classifier as x and y labels in the code for plotting
confusion matrix.
12.8 Create Own POS Tagger with NLTK 311
It looks like the tagger performs relatively well for nouns, verbs, and determiners
in sentences reflected in dark regions of the plot. Let’s look at some top features of
the model from following code:
References
Albrecht, J., Ramachandran, S. and Winkler, C. (2020) Blueprints for Text Analytics Using
Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications.
O’Reilly Media.
Antic, Z. (2021) Python Natural Language Processing Cookbook: Over 50 recipes to understand,
analyze, and generate text for implementing language processing tasks. Packt Publishing.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python. O'Reilly.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
Gutenberg (2022) Project Gutenberg official site. https://www.gutenberg.org/ Accessed 16
June 2022.
Hardeniya, N., Perkins, J. and Chopra, D. (2016) Natural Language Processing: Python and
NLTK. Packt Publishing.
Kedia, A. and Rasu, M. (2020) Hands-On Python Natural Language Processing: Explore tools and
techniques to analyze and process text with a view to building real-world NLP applications.
Packt Publishing.
NLTK (2022) NLTK official site. https://www.nltk.org/. Accessed 16 June 2022.
Perkins, J. (2014). Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd.
Sketchengine (2022a) Recent version of English POS Tagset by Sketchengine. https://www.
sketchengine.eu/english-treetagger-pipeline-2/. Accessed 21 June 2022.
Sketchengine (2022b) Recent version of Chinese POS Tagset by Sketchengine. https://www.
sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/. Accessed 21 June 2022.
Treebank (2022) Penn TreeBank Release 2 official site. https://catalog.ldc.upenn.edu/docs/
LDC95T7/treebank2.index.html. Accessed 21 June 2022.
Chapter 13
Workshop#4 Semantic Analysis and Word
Vectors Using spaCy (Hour 7–8)
13.1 Introduction
In Chaps. 5 and 6, we studied the basic concepts and theories related to meaning
representation and semantic analysis. This workshop will explore how to use spaCy
technology to perform semantic analysis starting from a revisit on word vectors
concept, implement and pre-train them followed by the study of similarity method
and other advanced semantic analysis.
Word vectors (Albrecht et al. 2020; Bird et al. 2009; Hardeniya et al. 2016; Kedia
and Rasu 2020; NLTK 2022) are practical tools in NLP.
A word vector is a dense representation of a word. Word vectors are important
for semantic similarity applications like similarity calculations between words,
phrases, sentences, and documents, e.g. they provide information about synonym-
ity, semantic analogies at word level.
Word vectors are produced by algorithms to reflect similar words appear in simi-
lar contexts. This paradigm captures target word meaning by collecting information
from surrounding words which is called distributional semantics.
They are accompanied by associative semantic similarity methods including
word vector computations such as distance, analogy calculations, and visualization
to solve NLP problems.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 313
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_13
314 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
In this workshop, we are going to cover the following main topics (Altinok 2021;
Arumugam and Shanmugamani 2018; Perkins 2014; spaCy 2022; Srinivasa-
Desikan 2018; Vasiliev 2020):
• Understanding word vectors.
• Using spaCy’s pre-trained vectors.
• Advanced semantic similarity methods.
Since each row corresponds to one word, a sentence represents a matrix, e.g. I
play tennis today is represented by matrix as in Fig. 13.3:
Vectors length is equal to word numbers in vocabulary as shown above. Each
dimension is apportioned to one word explicitly. When applying this encoding vec-
torization to text, each word is replaced by its vector and the sentence is transformed
into a (N, V) matrix, where N is words number in sentence and V is vocabulary size.
This text representation is easy to compute, debug, and interpret. It looks good so
far but there are potential problems:
• Vectors are sparse. Each vector contains many 0 s but has one 1. If words have
similar meanings, they can group to share dimensions, this vector will deplete
space. Also, numerical algorithms do not accept high dimension and sparse vec-
tors in general.
• A sizeable vocabulary is comparable to high dimensions vectors that are imprac-
tical for memory storage and computation.
316 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
• Similar words do not assign with similar vectors resulting in unmeaningful vec-
tors, e.g. cheese, topping, salami, and pizza have related meanings but have unre-
lated vectors. These vectors depend on corresponding word’s index and assign
randomly in vocabulary, indicating that one-hot encoded vectors are incapable to
capture semantic relationships and against word vectors’ purpose to answer pre-
ceding list concerns.
This is a 50-dimensional vector for word the, these dimensions have floating points:
1. What do dimensions represent?
2. These individual dimensions do not have inherent meanings typically but
instead they represent vector space locations, and the distance between these
vectors indicates the similarity of corresponding words’ meanings
3. Hence, a word’s meaning is distributed across dimensions
4. This type of word’s meaning representation is called distributional semantics
Use word vector visualizer for TensorFlow from (TensorFlow 2022) Google
which offers word vectors for 10,000 words. Each vector is 200-dimensional and
projected into three dimensions for visualization. Let us look at the representation
of tennis as in Fig. 13.4:
tennis is semantically grouped with other sports, i.e. hockey, basketball, chess, etc.
Words in proximity are calculated by their cosine distances as shown in Fig. 13.5
13.4 A Taste of Word Vectors 317
Word vectors are trained on a large corpus such as Wikipedia which included to
learn proper nouns representations, e.g. Alice is a proper noun represented by vector
as in Fig. 13.6:
It showed that all vocabulary input words are in lower cases to avoid multiple
representations of the same word. Alice and Bob are person names to be listed. In
addition, lewis and carroll have relevance to Alice because of the famous literature
Alice’s Adventures in Wonderland written by Lewis Carroll. Further, it also showed
syntactic category of all neighboring words are nouns but not verbs.
Word vectors can capture synonyms, antonyms, and semantic categories such as
animals, places, plants, names, and abstract concepts.
318 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
Word vectors capture semantics, support vector addition, subtraction, and analo-
gies. A word analogy is a semantic relationship between a pair of words. There are
many relationship types such as synonymity, anonymity, and whole-part relation.
Some example pairs are (King—man, Queen—woman), (airplane—air, ship—sea),
(fish—sea, bird—air), (branch—tree, arm—human), (forward—backward, absent—
present) etc.
For example, gender mapping represents Queen and King as Queen—Woman +
Man = King. If woman is subtracted by Queen and add Man instead to obtain King.
Then, this analogy interprets queen is attributed to king as woman is attributed to
man. Embeddings can generate analogies such as gender, tense, and capital city as
shown in Fig. 13.7:
320 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
Word vectors are part of many spaCy language models. For instance, en_core_web_
md model ships with 300-dimensional vectors for 20,000 words, while en_core_
web_lg model ships with 300-dimensional vectors with a 685,000 words vocabulary.
Typically, small models (names end with sm) do not include any word vectors
but context-sensitive tensors. Semantic similarity calculations can perform but
results will not be accurate as word vector computations.
A word’s vector is via token.vector method. Let us look at this method using
code query word vector for banana:
13.7 spaCy Pre-trained Word Vectors 321
In[3] utt1[3].vector
Out[3]
In[4] type(utt1[3].vector)
Out[4] numpy.ndarray
322 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
In[5] utt1[3].vector.shape
Out[5] (300,)
Query Python type of word vector in this code segment. Then, invoke shape()
method of NumPy array on the vector.
Doc and Span objects also have vectors. A sentence vector or a span is the aver-
age of words’ vectors. Run following code and view results:
Only words in model’s vocabulary have vectors, words are not in vocabulary are
called out-of-vocabulary (OOV) words. token.is_oov and token.has_vector are
two methods to query whether a token is in the model’s vocabulary and has a
word vector:
This is basically how to use spaCy’s pretrained word vectors. Next, discover
how to invoke spaCy’s semantic similarity method on Doc, Span, and Token
objects.
Every container type object has a similarity method that allows to calculate seman-
tic similarity of other container objects by comparing word vectors in spaCy.
Semantic similarity between two container objects is different container types. For
instance, a Token object to a Doc object and a Doc object to a Span object.
The following example computes two Span objects similarity:
In[10] utt4[2]
Out[10] England
In[11] utt4[2].similarity(utt5[3])
Out[11] 0.7389127612113953
In[12] utt4.similarity(utt5)
Out[12] 0.8771558796234277
1. The preceding code segment calculates semantic similarity between two sentences
I visited England and I went to London
2. Similarity score is high enough to consider both sentences are similar (similarity
degree ranges from 0 to 1, 0 represents unrelated and 1 represents identical)
In[13] utt4.similarity(utt4)
Out[13] 1.0
Judge the distance with numbers is complex but review vectors on paper can
understand how vocabulary word groups are formed.
Code snippet below visualizes a vocabulary of two graphical semantic classes.
The first word class is for animals and the second class is for food.
vocab = nlp( "cat dog tiger elephant bird monkey lion cheetah burger pizza
food cheese wine salad noodles macaroni fruit vegetable" )
Use PCA (Principal Component Analysis) similarity analysis and plot similarity
results with plt class.
It showed that spaCy word vectors can visualize two semantic classes that are
grouped. The distance between animals is reduced and uniformly distributed, while
food class formed groups within the group.
It has learnt that spaCy’s similarity method can calculate semantic similarity to
obtain scores but there are advanced semantic similarity methods to calculate words,
phrases, and sentences similarity.
13.9.2 Euclidian Distance
bigger due to geometry, although the semantics of first piece of text (now dog canine
terrier) remain the same.
This is the main drawback of using Euclidian distance for semantic similarity as
the orientation of two vectors in the space is not considered. Figure 13.9 illustrates
the distance between dog and cat, and the distance between dog canine terrier
and cat.
How can we fix this problem? There is another way of calculating similarity
called cosine similarity to address this problem.
Contrary to Euclidian distance, cosine distance is more concerned with the orienta-
tion of two vectors in the space. The cosine similarity of two vectors is basically the
cosine of angle created by these two vectors. Figure 13.10 shows the angle between
dog and cat vectors:
328 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
If is interested in finding the biggest mammals on the planet, the phrases biggest
mammals and in the world will be keywords. By comparing these phrases with the
search phrases largest mammals and on the planet should give a high similarity
score. But if is interested in finding out about places in the world, California will be
keyword. California is semantically like word geography and more suitably, the
entity type is a geographical noun.
Since we have learnt how to calculate similarity score, the next section will learn
about where to look for the meaning. It will cover a case study on text categorization
before improving task results via key phrase extraction with similarity score
calculations.
13.9 Advanced Semantic Similarity Methods with spaCy 329
Determining two sentences’ semantic similarity can categorize texts into predefined
categories or spot only the relevant texts. This case study will filter users’ comments
in an e-commerce website related to the word perfume. Suppose to evaluate the fol-
lowing comments:
Here, it is noted that only the second sentence is related. This is because it con-
tains the word fragrance and adjectives describing scents. To understand which sen-
tences are related, can try several comparison strategies.
To start, compare perfume to each sentence. Recall that spaCy generates a word
vector for a sentence by averaging the word vector of its tokens. The following code
snippet compares preceding sentences to perfume search key:
In[17] utt6 = nlp( "I purchased a science fiction book last week. I loved everything
related to this fragrance: light, floral and feminine... I purchased a bottle of
wine. " )
key = nlp( "perfume" )
for utt in utt6.sents:
print(utt.similarity(key))
Out[17]
spaCy extracts noun phases by parsing the output of dependency parser. It can
identify noun phrases of a sentence by using doc.noun_chunks method:
In[18] utt7 = nlp( "My beautiful and cute dog jumped over the fence" )
In[19] utt7.noun_chunks
Out[19] <generator at 0x1932f2de900>
In[20] list(utt7.noun_chunks)
Out[20] [My beautiful and cute dog, the fence]
13.9 Advanced Semantic Similarity Methods with spaCy 331
Let us modify the preceding code snippet. Instead of comparing the search key
perfume to the entire sentence, this time will only compare it with sentence’s
noun chunks:
Out[21] 0.27409999728254997
If these scores are compared with previous scores, it is noted that that the first
sentence remains irrelevant, so its score decreased marginally but the second sen-
tence’s score increased significantly. Also, the second and third sentences scores are
distant from each other to reflect that second sentence is the most related sentence.
In some cases, it can focus to extract proper nouns instead of every noun. Hence, it
is required to extract named entities. Let us compare the following paragraphs:
The codes should be able to recognize that first two paragraphs are about large
technology companies and their products, whereas the third paragraph is about a
geographic location.
Comparing all noun phrases in these sentences may not be helpful because many
of them such as volume are irrelevant to categorization. The topics of these para-
graphs are determined by phrases within them, that is, Google Search, Google,
Microsoft Bing, Microsoft, Windows, Dead Sea, Jordan Valley, and Israel. spaCy
can identify these entities:
332 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
In[22] utt8 = nlp( "Google Search, often referred as Google, is the most popular
search engine nowadays. It answers a huge volume of queries every day." )
utt9 = nlp( "Microsoft Bing is another popular search engine. Microsoft is
known by its star product Microsoft Windows, a popular operating system
sold over the world." )
utt10 = nlp( "The Dead Sea is the lowest lake in the world, located in the
Jordan Valley of Israel. It is also the saltiest lake in the world." )
In[23] utt8.ents
Out[23] (Google Search, Google, every day)
In[24] utt9.ents
Out[24] (Microsoft Bing, Microsoft, Microsoft Windows)
In[25] utt10.ents
Out[25] (the Jordan Valley, Israel)
Since words are extracted for comparison, let’s calculate similarity scores:
In[27] ents1.similarity(ents2)
Out[27] 0.5394545341415748
In[28] ents1.similarity(ents3)
Out[28] 0.48605042335384385
In[29] ents2.similarity(ents3)
Out[29] 0.39674953175052086
References 333
These figures revealed that the highest level of similarity exists between first and
second paragraphs, which are both about large tech companies. The third
paragraph is unlike other paragraphs. How can this calculation be obtained by
using word vectors only? It is probably because words Google and Microsoft often
appear together in news and other social media text corpora, hence producing
similar word vectors
This is the conclusion of advanced semantic similarity methods section with dif-
ferent ways to combine word vectors with linguistic features such as key phrases
and named entities.
References
Albrecht, J., Ramachandran, S. and Winkler, C. (2020) Blueprints for Text Analytics Using
Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications.
O’Reilly Media.
Altinok, D. (2021) Mastering spaCy: An end-to-end practical guide to implementing NLP applica-
tions using the Python ecosystem. Packt Publishing.
Arumugam, R., & Shanmugamani, R. (2018). Hands-on natural language processing with python.
Packt Publishing.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python. O'Reilly.
Doyle, A. C. (2019) The Adventures of Sherlock Holmes (AmazonClassics Edition).
AmazonClassics.
FastText (2022) FastText official site. https://fasttext.cc/. Accessed 22 June 2022.
Gutenberg (2022) Project Gutenberg official site. https://www.gutenberg.org/ Accessed 16
June 2022.
Hardeniya, N., Perkins, J. and Chopra, D. (2016) Natural Language Processing: Python and
NLTK. Packt Publishing.
Kedia, A. and Rasu, M. (2020) Hands-On Python Natural Language Processing: Explore tools and
techniques to analyze and process text with a view to building real-world NLP applications.
Packt Publishing.
NLTK (2022) NLTK official site. https://www.nltk.org/. Accessed 16 June 2022.
Perkins, J. (2014). Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd.
SpaCy (2022) spaCy official site. https://spacy.io/. Accessed 16 June 2022.
334 13 Workshop#4 Semantic Analysis and Word Vectors Using spaCy (Hour 7–8)
14.1 Introduction
NLTK and spaCy are two major NLP Python implementation tools for basic text
processing, N-gram modeling, POS tagging, and semantic analysis introduced in
last four workshops. Workshop 5 will explore how to position these NLP implemen-
tation techniques into two important NLP applications: text classification and senti-
ment analysis. TensorFlow and Kera are two vital components to implement
Long-Short Term Memory networks (LSTM networks), a commonly used Recurrent
Neural Networks (RNN) on machine learning especially in NLP applications.
This workshop will:
1. study text classification concepts in NLP and how spaCy NLP pipeline works on
text classifier training.
2. use movie reviews as a problem domain to demonstrate how to implement senti-
ment analysis with spaCy.
3. introduce Artificial Neural Networks (ANN) concepts, TensorFlow, and Kera
technologies.
4. introduce sequential modeling scheme with LSTM technology using movie
reviews domain as example to integrate these technologies for text classification
and movie sentiment analysis.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 335
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_14
336 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
Sequential data modelling with LSTM technology is used to process text for
machine learning tasks with Keras’s text preprocessing module and implement a
neural network with tf.keras.
This workshop will cover the following key topics:
• Basic concept and knowledge of text classification.
• Model training of spaCy text classifier.
• Sentiment Analysis with spaCy.
• Sequential modeling with LSTM Technology.
14.3 Technical Requirements
Codes for training spaCy text classifier and sentiment analysis are spaCy v3.0 com-
patible. Text classification with spaCy and Keras requires Python libraries as
follows:
• TensorFlow (version 2.3 or above)
• NumPy
• pandas
• Matplotlib
If these packages are not installed into PC/notebook, use pip install xxx command.
Text Classification (Albrecht et al. 2020; Bird et al. 2009; George 2022; Sarkar
2019; Siahaan and Sianipar 2022; Srinivasa-Desikan 2018) is the task of assigning
a set of predefined labels to text.
They are classified by manual tagging, but machine learning techniques are
applied progressively to train classification system with known examples, or train
samples to classify unseen cases. It is a fundamental task of NLP (Perkins 2014;
Sarkar 2019) using various machine learning method such as LSTM technology
(Arumugam and Shanmugamani 2018; Géron 2019; Kedia and Rasu 2020).
Text classification types are (Agarwal 2020; George 2022; Pozzi et al. 2016):
• Language detection is the first step of many NLP systems, i.e. machine translation.
• Topic generation and detection are the process of summarization, or classifica-
tion of a batch of sentences, paragraphs, or texts into certain Topic of Interest
(TOI) or topic titles, e.g. customers’ email request refund or complaints about
products or services.
14.4 Text Classification in a Nutshell 337
Fig. 14.1 Example of top detection for customer complaint in CSAS (Customer Service
Automation System)
Fig. 14.2 Sample input texts and their corresponding output class labels
338 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
Text classification can also be divided into (1) binary, (2) multi-class, and (3)
multi-label categories:
1. Binary text classification refers to categorize text with two classes.
2. Multi-class text classification refers to categorize texts with more than two
classes. Each class is mutually exclusive where one text is associated with single
class, e.g. rating customer reviews are represented by 1–5 stars category single
class label.
3. Multi-label text classification system is to generalize its multi-class counterpart
assigned to each example text e.g. toxic, severe toxic, insult, threat, obscenity
levels of negative sentiment. What are Labels in Text Classification?
Labels are class names for output. A class label can be categorical (string) or
numerical (a number).
Text classification has the following class labels:
• Sentiment analysis has positive and negative class labels abbreviated by pos and
neg where 0 represents negative sentiment and 1 represents positive sentiment.
Binary class labels are popular as well.
• The identical numeric representation applies to binary classification problems,
i.e. use 0–1 for class labels.
• Class labeled with a meaningful name for multi-class and multi-label problems,
e.g. movie genre classifier has labels action, scifi, weekend, Sunday movie, etc.
Numbers are labels for a five-class classification problem, i.e. 1–5.
14.5.1 TextCategorizer Class
There are two parameters (1) a threshold value and (2) a model name (either
Single or Multi depends on classification task) required for a TextCategorizer com-
ponent configuration.
TextCategorizer generates a probability for each class and a class is assigned to
text if the probability of this class is higher than the threshold value.
A traditional threshold value for text classification is 0.5; however, if prediction
is required for a higher confidence, it can adjust threshold to 0.6–0.8.
A single-label TextCategorizer (tCategorizer) component is added to nlp pipe-
line as follows:
In[6] tCategorizer
Out[6] <spacy.pipeline.textcat.TextCategorizer at 0x1bf406cedc0>
In[9] tCategorizer
Out[9]
Add a TextCategorizer pipeline component to nlp pipeline object at the last line of
each preceding code blocks. The newly created TextCategorizer component is
captured by textcat variable and set for training
In[10] movie_comment1 = [
("This movie is perfect and worth watching. ",
{"cats": {"Positive Sentiment": 1}}),
("This movie is great, the performance of Al Pacino is brilliant.",
{"cats": {"Positive Sentiment": 1}}),
("A very good and funny movie. It should be the best this year!",
{"cats": {"Positive Sentiment": 1}}),
("This movie is so bad that I really want to leave after the first hour
watching.", {"cats": {"Positive Sentiment": 0}}),
("Even free I won't see this movie again. Totally failure!",
{"cats": {"Positive Sentiment": 0}}),
("I think it is the worst movie I saw so far this year.",
{"cats": {"Positive Sentiment": 0}})
]]
The code will introduce a class category selected for TextCategorizer component.
In[14] nlp.pipe_names
Out[14] ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer',
'ner', 'textcat']
In[16] movie_comment_exp
Out[16]
344 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
14.5.3 System Training
In[17] movie_comment1
Out[17] [('This movie is perfect and worth watching. ',
{'cats': {'Positive Sentiment': 1}}),
('This movie is great, the performance of Al Pacino is brilliant.',
{'cats': {'Positive Sentiment': 1}}),
('A very good and funny movie. It should be the best this year!',
{'cats': {'Positive Sentiment': 1}}),
('This movie is so bad that I really want to leave after the first hour watch-
ing.',
{'cats': {'Positive Sentiment': 0}}),
("Even free I won't see this movie again. Totally failure!",
{'cats': {'Positive Sentiment': 0}}),
('I think it is the worst movie I saw so far this year.',
{'cats': {'Positive Sentiment': 0}})]
14.5 Text Classifier with spaCy NLP Pipeline 345
14.5.4 System Testing
Let us test a new textcategorizer component, doc.cats property holds the class labels:
The small dataset trained spaCy text classifier successfully for a binary text
classification problem to perform correct sentiment analysis. Now, let us perform
multi-label classification
14.5 Text Classifier with spaCy NLP Pipeline 347
Multi-label classification means the classifier can predict more than single label for
an example text. Naturally, the classes are not mutually exclusive.
Provide training samples with at least more than 2 categories to train a multiple
labelled classifier.
Construct a small training set to train spaCy‘s TextCategorizer for multi-label
classification. This time will form a set of movie reviews, where the multi-
category is:
• ACTION.
• SCIFI.
• WEEKEND.
Here is a small sample dataset (movie_comment2):
In[21] movie_comment2 = [
("This movie is great for weekend watching.",
{"cats": {"WEEKEND": True}}),
("This a 100% action movie, I enjoy it.",
{"cats": {"ACTION": True}}),
("Avatar is the best Scifi movie I ever seen!" ,
{"cats": {"SCIFI": True}}),
("Such a good Scifi movie to watch during the weekend!",
{"cats": {"WEEKEND": True, "SCIFI": True}}),
("Matrix a great Scifi movie with a lot of action. Pure action, great!",
{"cats": {"SCIFI": True, "ACTION": True}})
]
In[22] movie_comment2
Out[22] [('This movie is great for weekend watching.', {'cats': {'WEEKEND':
True}}),
('This a 100% action movie, I enjoy it.', {'cats': {'ACTION': True}}),
('Avatar is the best Scifi movie I ever seen!', {'cats': {'SCIFI': True}}),
('Such a good Scifi movie to watch during the weekend!',
{'cats': {'WEEKEND': True, 'SCIFI': True}}),
('Matrix a great Scifi movie with a lot of action. Pure action, great!',
{'cats': {'SCIFI': True, 'ACTION': True}})]
348 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
In[23] movie_comment2[1]
Out[23] ('This a 100% action movie, I enjoy it.', {'cats': {'ACTION': True}})
Provide examples with a single label, such as first example (the first sentence of
movie_comment2, the second line of preceding code block), and examples with
more than single label, such as fourth example of movie_comment2.
Import after the training set is formed.
Note that the last line has different code than previous section. Import multi-label
model instead of single-label model.
Next, add multi-label classifier component to nlp pipeline.
Also note that pipeline component name is textcat_multilabel as compared with
previous section’s textcat:
The output should look like output of previous section but use multiple category
for system training. Let’s test the new multi-label classifier:
Although sample size is small, but the multiple textcategorizer can classify two
IMDB user comments correctly into three categories: SCIFI, ACTION, and
WEEKEND. Note that over thousands of IMDB user comments are required to
perform a satisfactory sentiment analysis in real situations
14.6 Sentiment Analysis with spaCy 351
This section has learnt how to train a spaCy‘s TextCategorizer component for
binary and multilabel text classifications.
Now, TextCategorizer will be trained on a real-world dataset for a sentiment
analysis using IMDB user comments dataset.
This section will work on a real-world dataset using IMDB Large Movie Reviews
Dataset from Kaggle (2022).
The original imdb_sup.csv dataset has 50,000 rows. They need to down-size and
select first 5000 records into datafile imdb_5000.csv to speed up training. This
movie reviews dataset consists of movie reviews, reviews sizes, IMDB Ratings
(1–10), and Sentiment Ratings (0 or 1).
The dataset can be downloaded from workshop directory namely: imdb_sup.csv
(complete dataset) or imdb_5000.csv (5000 records).
In[32] mcommentDF.shape
Out[32] (5000, 3)
Note: This IMDB movie reviews dataset contains 5000 records, each record has 3
fields attributes: Review, Rating, and Sentiment
3. Examine rows and columns of dataset by printing the first few rows using
head() method:
In[33] mcommentDF.head()
Out[33]
4. Use Review and Sentiment columns only in this workshop. Hence, drop other
columns that won't use, and call dropna() method to drop the rows with miss-
ing values:
In[35] mcommentDF_clean.head()
Out[35]
In[36] axplot=mcommentDF.Rating.value_counts().plot
(kind='bar', colormap='Paired')
plt.show()
Out[36] 1400
1200
1000
800
600
400
200
0
10
1. Users prefer to give high rating, i.e. 8 or above, and 10 is the highest as shown
2. It is better to select sample set with even distribution to balance sample data rating
3. Check system performance first. If it is not as good as predicted, can use fine-tune
sampling method to improve system performance
Out[37]
2500
2000
1500
1000
500
0
0
1
Note that rating distribution has better results than the previous one, it has higher
number of positive reviews, but negative reviews are also significant as shown
In[38] mcommentDF.head()
Out[38]
Complete dataset exploration and display review scores with class categories
distribution. The dataset is ready to be processed. Drop unused columns and convert
review scores to binary class labels. Let us begin with the training procedure
14.6 Sentiment Analysis with spaCy 355
2. Create a pipeline object nlp, define classifier configuration, and add TextCategorizer
component to nlp with the following configuration:
In[42] movie_comment_exp[0]
Out[42]
5. Use POS and NEG labels for positive and negative sentiment respectively.
Introduce these labels to the new component and initialize it with examples.
In[44] tCategorizer
Out[44] <spacy.pipeline.textcat_multilabel.MultiLabel_TextCategorizer at 0x2626f-
9c4ee0>
6. Define training loop by examining the training set for two epochs but can exam-
ine further if necessary. The following code snippet will train the new text cate-
gorizer component:
7. Test how text classifier component works for two example sentences:
In[46] test5 = nlp("This is the best movie that I have ever watched")
In[47] test5.cats
Out[47] {'POS': 0.9857660531997681, 'NEG': 0.018266398459672928}
In[49] test6.cats
Out[49] {'POS': 0.1364014744758606, 'NEG': 0.8908849358558655}
Note both NEG and POS labels appeared in prediction results because it used
multi-label classifier. The results are satisfactory, but it can improve if the numbers
for training epochs are increased. The first sentence has a high positive probability
output, and the second sentence has predicted as negative with a high probability
This workshop section will learn how to incorporate spaCy technology with ANN
(Artificial Neural Networks) technology using TensorFlow and its Keras package
(Géron 2019; Kedia and Rasu 2020; TensorFlow 2022).
A typical ANN has:
1. Input layer consists of input neurons, or nodes.
2. Hidden layer consists of hidden neurons, or nodes.
3. Output layer consists of output neurons, or nodes
ANN will learn knowledge by its network weights update through network train-
ing with sufficient sample inputs and target outputs pairs. The network can predict
or match unseen inputs to corresponding output result after it had sufficient training
to a predefined accuracy. A typical ANN architecture is shown in Fig. 14.4.
358 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
TensorFlow (Géron 2019; TensorFlow 2022) is a popular Python tool widely used
for machine learning. It has huge community support and great documentation
available at TensorFlow official site (TensorFlow 2022), while Keras is a Python
based deep learning tool that can be integrated with Python platforms such as
TensorFlow, Theano, and CNTK.
TensorFlow 1 was disagreeable to symbolic graph computations and other low-
level computations, but TensorFlow 2 initiated great changes in machine learning
methods allowing developers to use Keras with TensorFlow’s low-level methods.
Keras is popular in R&D because it supports rapid prototyping and user-friendly
API to neural network architectures (Kedia and Rasu 2020; Srinivasa-Desikan 2018).
Neural networks are commonly used for computer vision and NLP tasks includ-
ing object detection, image classification, scene understanding, text classification,
POS tagging, text summarization, and natural language generation.
TensorFlow 2 will be used to study the details of a neural network architecture
for text classification with tf.keras implementation throughout this section.
Long-Short Term Memory network (LSTM network) is one of the significant recur-
rent networks used in various machine learning applications such as NLP applica-
tions nowadays (Ekman 2021; Korstanje 2021).
RNNs are special neural networks that can process sequential data in steps.
All inputs and outputs are independent but not for text data in neural networks.
Every word’s presence depends on neighboring words, e.g. a word is predicted by
considering all preceding predicted words and stored the past sequence token of
words within a LTSM cell in a machine translation task. A LSTM is showed in
Fig. 14.5.
14.10 Keras Tokenizer in NLP 359
An LSTM cell is moderately complex than an RNN cell, but computation logic
is identical. A diagram of a LSTM cell is shown in Fig. 14.6. Note that input and
output steps are identical to RNN counterparts:
Keras has extensive support for RNN variations GRU, LSTM, and simple API
for training RNNs. RNN variations are crucial for NLP tasks as language data’s
nature is sequential, i.e. text is a sequence of words, speech is a sequence of sounds,
and so on.
Since the type of statistical model has identified in the design, it can transform a
sequence of words into a word IDs sequence and build vocabulary with Keras pre-
processing module simultaneously.
In[50] testD = [ "I am going to buy a gift for Christmas tomorrow morning.",
"Yesterday my mom cooked a wonderful meal.",
"Jack promised he would remember to turn off the lights." ]
In[51] testD
Out[51] ['I am going to buy a gift for Christmas tomorrow morning.',
'Yesterday my mom cooked a wonderful meal.',
'Jack promised he would remember to turn off the lights.']
All tokens of Doc object generated by calling nlp(sentence) are iterated in the
preceding code. Note that punctuation marks have not filtered as this filtering
depends on the task e.g. punctuation marks such as ‘!’, correlate to the result in
sentiment analysis, they are preserved in this example.
Build vocabulary and token sequences into token-ID sequences using Tokenizer
as shown:
In[54] ktoken.fit_on_texts(testD)
ktoken
In[55] ktoken.word_index
Out[55]
In[56] ktoken.texts_to_sequences(["Christmas"])
Out[56] [[9]]
1. Note token-IDs start from 1 and not 0. 0 is a reserved value, which means a
padding value with specific meaning
2. Keras cannot process utterances of different lengths, hence need to pad all
utterances
3. Pad each sentence of dataset to a maximum length by adding padding utterances
either at the start or end of utterances
4. Keras inserts 0 for the padding which means it is a padding value without a token
Call pad_sequences on this sequences list and every sequence is padded with
zeros so that its length reaches MAX_LEN=4 which is the length of the longest
sequence. Then pad sequences from the right or left with post and pre options.
Sentences with post option were padded in the preceding code, hence the sentences
were padded from the right.
14.10 Keras Tokenizer in NLP 363
When these sequences are organized, the complete text preprocessing steps are
as follows:
Out[60]
14.10.1 Embedding Words
Tokens can be transformed into token vectors. Embedding tokens into vectors
occurred via a lookup embedding table. Each row holds a token vector indexed by
token-IDs, hence the flow of obtaining a token vector is as follows:
1. token->token-ID: A token-ID is assigned with each token with Keras’ Tokenizer
in previous section. Tokenizer holds all vocabularies and maps each vocabulary
token to an ID.
2. token-ID->token vector: A token-ID is an integer that can be used as an index to
embed table's rows. Each token-ID corresponds to one row and when a token
vector is required, first obtain its token-ID and lookup in the embedding table
rows with this token-ID.
364 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
Embedding
Token Token-ID
Table
food 11 row 0
row 1 Token-vector
row 2 “food”
row 11
This section will demonstrate the design of LSTM-based RNN text classifier for
sentiment analysis with steps below:
1. Data retrieve and preprocessing.
2. Tokenize review utterances with padding.
3. Create utterances pad sequence and put into Input Layer.
4. Vectorize each token and verified by token-ID in Embedding Layer.
5. Input token vectors into LSTM.
6. Train LSTM network.
Let us start by recalling the dataset again.
14.11 Movie Sentiment Analysis with LTSM Using Keras and spaCy 365
14.11.1 Step 1: Dataset
IMDB movie reviews identical dataset from sentiment analysis with spaCy section
will be used. They had already processed with pandas and condensed into two col-
umns with binary labels.
Reload reviews table and perform data preprocessing as done in previous section
to ensure the data is up to date:
Out[61]
2500
2000
1500
1000
500
0
0
1
In[62] mcommentDF.head()
Out[62]
366 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
Next, extract review text and review label from each dataset row and add them
into Python lists:
# Perform Tokenization
for idx, rw in mcommentDF.iterrows():
comments = rw["Review"]
rating = rw["Sentiment"]
categories.append(rating)
mtoks = [token.text for token in nlp(comments)]
movie_comment_exp.append(mtoks)
In[65] movie_comment_exp[0]
Out[65] ['*', '*', 'Possible', 'Spoilers', '*', '*']
Note that a list of words to movie_comment_exp has added, hence each element of
this list is a list of tokens. Next, invoke Keras' Tokenizer on this tokens list to build
vocabulary
Since the dataset had already processed, tokenize dataset sentences and build a
vocabulary.
2. Feed ktoken into token list and convert them into IDs by calling
texts_to_sequences:
3. Pad short utterance sequences to a maximum length of 50. This will truncate long
reviews to 50 words:
In[71] catlist.shape
Out[71] (5000, 1)
All basic preparation works are completed at present to create a LSTM network
and input data.
Load TensorFlow Keras related modules:
Do not confuse None as input shape. Here, None means that dimension can be any
scalar number. So, use this expression when Keras infers the input shape
Create LSTM_Layer:
Here, units = 256 is the dimension of hidden state. LSTM output shape and hidden
state shape are identical due to LSTM architecture.
14.11 Movie Sentiment Analysis with LTSM Using Keras and spaCy 369
When a 256-dimensional vector from LSTM layer has obtained, it will be con-
densed into a 1-dimensional vector (possible values of this vector are 0 and 1, which
are class labels):
After the model has defined, it required to compile with an optimizer, a loss func-
tion, and an evaluation metric:
In[78] imdb_mdl.summary()
Out[78]
370 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
1. Use ADAM (adaptive moment estimation) as optimizer for LSTM training for
imdb_mdl LSTM model
2. Use binary cross-entropy as loss function
3. A list of supported performance metrics can be found in Keras official site (Keras
2022)
<keras.callbacks.History at 0x26289a53f10>
References 371
References
Agarwal, B. (2020) Deep Learning-Based Approaches for Sentiment Analysis (Algorithms for
Intelligent Systems). Springer.
Albrecht, J., Ramachandran, S. and Winkler, C. (2020) Blueprints for Text Analytics Using
Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications.
O’Reilly Media.
Altinok, D. (2021) Mastering spaCy: An end-to-end practical guide to implementing NLP applica-
tions using the Python ecosystem. Packt Publishing.
Arumugam, R., & Shanmugamani, R. (2018). Hands-on natural language processing with python.
Packt Publishing.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python. O'Reilly.
Ekman, M. (2021) Learning Deep Learning: Theory and Practice of Neural Networks, Computer
Vision, Natural Language Processing, and Transformers Using TensorFlow. Addison-Wesley
Professional.
George, A. (2022) Python Text Mining: Perform Text Processing, Word Embedding, Text
Classification and Machine Translation. BPB Publications.
Géron, A. (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media.
Kaggle (2022) IMDB Large Movie Review Dataset from Kaggle. https://www.kaggle.com/code/
nisargchodavadiya/movie-review-analytics-sentiment-ratings/data. Accessed 23 June 2022.
372 14 Workshop#5 Sentiment Analysis and Text Classification with LSTM Using spaCy…
Kedia, A. and Rasu, M. (2020) Hands-On Python Natural Language Processing: Explore tools and
techniques to analyze and process text with a view to building real-world NLP applications.
Packt Publishing.
Keras (2022) Keras official site performance metrics. https://keras.io/api/metrics. Accessed 23
June 2022.
Korstanje, J. (2021) Advanced Forecasting with Python: With State-of-the-Art-Models Including
LSTMs, Facebook’s Prophet, and Amazon’s DeepAR. Apress.
Perkins, J. (2014). Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd.
Pozzi, F., Fersini, E., Messina, E. and Liu, B. (2016) Sentiment Analysis in Social Networks.
Morgan Kaufmann.
SpaCy (2022) spaCy official site. https://spacy.io/. Accessed 16 June 2022.
Sarkar, D. (2019) Text Analytics with Python: A Practitioner’s Guide to Natural Language
Processing. Apress.
Siahaan, V. and Sianipar, R. H. (2022) Text Processing and Sentiment Analysis using Machine
Learning and Deep Learning with Python GUI. Balige Publishing.
Srinivasa-Desikan, B. (2018). Natural language processing and computational linguistics: A prac-
tical guide to text analysis with python, gensim, SpaCy, and keras. Packt Publishing Limited.
TensorFlow (2022) TensorFlow official site. https://tensorflow.org/. Accessed 22 June 2022.
Vasiliev, Y. (2020) Natural Language Processing with Python and spaCy: A Practical Introduction.
No Starch Press.
Chapter 15
Workshop#6 Transformers with spaCy
and TensorFlow (Hour 11–12)
15.1 Introduction
In Chap. 8, the basic concept about Transfer Learning, its motivation and related
background knowledge such as Recurrent Neural Networks (RNN) with Transformer
Technology and BERT model are introduced.
This workshop will learn about the latest topic Transformers in NLP, how to use
them with TensorFlow and spaCy. First, we will learn about Transformers and
Transfer learning. Second, we will learn about a commonly used Transformer archi-
tecture—Bidirectional Encoder Representations from Transformers (BERT) as well
as to how BERT Tokenizer and WordPiece algorithms work.
Further, we will learn how to start with pre-trained transformer models of
HuggingFace library (HuggingFace 2022) and practice to fine-tune HuggingFace
Transformers with TensorFlow and Keras (TensorFlow 2022; Keras 2022) followed
by how spaCy v3.0 (spaCy 2022) integrates transformer models as pre-trained pipe-
lines. These techniques and tools will be used in the last workshop for building a
Q&A chatbot.
Hence, this workshop will cover the following topics:
• Transformers and Transfer Learning.
• Understanding BERT.
• Transformers and TensorFlow.
• Transformers and spaCy.
15.2 Technical Requirements
© The Author(s), under exclusive license to Springer Nature Singapore Pte 373
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_15
374 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
All source codes for this workshop can be downloaded from GitHub archive on
NLPWorkshop6 (NLPWorkshop6 2022).
Use pip install commands to install the following packages:
• pip install spacy.
• pip install TensorFlow (note: version 2.2 or above).
• pip install transformers.
Fig. 15.1 Sample Input Texts and their corresponding Output Class Labels
15.4 Why Transformers?
Softmax
Linear
Positional Positional
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(Shifted right)
Encoder generates a vector representation of input words and passes them to decoder
where the word vector transfer is represented by an arrow from encoder block to
decoder block direction. Decoder extracts input word vectors, transforms output
words into word vectors, and generates the probability of each output word.
There are feedforward layers, which are dense layers in encoder and decoder
blocks used for text classification with spaCy. The innovative transformers can
place in a Multi-Head Attention block to create a dense representation for each word
with self-attention mechanism. This mechanism relates each word in input sentence
to other words in the input sentence. Word embedding is calculated by taking a
weighted average of other words’ embeddings, and each word significance can be
calculated in input sentence to enable the architecture focus on each input word
sequentially.
A self-attention mechanism of how input words at the left-hand side attend input
word it at the right-hand side is shown in Fig. 15.3. Dark colors represented rele-
vance, phrase the animal are related to it than other words in the sentence. This
signified transformer can resolve many semantic dependencies in a sentence and
15.5 An Overview of BERT Technology 377
used in different tasks such as text classification and machine translation since they
have several architectures depending on tasks. BERT is a popular architecture to
be used.
15.5.1 What Is BERT?
Fig. 15.5 Two distinct word vectors generated by BERT for the same word bank in two different
contexts
It is noted that word bank has different meanings in these two sentences, word
vectors are the same because Glove and FastText are static. Each word has only one
vector and vectors are saved to a file after training. Then, these pre-trained vectors
can be downloaded to our application. BERT word vectors are dynamic on the con-
trary. It can generate different word vectors for the same word depending on input
sentence. Word vectors generated by BERT are shown in Fig. 15.5 against the coun-
terpart shown in Fig. 15.4:
15.5.2 BERT Architecture
BERT is a transformer encoder stack, which means several encoder layers are
stacked on top of each other. The first layer initializes word vectors randomly, and
then each encoder layer transforms output of the previous encoder layer. Figure 15.6
illustrates two BERT model sizes: BERT Base and BERT Large.
BERT Base and BERT Large have 12 and 24 encoder layers to generate word
vectors sizes of 768 and 1024 comparatively.
BERT outputs word vectors for each input word. A high-level overview of BERT
inputs and outputs is illustrated in Fig. 15.7. It showed that BERT input should be
in a special format to include special tokens such as CLS.
After learning BERT basic architecture, let us look at how to generate output vec-
tors using BERT.
BERT input format can represent a single sentence and a pair of sentences in a
single sequence of tokens (for tasks such as question answering and semantic simi-
larity, we input two sentences to the model).
15.5 An Overview of BERT Technology 379
Fig. 15.7 BERT model input word and output word vectors
380 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
BERT works with a special tokens class and a special tokenization algorithm
called WordPiece.
There are several types of special tokens [CLS], [SEP], and [PAD]:
• [CLS] is the first special token type for every input sequence. This token is a
quantity of input sentence for classification tasks but disregard non-
classification tasks.
• [SEP] is a sentence separator. If the input is a single sentence, this token will be
placed at the end of sentence, i.e. [CLS] sentence [SEP], or to separate two sen-
tences, i.e. [CLS] sentence1 [SEP] sentence2 [SEP].
• [PAD] is a special token for padding. The padding values can generate sentences
from dataset with equal length. BERT receives sentences with fixed length only,
hence, padding short sentences is required prior feeding to BERT. The tokens
maximum length can feed to BERT is 512.
It was learnt that a sentence can feed to Keras model one word at a time, input
sentences can be tokenized into words using spaCy tokenizer, but BERT works dif-
ferently as it uses WordPiece tokenization. A word piece is literally a piece of a word.
WordPiece algorithm breaks down words into several subwords, its logic behind
is to break down complex/long tokens into tokens, e.g. the word playing is tokenized
as play + ##ing. A ## character is placed before every word piece to indicate that
this token is not a word from language’s vocabulary but is a word piece.
Let us look at some examples:
BERT originators stated that “We then train a large model (12-layers to 24-layers
Transformer) on a large corpus (Wikipedia + BookCorpus) for a long time (1 M
update steps), and that is BERT” in Google Research’s BERT GitHub repository
(GoogleBert 2022).
15.5 An Overview of BERT Technology 381
BERT is trained by masked language model (MLM) and next sentence predic-
tion (NSP).
Language modelling is the task of predicting the next token given the sequence
of previous tokens. For example, given the sequence of words Yesterday I visited, a
language model can predict the next token as one of the tokens church, hospital,
school, and so on.
MLM is different. A percentage of tokens are masked randomly to replace a
[MASK] token and presume MLM predicts the masked words.
BERT’s masked language model is implemented as follows:
1. select 15 input tokens randomly.
2. 80% of selected tokens are replaced by [MASK].
3. 10% of selected tokens are replaced by another token from vocabulary.
4. 10% remain unchanged.
A training sentence to LMM example is as follows:
382 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
NSP is the task of predicting the next sentence given by an input sentence. There
are two sentences fed to BERT and presume BERT predicts sentences order if sec-
ond sentence is followed by first sentence.
An input of two sentences separated by a [SEP] token to NSP example is as
follows:
It showed that second sentence can follow first sentence, hence, the predicted
label is IsNext.
Here is another example:
This example showed that the pair of sentences generate a NotNext label without
contextual or semantical relevance.
15.6.1 HuggingFace Transformers
BERT uses WordPiece algorithm for tokenization to ensure that each input word is
divided into subwords.
384 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
Fig. 15.10 Lists of the available models (left-hand side) and BERT model variations (right-
hand side)
1. Import BertTokenizer. Note that different models have different tokenizers, e.g.
XLNet model’s tokenizer is called XLNetTokenizer.
2. Call from_pretrained method on tokenizer object and provide model’s name.
Needless to download pre-trained bert-base-uncased (or model) as this method
downloads model by itself.
3. Call tokenize method. It tokenizes sentence by dividing all words into
subwords.
4. Print tokens to examine subwords. The words he, lived, idle, that exist in
Tokenizer’s vocabulary are to be remained. Characteristically is a rare word
does not exist in Tokenizer’s vocabulary. Thus, tokenizer splits this word into
subwords characteristic and ##ally. Notice that ##ally starts with characters ##
to emphasize that this is a piece of word.
5. Call convert_tokens_to_ids.
Since [CLS] and [SEP] tokens must add to the beginning and end of input sen-
tence, it required to add them manually for the preceding code, but these prepro-
cessing steps can perform in a single step.
BERT provides a method called encode that can:
• add CLS and SEP tokens to input sentence
• tokenize sentence by dividing tokens into subwords
• converts tokens to their token-IDs
Call encode method on input sentence directly as follows:
btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
utt2 = "He lived characteristically idle and romantic."
id2 = btokenizer.encode(utt2)
print(id2)
Out[4] [101, 2002, 2973, 8281, 3973, 18373, 1998, 6298, 1012, 102]
This code segment outputs token-IDs in a single step instead of step-by-step. The
result is a Python list.
Since all input sentences in a dataset must have equal length because BERT can-
not process variable-length sentences, padding the longest sentence from dataset
into short sentences is required using the parameter "padding='longest'".
Writeup conversion codes are also required if a TensorFlow tensor is used instead
of a plain list. HuggingFace library provides encode_plus to combine all these steps
into single method as follows:
386 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
utt3 = "He lived characteristically idle and romantic."
encoded = btokenizer.encode_plus(
text=utt3,
add_special_tokens=True,
padding='longest',
return_tensors="tf"
)
id3 = encoded["input_ids"]
print(id3)
Out[5]
In[6] btokenizer.encode_plus?
This section will examine BERT model output as they are a sequence of word vec-
tors assigned by one vector per input word. BERT has a special output format. Let’
us look at the code first.
15.6 Transformers with TensorFlow 387
btokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bmodel = TFBertModel.from_pretrained("bert-base-uncased")
encoded = btokenizer.encode_plus(
text=utt4,
add_special_tokens=True,
padding='longest',
max_length=10,
return_attention_mask=True,
return_tensors="tf"
)
id4 = encoded["input_ids"]
outputs = bmodel(id4)
• Import TFBertModel
• Initialize out BERT model with a bert-base-uncased pre-trained model
• Transform input sentence to BERT input format with encode_plus, and capture
result tf.tensor in the input variable
• Feed sentence to BERT model and capture output with the output’s variables
BERT model output is a tuple of two elements. Let us print the shapes of out-
put pair:
In[8] print(outputs[0].shape)
Out[8] (1, 6, 768)
In[9] print(outputs[1].shape)
Out[9] (1, 768)
388 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
1. Shape, i.e. batch size, sequence length, hidden size is the first element of
output. A batch size is the numbers of sentences that can feed to model
instantly. When one sentence is fed, the batch size is 1. Sequence length is 10
because sentence is fed max_length = 10 to the tokenizer and padded to length
of 10. Hidden_size is a BERT parameter. BERT architecture has 768 hidden
layers size to produce word vectors with 768 dimensions. Hence, the first
output element contains 768-dimensional vectors per word means it contains 10
words × 768-dimensional vectors
2. The second output is only one vector of 768-dimension. This vector is the word
embedding of [CLS] token. Since [CLS] token is an aggregate of the whole
sentence, this token embedding is regarded as embeddings pooled version of all
words in the sentence. The shape of output tuple is always the batch size,
hidden_size. It is to collect [CLS] token’s embedding per input sentence
basically
When BERT embeddings are extracted, they can be used to train text classifica-
tion model with TensorFlow and tf.keras.
Some of the codes will be used from previous workshop, but this time the code is
shorter because the embedding and LSTM layers will be replaced by BERT to train
a binary text classifier and tf.keras.
This section will use an email log dataset emails.csv for spam mail classification
found in NLP Workshop6 GitHub repository (NLPWorkshop6 2022).
15.7.1 Data Preparation
Before Text Classification model using BERT is created, let us prepare the data first
just like being learnt in the previous workshop:
In[11] emails=pd.read_csv("emails.csv",encoding='ISO-8859-1')
emails.head()
Out[11]
In[12] emails=emails.dropna()
emails=emails.reset_index(drop=True)
emails.columns = ['text','label']
emails.head()
Out[12]
In[14] emails.head()
Out[14]
In[15] messages=emails['text']
labels=emails['label']
len(messages),len(labels)
Out[15] (5728, 5728)
In[16] input_ids=[]
attention_masks=[]
input_ids=np.asarray(input_ids)
attention_masks=np.array(attention_masks)
labels=np.array(labels)
15.7 Revisit Text Classification Using BERT 391
This code segment will generate token-IDs for each input sentence of the dataset and
append to a list. They are list of class labels consist of 0 and 1 s. convert python lists,
input_ids, label to numpy arrays and feed them to Keras model
In[18] model.fit?
In[19] history=model.fit(input_ids,labels,batch_size=1,epochs=1)
Out[19] 5728/5728 [==============================] - 8675s
2s/step - loss: 0.0950 - accuracy: 0.9663
In[20] bmodel.summary()
Out[20]
392 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
BERT model accepts one line only but can transfer enormous knowledge of Wiki
corpus to model. This model obtains an accuracy of 0.96 at the end of the training.
A single epoch is usually fitted to the model due to BERT overfits a moderate
size corpus.
The rest of the code handles compiling and fitting Keras model as BERT has a
huge memory requirement as can be seen by RAM requirements of Google
Research’s GitHub archive (GoogleBert-Memory 2022).
The training code operates for about an hour in local machine, where bigger
datasets require more time even for one epoch.
This section will learn how to train a Keras model with BERT from scratch.
nlp = pipeline("sentiment-analysis")
result1 = nlp(utt5)
result2 = nlp(utt6)
Check outputs:
In[22] result1
Out[22] [{'label': 'NEGATIVE', 'score': 0.9276903867721558}]
In[23] result2
Out[23] [{'label': 'POSITIVE', 'score': 0.9998767375946045}]
nlp = pipeline("question-answering")
res = nlp({
'question': 'What is the name of this book ?',
'context': "I'll publish my new book Natural Language Processing
soon."
})
print(res)
Out[24] {'score': 0.9857430458068848, 'start': 25, 'end': 52, 'answer': 'Natural
Language Processing'}
Again, import pipeline function to create a pipeline object nlp. A context which
has identical background information for the model is required for question-
answering tasks to the model
• Request the model about this book’s name after giving information of this
new publication will be available soon
• The answer is natural language processing, as expected
• Try your own examples as simple exercise
SpaCy v3.0 had released new features and components. It has integrated transform-
ers into spaCy NLP pipeline to introduce one more pipeline component called
Transformer. This component allows users to use all HuggingFace models with
spaCy pipelines. A spaCy NLP pipeline without transformers is illustrated in
Fig. 15.11.
With the release of v3.0, v2 style spaCy models are still supported and
transformer-based models introduced. A transformer-based pipeline component
looks like the following as illustrated in Fig. 15.12:
15.9 Transformer and spaCy 395
Transformer-based models and v2 style models are listed under Models page of
the documentation (spaCy-model 2022) in English model for each supported lan-
guage. Transformer-based models have various sizes and pipeline components like
v2 style models. Also, each model has corpus and genre information like v2 style
models. An example of an English transformer-based language model from Models
page is shown in Fig. 15.13.
It showed that the first pipeline component is a transformer that generates word
representations and deals with WordPiece algorithm to tokenize words into sub-
words. Word vectors are fed to the rest of the pipeline.
396 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
After loading model and initializing pipeline, use this model the same way as in
v2 style models:
In[29] utt8._.trf_data.wordpieces
Out[29]
There are five elements: word pieces, input IDs, attention masks, lengths, and
token type IDs in the preceding output.
Word pieces are subwords generated by WordPiece algorithm. The word pieces
of this sentence are as follows:
The first and last tokens are special tokens used at the beginning and end of the
sentence. The word unwillingly is divided into three subwords—unw, ill, and ingly.
A G character is used to mark word boundaries. Tokens without G are subwords,
such as ill and ingly in the preceding word piece list, except first word in the sen-
tence marked by <'s'>.
Input IDs have identical meanings which are subword IDs assigned by the trans-
former’s tokenizer.
The attention mask is a list of 0 s and 1 s for pointing the transformer to tokens
it should notice. 0 corresponds to PAD tokens, while all other tokens should have a
corresponding 1.
Lengths refer to the length of sentence after dividing into subwords. Here is 9 but
notice that len(doc) outputs is 5, while spaCy always operates on linguistic words.
token_type_ids are used by transformer tokenizers to mark sentence boundaries
of two sentences input tasks such as question and answering. Since there is only one
text provided, this feature is inapplicable.
Token vectors are generated by transformer, doc._.trf_data.tensors which contain
transformer output, a sequence of word vectors per word, and the pooled output
vector. Please refer to Obtaining BERT word vectors section if necessary.
398 15 Workshop#6 Transformers with spaCy and TensorFlow (Hour 11–12)
In[30] utt8._.trf_data.tensors[0].shape
Out[30] (1, 9, 768)
In[31] utt8._.trf_data.tensors[1].shape
Out[31] (1, 768)
The first element of tuple is the vectors for tokens. Each vector is 768-dimensional,
hence 9 words produce 9 x 768-dimensional vectors. The second element of tuple is
the pooled output vector which is an aggregate representation for input sentence, and
the shape is 1 x 768
spaCy provides user-friendly API and packaging for complicated models such as
transformers. Transformer integration is a validation of using spaCy for NLP.
References
Agarwal, B. (2020) Deep Learning-Based Approaches for Sentiment Analysis (Algorithms for
Intelligent Systems). Springer.
Albrecht, J., Ramachandran, S. and Winkler, C. (2020) Blueprints for Text Analytics Using
Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications.
O’Reilly Media.
Arumugam, R., & Shanmugamani, R. (2018). Hands-on natural language processing with python.
Packt Publishing.
Bansal, A. (2021) Advanced Natural Language Processing with TensorFlow 2: Build effective
real-world NLP applications using NER, RNNs, seq2seq models, Transformers, and more.
Packt Publishing.
Devlin, J., Chang, M. W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Archive: https://arxiv.org/
pdf/1810.04805.pdf.
Ekman, M. (2021) Learning Deep Learning: Theory and Practice of Neural Networks, Computer
Vision, Natural Language Processing, and Transformers Using TensorFlow. Addison-Wesley
Professional.
Facebook-transformer (2022) Facebook Transformer Model archive. https://github.com/pytorch/
fairseq/blob/master/examples/language_model/README.md. Accessed 24 June 2022.
Géron, A. (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media.
GoogleBert (2022) Google Bert Model Github archive. https://github.com/google-research/bert.
Accessed 24 June 2022. Accessed 24 June 2022.
GoogleBert-Memory (2022) GoogleBert Memory Requirement. https://github.com/
google-research/bert#out-of-memory-issues.
HuggingFace (2022) Hugging Face official site. https://huggingface.co/. Accessed 24 June 2022.
HuggingFace_transformer (2022) HuggingFace Transformer Model archive. https://github.com/
huggingface/transformers. Accessed 24 June 2022.
References 399
Kedia, A. and Rasu, M. (2020) Hands-On Python Natural Language Processing: Explore tools and
techniques to analyze and process text with a view to building real-world NLP applications.
Packt Publishing.
Keras (2022) Keras official site. https://keras.io/. Accessed 24 June 2022.
Korstanje, J. (2021) Advanced Forecasting with Python: With State-of-the-Art-Models Including
LSTMs, Facebook’s Prophet, and Amazon’s DeepAR. Apress.
NLPWorkshop6 (2022) NLP Workshop 6 GitHub archive. https://github.com/raymondshtlee/
NLP/tree/main/NLPWorkshop6. Accessed 24 June 2022.
Rothman, D. (2022) Transformers for Natural Language Processing: Build, train, and fine-tune
deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, and
GPT-3. Packt Publishing.
SpaCy (2022) spaCy official site. https://spacy.io/. Accessed 24 June 2022.
SpaCy-model (2022) spaCy English Pipeline Model. https://spacy.io/models/en. Accessed 24
June 2022.
Siahaan, V. and Sianipar, R. H. (2022) Text Processing and Sentiment Analysis using Machine
Learning and Deep Learning with Python GUI. Balige Publishing.
TensorFlow (2022) TensorFlow official site> https://tensorflow.org /. Accessed 24 June 2022.
Tunstall, L, Werra, L. and Wolf, T. (2022) Natural Language Processing with Transformers:
Building Language Applications with Hugging Face. O’Reilly Media.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin,
I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
https://arxiv.org/abs/1706.03762.
Yıldırım, S, Asgari-Chenaghlu, M. (2021) Mastering Transformers: Build state-of-the-art models
from scratch with advanced natural language processing techniques. Packt Publishing.
Chapter 16
Workshop#7 Building Chatbot
with TensorFlow and Transformer
Technology (Hour 13–14)
16.1 Introduction
16.2 Technical Requirements
Transformers, Tensorflow, and spaCy and Python modules include numpy and
scikit-learn that are to be installed in machine.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 401
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4_16
402 16 Workshop#7 Building Chatbot with TensorFlow and Transformer Technology…
16.3.1 What Is a Chatbot?
A wake word is the gateway between user and user’s digital assistant/Chatbot. Voice
assistants such as Alexa and Siri are powered by AI with word detection abilities to
queries response and commands.
Common wake words include Hey, Google, Alexa, and Hey Siri.
Today’s wake word performance and speech recognition are operated by machine
learning or AI with cloud processing.
Sensory’s wake word and phrase recognition engines use deep neural networks
to provide an embedded or on-device wake word and phrase recognition engine
(Fig. 16.1).
Wake words like Alexa, Siri, and Google are associated with highly valued and
technical products experiences, other companies had created tailor-made wake
word and uniqueness to their products, i.e. Hi Toyota had opened a doorway to
voice user interface to strengthen the relationship between customers and the brand.
Wake word technology has been used in cases beyond mobile applications. Some
battery powered devices like Bluetooth headphones, smart watches, cameras, and
emergency alert devices.
Chatbot allow users to utter commands naturally. Queries like what time is it? or
how many steps have I taken? are phrases examples that a chatbot can process zero
latency with high accuracy.
Wake word technology can integrate with voice recognition applications like
touch screen food ordering, voice-control microwaves, or user identification set-
tings at televisions or vehicles.
This workshop will integrate the learnt technologies including: TensorFlow (Bansal
2021; Ekman 2021; TensorFlow 2022), Keras (Géron 2019; Keras 2022a),
Transformer technology with Attention Learning Scheme (Ekman 2021; Kedia and
Rasu 2020; Rothman 2022; Tunstall et al. 2022; Vaswani et al. 2017; Yıldırım and
Asgari-Chenaghlu 2021) to build a live domain-based chatbot system. The Cornell
Large Movie Dialog Corpus (Cornell 2022) will be used as conversation dataset for
system training. The movie dataset can be downloaded either from Cornell data-
bank (2022) or Kaggle’s Cornell Movie Corpus archive (2022).
16.4 Building Movie Chatbot by Using TensorFlow and Transformer Technology 405
Use pip install command to invoke TensorFlow package and install its dataset:
import re
import matplotlib.pyplot as pyplt
The Cornell Movie Dialogs corpus is used in this project. This dataset, movie_con-
versations.txt contains lists of conversation IDs and movie_lines.txt associative con-
versation ID. It has generated 220,579 conversations and 10,292 movie characters
amongst movies.
def get_dialogs():
# Create the dialog object (dlogs)
id2dlogs = {}
# Open the movie_lines text file
with open('data/movie_lines.txt', encoding = 'utf-8', errors = 'ignore') as
f_dlogs:
dlogs = f_dlogs.readlines()
for dlog in dlogs:
sections = dlog.replace('\n', '').split(' +++$+++ ')
id2dlogs[sections[0]] = sections[4]
In[5] len(queries)
Out[5] 50000
In[6] Len(responses)
Out[6] 50000
Cap utterance max length (MLEN) to 40, perform filtering and padding:
m_token_aa = tflow.keras.preprocessing.sequence.pad_sequences
(m_token_aa, maxlen=MLEN, padding = 'post')
Review the size of movie vocab (SVCAB) and total number of conversation
(conv):
1. Note that the total number of conversations after filtering and padding process is
44,095 which is less than previous max conv size 50,000 as some conversations
are filtered out
2. SVCAB size is around 8000 which makes sense as the total numbers of
conversation are around 44,000 lines, the number of vocabulary used is between
5000 and 10,000
In[12] tflow.data.Dataset.from_tensor_slices?
mDS = mDS.cache()
mDS = mDS.shuffle(sBuffer)
mDS = mDS.batch(sBatch)
mDS = mDS.prefetch(tflow.data.experimental.AUTOTUNE)
1. Create a TensorFlow Dataset object first to define Batch and Buffer size
2. Define three layers of Transformer Model: a. Input node layer (inNodes) –
Queries b. Decoder input node layer (decNodes) – Responses c. Output node
layer (outNodes) - Responses
3. Define prefetch scheme—AUTOTUNE in our project
410 16 Workshop#7 Building Chatbot with TensorFlow and Transformer Technology…
QK T
Attention Q,K ,V soft max k (16.1)
d
k
In[14] # Calculate the Attention Weight, Query (q), Key(k), Value(v), Mask(m)
def calc_attention(q, k, v, m):
qk = tflow.matmul(q, k, transpose_b = True)
dep = tflow.cast(tflow.shape(k)[-1], tflow.float32)
mlogs = qk / tflow.math.sqrt(dep)
return out_wts
16.4.7 Multi-Head-Attention (MHAttention)
assert dm % self.nhd == 0
self.dep = dm // self.nhd
self.qdes = tflow.keras.layers.Dense(units=dm)
self.kdes = tflow.keras.layers.Dense(units=dm)
self.vdes = tflow.keras.layers.Dense(units=dm)
self.des = tflow.keras.layers.Dense(units=dm)
# 1. Construct Linear-layers
q = self.qdes(q)
k = self.kdes(k)
v = self.vdes(v)
412 16 Workshop#7 Building Chatbot with TensorFlow and Transformer Technology…
# 2. Perform Head-splitting
q = self.sheads(q, bsize)
k = self.sheads(k, bsize)
v = self.sheads(v, bsize)
# 4. Head Combining
cattention = tflow.reshape(sattention,
(bsize, -1, self.dm))
# 5. Layer Condensation
outNodes = self.des(cattention)
return outNodes
16.4.8 System Implementation
Implement (1) Padding Mask and (2) Look_ahead Mask to mask token sequences.
PE pos, 2i sin pos / 100002i / dmodel
(16.2)
PE pos, 2i 1 cos pos / 10000 2i / d model
In[19] # Implementation of Positional Encoding Class (PEncoding)
class PEncoding(tflow.keras.layers.Layer):
Out[20]
50 1.00
0.75
40
0.50
30 0.25
Position
0.00
20 –0.25
–0.50
10
–0.75
0 –1.00
0 100 200 300 400 500
Depth
att = MHAttention(
dm, nhd, name="att")({
'q': inNodes,
'k': inNodes,
'v': inNodes,
'm': pmask
})
att = tflow.keras.layers.Dropout(rate=drop)(att)
att = tflow.keras.layers.LayerNormalization(
epsilon=1e-6)(inNodes + att)
return tflow.keras.Model(
inputs=[inNodes, pmask], outputs=outNodes, name=name)
Out[22]
outNodes = tflow.keras.layers.Dropout(rate=drop)(embeddings)
for i in range(nlayers):
outNodes = enclayer(
i=x,
dm=dm,
nhd=nhd,
drop=drop,
name="enclayer_{}".format(i),
)([outNodes, pmask])
return tflow.keras.Model(
inputs=[inNodes, pmask], outputs=outNodes, name=name)
418 16 Workshop#7 Building Chatbot with TensorFlow and Transformer Technology…
tflow.keras.utils.plot_model
(encoder_sample, to_file='encoder_sample.png', show_shapes = True)
Out[24]
16.4 Building Movie Chatbot by Using TensorFlow and Transformer Technology 419
att1 = tflow.keras.layers.LayerNormalization(epsilon=1e-6)
(att1 + inNodes)
1. Encoder Layer implements single Attention Learning object, and Decoder Layer
implements two Attention Learning objects att1 and att2 according to Transformer
Learning model
2. Again, relu function is used as Activation Function. It can modify or adopt different
Activation Function to improve network performance as studied in Sect. 16.1
In[26] # Create a decoder layer sample and show object association diagram
declayer_sample = declayer(i = 512, dm = 128, nhd = 4, drop = 0.3,
name = "declayer_sample")
tflow.keras.utils.plot_model
(declayer_sample, to_file='declayer_sample.png', show_shapes=True)
Out[26]
16.4 Building Movie Chatbot by Using TensorFlow and Transformer Technology 421
outNodes = tflow.keras.layers.Dropout(rate=drop)(embeddings)
for i in range(nlayers):
outNodes = declayer(i = x,
dm=dm,
nhd=nhd,
drop=drop,
name = 'declayer_{}'.format(i),)(inputs=[outNodes, en-
couts, lamask, pmask])
Out[28]
Transformer involves implementing Encoder, Decoder, and the final Linear Layer.
Transformer Decoder output is input to Linear Layer as a Recurrent Neural
Network (RNN) and output model is returned.
16.4 Building Movie Chatbot by Using TensorFlow and Transformer Technology 423
enc_pmask = tflow.keras.layers.Lambda(
gen_pmask, output_shape=(1, 1, None),
name="enc_pmask")(queries)
# Perform Look Ahead Masking for Decoder Input for the Att1
lamask = tflow.keras.layers.Lambda(gen_lamask,
output_shape=(1, None, None),
name = "lamask")(dec_queries)
encouts = encoder(svcab=svcab,
nlayers = nlayers,
x = x,
dm = dm,
nhd = nhd,
drop = drop,)(inputs = [queries, enc_pmask])
decouts = decoder(svcab=svcab,
nlayers = nlayers,
x = x,
dm = dm,
nhd = nhd,
drop=drop,)(inputs=[dec_queries, encouts, lamask, dec_
pmask])
responses =
tflow.keras.layers.Dense(units=svcab, name="outNodes")(decouts)
tflow.keras.utils.plot_model(transformer_sample,
to_file="transformer_sample.png", show_shapes=True)
Out[30]
Parameters for nLayers, dm, and units (x) had reduced to speed up training process.
1. A Movie Chatbot Transformer Model consists of two layers with 512 units,
data-model size 256, head number 8 and dropout rate 0.1 according to Transformer
Model as in Fig. 16.2
2. It is recommended to modify these parameter settings to improve network
performance as discussed in Sect. 16.1
16.4 Building Movie Chatbot by Using TensorFlow and Transformer Technology 425
loss_val = tflow.keras.losses.SparseCategoricalCrossentropy(
from_logits= True, reduction='none')(xtrue, xpred)
return tflow.reduce_mean(loss_val)
self.warmup_steps = warmup_steps
pyplt.plot(CLearning_sample(tflow.range(200000, dtype=tflow.float32)))
pyplt.ylabel("Learning Rate")
pyplt.xlabel("Train Step")
Out[34]
0.0014
0.0012
0.0010
Learning Rate
0.0008
0.0006
0.0004
0.0002
0.0000
0 25000 50000 75000 100000 125000 150000 175000 200000
Train step
Train Chatbot transformer model by calling model.fit() for 20 epochs to save time.
In[36] EPOCHS = 20
Out[36]
428 16 Workshop#7 Building Chatbot with TensorFlow and Transformer Technology…
for i in range(MLEN):
chatting = model(inputs = [utterance, response], training = False)
chatted_utterance =
m_token.decode([i for i in mchatting if i < m_token.vocab_size])
print('Query: {}'.format(utterance))
print('Response: {}'.format(chatted_utterance))
return chatted_utterance
1. Training showed that epochs 1–20 are rather slow but increased in accuracy and
decreased in loss rate
2. Two chatbots experiments with one used 2 epochs and the other used 20 epochs.
Results showed that performance on 20 epochs has satisfactory performance than
the one with 2 epochs
3. Increase epochs, say up to 50 epochs to review whether accuracy has continuous
improvement. It is natural to require more time unless there are sufficient GPUs
16.5 Related Works
This workshop had integrated all NLP related implementation techniques including
TensorFlow and Keras with Transformer Technology to design an AI-based NLP
application chatbot system. It is a step-by-step implementation consisting of data
preprocessing, model construction, system training, testing evaluation process; and
Attention Learning and Transformer Technology with TensorFlow and Keras imple-
mentation platform easily applied to other chatbot domain and interactive QA sys-
tems using Cornell Large Movie dataset with over 200,000 movie conversations
with 10,000+ movie characters.
Nevertheless, it is only the dawn of journey. There are regular new R&D preva-
lence and usage in NLP applications. Below are lists of renowned domains and
resources related to chatbot systems for reference.
References 431
References
© The Author(s), under exclusive license to Springer Nature Singapore Pte 433
Ltd. 2024
R. S. T. Lee, Natural Language Processing,
https://doi.org/10.1007/978-981-99-1999-4
434 Index
F J
FastText, 320 Jupyter Notebook, 244
Feature-based model, 219
Feedforward neural network (FNN), 180
Fillmore’s Case Roles Theory, 153 K
Fillmore’s theory, 103–107, 112 Kaggle, 351
First-order predicate calculus (FOPC), 98, KBQA, 227, 228
100, 102, 103, 107–113 Keras, 336, 357–371, 401, 404, 415, 418, 420,
Frame-based representation, 98–100 422, 424, 430, 431
Index 435
L N
Language detection, 336 Name entity recognition (NER), 227, 276
Language model, 21, 24, 26, 34, 40 Natural language generation (NLG), 11
Language model evaluation (LME), 34–40 Natural language processing, 3–16
Language modeling, 24–25 Natural language toolkit (NLTK), 243–264,
Laplace (Add-one) Smoothing, 36–38 267–311, 401
Large Movie Reviews Dataset, 351 Natural language understanding (NLU),
Latent Semantic Analysis (LSA), 221 11–13, 48, 65, 70, 276
Latent Semantic Indexing, 207–208 Next sentence prediction (NSP), 193, 195
Lemma, 23 N-gram, 21, 22, 26, 27, 29–33, 35, 36, 38–41,
Lesk Algorithm, 136 46, 47, 58, 60, 86, 87, 90, 92, 211,
Lexical ambiguity, 7, 123 267, 268, 270, 271, 273
Lexical analysis, 75 Noun-phrase (NP), 72
Lexical dispersion plot, 253–255 Nouns, 43, 44, 49, 51, 53, 56
Lexical diversity, 258–259
Lexicalized parsing, 91–93
Lexical probability, 92, 93 P
Lexical semantic analysis, 117 Parsing, 67–94, 268
Lexical semantics, 117, 119, 145 Part-of-Speech (POS), 43–65, 67
Lexicology, 6 Path-based similarity, 132–134, 145
Lexicon, 75, 78, 80, 91 Penn Treebank (PTB), 45, 46
Linguistic levels, 6, 7 Perplexity (PP), 34–35, 41
Log-linear model, 170 Phonetics, 6
Long short-term memory (LSTM), 10, 170, Phonological parsing, 79
183–186, 188, 196, 224, 226, 230, Pointer-Generator Networks, 224
335–371, 374, 375, 388 Point-wise Mutual Information (PMI), 139
Luhn’s Algorithm, 219 Polysemy, 120, 129
Porter Stemmer, 285, 289–292
Positional encoding, 189–190, 413, 416
M Positive Point-wise Mutual Information
Machine learning (ML) method, 169–170 (PPMI), 140–142, 144–146
Machine translation (MT), 8, 14, 19, 22 POS tagger, 285, 304–306, 308–310
Markov chain, 25–27, 40 POS tagging, 44, 45, 47–49, 58, 60, 64, 65,
Maximum entropy Markov model 124, 244, 285–311
(MEMM), 60 POS Tagset, 301, 302
Maximum likelihood estimates (MLE), 30 Pragmatic, 124
Meaning representations, 95–98, 100–103, Pragmatic ambiguity, 8
110, 112, 113 Pragmatic analysis, 12, 13, 16
MeSH, 130–131 Pragmatic meaning, 95
Metaphor, 120 Pragmatics analysis, 149
Metonymy, 120 Predicates, 108, 109
Minsky, M., 99 Prepositions, 51–52
Modal verb, 54 Probabilistic context-free grammar (PCFG),
Morphological parsing, 79 87, 88, 90–91, 94
Morphology, 6 Probabilistic Ranking Principle (PRP), 202–207
Morphology analysis, 96 Pronouns, 43, 51, 53, 55
Movie comments, 351 PTB Tagset, 285, 302
436 Index
Q Stop-words, 292
Q&A chatbots, 19 Subject–Predicate–Object (SPO), 227
QA systems, 16, 224, 225, 227–229, 231, SummaRuNNer, 219, 221
233, 235 Supervised discourse segmentation, 158
Quadrigram, 22, 33, 269, 275 Supervised learning (SL), 125
Quantifiers, 108 SVD model, 208
Query-focused summarization (QFS) Symbolic representations, 118
systems, 215–216 Synonyms, 121
Query-likelihood, 207–208 Synsets, 126, 145
Syntactic ambiguity, 7
Syntactic levels, 6
R Syntactic parsing, 79, 80
Recurrent neural network (RNN), 170, Syntactic rules, 68
180–188, 196, 373 Syntax, 48, 67–94
Referring expression (RE), 159, 161 Syntax analysis, 13, 67
Regular language (RL), 76
Resnik method, 135, 145
Rhetorical structure theory (RST), 158, 209, 211 T
Rule-based POS tagging, 45 Taggers evaluations, 63–64
Rule-based QA systems, 227 Tagging, 43–65, 335, 336, 358
Tag sequence frequency, 60
Tagset, 56–58
S Taskmaster, 430
Selectional restrictions, 107 TensorFlow, 316, 335, 336, 357, 358, 367,
Self-attention, 190, 410 373–398, 401–431
Self-attention mechanism, 227 Term-context matrix, 139, 141, 145
Semantic ambiguity, 7 Term Distribution Models, 207
Semantic analysis, 12, 13, 244, 268, 313–333 Term-document matrix, 137–139
Semantic categorization, 330 Term-frequency (TF), 200
Semantic level, 6 Text analysis, 243, 249, 253, 257,
Semantic meaning, 95 259, 296–299
Semantic networks, 98–99 TextCategorizer, 335, 338–343, 345, 347–351,
Semantic processing, 97 355, 356
Semantics, 48, 95, 107, 108, 111 Text classification, 244, 335–371
Semantic similarity, 320, 323, 326–333 Text coherent, 158–159
Semi-supervised methods, 125 Text processing, 248–249, 252
Sentiment analysis, 15, 337, 338, 401 Text summarization (TS), 212–224, 235
Shakespeare, W., 31 TextTeaser, 219
Shannon’s method, 31, 32 TextTiling, 208
Sherlock Holmes, 27–31, 33, 35, 37, 38, 40, Thesaurus, 130–132
41, 247, 249, 251, 256, 257, 262, TL system, 235
263, 268, 270, 271, 275, 278–281 Tokenization, 243, 244, 255–259,
Single and multiple document 267–288, 301
summarization, 217 Tokens, 23
Smoothing, 141–143 Top-down parser, 80–82
Smoothing techniques, 36 Topic generation, 336
Snowball Stemmer, 285, 289, 291–292 Transfer learning (TL), 175–196
SpaCy, 243, 267, 276–284, 313–333, TransferTransfo Conversational
335–371, 373–398, 401 Agents, 234–235
Speech recognition, 10, 12, 13, 268 Transformation-based learning (TBL),
Stem, 23 61–62, 65
Stemming, 244, 285, 288, 289, 301 Transformers, 170, 175–196, 243,
Stochastic POS tagging, 45 373–398, 401–431
Index 437