Deep Learning For Natural Language Processing
Deep Learning For Natural Language Processing
Chapter 1 Introduction������������������������������������������������������������������������������������������ 1
Learning Outcomes������������������������������������������������������������������������������ 1
1.1.Introduction������������������������������������������������������������������������������� 1
1.1.1. Subsets of Artificial Intelligence��������������������������������� 3
1.1.2. Three Horizons of Deep Learning Applications��������� 4
1.1.3. Natural Language Processing�������������������������������������� 5
1.1.4. Speech Recognition����������������������������������������������������� 7
1.1.5. Computer Vision��������������������������������������������������������� 7
1.2. Machine Learning Methods for NLP, Computer Vision
(CV), and Speech��������������������������������������������������������������������� 10
1.2.1. Support Vector Machine (SVM)������������������������������� 10
1.2.2.Bagging��������������������������������������������������������������������� 12
1.2.3. Gradient-boosted Decision Trees (GBDTs)�������������� 13
1.2.4. Naïve Bayes��������������������������������������������������������������� 13
1.2.5. Logistic Regression��������������������������������������������������� 15
1.2.6. Dimensionality Reduction Techniques���������������������� 17
1.3. Tools, Libraries, Datasets, and Resources for the
Practitioners����������������������������������������������������������������������������� 17
1.3.1.TensorFlow���������������������������������������������������������������� 20
1.3.2.Keras������������������������������������������������������������������������� 21
1.3.3.Deeplearning4j���������������������������������������������������������� 21
1.3.4.Caffe�������������������������������������������������������������������������� 21
1.3.5.ONNX����������������������������������������������������������������������� 21
1.3.6.PyTorch��������������������������������������������������������������������� 21
1.3.7.scikit-learn����������������������������������������������������������������� 22
1.3.8.NumPy���������������������������������������������������������������������� 22
1.3.9.Pandas����������������������������������������������������������������������� 22
1.3.10.NLTK������������������������������������������������������������������������ 23
1.3.11.Gensim���������������������������������������������������������������������� 23
1.3.12.Datasets��������������������������������������������������������������������� 24
1.4 Summary������������������������������������������������������������������������������������� 26
Bibliography................................................................................................. 26
vii
viii Contents
4.8.Summary���������������������������������������������������������������������������������� 97
Bibliography.................................................................................................97
Index���������������������������������������������������������������������������������������������������������������������� 221
About the Authors
L. Ashok Kumar was Postdoctoral Research Fellow from San Diego State University,
California. He was selected among seven scientists in India for the BHAVAN
Fellowship from the Indo-US Science and Technology Forum, and, also, he received
SYST Fellowship from DST, Government of India. He has 3 years of industrial
experience and 22 years of academic and research experience. He has published 173
technical papers in international and national journals and presented 167 papers in
national and international conferences. He has completed 26 Government-of-India-
funded projects worth about 15 crores, and currently 9 projects worth about 12 crores
are in progress. He has developed 27 products, and out of that 23 products have been
technology-transferred to industries and for government-funding agencies. He has
created six Centres of Excellence at PSG College of Technology in collaboration
with government agencies and industries, namely, Centre for Audio Visual Speech
Recognition, Centre for Alternate Cooling Technologies, Centre for Industrial Cyber
Physical Systems Research Centre for Excellence in LV Switchgear, Centre for
Renewable Energy Systems, Centre for Excellence in Solar PV Systems, and Centre
for Excellence in Solar Thermal Systems. His PhD work on wearable electronics
earned him a national award from ISTE, and he has received 26 awards in national
and in international levels. He has guided 92 graduate and postgraduate projects.
He has produced 6 PhD scholars, and 12 candidates are doing PhD under his super-
vision. He has visited many countries for institute industry collaboration and as a
keynote speaker. He has been an invited speaker in 345 programs. Also, he has orga-
nized 102 events, including conferences, workshops, and seminars. He completed
his graduate program in Electrical and Electronics Engineering from University of
Madras and his post-graduate from PSG College of Technology, Coimbatore, India,
and Master’s in Business Administration from IGNOU, New Delhi. After the com-
pletion of his graduate degree, he joined as Project Engineer for Serval Paper Boards
Ltd. Coimbatore (now ITC Unit, Kova). Presently, he is working as Professor in the
Department of EEE, PSG College of Technology. He is also Certified Chartered
Engineer and BSI Certified ISO 50001 2008 Lead Auditor. He has authored 19 books
in his areas of interest published by Springer, CRC Press, Elsevier, Nova Publishers,
Cambridge University Press, Wiley, Lambert Publishing, and IGI Global. He has
11 patents, one design patent, and two copyrights to his credit and also contributed
18 chapters in various books. He is also Chairman of Indian Association of Energy
Management Professionals and Executive Member in the Institution of Engineers,
Coimbatore Executive Council Member in Institute of Smart Structure and Systems.
Bangalore, and Associate Member in the Coimbatore District Small Industries
Association (CODISSIA). He is also holding prestigious positions in various national
and international forums, and he is Fellow Member in IET (UK), Fellow Member in
IETE, Fellow Member in IE, and Senior Member in IEEE.
xv
xvi About the Authors
PSG College of Technology since 2004. She is Associate Dean (Students Welfare)
and Convenor for the Students Welfare Committee in PSG College of Technology.
She is a recipient of Indo-U.S. Fellowship for Women in STEMM (WISTEMM)—
Women Overseas Fellowship program supported by the Department of Science
and Technology (DST), Government of India, and implemented by the Indo-U.S.
Science & Technology Forum (IUSSTF). She was Postdoctoral Research Fellow
from Wright State University, Ohio, USA. Her area of specializations includes Data
Mining, Evolutionary Algorithms, Soft Computing, Machine Learning and Deep
Learning, Affective Computing, and Computer Vision. She has organized an inter-
national conference on Innovations in Computing Techniques on January 22–24,
2015 (ICICT2015), and national conference on “Information Processing and Remote
Computing” on February 27 and 28, 2014 (NCIPRC 2014). Reviewer for Computers
and Electrical Engineering for Elsevier and Wiley book chapter and Springer book
chapters on “Knowledge Computing and its Applications,” she is currently guiding
eight research scholars for their PhD under Anna University, Tamil Nadu, Chennai.
She has published several papers in reputed national and international journals and
conferences.
Preface
In early days, applications were developed to establish an interaction between the
humans and computer. In the current trend, humans have the most advanced meth-
ods of communications like text, speech, and images/video to interact with com-
puters. Voice-based assistants, AI-based chatbots, and advanced driver assistance
systems are examples of applications that are becoming more common in daily life.
In particular, the profound success of deep learning in a wide variety of domains
has served as a benchmark for the many downstream applications in artificial intel-
ligence (AI). Application areas of AI include natural language processing (NLP),
speech, and computer vision. The cutting-edge deep learning models have predomi-
nantly changed the perspectives of varied fields in AI, including speech, vision, and
NLP. In this book, we made an attempt to explore the more recent developments of
deep learning in the field of NLP, speech, and computer vision. With the knowledge
in this book, the reader can understand the intuition behind the working of natural
language applications, speech, and computer vision applications. NLP is a part of
AI that makes computers to interpret the meaning of human language. NLP uti-
lizes machine learning and deep learning algorithms to derive the context behind the
raw text. Computer vision applications such as advanced driver assistance systems,
augmented reality, virtual reality, and biometrics have advanced significantly. With
the advances in deep learning and neural networks, the field of computer vision has
made great strides in the last decade and now outperforms humans in tasks such as
object detection and labeling. This book gives an easy understanding of the funda-
mental concepts of underlying deep learning algorithms to the students, researchers,
and industrial researchers as well as anyone interested in deep learning and NLP.
It serves as a source of motivation for those who want to create NLP, speech, and
computer vision applications.
xvii
Acknowledgments
The authors are thankful to Shri L. Gopalakrishnan, Managing Trustee, PSG
Institutions, and Dr. K. Prakasan, Principal, PSG College of Technology, Coimbatore,
for their wholehearted cooperation and constant encouragement in this successful
endeavor. The authors wish to acknowledge the Department of Science and Technology
(DST) for sponsoring their project under DST-ICPS scheme which sowed the seeds
among the authors in the area of Deep Learning Approach for Natural Language
Processing, Speech, and Computer Vision. The authors thank the editorial team and
the reviewers of CRC Press, Taylor & Francis Group for their relentless efforts in bring-
ing out this book.
Dr. L. Ashok Kumar would like to take this opportunity to acknowledge the peo-
ple who helped him in completing this book. He is thankful to his wife, Ms. Y. Uma
Maheswari, and also grateful to his daughter, Ms. A. K. Sangamithra, for their con-
stant support and care during writing.
Dr. D. Karthika Renuka would like to express gratitude to all her well-wishers and
friends. She would also like to express her gratitude to her parents, Mr. N. Dhanaraj
and Ms. D Anuradha, for their constant support. She gives heartfelt thanks to her
husband, Mr. R. Sathish Kumar, and her dear daughter, Ms. P. S. Preethi, for their
unconditional love which made her capable of achieving all her goals.
The authors are thankful to the almighty God for His immeasurable blessing upon
their lives.
xix
1 Introduction
LEARNING OUTCOMES
After reading this chapter, you will be able to:
1.1 INTRODUCTION
The fourth industrial revolution, according to the World Economic Forum, is
about to begin. This will blend the physical and digital worlds in ways we couldn’t
imagine a few years ago. Advances in machine learning and AI will help usher in
these existing changes. Machine learning is transformative which opens up new
scenarios that were simply impossible a few years ago. Profound gaining addresses
a significant change in perspective from customary programming improvement
models. Instead of having to write explicit top down instructions for how software
should behave, deep learning allows your software to generalize rules of opera-
tions. Deep learning models empower the engineers to configure, characterized
by the information without the guidelines to compose. Deep learning models are
conveyed at scale and creation applications—for example, car, gaming, medical
services, and independent vehicles. Deep learning models employ artificial neural
networks, which are computer architectures comprising multiple layers of inter-
connected components. By avoiding data transmission through these connected
units, a neural network can learn how to approximate the computations required
to transform inputs to outputs. Deep learning models require top-notch informa-
tion to prepare a brain organization to carry out a particular errand. Contingent
upon your expected applications, you might have to get thousands to millions
of tests.
This chapter takes you on a journey of AI from where it got originated. It does not
just involve the evolution of computer science, but it involves several fields say biol-
ogy, statistics, and probability. Let us start its span from biological neurons; way back
in 1871, Joseph von Gerlach proposed the reticulum theory, which asserted that “the
nervous system is a single continuous network rather than a network of numerous
separate cells.” According to him, our human nervous system is a single system and
not a network of discrete cells. Camillo Golgi was able to examine neural tissues in
greater detail than ever before, thanks to a chemical reaction he discovered. He con-
cluded that the human nervous system was composed of a single cell and reaffirmed
DOI: 10.1201/9781003348689-1 1
2 Deep Learning Approach for NLP, Speech, and Computer Vision
his support for the reticular theory. In 1888, Santiago Ramon y Cajal used Golgi’s
method to examine the nervous system and concluded that it is a collection of distinct
cells rather than a single cell.
Way back in 1940, the first artificial neuron was proposed by McCulloch Pitts.
Alan Turing proposed the theory of computation and also pioneered the idea of
universal computer. Shannon came up with information theory which is exten-
sively used in machine learning and signal processing. In 1950, Norbert Wiener is
the originator of cybernetics. In 1951, Minsky developed the Neural Net machine
called Stochastic Neural Automatic Reinforcement Calculator (SNARC). One of
the most significant moments in the history of AI was the advancement of the
primary electronic universally useful PC, known as the ENIAC (Electronic
Numerical Integrator and Computer), which can be reprogrammed to tackle a wide
range of numerical problems. Samuel created the first checkers-playing program
for the IBM 701 in 1952. It has a memory of its previous game experience and
applies the early experience in the current game. In 1955, Logic Theorist was the
first AI program written by Allen Newell, Herbert A. Simon, and Cliff Shaw which
is a proved theorem by Whitehead and Russell’s Principia Mathematica. In the
Dartmouth Conference in 1956, the term AI was coined with the aim to build
thinking machines, and this event was the formal birth of AI. AI has the ability
to change each and every individual in the world. Later, in 1957, Rosenblatt was
the first to come up with an idea of two-layered artificial neural network called
perceptron.
Arthur Samuel characterized AI in 1959 as the field of examination that permits
PCs to learn without being expressly modified. The golden years of AI were between
1960 and 74 where the growth of expert systems happened, for example, when the
idea of explicit, rule-based programs was born to address some problems like playing
chess, understanding natural language, solving word problems in algebra, identify-
ing infections, and recommending medications. But the expert system is not good
enough as it is basically a rule-based model. In 1969, back propagation algorithm was
derived by Bryson and Ho.
“Within twenty years, machines will be capable of accomplishing whatever
work a man can do,” predicted Nobel laureate H.A. Simon in 1965. In 1970,
Marvin Minsky predicted that “Within three to eight years, we will have a
machine with the general intelligence of an average human being.” But actually,
these two sayings were not true; 1974–80 is called the first winter of AI. It is
because AI results were less addressed to real-world problems due to the lack of
computational power, combinatorial explosion, and loss of government funding
in AI.
The period 1980–2000 is said to be the spring of AI, when the first driverless car
was invented. Minsky in 1984 said “winter is coming,” and the period 1987–93 is the
second AI winter due to lack of spectacular results and funding cuts. In 1997, IBM’s
Deep Blue beat world chess champion Kasparov in chess competition.
The current AI system is running inspired by the working of brain the same
way airplanes are inspired by birds. The current AI system provides us a lot of
excitement around how we can personalize applications to make smarter decision
for people.
Introduction 3
Some of the use cases of NLP are discussed in Figure 1.9, which can attract the
researcher, and they are explained in Chapters 2 to 4.
Introduction 7
• Acoustic Model: Identify the most probable phoneme given in the input audio.
• Pronunciation Model: Map phoneme to words.
• Language Model: Identify the likelihood of words in a sentence.
The different use cases of speech are depicted in Figure 1.11 and explained in the
later part of the book in Chapter 5 to 7.
1.1.5 Computer Vision
Human vision is both complex and beautiful. The vision system is made up of eyes
that collect light, brain receptors that allow access to it, and a visual cortex that
processes it. Computer vision AI research enables machines to comprehend and
8 Deep Learning Approach for NLP, Speech, and Computer Vision
recognize the underlying structure of images, movies, and other visual inputs. The
process of making a machine to understand and identify the underlying pattern
behind an image is called computer vision. To prepare a calculation to sort out the
example behind it and get a smart substance in the manner in which the human
cerebrum does, we need to feed it incredibly huge data of millions of objects
across thousands of angles. Some of the extensively used image-processing
Introduction 9
SVM handles nonlinearly separable case using kernel tricks. It maps the data into
higher dimensional space.
Pros:
Cons:
1.2.2 Bagging
Bagging and boosting are the ensemble techniques which grow multiple decision trees
and combine their results to come out with single prediction as given in Figure 1.16.
Pros:
Cons:
1.2.4 Naive Bayes
Naive Bayes is a basic probabilistic model of how information in each class might
have been created. They are also known as naive if the features are conditionally
independent, given the class. Figure 1.18 depicts several flavors of the Naive Bayes
classifier. They are defined next.
Pros:
Cons:
• The assumption that features are conditionally independent for the given
class is not realistic
• Poor generalization performance
Introduction 15
1.2.5 Logistic Regression
Logistic regression is extensively used for binary classification. Most real-world situa-
tions have two outcomes: alive or dead, winner or loser, successful or unsuccessful—
also known as binary, Bernoulli, or 0/1 outcomes.
Let us understand with a simple example: the main objective is to model the prob-
ability so that a player wins the match as given in Eq. (1.1). The probability value
ranges between 0 and 1.
The odds value is defined as the probability over one minus the probability. The
value of odds ranges between 0 and ∞.
Pr (Win | Score, b0 , b1 )
odds = (1.2)
1 − Pr (Win | Score, b0 , b1 )
The log of the odds, otherwise called as logit, is usually the ranges between –∞
and +∞.
( Pr (Win|Score, b0 , b1 ) )
logit = log |
| 1 − Pr (Win|Score, b , b ) || (1.3)
( 0 1 )
16 Deep Learning Approach for NLP, Speech, and Computer Vision
( Pr (Win|Score, b0 , b1 ) )
log | = b Score + b0
| 1 − Pr (Win|Score, b , b ) || 1 (1.5)
( 0 1 )
The generic forms for linear regression and logistic regression with sigmoid function
is depicted in Figures 1.19 and 1.20.
y = logistic ( b + w1 x1 + w2 x2 + wn xn ) (1.6)
1
=
1 + exp
−( b + w1 x1 + w2 x2 + wn xn ) (1.7)
Pros:
Cons:
Pros:
Cons:
• Information loss
1.3.1 TensorFlow
TensorFlow supports a wide range of solutions, including NLP, computer vision
(CV), predictive ML, and reinforcement learning. TensorFlow is a start-to-finish
open-source profound learning model by Google. You can use the following snippets
to import TensorFlow libraries. Figures 1.23 and 1.24 depict the extensively used
machine learning tools.
Import tensorflow as tf
print(“TensorFlow version:”, tf.__version__)
To import the inbuilt dataset from TensorFlow, use the following snippets:
• Enable the user to run their machine learning and deep learning codes on
CPU and GPU platforms.
• Simple and flexible to train and deploy model in cloud.
• It provides eager execution that enables immediate iteration and intuitive
debugging.
Introduction 21
1.3.2 Keras
Keras is an open-source API for developing sophisticated deep learning models.
TensorFlow, a high-level interface, is used as the backend. It supports both CPU and
GPU run time. It supports almost all the models of a neural network. Keras, being
flexible, is simply suitable for innovative research.
1.3.3 Deeplearning4j
Deeplearning4j is a deep learning library for Java virtual machines such as Scala and
Kotlin. It can process massive amounts of data in a Java ecosystem, includes a deep
learning framework with multiple threads as well as a single thread can be used in
conjunction with Spark and Hadoop.
1.3.4 Caffe
Caffe supports language like C, C++, Python, and MATLAB. It is known for its
speedy execution. Caffe’s model zoo consists of many pretrained models which can
be used for various image-processing applications.
1.3.5 ONNX
The Open Neural Network Exchange, or ONNX, was created as an open-source
deep learning ecosystem. ONNX, created by Microsoft and Facebook, is a deep
neural network learning framework that allows developers to easily switch between
platforms.
1.3.6 PyTorch
PyTorch is an open-source machine learning tool developed by Facebook, which
is extensively used in various applications like computer vision (CV) and NLP.
Some of the popular use cases developed using PyTorch are hugging face trans-
formers, PyTorch lightning, and Tesla Autopilot. Torchaudio, torchtext, and
torchvision are parts of PyTorch. Use the code snippet given here and start work-
ing in PyTorch.
Import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
PyTorch is a python library with a C++ core. Which empowers data scientist and AI
engineers with profound deep learning models. It removes the cognitive overhead
involved in building, training and deploying neural networks. It is built around the
22 Deep Learning Approach for NLP, Speech, and Computer Vision
tensor class of multi dimensional arrays similar tonumpy. The following are the fea-
tures of PyTorch:
1.3.7 scikit-learn
scikit-learn includes inbuilt datasets like iris, digits for classification, and the diabe-
tes dataset for regression. The following are the simple code snippet for importing
datasets from scikit-learn.
scikit-learn is a succinct and efficient tool for data mining and data analysis, built
on NumPy, SciPy, and Matplotlib. It is extensively used for use cases in classifica-
tion, regression, clustering, and dimensionality reduction.
1.3.8 NumPy
NumPy stands for numerical Python and supports N-dimensional array objects that can
be used for processing multidimensional data. NumPy supports vectorized operations.
NumPy is used in
• Mathematical and logical operations on arrays
• Fourier transforms
• Linear algebra operations
• Random number generation
Numpy.array(object)
NumPy is used to create an array with start, stop, and step value.
Numpy.arrange(start=1,stop=10,step=2)
Using NumPy, we can generate array with ones() and zeros() using the following
snippets:
numpy.ones(shape,dtype)
numpy.zeros(shape,dtype)
1.3.9 Pandas
The pandas library gives elite execution, easy-to-use information structures, and logical
capacities for the Python programming language. The open-source Python library gives
superior execution information control and examination instruments by utilizing its strong
Introduction 23
TABLE 1.1
Vector Representation
Python Pandas Data Type Description
int int64 Numeric characters
float float64 Numeric character with decimals
information structures. The pandas deals with data frames that are two-dimensional and
mutable and store heterogeneous data. For importing necessary libraries, use the following
snippet:
Import os
import pandas as pd
The pandas library and Python use different names for data types as shown in
Table 1.1.
1.3.10 NLTK
To work with human language data, Python scripts can be written using the open-
source NLTK framework. It consists of text-processing libraries for NLP applica-
tions such as categorization, tokenization, stemming, tagging, parsing, and semantic
reasoning. The following are the commands to use NLTK libraries:
import nltk
nltk.download()
1.3.11 Gensim
Gensim is a widely used tool for NLP applications, in particular for Word2Vec
embedding. You can use the code given here to start with Gensim:
from gensim import corpora, models, similarities, downloader
Features of Gensim:
• Stanford CoreNLP
• Apache OpenNLP
• Textblob Library
• IntelNLP Architect
• Spacy
24 Deep Learning Approach for NLP, Speech, and Computer Vision
1.3.12 Datasets
Table 1.2 shows the widely used datasets in the cutting-edge research and industrial
applications.
TABLE 1.2
Datasets
Natural Language Processing
Dataset Name Application Description
The Blog Authorship Corpus Sentiment Analysis The corpus contains 681,288 posts and north
of 140 million words or roughly 35 posts
and 7,250 words for each individual.
Amazon Product Dataset Sentiment Analysis Links (also viewed/also bought graphs), product
metadata (descriptions, category information,
price, brand, and image attributes), and
product reviews are all included in this dataset
(ratings, text, helpfulness votes).
Enron Email Dataset Spam Email The corpus contains approximately
Classification 0.5 million messages in total.
Multi-Domain Sentiment Sentiment Analysis Amazon.com product reviews are included in
Dataset the Multi-Domain Sentiment Dataset for a
variety of product types.
Stanford Question Question and A brand-new dataset on reading
Answering Dataset Answering comprehension that includes 10,000 queries
(SquAD) Analysis submitted by Wikipedia crowd workers.
The WikiQA Corpus Question and The WikiQA corpus is an open-source corpus
Answering with question and answer pair. Suitable for
Question and Answering system.
Yelp Reviews Text Classification Open-source dataset with 6,990,280 reviews.
WordNet Text Classification Contains a large English lexical database.
Cognitive synonyms (synsets) are groups of
nouns, verbs, adjectives, and adverbs, and
each conveys a unique idea.
Reuters Newswire Topic Text Classification This dataset consists of 11,228 Reuters
Classification newswires organized into 46 themes.
Project Gutenberg Language 28,752 English language books for research
Modelling in language model.
Speech Analytics
LibriSpeech Speech to Text Widely used Speech Recognition corpus in
research. It consists of 1 k hours of read
speech recorded in acoustic environment.
Acted Emotional Speech Emotion Consists of speech samples for Greek language
Dynamic Database Recognition and is suitable for emotion analysis.
Audio MNIST Speech to Text The corpus consists of 30 k audio files of
spoken digits (0–9) from multiple speakers.
VoxForge Speech to Text VoxForge is an open-source speech repository
with varied dialects of the English language.
Suitable for speech-to-text applications.
Introduction 25
1.4 SUMMARY
It is fairly obvious that the advent of deep learning has triggered several of the prac-
tical use cases in industrial applications. This chapter gives readers a high-level over-
view of AI and its subsets like machine learning and deep learning. It also gives
a gentle introduction on machine learning algorithms like SVM, Random Forest,
Naive Bayes, and vector representation of NLP. The detailed discussion on tools,
libraries, datasets, and frameworks will enable readers to explore their ideas.
This book is organized as follows: Chapters 2, 3, and 4 will describe the funda-
mental concepts and major architectures of deep learning utilized in NLP appli-
cations and aid readers in identifying the suitable NLP-related architectures for
solving many real-world use cases. Chapters 5, 6, and 7 will detail the concepts of
speech analytics and the idea behind sequence-to-sequence problems with various
pretrained architecture, and also discusses the cutting-edge research happening in
this domain. Chapters 8, 9, and 10 walk through the basic concepts behind CV with
a focus on real-world applications and a guide to the architecture of widely used
pretrained models.
BIBLIOGRAPHY
Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd
Edition, O’Reilly Media, Inc, 2019
Aurélien Géron, Neural Networks and Deep Learning, O’Reilly Media, Inc, 2018
Ben Lorica, Mike Loukides, What Is Artificial Intelligence?, O’Reilly Media, Inc, 2016
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
David Beyer, Artificial Intelligence and Machine Learning in Industry, O’Reilly Media, Inc,
2017
Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016
S Lovelyn Rose, L Ashok Kumar, D Karthika Renuka, Deep Learning Using Python, Wiley
India, 2019
S Sumathi, Suresh Rajappa, L Ashok Kumar, Surekha Paneerselvam, Machine Learning for
Decision Sciences with Case Studies in Python, CRC Press, 2022
Tom Mitchell, Machine Learning, McGraw Hill, 1997
Valliappa Lakshmanan, Martin Görner, Ryan Gillard, Practical Machine Learning for Com-
puter Vision, O’Reilly Media, Inc, 2021
2 Natural Language
Processing
LEARNING OUTCOMES
After reading this chapter, you will be able to:
DOI: 10.1201/9781003348689-2 27
28 Deep Learning Approach for NLP, Speech, and Computer Vision
To develop a program that can sum up the most significant news in 100 words or less,
data from websites and webpages pertaining to current events must be scraped for this.
Consider the Spanish sentence s1 as an example. After being translated into the target lan-
guage, in this case English, it becomes sentence s2. The sentence that was first translated
as “s2” was changed to “s3” in Spanish. s3 was finally added to the dataset after it was
discovered that s1 and s3 have extremely similar meanings with a few minor differences.
The entity names will be swapped out with those of other entities to create new data-
sets. For instance, I would like to visit Australia. Just rename Australia to something
else, like Antarctica.
Pick any words in a sentence, for example, “s” words at random inside a sentence,
which are not stop words that substitute their use for these terms. Usually, stop words
are eliminated from texts before they are processed for analysis.
2.2.2 Text Cleaning
After data collection, it is important that the data be presented in a way that the com-
puter can understand. This fact should be taken into account when the model is being
trained, as the text may contain various symbols and words that have no apparent
significance. Eliminating words from the text before feeding them into the model is
effective in text cleaning. This technique is also called as Data Cleaning. Some text
cleaning procedures are given next.
There is the possibility to scrape through different web pages when gathering the
data. Beautiful Soup and Scrapy offer a variety of tools for parsing web pages. As a
result, there are no HTML tags in the content that was gathered.
It may come across other Unicode characters, such as symbols, emoji, and other
graphic characters, which may be encountered, when cleaning the data. Unicode
normalization is used to parse such non-textual symbols and special characters. This
indicates that in order to store the text in a computer, it must be transformed into
some sort of binary representation. Text encoding is the name of this procedure.
Code snippets given next explains a simple word that was given with an emoji; by
using the Unicode normalization, it will deliver the respected emoji-encoded for-
mats. Table 2.1 represents the emojis with their respective Unicode.
The information may contain mistakes due to hurried typing, the usage of shorthand,
or slangs seen on social media sites like Twitter. It is necessary to treat these data
before feeding them to the model because using them cannot not improve the model’s
ability to forecast. There won’t be a reliable way to repair this, but we may still make
good efforts to lessen the problem. For possible spell checking in Python, Microsoft
released a REST API.
TABLE 2.1
Unicode Normalization for Emojis
Emoji Unicode
☺ U+263A
˜ U+1F603
♥ U+2665
♠ U+2660
✓ U+2713
Natural Language Processing 31
Stop words must be removed during the pre-processing stage because they are
used in another activity and will likely cause the situation of missing certain com-
mon words. In NLP software, the following typical pre-processing steps are used:
2.3.1 Noise Removal
Unwanted information can occasionally be found in the source of text data. For
instance, text data that has been extracted from a webpage will have HTML tags.
Additionally, it must remove punctuation, spaces, numerals, special characters, etc.,
from the text data. Pre-processing includes the crucial and domain-specific step of
removing noise from text data. Depending on the particular NLP assignment you’re
working on, the notion of noise will vary. Punctuation might be considered noise in
some situations, but crucial information in others. For instance, you could want to
eliminate punctuation while estimating a Topic Model. However, punctuation must
always be left in the data while training a transformer.
2.3.2 Stemming
The technique of stemming involves words returning with inflection to their original
form. In this instance, the “root” can simply be a canonical version of the original
32 Deep Learning Approach for NLP, Speech, and Computer Vision
TABLE 2.2
Stemmer Stemming
Original Word Stemmed Words
0 connect connect
1 connected connect
2 connects connect
3 connections connect
word and not a true root word. By chopping off the ends of words, the rudimentary
heuristic technique of stemming attempts to appropriately convert words into their
root form. Therefore, since the ends were simply chopped off, the terms “Connect”,
“Connected”, and then “Connects” can be changed to “Connect” instead of using the
original words; refer to Table 2.2 for better understanding. For stemming, various
algorithms exist. The Porters Method is the most often used algorithm and is well
known to be empirically successful for English. Here is a Porter stemming demon-
stration in Table 2.2 for practice. Stemming is helpful for standardizing terminol-
ogy and addressing concerns with sparsity and to reveal documents that mention
“Machine Learning classes” in addition to those that mention “Machine Learning
class.” To find the most pertinent papers, every word variation should be matched. In
comparison to better constructed features and text enrichment techniques like word
embeddings, stemming provides a slight improvement for classification accuracy.
2.3.3 Tokenization
Text must be converted into a collection of fundamental components, referred to as
Tokens, in order for a machine to simulate natural language. Tokenization is the pro-
cess of dividing text data into smaller or more fundamental components. Tokenizers
are programmers that separate text into tokens, which are small, distinct entities. There
are no formal guidelines regarding how to tokenize a string; however, it is quite typical
for a tokenizer to employ spaces to tokenize a string (e.g., separate it into individual
words). Some tokenizers divide words into their constituent parts rather than just con-
centrating on white space. After tokenizing a document or group of documents, you
can determine the set of distinctive tokens that frequently appear in your data. The
term “vocabulary” refers to this collection of distinctive tokens. As a result, the content
of your vocabulary is directly impacted by your tokenizer. Vocabulary will contain
individual words if tokens are single words; however, if tokens are sub-word strings or
characters, vocabulary will be made up of those elements rather than individual words.
Once vocabulary has been established, it can map each distinct token to a distinct inte-
ger value. In NLP, words are changed into numbers in this manner.
2.3.4 Lemmatization
Similar to stemming, lemmatization aims to take a word out of its inflections and transfer
it to its basic form. The sole distinction is that lemmatization attempts to carry out the
process correctly. It actually changes words to their true roots rather than simply cutting
them off. It may make use of WordNet-style dictionaries or other specialized rule-based
approaches for mappings; better, for example, would be comparable to “good.”
When it comes to search and text classification, lemmatization does not seem to
be any better than stemming. Depending on the technique used, it can be signifi-
cantly slower than using a very simple stemmer and may need comprehension of the
word’s part of speech in order to generate the proper lemma. The accuracy of text
categorization using neural architectures is unaffected by lemmatization. The addi-
tional expense could or might not be appropriate. An example of a lemmatized term
formed from the original word is shown in Table 2.3.
TABLE 2.3
Lemmatization for Original Words
Original _Word Lemmatized Word
0 trouble trouble
1 troubling trouble
2 troubled trouble
34 Deep Learning Approach for NLP, Speech, and Computer Vision
Stop word lists can be made specifically for domains or drawn from pre-existing sets.
Some libraries (like Sklearn) allow you to delete words that appeared in X% of your
texts, which may also have the effect of reducing word usage.
Any ML pipeline must include the stage of feature engineering. The raw data is
transformed into a machine-consumable format by feature engineering processes.
In the traditional ML pipeline, these transformation functions are often created and
tailored to the task at hand. Consider the task of emotion classification for e-com-
merce product reviews. Counting the positive and negative words in each review is
one technique to translate the reviews into meaningful “numbers” that might assist
forecast their moods (positive or negative). If a feature is helpful for a task or not, it
can be determined statistically. One benefit of using handcrafted features is that the
model can still be understood because it is feasible to measure precisely how much
each feature affects the model prediction.
TABLE 2.4
Part of Speech with Tags
Part of Speech (POS) Tag
Noun n
Verb v
Adjective a
Adverb r
Natural Language Processing 35
• DL pipeline
In this case, after pre-processing, the raw data is given straight to a model in the DL
pipeline. The model has the ability to “learn” characteristics from the data. As a
result, these features are better suited to the job at hand and generally produce better
performance. However, the model becomes less interpretable because all of these
features are learned through model parameters.
2.5 MODELING
The next step is to figure out on developing a practical solution from NLP Pipeline.
The development of the model will depend on the techniques and the guideline infor-
mation due to their less information present in the text. To raise the performance of
the model with addition in data, some ways should be followed that are listed next.
It is ideal to utilize these heuristics as features to train your ML model when there are
numerous heuristics whose individual behavior is deterministic but whose combined
behavior is unclear in terms of how it predicts. It may enhance the ML model in the
email spam classification example by including features like the number of words
from the blacklist in a particular email or the email bounce rate.
36 Deep Learning Approach for NLP, Speech, and Computer Vision
It is advantageous to use a heuristic first, before feeding data into a machine learning
model, if it has a really high prediction rate for a specific class. For instance, it is
preferable to categorize an email as spam rather than sending it to an ML model if
particular terms in the email have a 99% likelihood of being spam.
• Transfer learning
A model that has been trained for one task is repurposed for a different, related task
using the machine learning technique known as transfer learning. BERT can be used,
as an illustration, to fine-tune the email dataset for the classification of email spam.
• Reapplying heuristics
At the conclusion of the modeling pipeline, it is feasible to go back and look at these
examples again to identify any patterns in the defects and utilize heuristics to fix
them. The model predictions can be improved by using domain-specific knowledge
that is not automatically captured in the data.
2.6 EVALUATION
NLP model can be evaluated with their performance after building the model. The
evaluation metrics depends on the NLP tasks. During the model building stage and
deployment stages, the evaluation metrics are used. The metrics will suffice in the
Natural Language Processing 37
processing, and it decides the model performance. There are some evaluation metrics
available for NLP task, and they are listed here:
• BLEU
• NIST
NIST weights each matched n-gram in accordance with its information gain in addi-
tion to BLEU (Entropy or Gini Index). Over the set of references, the information
gain for an n-gram composed of the letters w1, . . ., wn is determined. It was built on
top of the BLEU metric with a few changes. While BLEU analyzes n-gram accuracy
by giving each one the same weight, NIST additionally assesses how informative a
certain n-gram is. It is intended to give more credit when a paired n-gram is unusual
and less credit when it is common in order to decrease the chance of manipulating
the measure by producing useless n-grams.
• METEOR
The drawbacks of BLEU include the fact that it only supports accurate n-gram match-
ing and ignores recollection. To fix these issues, METEOR (Metric for Evaluation of
Translation with Explicit Ordering) was developed. It uses lax matching criteria and is
based on the F-measure. METEOR takes into account a matched unigram even if it is
equivalent to a unigram in the reference but does not have an exact surface level match.
• ROUGE
• CIDEr
an image’s reference captions will frequently contain n-grams related to the image.
Each n-gram in a sentence is given a weight by CIDEr using TF-IDF on the basis
of how frequently it appears in the corpus and the reference set for that specific cir-
cumstance (term-frequency and inverse-document-frequency). But since they are less
likely to be instructive or pertinent, n-grams that often exist throughout the dataset
(e.g., in the reference captions of different photos) are given a lower weight using the
inverse-document-frequency (IDF) term.
• SPICE
• BERT
The use of BERT to obtain word embeddings demonstrates that outcomes from con-
textual embeddings and a simple averaged recall-based metric are equivalent. The
greatest cosine similarity in between embeddings of any reference token and any
token in the hypotheses is used to calculate the BERT score.
• MOVERscore
2.7 DEPLOYMENT
Deployment is one of the stages in an NLP pipeline’s post-modeling phase. Once
it is satisfied by the model’s performance, it is ready to be deployed in production,
where in connect with NLP module to the incoming data stream, the output is usable
by downstream applications. An NLP module is typically deployed as a web service,
and it is critical that the deployed module is scalable under high loads.
Natural Language Processing 39
2.9.2 Word Embeddings
In NLP, text processing is a technique used to tidy up text and get it ready for model
creation. It is adaptable and contains noise in many different ways, including emotions,
punctuation, and text written in specific character or number formats. Starting with text
processing, several Python modules make this process simpler and offer a lot of versa-
tility thanks to their clear, simple syntax. The first is NLTK, which stands for “natural
language toolkit,” which is helpful for a variety of tasks like lemmatizing, tokenizing,
and POS. There isn’t a single statement that isn’t a contraction, which means we fre-
quently use terms like “didn’t” instead of “did not.” When these words are tokenized,
they take on the form “didn’t,” which has nothing to do with the original word. There
is a collection of terms called contractions that deals with such words.
BeautifulSoup is a package used for online scraping, which acquire data contain-
ing HTML tags and URLs, so BeautifulSoup is used to handle this. Additionally, we
are utilizing the inflect library to translate numbers into words.
40 Deep Learning Approach for NLP, Speech, and Computer Vision
Word embedding is a technique used in NLP to represent words for text analysis.
This representation frequently takes the form of a real-valued vector that encodes
the definition of the word, presuming that words that are close to one another in
the vector space would have related meanings. Using word embeddings, which are
a type of word representation, it is possible to show words with similar meanings.
They are a dispersed representations for the text and may be one of the major tech-
nological advancements that allows deep learning algorithms to excel at difficult
NLP challenges. Each word is represented as a real-valued vector in a predetermined
vector space during a process known as word embedding. The method is sometimes
referred interpreted as “deep learning” since each word is given to a unique vector,
and the vector values are learned similar to a neural network. The distributed repre-
sentation is learned via word usage. This enables words that are frequently used to
naturally have representations that accurately convey their meaning.
Neural network embeddings serve three main objectives:
TABLE 2.5
Bag of Words Calculation
amazing an anatomy best great greys is series so the TV
0 1 1 1 0 0 1 1 1 0 0 1
1 0 0 1 1 0 1 1 1 0 1 1
2 0 0 1 0 1 1 1 0 1 0 0
2.9.3 Bag of Words
Natural language processing employs the text modelling technique known as “bag
of words.” To explain it formally, it is a method for feature extraction from text data.
This approach makes it simple and flexible to extract features from documents.
A textual example of word recurrence in a document is called a “bag of words” since
any information about the placement or organization of the words within the docu-
ment is discarded. Table 2.5 explains the bag of words concept with the calculation
of bag-of-words approach:
2.9.4 TF-IDF
Utilizing the statistical method TF-IDF (term frequency-inverse document fre-
quency), anybody can determine how relevant a word is to each document in a group
of documents. To accomplish this, the frequency of a word within a document and its
inverse document frequency across a collection of documents are multiplied. It has
a wide range of uses, with automated text analysis being the most important, includ-
ing word scoring in machine learning algorithms for NLP. TF-IDF was created for
document search and information retrieval. It works by escalating in accordance with
how frequently a term appears in a document but is balanced by how many papers
contain the word.
TF-IDF is calculated for each word in a document by multiplying two separate
metrics:
• The number of times a word appears in a text. The simplest way to calculate
this frequency is to simply count the instances of each word in the text.
Other ways to change the frequency include the document’s length or the
frequency of the term that appears most frequently.
• The phrase “inverse document frequency” appears frequently in a group of
documents. This relates to how frequently or infrequently a word appears in
all writings. The nearer to 0 a word is, the more common it is. This metric
can be calculated by taking the entire collection of papers, then dividing it
by the total number of documents including a word, and then computing the
logarithm.
42 Deep Learning Approach for NLP, Speech, and Computer Vision
• This value will therefore be close to 0 if the word is widely used and appears
in numerous papers. If not, it will go close to 1.
The result of multiplying these two figures is the word TF-IDF score in a document.
The more relevant a word is in a given document, the higher the score.
The following formula is used to determine the TF-IDF score for the word t in a
document d from a document set D:
2.9.5 N-gram
N-grams are continuous word, symbol, or token combinations in a document. They
are the adjacent groups of items in a document. N-grams are relevant when perform-
ing NLP activities on text data. A list of the different categories for n-grams is shown
in Table 2.6, where n is the positive integer value that includes the total number of
n-grams, and in the term it describes the different categories for n-grams.
Examples for n-gram representation in text:
• “Candy”—Unigram (1-gram)
• “Candy Crush”—Bigram (2-gram)
• “Candy Crush Saga”—Trigram (3-gram)
The sample given here demonstrates the variety of n-gram types seen in typical
literature. The N number of grams was determined by counting the sequence of
words that were present in the circumstance where each repetition of the phrases was
treated as a separate gram.
2.9.6 Word2Vec
The Word2Vec (W2V) technique extracts a vector representation of each word
from a text corpus. In order to represent the distribution of words in a corpus C,
the Word2Vec model mixes many models. The Word2Vec approach uses a neural
TABLE 2.6
N-gram Categories
n Term
1 Unigram
2 Bigram
3 Trigram
n n-gram
Natural Language Processing 43
TABLE 2.7
Word2Vec Representation
I Like enjoy machine learning vector studying .
I 0 2 1 0 0 0 0 0
like 2 0 1 0 1 0 0 0
enjoy 1 0 0 0 0 0 1 0
machine 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
vector 0 1 0 0 0 0 0 1
studying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
network model to learn word associations from a vast corpus of text. Once trained,
a model of this kind can identify terms that are synonyms or can recommend new
words to complete a sentence. Word2Vec, as the name suggests, uses a vector of
specified numbers to represent each unique word. The vectors are selected with care
so that a straightforward mathematical function (cosine similarity between vectors)
may be used to determine how semantically similar the words represented by each
vector are to one another. Table 2.7 gives a straightforward illustration showing how
words are represented as vectors.
2.9.7 Glove
The use of embeddings rather than other text representation approaches like one
hot vector encodes, TF-IDF, and bag-of-words has produced numerous impressive
outcomes on deep neural networks with difficulties like neural machine translations.
Additionally, some word-embedding techniques, such GloVe and Word2Vec, may
eventually achieve performance levels comparable to those of neural networks. GloVe
is an abbreviation for Global Vectors in word representation. It is an unsupervised
learning system created by Stanford University that tries to create word embeddings
by combining global word co-occurrence matrices from a corpus. The main princi-
ple of GloVe word embeddings is to utilize statistics to determine the relationship
between the words. The co-occurrence matrix, as opposed to the occurrence matrix,
tells how frequently a particular word pair appears with another. In the co-occur-
rence matrix, each value reflects a pair of words that frequently appear together.
2.9.8 ELMo
The numerous aspects of word use, including as syntax and semantics, as well
as the ways in which these uses vary across linguistic contexts, are modeled
by the deep contextualized word representation known as ELMo (i.e., to model
44 Deep Learning Approach for NLP, Speech, and Computer Vision
polysemy). These word vectors are learned functions of the internal states of a deep
bidirectional language model that has been pre-trained on a substantial text corpus
(biLM). They greatly enhance the state-of-the-art for many challenging NLP prob-
lems, including sentiment analysis, question answering, and textual entailment,
and are easy to include into already-existing models. ELMo shows the context
in which each word is spoken throughout the entire dialog. Since ELMo is char-
acter-based, the network may infer the vocabulary tokens from training by using
morphological cues.
Examples for ELMo are as follows:
Here, the word “watch” is used as a verb in the first phrase and as a noun in the
second. These words are referred to as polysemous words since their context varies
between sentences. This type of word nature can be handled by ELMo with greater
success than GLOVE or FastText.
• Based on its probability, select a bigram at random (s, w). Decide on a ran-
dom bigram (w, x) based on its probability, and so on till we reach a deci-
sion. Then join the words together. The punctuation marks s> and /s> here
denote the beginning and conclusion of the phrases, respectively.
2.10.2 Smoothing
In order for all plausible word sequences to occur with some probability, a language
model’s inferred probability distribution must be flattened (or smoothed). This fre-
quently entails enlarging the distribution by shifting weight from regions of high
likelihood to regions of zero probability. Smoothing works to increase the model's
overall accuracy in addition to preventing zero probability. Parameter estimation
(MLE) should be employed in a linguistic model using training data. Due to the like-
lihood that both test sets would contain words and n-grams with a chance of 0, we
are unable to evaluate our MLE models using unseen test data. All probability mass
is assigned to events in the training corpus through relative frequency estimation.
Examples for smoothing are:
A. B Σin=1 Ai Bi
Similarity = cos(θ ) = =
|| A || | B || Σin=1 Ai2 Σin=1 Bi2
The orientation of two vectors is the same if the cosine similarity score is 1. The closer
the value is to 0, the less similar the two documents are. The cosine similarity metric is
preferable to Euclidean distance because there is still a potential that two texts that are far
away by Euclidean distance will be similar to one another in terms of context.
2.12 SUMMARY
The theories of NLP, NLP Pipeline, and various text pre-processing methods, includ-
ing noise removal, stemming, tokenization, lemmatization, stop word removal, and
parts of speech tagging, are all covered in detail with examples in this chapter. The
concepts of NLP in language models for n-grams were also explained, and the text
semantics of vector, lexical, cosine, and bias were also covered in detail.
BIBLIOGRAPHY
Farha, I. A., & Magdy, W. (2021, April). Benchmarking transformer-based language models
for Arabic sentiment and sarcasm detection. In Proceedings of the sixth Arabic natural
language processing workshop (pp. 21–31). Toronto: Association for Computational
Linguistics.
Feldman, J., Lakoff, G., Bailey, D., Narayanan, S., Regier, T., & Stolcke, A. (1996). L 0—The
first five years of an automated language acquisition project. In Integration of natural
language and vision processing (pp. 205–231). Dordrecht: Springer.
Lewis, P., Ott, M., Du, J., & Stoyanov, V. (2020). Pretrained language models for biomedical
and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings of
the 3rd clinical natural language processing workshop, Online (pp. 146–157). Toronto:
Association for Computational Linguistics, Anna Rumshisky, Kirk Roberts, Steven
Bethard, Tristan Naumann (Editors).
Liddy, E. D. (2001). Natural language processing. In Encyclopedia of library and information
science, 2nd ed. New York: Marcel Decker, Inc.
Maulud, D. H., Zeebaree, S. R., Jacksi, K., Sadeeq, M. A. M., & Sharif, K. H. (2021). State of
art for semantic analysis of natural language processing. Qubahan Academic Journal,
1(2), 21–28.
Raina, V., & Krishnamurthy, S. (2022). Natural language processing. In Building an effective
data science practice (pp. 63–73). Berkeley, CA: Apress.
Smelyakov, K., Karachevtsev, D., Kulemza, D., Samoilenko, Y., Patlan, O., & Chupryna, A.
(2020). Effectiveness of preprocessing algorithms for natural language processing appli�
-
cations. In 2020 IEEE international conference on problems of infocommunications.
Science and technology (PIC S&T) (pp. 187–191). New York: IEEE.
Spilker, J., Klaner, M., & Görz, G. (2000). Processing self-corrections in a speech-to-speech
system, edited by Wolfgang Wahlster. Verbmobil: Foundations of Speech-to-Speech
Translation.
Sun, F., Belatreche, A., Coleman, S., McGinnity, T. M., & Li, Y. (2014, March). Pre-process-
ing online financial text for sentiment classification: A natural language processing
approach. In 2014 IEEE conference on computational intelligence for financial engi-
neering & economics (CIFEr) (pp. 122–129). New York: IEEE.
Wasim, M., Asim, M. N., Ghani, M. U., Rehman, Z. U., Rho, S., & Mehmood, I. (2019).
Lexical paraphrasing and pseudo relevance feedback for biomedical document retrieval.
Multimedia Tools and Applications, 78(21), 29681–29712.
3 State-of-the-Art Natural
Language Processing
LEARNING OUTCOMES
After reading this chapter, you will be able to:
3.1 INTRODUCTION
It is a huge challenge for the computers to understand information as the way we do,
but advances in technology are helping in bridging the gap. Technologies like ASR,
NLP, and CV are helpful in transforming in a more useful way than ever before. The
ability of computers to understand and interpret the human languages was envis-
aged by Alan Turing in 1950 as a hallmark of computational intelligence. Many
commercial applications of today make use of NLP models for their downstream
modules. A wide variety of applications such as search engine and chatbots utilize
AI-based NLP in their back end. If you observe, most of the NLP problems are
aligned toward sequential data. In general, human thoughts have persistence. For
example, you comprehend each word in this chapter on the basis of how you com-
prehended the words preceding it. A conventional convolution neural network is not
suitable for such complex time series data as it accepts predetermined input vector
such as image and produces predetermined output vector such as class labels. To
overcome the shortcomings of CNN, sequential models such as RNN comes into the
picture. A recurrent neural network (RNN) is a kind of brain network that is utilized
to tackle consecutive information like text, sound, and video. RNN is able to store
information about the previous data using memory, and they are great learners of
sequential data. One of the major disadvantages of RNN is that it is unable to capture
contextual information when the input sequence is larger. To tackle the long-term
sequential dependency, Long Short Term Memory (LSTM) is introduced, and it is
evident that it is suitable for many real-world NLP problems. Ian Goodfellow came
up with an idea of attention mechanism which is yet another cutting-edge model
which handles sequential NLP data in a more contextual way by applying its focus
to input sequential data.
This chapter explains the various sequential models such as RNN, LSTM, atten-
tion and transformer-based models for NLP applications. These models are widely
used for various NLP use cases as follows:
DOI: 10.1201/9781003348689-3 49
50 Deep Learning Approach for NLP, Speech, and Computer Vision
• Machine translation
Machine translation is used to automatically translate text in different languages
without the assistance of human linguists.
• Sentiment analysis
Sentiment analysis, often known as opinion mining, is a technique used in natural lan-
guage processing (NLP) to determine the emotional context of a document.
• Chatbot
Rather than offering direct contact with a genuine human specialist, a chatbot is
a product program that mechanizes communications and is utilized to lead online
discussions through text or text-to-discourse.
• Question-answering system
Building systems that respond automatically to questions asked by people in natural
language is the focus of question-answering system.
• Name-entity recognition
Recognizing and sorting named elements referred to in unstructured text into pre-
laid out classes like individual names, associations, areas, and so forth is the goal of
the name-entity recognition.
• Predictive text
Through word suggestions, predictive text is an input method that makes it easier for
users to type on mobile devices.
3.2.1 Sequence
The word sequence means continuous flow of data with respect to time. In general,
sequence consists of multiple data points where data points depend on each other in a
complicated way. For instance, sequence could be something like sentence, medical
EEG signal, and speech wave form as given in Figure 3.1.
tagging. For in Figure 3.2, the input text sequence is mapped to output parts of speech
tag using neural network models.
“I had a wonderful time in France and learned some of the _______ language.”
To model sequences, we need to do the following:
52 Deep Learning Approach for NLP, Speech, and Computer Vision
(
ht = f w ht −1 , xt)
(1)
W represents the weight matrix, U represents the transition matrix, and ht rep-
resents the hidden state at time step t and ht−1 represents the hidden state at time
step t−1. The nonlinearity introduced by the activation function tanh overcomes the
vanishing gradient problem.
To store the information from previous time step, it has an in-built memory.
54 Deep Learning Approach for NLP, Speech, and Computer Vision
3.3.1 Unrolling RNN
A graphical representation of an unrolling RNN is shown in Figure 3.6. Initially, at
time step 0, x0 is given as input to RNN cell to product output y0. As propagation,
the next RNN cell receives the hidden state from time step 0. RNN works in such
a way that the output of the current time step is dependent on both the input and
the results of the previous time step. The parameters U and V are shared across
the RNN layer.
The different topologies of RNN are shown in Figure 3.7.
• One to one: It is similar to CNN which accepts the input of fixed size vector and
converts into one discrete output class label. Example being pattern recognition.
• One to many: One input of fixed size vector is converted to a sequence of
output. Example being image captioning.
• Many to one: Sentiment analysis is an example for many to one which takes
sequential text as input and identifies the output.
• Many to many: Machine translation is a good example for many to many
mapping.
The two major variants of RNN are gated recurrent unit (GRU) and LSTM.
State-of-the-Art Natural Language Processing 55
3.3.3 Challenges in RNN
RNN fails to handle varied length sequence—for instance, in machine translation if
we want to translate English to Tamil as shown in Figure 3.10. The length of input
sequence is three whereas that of the output sequence is two. RNN is not suitable for
such varied length sequences.
only one thought at a time and to extract the important information from noisy data,
giving more strong attention to some part of data than others. Attention mechanism
is widely accepted for machine translation applications as given in Figure 3.11.
1. Tokenize the input words and [w1, w2, … . wn], word-embedding vectors are
generated.
2. The input vector is multiplied with weight matrixes such as wq , wk , wv to
produce query [ q1 , q2 , ….qn ] , key [ k1 , k2 , ….kn ] , and value [ v1 , v2 , ….vn ]
vectors.
State-of-the-Art Natural Language Processing 59
3. The inner product of query and key vector produces attention score or rank
which is [ s1 , s2 , ….sn ] . If the score is higher, it implies similar words or else
dissimilar words.
4. The score values are then multiplied with value vector and added to produce
the context vector cv ∼ n.
∼ n = ( s1 * e1 ) + ( s2 * e2 )… ( sn * en )
cv
sets of query, key, and value weight matrices are used to produce multiple represen-
tations of query, keys and value vectors vectors as shown in figure 3.16.
(
where headi = Attention QWiQ , KWi K , VWiV )
3.4.3 Bahdanau Attention
Bahdanau architecture uses encoder decoder architecture using bidirectional recur-
rent neural network (Bi-RNN) -→ which peruses input sentence in forward heading to
h
← i and then move in the backward direction to produce
deliver forward secret state
backward hidden state. hi.
--→ ←-- T
hi = ⌈ hiT : hiT ⌉
⌊ ⌋
The context vector is the output of the encoder and is fed into the decoder archi-
tecture. It makes use of additive attention as shown in Figure 3.17 to produce the
context vector.
State-of-the-Art Natural Language Processing 61
FIGURE 3.14 Attention Computation with Key, Query, and Value Vectors.
3.4.4 Luong Attention
1. The encoder produces the hidden states, H = hi , i = 1, … , T , from the infor-
mation sentence.
2. The present decoder hidden state is derived as: st = RNN decoder ( st −1 , yt −1 ) .
Here, st −1 implies the previous hidden decoder state, and yt −1 the previous
decoder output.
62 Deep Learning Approach for NLP, Speech, and Computer Vision
Figures 3.19 and 3.20 show the local attention working with widow size 1 and 2,
respectively. Figure 3.21 shows the working behind random attention.
64 Deep Learning Approach for NLP, Speech, and Computer Vision
3.4.6 Hierarchical Attention
Let us understand the hierarchical attention using applications like chat box and
document classification. Nowadays, the most widely used NLP-based application is
chat box. Figure 3.22 shows typical chat box dialogs. The dialog is a sequence of a
sequence which in turn consists of utterances between the user and the bot. Each
utterance is in turn a sequence of words. Hierarchical attention (HAN) is a well-
suited model for sequence of sequence problems. HAN consists of two-level atten-
tion layers as shown in the architecture in the Figure 3.23. First, we need to attend to
the most important and informative words in a sentence:
The components of HAN are
• Word Encoder
• Word Attention
• Sentence Encoder
• Sentence Attention
model is, let us consider an example as shown in Figure 3.24. The probability of the
phrase “Tensorflow is open source” is greater than the probability of phrase “Source
Tensorflow is open,” given some training corpus.
A masked language model is a smidgen unique, so rather than distinguishing the
likelihood of an entire expression, the most common way of preparing the model is
by having a fill in the clear. Concealed language models are helpful on the grounds
that there is one approach to doing relevant word embedding. BERT is extremely
large: the larger version has 340 million trainable parameters when compared to
word-embedding model ELMo, which had only 93 million.
The original transformer architecture is shown in Figure 3.25. The model utilizes
conventional succession to grouping where you have an encoder, which takes infor-
mation and transforms it into embedding, and a decoder that takes those embeddings
and transforms them into string yield.
bidirectional. Take a look at Figure 3.26 for an example of this. The term “bank”
appears in both statements in this example.
• Bidirectional: This model reads text from both directions (left and right) to
gain a better understanding of text.
• Encoder: Encoder Decoder model is used in NLP where the task is feeding
to the input of encoder and the output is taken from the decoder.
• Representation: Encoder Decoder architecture is represented using
transformers.
• Transformers: A key part of the transformers is the multi-head attention
block. Transformers is a mix of attention, standardization, and veiled con-
sideration in the decoder stage.
State-of-the-Art Natural Language Processing 67
BERT engineering is a tad different, which takes various encoders and stacks
them on top of one another. BERT base and BERT huge are the two variations of
BERT as displayed in Figure 3.27. The CLS token in front is utilized to address the
order of a particular piece of information, though SEP tokens are utilized toward the
finish of each and every information grouping.
The BERT design is to utilize an extra classifier to adjust the first model, where
we really update the loads in the first BERT model as given in Figure 3.28. There are
several architectures proposed by extending the BERT model such as:
• RoBERTa
• DistilBERT
• AlBERT
68 Deep Learning Approach for NLP, Speech, and Computer Vision
FIGURE 3.22 Chatbot.
State-of-the-Art Natural Language Processing 69
• CamemBERT (French)
• AraBERT (Arabic)
• Mbert (multilingual)
• As it is a pre trained model, we can utilize this model for more modest
assignments of explicit and downstream applications without having to rec-
ompute the exceptionally enormous and costly BERT model.
• With effective fine-tuning, we can accomplish an excellent exactness. Many
cutting-edge frameworks consolidate BERT here and there.
• Finally, pre-trained models are available in more than 100+ languages.
3.5.2 GPT3
OpenAI developed a very large neural-network-based language model called genera-
tive pre-trained transformers (GPT3) for NLP-related tasks as shown in Figure 3.29.
Based on the input text, it makes probabilistic predictions about the following tokens
from a known vocabulary. In 2018, GPT3 was launched with 117 billion parameters.
Later, in 2019, the second version of GPT3 was released with 1.5 billion param-
eters. GPT3 with 175 billion parameters is one among the world’s biggest neural
network model as of 2021. GPT3 model was trained on very large dataset as shown
in Table 3.1.
It is capable of doing the following tasks such as,
TABLE 3.1
Dataset Used in GPT3 Training
Dataset # Tokens Content
Common Crawl 410 billion 8+ years of raw web page data, metadata extracts, and text
extracts with light filtering
WebText2 19 billion All incoming Reddit links from posts with three or more
upvotes
Books1 12 billion
Books2 55 billion
Wikipedia 3 billion English-language Wikipedia pages
Table 3.2 explains the widely used language models with their number of trainable
parameters.
74 Deep Learning Approach for NLP, Speech, and Computer Vision
TABLE 3.2
List of Language Models
Language Released By Number of Trainable
Models Parameters
Bert Large Google 340 million
GPT Open AI 117 million
GPT2 Open AI 1.5 billion
GPT3 Open AI 175 billion
T5 Google 220 million
Turing NLP Microsoft 17 billion
3.6 SUMMARY
This chapter explores the various models for sequential data analysis such as RNN
and attention-based models with examples in detail. The variants of RNN are dis-
cussed and introduce language modeling and transformer-based models used in NLP
applications. Models like BERT and GPT3 were also discussed in detail. Next chap-
ter focuses on various applications of NLP that can be implemented using the models
and techniques described in this chapter.
BIBLIOGRAPHY
Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learn-
ing to align and translate. Paper presented at 3rd International Conference on Learning
Representations. San Diego, CA: ICLR.
Bengio, Y., Goodfellow, I., & Courville, A. (2017). Deep learning, vol. 1. Cambridge, MA:
The MIT press.
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., & Bengio, Y. (2015). Advances
in neural information processing systems, vol. 28. New York: Curran Associates, Inc (A
Recurrent Latent Variable Model for Sequential Data).
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). Proceedings of the 2019 conference
of the North American chapter of the association for computational linguistics: Human
language technologies, vol. 1 (Long and Short Papers, pp. 4171–4186). Minneapolis:
Association for Computational Linguistics.
Dou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N., & Choi, Y. (2022). Is GPT-3 text
indistinguishable from human text? Scarecrow: A framework for scrutinizing machine
text. Proceedings of the 60th annual meeting of the association for computational lin-
guistics, vol. 1 (Long Papers, pp. 7250–7274). Dublin: Association for Computational
Linguistics.
Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds &
Machines 30, 681–694.
Graves, A. (2012). Supervised sequence labelling. In: Supervised sequence labelling with
recurrent neural networks (Studies in Computational Intelligence), vol. 385. Berlin and
Heidelberg: Springer.
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent
neural networks. In: IEEE international conference on acoustics. Speech and signal pro-
cessing (ICASSP) (pp. 6645–6649). Vancouver, Canada: IEEE.
State-of-the-Art Natural Language Processing 75
Han, Z., Ian, G., Dimitris, M., & Augustus, O. (2019). Proceedings of the 36th international
conference on machine learning (pp. 7354–7363). New York: PMLR.
Sneha, C., Mithal, V., Polatkan, G., & Ramanath, R. (2021). An attentive survey of attention
models. ACM Transactions on Intelligent Systems and Technology (TIST) 12(5), 1–32.
TensorFlow. Recurrent neural network s. TensorFlow. www.tensorflow.org/tutorials/recurrent.
Vaswani, A., et al. (2017). Attention is all you need, advances in neural information processing
systems. Kolkata: NIPS.
Zichao, Y., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention
networks for document classification. In: Proceedings of the 2016 conference of the
North American chapter of the association for computational linguistics: Human lan-
guage technologies (pp. 1480–1489). San Diego, CA: Association for Computational
Linguistics.
4 Applications of Natural
Language Processing
LEARNING OUTCOMES
After reading this chapter, you will be able to:
4.1 INTRODUCTION
The applications of Natural Language Processing are extensively used in many fields in
day-to-day life. The search engines use NLP to actively understand the user requests and
provide the solutions in a faster manner. The smart assistants like Alexa and Siri rely on
NLP. NLP is used in business-related fields by providing the facts and figures of each and
every year growth and production. It evaluates text sources from emails to social media
and many more. Unstructured text and communication are transformed into usable data
for analysis utilizing a variety of linguistic, statistical, and machine learning techniques
by NLP text analytics. NLP and AI tools can automatically comprehend, interpret, and
classify unstructured text using text classification. Data is arranged on the basis of cor-
responding tags and categories using the Natural Language Processing algorithms. The
text extraction uses NLP, often referred to as named entity recognition, and may auto-
matically detect particular named entities within text. Text summarization uses NLP to
efficiently process the input and obtain very important information. It is widely used in
educational sector, research, or healthcare environments. NLP can provide a paraphrased
summary of a text by focusing on specific key phrases within the text or by determin-
ing meanings and conclusions. Market intelligence uses NLP for separating subjects,
sentiment, keywords, and intent in unstructured data from any type of text or consumer
communication. The feature of intent classification enables businesses to more precisely
determine the text’s intention through their emails, social media posts, and other commu-
nication, and it can help customer care teams and sales teams.
Financial firms can use sentiment analysis to examine more market research and
data and then use the knowledge gained to streamline risk management and make
better investment decisions. Banks and other security organizations can use NLP to
spot instances of money laundering or other frauds. It is possible to use natural lan-
guage processing to assist insurance companies in spotting fraudulent claims. AI can
spot signs of frauds and flag such claims for additional examination by examining
customer communications and even social media profiles.
DOI: 10.1201/9781003348689-4 77
78 Deep Learning Approach for NLP, Speech, and Computer Vision
Insurance companies use natural language processing to monitor the highly com-
petitive insurance market environment. It can better understand what their rivals
are doing by utilizing text mining and market intelligence tools, and they can plan
what products to launch to keep up with or outpace their rivals. NLP analyses the
shipment manufactures and provides information about the shortage of supply chain
to improve the automation and also manufacturing pipeline. With the help of this
information, they can improve specific steps in the procedure or adjust the logistics
to increase the efficiency. Sentiment analysis should be used, especially by retailers.
Retail businesses can improve the success of each of their company activities, from
product release and marketing, by measuring customer sentiments regarding their
brands or items. NLP makes use of social media comments, customer reviews, and
other sources to transform this information into useful information that merchants
can utilize to develop their brand and address their problems. The potential uses of
natural language processing in the healthcare industry are enormous, and they are
only just beginning. It is now assisting scientists working to combat the COVID-19
pandemic in a number of ways, including by examining incoming emails and live
chat data from patient help lines to identify those who may be exhibiting COVID-19
symptoms. This has made it possible for doctors to proactively prioritize patients and
expedite patient admission to hospitals.
This chapter discusses the different types of applications involving Natural Language
Processing that is extensively used, and they are word sense disambiguation (WSD),
word sense induction, text classification, sentiment analysis, spam email classification,
question answering, information retrieval, entity linking, chatbots, and dialog system.
4.2.1 Word Senses
A sense, sometimes known as a word sense, is a distinct representation of one part
of a word’s meaning. When working on tasks that include meaning, understanding
the relationship between two senses might be crucial. For example, think about the
Applications of Natural Language Processing 79
antonymy relationship. When two words, such as long and short or up and down,
have the opposing meanings, they are said to be anagrams. It is crucial to distinguish
between them since it would be bad if a person requests the dialog agent to start the
music and they did the opposite. However, antonyms can actually be confused with
each other in embedding models like Word2Vec since they are typically one of the
words in the embedding space that is most similar to a term.
WSD is used to identify the meaning of a word which can be used in a different
scenario. The first problem which the natural language processing faces is lexical
ambiguity, syntactic or semantic. The part of speech tagging with maximum accu-
racy will provide solution to the word’s syntactic disambiguity. In order to solve the
semantic ambiguity, the word sense disambiguation is used. The dictionary and the
test corpus are the two inputs used to evaluate the WSD. In Figure 4.1, the WSD is
derived from non-predefined senses and predefined senses. The predefined senses
are acquired from knowledge and corpus.
• Dictionary based
The dictionary base is the primary source for WSD. The corpus is not used to solve
the disambiguation. The Lesk algorithm was developed by Michael Lesk in 1986. It
is used to remove the word ambiguity.
80 Deep Learning Approach for NLP, Speech, and Computer Vision
• Supervised methods
• Semisupervised methods
The word disambiguation algorithms are mostly semi-supervised. Since they are
semi-supervised, they use both labeled and unlabeled data. This particular method
needs small amount of annotated text and large amount of plain unannotated text.
Applications of Natural Language Processing 81
• Unsupervised methods
In unsupervised methods, similar senses occur in similar context. The senses can be
inferred by word grouping the number of times the word occurs, and it is also based
on the same measure of context similarity. Due to their lack of reliance on manual
efforts, unsupervised approaches have a lot of potential for overcoming the knowl-
edge acquisition barrier.
• Text mining
Text mining is the concept of analyzing large amount of unstructured text data and
involves using the software to identify the insights within the text, patterns, key-
words, and various properties in the data. The alternate name for text mining is text
analytics. Text mining is frequently used by data scientists and others for developing
the big data and deep learning algorithms. The chatbots and virtual agents can be
seen nowadays in most of the webpages. These applications are used to acquire texts.
The word sense disambiguation is widely used to do relevant analysis of text. The
WSD is used to find the correct words.
Social media is a popular platform for consumers to share their thoughts and expe-
riences with goods and services. Text classification is frequently used to distinguish
between tweets that require a response from brands (those that are actionable) and
those that do not require any response.
• E-commerce
On e-commerce sites like Amazon and eBay, customers post reviews for a variety of
products. To comprehend and analyze customers’ perceptions of a product or service
on the basis of their remarks is an example of how text categorization is used in this
type of scenario. This practice is referred to as “sentiment analysis.” It is widely used
by brands all over the world to determine if they are drawing nearer to or further
away from their consumers. Over time, sentiment analysis has developed into a more
complex paradigm known as “aspect”-based sentiment analysis, which classifies
consumer input as “aspects” rather than merely positive, negative, or neutral.
84 Deep Learning Approach for NLP, Speech, and Computer Vision
4.3.3 Other Applications
• Text classification is used to determine the language of new tweets or post-
ings, for example. For instance, Google Translate includes a tool for auto-
matically identifying languages.
• Another common application of text categorization is in the identification of
unknown authors of works from a pool of writers. It is utilized in a variety
of disciplines—from forensic analysis to literary studies.
• In the recent past, text classification was used to prioritize submissions in
an online discussion board for mental health services. Annual competitions
for the solution of such text categorization issues deriving from clinical
research are held in the NLP community (e.g., clpsych.org).
• Text categorization has recently been used to separate bogus news from
actual news.
The polarity plays an important aspect in Graded Sentiment Analysis. The polarity
can be expanded into different levels of positive and negative. The ratings can be
given as 5 (for very positive) and 1 (for very negative). This is based on the customer
feedback.
• Emotion detection
This approach of reading a text’s emotions is trickier. Machine learning and lexicons
are used to determine the sentiment. Lexicons are collections of positive or negative
word lists. This makes it easier to classify the terms according to their usage. The
advantage of doing this is that a company can understand why a consumer feels the way
they do. This is more algorithm-based and may be challenging to comprehend at first.
86 Deep Learning Approach for NLP, Speech, and Computer Vision
The specific characteristics of the people will be discovered through the sentiment
analysis and are referred to as positive, negative, and neutral. This form of sentiment
analysis often just examines one aspect of a service or product. To understand how
customers feel about specific product attributes, a company that sells televisions, for
example, may use this type of sentiment analysis for a specific feature of televisions,
such as brightness and sound.
• Intent analysis
This analysis is purely based on the customer. For example, the company can easily
predict whether the customer will purchase this particular product or not. This is by
tracking the intention of the particular customer Producing a pattern and then used
for marketing and advertising.
• Medical field
• Financial services
• Online courses
• Work from home jobs
• Online games
• Online gambling
• Cryptocurrencies
Applications of Natural Language Processing 87
4.5.1 History of Spam
Although spam may be a contemporary issue, it has a lengthy history. Gary Thuerk,
a worker at the now-defunct Digital Equipment Corp. (DEC), wrote the first spam
email in 1978 to advertise a fresh item. There are 2,600 users who had email accounts
on the Advanced Research Projects Agency Network, and the unwanted email was
sent to around 400 of them. According to some accounts, DEC’s new sales increased
by around $12 million as a result.
However, the term “spam” didn’t come into use until 1993. It was used with
Usenet, a newsgroup that combines elements of both email and a website forum. It
posted more than 200 messages to a discussion group automatically due to a bug in
its new moderation software. The event was mockingly referred to as spam.
In 1994, Usenet was also a target of the first significant spam attack. Spam
accounted for 80–85% of all emails sent globally in 2003. The United States Passed
the Controlling the Assault of Non-Solicited Pornography and Marketing (CAN-
SPAM) Act of 2003 as a result of the problem becoming so pervasive. The most
crucial law that legal email marketers must abide by to avoid being branded as spam-
mers is still CAN-SPAM.
The volume of spam sent on a daily average decreased from 316.39 billion to
roughly 122 billion between mid-2020 and early 2021. However, spam still makes
up for 85% of all emails, costing reputable companies billions of dollars annually.
4.5.3 Types of Spams
• Malware messages: Few spam emails contain malware, which can fool
users by disclosing the personal information, making payments, or taking
some action.
• Frauds and scams: Users get emails with offers that promise rewards in
exchange for a small deposit or advance charge. Once they have paid, the
fraudsters will either create new charges or cease communication.
• Antivirus warnings: These notifications “warn” a user of a virus infection
and provide a “fix” for it. The hacker can access the user’s system if they fall
88 Deep Learning Approach for NLP, Speech, and Computer Vision
for the trick and click on a link in the email. A malicious file could also be
downloaded to the device via the email.
• Sweepstakes’ winners: Spammers send emails with the false claim that the
receiver has won a contest or prize. The email’s link must be clicked by the
receiver in order to claim the prize. The malicious link usually seeks to steal
the user’s personal data.
4.6.1.2 Retriever
The task of a retriever is to locate pertinent documents for the user’s inquiry. It tries
to extract the pertinent terms from the question first. Then it uses these to find perti-
nent materials. Several Natural Language Processing (NLP) approaches are utilized
to transform a user’s inquiry into a form that a retriever can understand. These con-
sist of:
• Removing punctuations
When finding pertinent materials, full stops, commas, and other punctuations are
redundant. As a result, they are eliminated from the user’s query.
Stop words are often used words that don’t significantly change the meaning of
the text. Examples are articles such as the, a, and an. As a result, these words are
eliminated.
• Tagging entities
Entities that are directly related to the query are typically things like products or
names. As a result, they are included in the query.
• Stemming
Words can take on several guises or conjugations (walk, walked, walking, etc.). Such
terms are stripped down to their most basic form before being put into the query
because they may well appear in many forms within a document.
4.6.1.3 Reader
The responsibility of the reader is to extract an answer from the documents they
receive. They make an effort to comprehend the question and the documents by using
a suitable language model, and then they mine the texts for the best possible response.
uses transformers or attention-based models that are used to develop the question
answering system.
• Question processing
In IR-based factoid question answering, the first step is to retrieve the query. Initially,
the keywords are extracted; next, the entity type or answer type is gathered. The
focus (word replaced by the answer) question types and relations are extracted.
• Query formulation
The query formulation is the process of passing a question to the web search engine
The question answering from the smaller sets like corporate information pages and
more processing such as query expansion and query reformulation are performed.
• Passage retrieval
One of the basic methods of passage retrieval is to send every paragraph to the answer
extraction stage. One of the more advanced techniques is that to filter the paragraph
after executing the named entity or answer type classifications on the passages.
The ranks can be completed with the help of the supervised learning using fea-
tures such as:
Applications of Natural Language Processing 91
• Answer processing
The answer processing is the final step in the question answering section. The
final step is to extract the relevant answer from the passage.
4.6.3 Entity Linking
The entity linking is the process of providing the distinct identification to entities like
location. Entity linking can initiate question answering and combining the informa-
tion. Steps involved in entity linking are as follows:
• Recognize
Recognize items referenced in the context of the text. The entity linking mechanism
in this module seeks to remove unwanted entities in the knowledge base. For every
entity mention m and return, a candidate entity set Em consists of relevant entities.
• Rank
Assign a score to each candidate. The size of the candidate entity set Em is usually greater
than 1. To rank the potential entities in Em, researchers use many types of evidence. They
try to locate the entity Em, which is the most plausible link for the reference m.
• Link
In the knowledge graph, connect the recognized entities to the categorized entities.
92 Deep Learning Approach for NLP, Speech, and Computer Vision
• Parse the natural language question into an uninstantiated logic form (e.g.,
a SPARQL query template), which is a syntactic representation of the ques-
tion free of entities and relations.
• The logic form is subsequently instantiated and validated by using KB
grounding to perform various semantic alignments to structured KBs
(obtaining, e.g., an executable SPARQL query).
• To create expected replies, the parsed logic form is run against KBs.
• Information retrieval-based approaches.
The information retrieval-based approach works under the principle of retrieval and
rank. It follows the steps given here:
• The system first extracts a question-specific graph from KBs, ideally com-
prising all question-related entities and relations as nodes and edges, start-
ing with the topic entity.
• The system then converts the input questions into vectors that convey
reasoning instructions.
Applications of Natural Language Processing 93
• Turns: Turns are separate contributions to the dialog. Turns will be in the
form of a single word (i.e., in a shorter sentence) or with multiple words (in
a longer sentence). It is very important to understand the structure of the turn
for a spoken dialog system. The system has to decide when to start talking
and stop talking. The system needs to detect as soon as the user finishes the
conversation called endpoint detection.
• Speech acts: The speech acts referred to the actions are performed by the
speaker. The speech acts are otherwise called dialog acts.
• Constatives: Constatives are creating a statement—for example, answering the
question. Some of the constatives are answering, claiming, and confirming.
• Directives: A directive is an act in which you try to get your conversational
partner to do something. The directives may be advising someone, asking
them to do something, provide some kind of information, forbidding,
94 Deep Learning Approach for NLP, Speech, and Computer Vision
inviting them, ordering them to do some work, and politely requesting them
to do some work.
• Commissives: Commissives are where commitments are made—a kind
of a future action. Commissives can be making plans, explicit promising,
vowing to do something, betting or showing explicit opposition.
• Acknowledgments: Acknowledgments provide a useful function by
expressing a speaker’s attitude toward some sort of social action. Examples
of acknowledgments include apologizing, greeting, expression of gratitude,
and acceptance.
• Grounding: Grounding is establishing the common ground between two parties
in the conversation. It is to acknowledge that the speaker has been heard or
understood. This is usually by saying okay at the beginning of the turn, repeating
the parts of what the other speaker said, and using other implicit signals
• Initiative: The initiatives are conversational controls. The speaker asking
questions has a conversational initiative. In every dialog, most interactions
are mixed initiative. Initiative is a sense of control in the conversation. Even
though most human–human conversations are mixed initiative, it is very
difficult for the dialog systems to achieve mixed initiative conversations.
• Structure: Conversations have structure. Questions set up an expectation for
an answer, and proposals set up an expectation for an acceptance or rejection.
• Adjacency pairs: The dialog system’s act pairs are adjacent pairs.
• Inference: Providing conclusions based on more information than is present
in the uttered word.
• Implicature: The act of implying meaning beyond what is directly
communicated.
4.7.2 Chatbots
NLP enables your chatbot to evaluate and generate text from human language. NLP
(natural language processing) is an AI tool that aids your chatbot in analyzing and
comprehending natural human language exchanged with your clients. Chatbots may
grasp the conversation’s intent rather than just using the data to communicate and
reply to questions. In the area of automation and AI, there are various acronyms that
are important to know in order to understand how your virtual agent or NLP chatbot
operates. They are NLU, natural language generation (NLG), and natural language
interaction (NLI).
• Menu-button-based chatbots
The most fundamental form of chatbots now used on the market is menu/but-
ton-based ones. These chatbots are typically based on the decision tree hierarchies
that appear to the user as buttons. These chatbots demand the user to make a number
of decisions in order to develop deeper and get at the ultimate solution, much like the
automated phone menus.
• Linguistic-based chatbots
These chatbots are used to predict the type of questions the customer may ask. It cre-
ates the conversational automation logic. As the first step, language conditions need
to be defined clearly. The conditions will be of the form to assess the word, order of
the word, and context of the word.
• Keyword-recognition-based chatbots
This chatbot listens to the customer input, and the input will be typed by the cus-
tomer. It recognizes the customer input and it will reply accordingly. This chatbot is
developed using the artificial intelligence concept.
• Tokenizing: The chatbot begins by breaking up text into small chunks (also
known as “tokens”) and deleting punctuation marks.
• Normalizing: The bot then removes irrelevant information and changes
words to their “regular” form, such as by making everything lowercase.
• Recognizing entities: Now that all of the words have been normalized, the
chatbot tries to figure out what kind of thing is being discussed.
96 Deep Learning Approach for NLP, Speech, and Computer Vision
• Dependency: The bot then determines the function of each word in the
sentence, such as noun, verb, adjective, or object.
• Generation: Finally, the chatbot develops a number of responses on the basis
of the data gathered in the previous steps and chooses the most appropriate
one to send to the user.
An audio input can come from a phone or any other device, and the output of the
ASR will be a continuous string of words. Similar to a discussion state, the ASR
component is likewise. For instance, what if the computer asked the user which
state they were leaving? The ASR model will respond with state names with a high
probability under certain situations. The language model is trained to accomplish
the same.
The speech synthesis component is the key component for the spoken language pro-
cessing. In the dialog state architecture, has component for retrieving the slot fillers
from the users input, with the help of machine learning rules.
The conversation state tracker used in the architecture is used to retrieve the users’
present state of the frame and users’ most recent conversation. The dialog state
encompasses current sentence’s slot-fillers and also complete state of the frame. It
means that it collects all the users’ limitations.
The dialog policy’s purpose is to determine the generation of action and dialog act.
In more technical terms, it needed to be forecasted which action Ai will be done at
turn i in the conversation based on the overall dialog state. The state refers to the
complete series of dialog acts between the user and the system. It is referred for User
(U) and for system (A).
As soon as the dialog act has been decided, the response to the user query needs to
be generated in the form of text. In the information state architecture, the NLG is
divided into two types: they are content planning (what to say) and sentence realiza-
tion (how to say). It is assumed that the content planning is done by the dialog policy.
• Text to speech
After the Natural Language Generation step, the text is converted into speech.
4.8 SUMMARY
The Natural Language Processing has been emerging steadily in the recent years.
It is used in many organizations and industrial applications. This chapter gives the
idea about the recent and traditional applications of Natural Language Processing.
It explains most popular applications like WSD, which is mainly used in text min-
ing and the tasks involving the extracting the information. Word Sense Induction is
used in Web search clustering. Text Classification is used in social media, customer
experience, marketing, and in many more fields. Sentiment analysis is mainly used
in customer service management and for analyzing the customer feedback. Question
Answering system is used in IBM Watson. Spam email classification is primarily
used to classify spam or not spam. Information retrieval is used in digital libraries’
blog search, chatbots and dialog system, and properties of human conversation. They
are the current research trends which researchers can explore using the state-of-art-
learning methods, providing good venues for research and problem statement for
research in NLP.
BIBLIOGRAPHY
Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language
Processing, Cambridge: The MIT Press, 2018.
98 Deep Learning Approach for NLP, Speech, and Computer Vision
Daniel Jurafsky and James H Martin, Speech and Language Processing: An introduction to
Natural Language Processing, Computational Linguistics and Speech Recognition.
Hoboken, NJ: Prentice Hall, 2014.
Ela Kumar, Natural Language Processing. New Delhi: IK International Pvt Ltd, 2011.
James Allen, Natural Language Understanding. Benjamin: Cummings Publishing Company,
2003.
Li Deng and Yang Liu, Deep Learning in Natural Language Processing. Berlin: Springer,
2018.
Madeleine Bates and Ralph M. Weischedel. Challenges in Natural Language Processing.
Cambridge: Cambridge University Press, 2006.
Nitin Indurkhya and Fred J. Damerau, Handbook of Natural Language Processing Machine
Learning & Pattern Recognition Series. London: Chapman & Hall/CRC, Taylor and
Francis Group, 2010.
Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python—
Analyzing Text with the Natural Language Toolkit. London: O’Reilly, 2012.
Tanveer Siddiqui, U.S. Tiwary, Natural Language Processing and Information Retrieval.
Oxford: Oxford University Press, 2008.
Yoav Goldberg, Neural Network Methods for Natural Language Processing. London: Synthe-
sis Lectures on Human Language Technologies, 2017.
5 Fundamentals of
Speech Recognition
LEARNING OUTCOMES
After reading this chapter, you will be able to:
5.1 INTRODUCTION
In today’s computer and mobile era, speech has become a more important channel for
human–machine connection. Human–computer interaction has been evolving since
the dawn of computer engineering. In today’s world, it is not uncommon for this
engagement to take place through speech. Several software packages now incorpo-
rate cutting-edge speech technology to perform a variety of tasks. A detailed study
of human speech perception is required for these systems to be of practical utility,
i.e., to perform in a human-like manner. A compact and meaningful representation
of speech input, which eliminates the influence of inconsequential components like
as background noise, is also a key factor in improving the system’s performance.
Speech input has recently started to change the way people interact with one another.
This method of communication is very useful in several applications like assistive
technology for disabled people, autonomous vehicle for getting the navigation, and
multimedia search. ASR technique aims in making the computer understand human
speech and respond. Given the speech signal, ASR technology derives the transcribed
utterances. The fundamental difference between speech recognition performed by
people and computers using automatic voice recognition is given here:
DOI: 10.1201/9781003348689-5 99
100 Deep Learning Approach for NLP, Speech, and Computer Vision
5.3.1 Pitch
Vocal cord vibrates only when a sound is produced, which in turn generates glottal
pulse. Pitch is the glottal pulse’s fundamental frequency. It identifies a specific tone
and distinguishes between different sounds. Time frequency domain can be used to
analyze pitch. Zero crossing method can be used to determine pitch.
5.3.2 Timbral Features
The speech signals have same pitch and loudness differentiated by the timbral fea-
tures. They represent the sound quality. The harmonic component of an audio deter-
mines the timber. The vibrato and tremolo present in speech also determine timbre.
The dynamic characteristics of speech are vibrato which increases the richness of
speech.
5.3.3 Rhythmic Features
The rhythmic elements of an audio signal, which are of two types, rhythmical struc-
ture and bit strength, determine the regularity of the signal. Histogram is used to
represent the bit strength.
5.3.4 MPEG-7 Features
The Moving Pictures Expert Group (MPEG) has established an international stan-
dard for the classification of audio/speech signals. It defines the following audio fea-
ture standards.
N /2 ( f [k ] )
Σ K =1
log 2 |
( 1000 )
| Sr [ k ]
ASC = (5.1)
N /2
Σ K =1
Sr [ k ]
102 Deep Learning Approach for NLP, Speech, and Computer Vision
ASC can be used to determine the low or high frequencies of the power spectrum.
• Audio spectrum spread (ASS): ASS is a spectral distribution at the centroid
and is described by the formula as given here. It is also used to identify the
difference between noise and speech.
N /2 ⌈ f [k ] ⌉
Σ k =1
|
⌊
log 2
1000
| − ASCr ] Sr [ k ]
⌋
2
ASSr = N /2
(5.2)
Σ k =1
Sr [ k ]
ASF describes the deviation of spectral form with respect to a flat spectrum. Flat
spectrum shows the noise or impulse-like signals.
• Harmonic ratio (HR): The harmonic ratio is the greatest value of autocorre-
lation inside the frame.
5.4.1 Pronunciations
Pronunciation impacts the performance of ASR system drastically. Pronunciation is
a sound of a word which is fed into a speech engine.
Fundamentals of Speech Recognition 103
5.4.2 Vocabulary
Vocabularies are dictionaries which hold list of words/utterances that are utilized
by the ASR system. Smaller vocabularies are quite simple for a system to recognize,
whereas large vocabularies are difficult to manage. The size or volume of the vocab-
ulary has a fundamental outcome on the voice recognition system accuracy. Based
on the number of words, vocabulary can be defined as,
5.4.3 Grammars
The domain/context within which the ASR system works is defined by the grammar.
The speech recognition engine utilizes a set of predefined rules, say grammar, to
define the words or phrases.
5.5.1 Input Speech
The analog signal captured using microphone can be digitized using sound card.
A spectrogram is a time-based depiction of a voice signal. The flat pivot of a spectro-
gram shows time; the upward hub portrays the recurrence or power of the informa-
tion-expressed stream. In the creation of voice to text models, it is a widely utilized
representation of voice signal. The following Figure 5.3 depicts a time–frequency
representation of a spoken signal.
5.5.3 Feature Extraction
Feature extraction is utilized to extract a set of acoustic features in speech signal.
Such features can be computed by processing the speech signal waveform. The
two variants of features are prosodic feature and spectral features which can be
extracted from an input speech. Prosodic features are the aspects of each signal,
which deals with auditory qualities of sound. Spectral features are frequency-based
features.
• Spectrogram
• Mel Frequency Cepstral Coefficients (MFCC)
• Short-Time Fourier Transform
Fundamentals of Speech Recognition 105
5.6.1 Spectrogram
A spectrogram is a graphical illustration of the amplitude of a sound that graphs the
signal’s constituent frequencies versus time or some other variable. Spectrograms are
essentially two-dimensional graphs with color representing a third dimension. Two-
dimensional signals are created from the spoken signal. In a spector graph, time is
displayed along horizontal axis from left to right. The lowest frequency will be at the
bottom, and the highest frequency at the top of the vertical axis, which symbolizes
frequency that can also be thought of as pitch or tone. The workflow of obtaining
spectrogram is depicted in Figure 5.4, and Python code for spectrogram is given here
followed by the output in Figure 5.5.
Importing Libraries7
import librosa
import glob
%matplotlib inline
import matplotlib.pyplot as plot
import librosa.display
import IPython.display as ipdis
Loading an audio
for filename in glob.glob('/content/drive/MyDrive/LibriSpeech/dev-clean/1462/*/*.
wav'):
speech_path = filename
x, sr = librosa.load(speech_path)
Plotting an audio in waveform
plot.figure()
librosa.display.waveplot(x, sr=sr)
time.sleep(0.1)
plot.pause(0.0001)
Plotting an audio in Spectrogram
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
print(Xdb)
plot.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plot.colorbar()
5.6.2 MFCC
The Mel Frequency Cepstral Coefficients (MFCCs) is a well-known feature extraction
technique in ASR. The functioning of MFCC is similar to the working of the human
ear. It records every tone with a real frequency f (Hz), which corresponds to mel
scale’s subjective pitch. The MFCC gives a discrete cosine change of the energy
sign’s logarithm on a mel recurrence scale. At first, the discourse signal is separated
into casings, and, afterward, Fast Fourier change (FFT) is used to each edge to secure
power range. The Mel scale then, at that point, applies the channel bank to the power
range. In the wake of switching the power range over completely to log space, the
discrete cosine transform is applied to the discourse sign to acquire the MFCC coef-
ficients. Eq. (5.3) is used to resolve the mel function for any recurrence.
( f )
mel ( f ) = 2595 x log10 |1 + |
( 700 ) (5.3)
k
⌒ ⌒ ( 1)À
Cn = Σ(log s ) cos [n |( k − 2 |) k
n =1
k (5.4)
Figure 5.6 is the workflow diagram of MFCC. It defines all the steps to obtain the
MFCC coefficients. For audio signals with background sound, MFCC function does
not work effectively and so is not suited for such powerful speech recognition system.
Python code for extracting MFCC is given here followed by the output in Figure 5.7.
Loading an audio:
for filename in glob.glob('/content/drive/MyDrive/LibriSpeech/dev-clean/*/*/*.
wav'):
speech_path = filename
x, sr = librosa.load(speech_path)
print(type(x), type(sr))
Python code for plotting an audio in waveform is as follows:
plt.figure()
librosa.display.waveplot(x, sr=sr)
time.sleep(0.1)
plt.pause(0.0001)
MFCC coefficient:
mfccs = librosa.feature.mfcc(x, sr=sr)
print(mfccs.shape)
print(mfccs) librosa.display.specshow(mfccs, sr=sr, x_axis='time')
Output:
Python code:
import librosa
speech_path = '/content/demo.wav'
x, sr = librosa.load(speech_path)
print(type(x), type(sr))
librosa.load(speech_path, sr=44100)
import IPython.display as ipd
ipd.Audio(speech_path)
%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
Fundamentals of Speech Recognition 109
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()
x, sr = librosa.load(speech_path)
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
import sklearn
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
def normalize(x, axis=0):
return sklearn.preprocessing.minmax_scale(x, axis=axis)
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r')
m −1 ⌈ k⌉
Cm = am + Σ k =1 | m | ck am − k
⌊ ⌋
(5.5)
The cepstral coefficient is Cm, and the linear prediction coefficient is. Because
LPCC features are less susceptible to noise than MFCC features, they have a lower
word error rate. Python code for LPCC is as follows:
Importing libraries:
import numpy as np
from scipy.io.wavfile import read
import matplotlib.pyplot as plt
import glob
import soundfile as sf
Functions to find LPCC:
def autocorr(self, order=None):
if order is None:
order = len(self) — 1
autocor=[]
sum=0
for tau in range(order+1):
for n in range(len(self)—tau):
sum +=self[n] * self[n + tau]
autocor.append(sum)
return autocor
def lpc(seq, order=None):
# In this lpc method we use the slow(if the order is >50) autocorrelation approach.
acseq = np.array(autocorr(seq, order))
# Using pseudoinverse to obtain a stable estimate of the toeplitz matrix
a_coef = np.dot(np.linalg.pinv(scipy.linalg.toeplitz(acseq[:-1])), -acseq[1:].T)
# Squared prediction error, defined as e[n] = a[n] + \sum_k=1^order (a_k * s_{n-k})
err_term = acseq[0] + sum(a * c for a, c in zip(acseq[1:], a_coef))
return a_coef.tolist(), np.sqrt(abs(err_term))
def lpcc(seq, err_term, order=None):
if order is None:
order = len(seq) — 1
lpcc_coeffs = [np.log(err_term), -seq[0]
for n in range(2, order + 1):
# Use order + 1 as upper bound for the last iteration
upbound = (order + 1 if n > order else n)
112 Deep Learning Approach for NLP, Speech, and Computer Vision
Σ
N −1
ϕ (t ) = h [ n ] 2ϕ ( 2t − n ) (5.6)
n =0
Σ
N −1
p (t ) = g [ n ] 2ϕ ( 2t − n ) (5.7)
n =0
where h[n] is the low-pass filter’s impulse response, g[n] is the high-pass filter’s
impulse response, and ϕ ( t ) is the scaling function and ρ ( t ) is the wavelet function,
respectively.
The DWT for a continuous signal is given in Eq. (5.8).
+∞
( DWT )( m, p ) =
−∞
∫ x (t ) .ϕ m , p dt (5.8)
where ϕm, p is the wavelet function bases, m is the dilation parameter, and p is the
translation parameter. ϕm, p in Eq. (5.9) is
1 ( t − pb0 a0m )
ϕm, p = ϕ| || (5.9)
| a0m
a0m ( )
( n − nb0 a0m )
Σ
1
( DWT )( m, k ) = x [ n ] .g | || (5.10)
n | a0m
a0m ( )
where g (*) is the mother wavelet and x[n] is the discretized signal.
TABLE 5.1
Comparison of Six Feature Extraction Techniques
Type of Filter Shape of Filter Speed of Type of Noise Reliability
Computation Coefficient Resistance
MFCC Mel Triangular High Cepstral Medium High
LPC Linear Linear High Auto High High
prediction correlation
Coefficient
LPCC Linear Linear Medium Cepstral High Medium
prediction
LSF Linear Linear Medium Spectral High Medium
prediction
DWT Low pass and – High Wavelets Medium Medium
High pass
PLP Bark Trapezoidal Medium Cepstral & Medium Medium
Auto
Correlation
features are representative for certain phonemes, and this is called the acoustic model
of the speech recognizer. The sequence of probable phonemes needs then to be trans-
formed into a sequence of words or even a sentence which is to be recognized. This is
done in the decoding step. In the decoder, the phoneme probabilities are organized in
order to form words and sentences. So the decoder needs to access a vocabulary, but
it needs to know which words can follow each other. A language model/grammar is
employed in order to organize the words in a certain sequence. To extract the features
for ASR processes, first we try to separate the excitation signal and the vocal track
modulation from the speech and then we make use of the hearing characteristics of
the human ear in order to process in a way which is similar to what we have in the
human ear. The next step is to classify the features into phonemes and then to orga-
nize these phonemes into words or sequence of words. The two popular approaches
which are used for this task are Hidden Markov Model and Neural Network. Both
of them are statistical approaches that calculate a probability that a certain feature
vector is related to certain phoneme. So it gives probabilities of phonemes, and these
probabilities of phonemes can then be organized into probabilities of words. The
input to an ASR system is the raw one-dimensional speech signal. The fundamental
unit of ASR is phonemes or phones. Word models can be built by concatenating
phone models. Let x represent an input audio sample and the function f(x) that maps
the sequence of words to the transcripts of the speech signal. A basic speech recog-
nition system includes following units: pre-processing, feature extraction, acoustic
modelling, and language modelling for ASR as shown in Figure 5.15. Acoustic model
converts the speech into their corresponding phonemes. The lexicon or pronuncia-
tion model converts the phones to corresponding words. Language model defines the
most likely sequence of words.
pronounce the same word. The same speech can sound different on the basis of the
background of the speaker due to elements like background sound and language.
To establish the relationship between audio frames and phonemes, acoustic models
employ deep-learning algorithms that have been trained on hours of various audio
recordings and relevant texts. Figure 5.16 depicts the working of acoustic model
with an example.
⌒
Assuming the acoustic feature vectors of a speech signal, x = { x1 , x2 , … xn } the
⌒ ⌒
idea of ASR is to identify the word sequence w = {w1 , w2 , … wn } w . is defined as
⌒
w = argmax P(W | X ) (5.11)
where the probability of word given x is given as P(W | X ) . According to Bayes rule,
condition (5.2) can be revamped as
⌒ P ( X |W ) P (W )
w = argmax (5.12)
P(X )
the text sequence, and, in context to ASR system, it denotes the language model. By
applying this to ASR, the statistical ASR system is defined as in Eq. (5.4):
P ( X |W ) P (W )
P (W |X ) = (5.13)
P(X )
The acoustic models identify the phoneme sequence, given the feature vector.
⌒
w = argmax P ( X | W ) P (W )
W
= argmax
W ∈V *
ΣP( X , S | W ) P (W )
S
(5.15)
≈ argmax P ( X , S ) P ( S | W ) P (W )
W ,S
Eq. (5.15) shows the acoustic and language model where P ( X , S ) represents the
acoustic function of each phone state, P( S | W ) represent the earlier likelihood of a word.
5.7.2 Pronunciation Model
Pronunciation lexicon converts the phoneme sequence to its corresponding words as
shown in Table 5.2. By isolating the sound bite with a sliding window, a grouping of
sound edges is produced. Figure 5.17 shows the pronunciation model with an exam-
ple by converting phoneme sequence to words.
The probabilistic chain rule is used in lexicon model as shown in Eq. (5.16), where
W is the words and S is the phoneme sequence.
∏
T
P ( S |W ) = P( st | st −1 , W ) (5.16)
t =1
TABLE 5.2
Lexicon or Dictionary
Phone Sequence Words
l-ay-k Like
g-uh-d good
Ih-z is
f-ay-v Five
118 Deep Learning Approach for NLP, Speech, and Computer Vision
5.7.3 Language Model
In language demonstrating, various measurable and probabilistic methods are
utilized to compute the probability that a given series of words will show up in
an expression. The noise is speech recognition. The language model predicts the
likelihood of each word in a phrase on the basis of the output of the pronuncia-
tion model and then converts the words into sentences. Language model helps to
improve the accuracy of ASR system. It is represented as a probability distribution
P(W) which reflects the frequency a string w occurs as a sequence. P(W) can be
represented as
P ( W ) = { w1 , w 2 , … w n }
= P ( w1 ) P ( w 2 |w1 ) P ( w 3 |w1w 2 ) P( w n | w1w 2 … w n )}
(5.17)
∏ P ( w i |w1w 2 … w i −1 )
n
=
i =1
• N-gram: Unigram and bigram are the variants of n-gram. For instance, given
bigram of prior words, it will predict the next most likely word.
ASR uses n-gram language models to guide the speech for the correct word sequence.
It predicts the likelihood of the nth word using the previously occurring words.
Commonly used n-gram models are trigrams where n is 3, and it is represented as
P( w3 | w1 , w2 ), Bigram model is represented as P( w2 | w1 ) . To estimate P( wi | wi −1 )
(i.e.) the probability of the word wi , given wi−1 , simply count the occurrence of the
sequence P( wi | wi −1 ) and then normalize the count by the number of times wi−1
occurs. In the trigram model, the likelihood of a word is determined by the two
words preceding it. For example,
p ( w ) = p ( w1 ) p ( w 2 |w1 ) p ( w 3 |w 2 w1 ) (5.22)
The probabilities for trigram model are computed by the frequencies of the word pair
c ( wi − 2 , wi −1 ).
c ( w i − 2 , w i −1 , w i )
p ( w i |w i − 2 w i −1 ) = . (5.23)
c ( w i − 2 , w i −1 )
N-gram model makes use of three types of decoding search to select the most
probable candidate such as:
• Beam search
• Greedy search
• Neural rescoring: Neural-network-based language model which utilizes a
deep learning-based approach to identify the most likely sequence.
• Transformer-based BERT: Hugging face library provides transformer-based
BERT model which is an extensively used in ASR system. It is used to grade
the delivered sentence of the speech.
5.8.1 In Banking
• In the banking, voice-enacted banking could reduce the need for human
client assistance and reduce labor expenses.
5.8.2 In-Car Systems
• Straightforward voice orders can be utilized to answer calls, change radio
broadcasts, and play music from a MP3 player, streak drive, or viable cell
phone. The ability to recognize voice varies by car model and make.
5.8.3 Health Care
• Discourse acknowledgment programming can be beneficial to people
with disabilities. Discourse acknowledgment programming is used to
automatically make closed inscriptions for exchanges, for example,
gathering room conversations, school addresses, and strict administrations
for people who are deaf or hard of hearing.
• Medical documentation: Discourse acknowledgment can be coordinated
into the front-end or back-end of the clinical documentation process in the
medical services industry.
Bing-Voice-Search, the Bing mobile voice search application, was used to collect
data for the first successful DBN-DNN and DNN-HMM acoustic models for a large
vocabulary speech recognition challenge (BMVS).
The Google voice input discourse acknowledgment task deciphers cell phone client
exercises like short messages, messages, and voice search demands. Given the size
122 Deep Learning Approach for NLP, Speech, and Computer Vision
of the jargon in question, a language model fit for taking care of both transcription
and search questions is being utilized. The acoustic models are triphone frameworks
created from choice trees that utilize GMMs with differing quantities of Gaussians
per acoustic state. It utilizes a three-state, left-to-right GMM-HMM with setting sub-
ordinate cross-word triphone HMMs.
5.8.5 Measure of Performance
The effectiveness of ASR is identified in the perspective of accuracy and latency.
Accuracy is identified using word error rate (WER), character error rate (CER), and
word recognition rate (WRR), which are calculated using Eqs. (5.19) and (5.20).
Latency is used to measure the performance of streaming ASR system.
5.9.4 Phonetics
Phoneme, in linguistic, is the fundamental unit of speech that is combined with
other phonemes to form a meaningful word. Phonemes are language dependent,
and an ASR system should have proper phonetic knowledge about the particular
language.
5.10.1 Frameworks
• Sphinx: Carnegie Mellon University developed an ASR toolbox.
• Kaldi: A free and open-source C++ ASR framework designed for both
academic and commercial speech processing.
• NVIDIA’s Nemo Kit, Jasper, and Quartz ASR pre-trained models are
available at NVIDIA NGC portal.
124 Deep Learning Approach for NLP, Speech, and Computer Vision
5.11 SUMMARY
This chapter covers the speech recognition basics, characteristics of ASR, types of
ASR, and milestones of various developments in ASR. Classification of the audio
signals is based on some audio parameters which are the audio features. This chapter
also narrates the complete framework of speech recognition system which includes
speech capturing, pre-processing, different audio features, various feature-extraction
techniques, and models of speech recognition systems. It highlights on the various
applications of speech recognition and the benchmark datasets for speech recognition
together with the exposure on various open-source toolkits for speech recognition.
BIBLIOGRAPHY
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., and Zisserman, A. 2018. Deep audio-visual
speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
44, pp. 8717–8727.
Feng, W., Guan, N., Li, Y., Zhang, X., and Luo, Z. 2017, May. Audio visual speech recognition
with multimodal recurrent neural networks. In 2017 International Joint Conference on
Neural Networks (IJCNN) (pp. 681–688). New York: IEEE.
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A., and Cleder, C. Auto�� -
matic speech emotion recognition using machine learning. In Social Media and
Machine Learning. IntechOpen, Edited by Alberto Cano, 2019. https://doi.org/10.5772/
intechopen.84856.
Makino, T., Liao, H., Assael, Y., Shillingford, B., Garcia, B., Braga, O., and Siohan, O. 2019.
Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 905–912).
New York: IEEE.
Miao, Y., and Metze, F. 2016. Open-domain audio-visual speech recognition: A deep learning
approach. In Interspeech (pp. 3414–3418). New York: IEEE.
Michelsanti, D., Tan, Z.-H., Zhang, S.-H., Xu, Y., Yu, M., Yu, D., and Jensen, J. 2021. An
overview of deep-learning-based audio-visual speech enhancement and separation. In
IEEE/ACM Transactions on Audio, Speech, and Language Processing. New York: IEEE.
Fundamentals of Speech Recognition 125
Morrone, G., Michelsanti, D., Tan, Z.H., and Jensen, J. 2021. Audio-visual speech in painting
with deep learning. In ICASSP 2021–2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 6653–6657). New York: IEEE.
Mudaliar, N.K., Hegde, K., Ramesh, A., and Patil, V. 2020. Visual speech recognition: A deep
learning approach. In 2020 5th International Conference on Communication and Elec-
tronics Systems (ICCES) (pp. 1218–1221). New York: IEEE.
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., and Ogata, T. 2015. Audio-visual speech
recognition using deep learning. Applied Intelligence, 42(4), pp. 722–737.
Oneaţă, D., Caranica, A., Stan, A., and Cucu, H. 2021. An evaluation of word-level confidence
estimation for end-to-end automatic speech recognition. In 2021 IEEE Spoken Language
Technology Workshop (SLT), pp. 258–265. New York: IEEE.
Petridis, S., Li, Z., and Pantic, M. 2017. End-to-end visual speech recognition with LSTMs.
In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 2592–2596). New York: IEEE.
Rahmani, M.H., Almasganj, F., and Seyyedsalehi, S.A. 2018. Audio-visual feature fusion via
deep neural networks for automatic speech recognition. Digital Signal Processing, 82,
pp. 54–63.
Sadeghi, M., Leglaive, S., Alameda-Pineda, X., Girin, L,. and Horaud, R. 2020. Audio-visual
speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 28 (pp. 1788–1800). New York: IEEE.
Thanda, A., and Venkatesan, S.M. 2016. Audio visual speech recognition using deep recurrent
neural networks. In IAPR Workshop on Multimodal Pattern Recognition of Social Sig-
nals in Human–computer Interaction (pp. 98–109). Cham: Springer.
Yang, C.-H.H., Qi, J., Yen-Chi Chen, S., Chen, P-Y., Marco Siniscalchi, S., Ma, X., and Lee,
C.-H. 2021. Decentralizing feature extraction with quantum convolutional neural net�-
work for automatic speech recognition. In ICASSP 2021–2021 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6523–6527. New
York: IEEE.
Yu, W., Zeiler, S., and Kolossa, D. 2021. January. Multimodal integration for large-vocabulary
audio-visual speech recognition. In 2020 28th European Signal Processing Conference
(EUSIPCO) (pp. 341–345). New York: IEEE.
Zimmermann, M., Ghazi, M.M., Ekenel, H.K., and Thiran, J.P. 2016. Visual speech recogni�-
tion using PCA networks and LSTMs in a tandem GMM-HMM system. In Asian Con-
ference on Computer Vision (pp. 264–276). Cham: Springer.
6 Deep Learning Models
for Speech Recognition
LEARNING OUTCOMES
After reading this chapter, you will be able to:
• Based on words
• Single Word Recognizer: Single word recognizers are used to transcribe
spoken isolated words. It is very simple, and HMM-based statistical mod-
els are widely accepted for single word recognizers.
• Continuous Word Recognizer: It has a complex structure.
• Based on speaker
The following section deals with the brief discussion of conventional speech recog-
nition models.
The main objective of ASR system is to identify the word sequence Y = y1 , y2 ,.. yn,
given the acoustic data X = x1 , x2 ,.. xn . According to the Bayes rule in Eq. (6.1),
P ( X | Y ) .P ( y )
P ( Y |X ) = (6.1)
P(X )
where
The architecture of HMM for ASR system is shown the Figure 6.2. HMMs identify
the most probable subword units. By concatenating the subword units using genera-
tive models, they find the actual spoken text.
Deep Learning Models for Speech Recognition 129
The types of ASR model which can be trained using HMM are
• Phone-based model: The main idea is to identify the input phonemes (basic
sound units) using HMM as shown in Figure 6.3.
• Word-based model: The model, depicted in Figure 6.4, is built by concate�-
nating the sequence of phone models.
• Isolated word ASR model: To identify the isolated words, several HMMs
are designed and one best model is chosen as shown in Figure 6.5 on the
basis of Bayes rule,
P ( M |X ) = P ( X |M ) p ( M )
130 Deep Learning Approach for NLP, Speech, and Computer Vision
The widely used HMM topologies are shown in figure with their corresponding
transition matrix in Figure 6.7.
• Left-to-right model
• Parallel path left-to-right model
• Ergodic model
• Evaluation
• Problem: To compute the probability of observation sequence, given a model
• Solution: Viterbi algorithm
• Decoding and alignment
Deep Learning Models for Speech Recognition 131
Σ
M
p(x /λ) = wi g ( x / µi , Σ i ) (6.2)
i −1
FIGURE 6.11 Perceptron.
Speech is a continuous signal which can be recognized using neural networks. For
handling, the time series and the temporal relationship of acoustic signal, neural
networks such as recurrent neural networks (RNN), long short-term memory have
been widely used. Neural-network-based ASR system gives a promising result for
continuous speech recognition.
A simple ANN for recognition phonemes is depicted in Figure 6.13. The input
sequence is X = {X1 , X 2 , … X t } whose MFCC features are extracted and fed into
the hidden layer of approximately 1,000 hidden units. There are approximately 61
phonemes as the output classes. Given the acoustic feature sequence, ANN predicts
the probable phoneme sequence.
FIGURE 6.14 HMM and ANN Acoustic Modelling of Phone Recognition Task.
1
( )
y j = logistic x j = −xj (6.6)
1+ e
and
x j = bj + Σ yw
i i ij
(6.7)
138 Deep Learning Approach for NLP, Speech, and Computer Vision
where bj is the bias and wij is the weight. For multiclass classification, output unit j
converts its total input xj into a class probability pj, using the softmax nonlinearity as
shown in Eq. (6.8).
pj =
exp x j( ) (6.8)
Σ k
exp ( xk )
c=− Σ j
d j logp j
(6.9)
∂C
△wij ( t ) = α△wij ( t − 1) − ϵ (6.10)
∂wij ( t )
yt = f ( xt , ht −1 ) (611)
where the function f maps the input xt to output yt ht −1 is the memory from previous
input.
• Gated recurrent unit (GRU): It is composed of two gates and the information
flows unidirectionally:
• Reset gate
• Update gate
• Bi(GRU): The information is carried out bidirectionally through the network
of GRU units.
• Long Short Term Memory (LSTM) unit: It is a widely used neural network
architecture which is composed of three gates as follows:
• Input gate
• Forget gate
• Output gate
• Bi-LSTM: Bidirectional LSTM unit where the information flows in both the
directions.
Deep Learning Models for Speech Recognition 139
6.3 ENCODER
The input speech sequence consists of continuous speech frames {x1 , x 2 ,…… x n }.
The input sequences are then sent to a group of several RNN cells such as LSTM
or GRU units for processing the sequential data, each of which admits one input
sequence element, gathers data for that element, and transmits that data forward. The
hidden states hi where the processing occurs are calculated using Eq. (6.12).
(
ht = f W ( ) ht −1 + W ( ) xt
hh hx
) (6.12)
where ht is the hidden layer, W is the weights, and ht−1 is the previous hidden layer.
140 Deep Learning Approach for NLP, Speech, and Computer Vision
The values of the weights are updated appropriately by back propagation algo-
rithm to the previous hidden state ht−1 and the input vector xt. The context from the
encoder part which contextually analyzes the input sequence is fed into the decoder.
6.4 DECODER
A group of several RNN units, each of which estimates an output at time step t, yt.
Each recurrent unit receives a hidden state from the prior unit and generates both an
output and a hidden state of its own. Each word is denoted by the symbol yi, where i
denotes the word’s order. The hidden state hi is computed using the Eq. (6.13).
(
ht = f W ( ) ht −1
hh
) (6.13)
Using the hidden state at the current time step and the appropriate weight W, the
output ( yt ) is calculated.
(
yt = softmax W S ht ) (6.14)
Using softmax, the probability vector that will enable us to predict the result is
calculated as shown in Eq. (6.14).
The performance of RNN-based encoder–decoder model is evaluated in WER on
LibriSpeech dataset as shown in Table 6.1.
The RNN-based encoder model is suitable for clean speech, and it can be further
improved if it is concatenated with language models like n-gram and BERT. But
when the length of input speech is longer, RNN model is not suitable. To overcome
the challenges attention-based models are used which is discussed in the following
section.
TABLE 6.1
Performance Analysis of RNN-based Encoder–Decoder Model
LibriSpeech Dataset WER (in %)
Dev-Clean 5.9
Test-Clean 8.5
Test-Other (Noisy) 13.1
Deep Learning Models for Speech Recognition 141
samples. Attention units are used in the encoder–decoder modules of ASR system.
The various attention units used are as follows:
QK T
Attention ( Q, K, V ) = Softmax V (6.15)
dk
(
Multi-Head ( Q, K, V ) = Head1Attention QW1Q , KW1Q , VW1Q ⋯ )
(
Head1Attention QWnQ , KWnQ , VWnQ ) (6.16)
The output of the encoder, CVa, is fed into decoder block which is composed of multi-
head attention layers with feed forward neural network. The decoder stack identifies
the output labels at each time step ot based on the prior production ot−1 as well as
context vector as given in Eq. (6.18).
Output label at current time step o t = ∏ p(o |fused context vector, o
t t t −1 )
(6.18)
Finally, the CTC loss function is applied which makes use of a special character
called blank to remove the duplicate characters.
These are a few of the unresolved research problems that need to be solved. And
following this, many researchers and leading tech giants have developed cutting-edge
end-to-end pre-trained ASR architectures which are discussed in Chapter 7.
6.7 SUMMARY
The various traditional ASR models, including HMM, GMM, and hybrid HMM–
DNN, are explored in this chapter. The idea of an encoder–decoder model based on
RNN was also analyzed in order to handle the continuous speech signals. Chapter 7
examines the more sophisticated state-of-the-art deep-learning-based pre-trained
end-to-end ASR systems to address the drawbacks of these traditional models.
BIBLIOGRAPHY
J. Bilmes (2008). Gaussian models in automatic speech recognition. In: Havelock, D., Kuwano,
S., and Vorländer, M. (eds) Handbook of Signal Processing in Acoustics. Springer, New
York.
J. Bilmes (2003). Buried Markov models: A graphical modeling approach to automatic speech
recognition. Computer Speech & Language, 17: 213–231.
A. Dutta, G. Ashishkumar and C.V.R. Rao (2021). Performance analysis of ASR system in
hybrid DNN-HMM framework using a PWL Euclidean activation function. Frontiers of
Computer Science, 15: 154705.
J.P. Haton (1999). Neural networks for automatic speech recognition: A review. In: Chollet, G.,
Di Benedetto, M.G., Esposito, A., and Marinaro, M. (eds) Speech Processing, Recogni-
tion and Artificial Neural Networks. Springer, London.
L. Rabiner and B. Juang (1986). An introduction to hidden Markov models. IEEE ASSP Mag-
azine, 3(1): 4–16.
7 End-to-End Speech
Recognition Models
LEARNING OUTCOMES
After reading this chapter, you will be able to:
This chapter explores the well-known end-to-end ASR systems such as CTC, Listen
Attend and Spell, deep speech 1, and deep speech 2.
basis of dynamic programming and widely accepted ASR and NLP applications. The
main advantage of CTC layer is that no prior alignment between the input and target
sequences is required. Alex Graves in developed a standard RNN with CTC for end-
to-end ASR systems that, for any given input speech x = ( x1 , x2 , … xT ) , which is fed
into a bidirectional RNN layer as shown in the Figure 7.2, produce the hidden vector
h = ( h1 , h2 , … hT ) and output vector y = ( y1 , y2 , … yT ) .
Figure 7.2 shows the bi-RNN which processes both forward and backward hidden
vectors as shown in Eqs. (7.1) to (7.3),
-→
- →
(
ht = H WXh→ xt + Whh
→→ h
t −1 + bh
→
) (7.1)
← ←
(
ht = H WXh← xt + Whh
←← h
t −1 + bh
←
) (7.2)
→ ←
yt = Why
→ h + W← h + b
t hy t o (7.3)
Each output character from the stack of bi-RNN layers is fed into a final output CTC
layer. A special character called Blank(B) for null emission is introduced by CTC as
shown in Figure 7.3. The blank character is responsible to remove duplicates. Finally,
the output from CTC layer is fed into a softmax activation to identify the most likely
sequence of transcriptions.
The following are the pros and cons of CTC architecture:
End-to-End Speech Recognition Models 147
FIGURE 7.2 Bi-RNN.
FIGURE 7.3 CTC.
148 Deep Learning Approach for NLP, Speech, and Computer Vision
l
(
ht( ) = g W ( ) ht( ) + bl
l l −1
) (7.4)
where ht( l ) is the current hidden layer, ht( l −1) is the previous hidden layer, W is the
weights, and b is the bias.
(
ht( ) = g W ( ) ht( ) + Wr( ) ht(−1) + b( )
f 4 3 f f 4
) (7.5)
(
ht( ) = g W ( ) ht( ) + Wr( ) ht(+1) + b( )
b 4 3 b b 4
) (7.6)
Deep speech model achieves a better word error rate when compared with other sys-
tems for both clean and noisy speeches as shown in Table 7.1.
• Pre-processing: It converts the raw audio sample into log spectrogram and
produces normalized features.
• Model: Deep neural network with two to three convolution layers followed
by three to seven GRU/LSTM layers and one fully connected layer. The
output labels from the fully connected layer are passed into a CTC layer.
• The main enhancement in deep speech 2 is the utilization of decoder. In
deep speech 2, they have considered greedy and beam search decoding
mechanism.
• Hyper parameters used are as follows:
• Learning rate = 0.001
• Batch size per GPU is 16
• Stochastic Gradient Descent with momentum is 0.9
• Batch size per GPU is 16
• Dropout to lower regularization error
TABLE 7.1
Comparative Analysis of Deep Speech Model
Model Clean Speech Noisy Speech
(WER %) (WER %)
Apple dictation 14.24 43.76
Google API 6.64 30.47
Deep speech 6.56 19.06
150 Deep Learning Approach for NLP, Speech, and Computer Vision
TABLE 7.2
Comparison of Deep Speech 2 Model in the Perspective of WER
Dataset WER (in %)
WSJ 4.42
LibriSpeech (Test Clean) 5.15
LibriSpeech (Test Other) 12.73
Voxforge American Canadian 7.94
Voxforge European 18.44
layers of 512 pBLSTM are stacked on top of LSTM layers to extract the
relevant information.
h = Encoder or Listener ( x )
(5.7)
(
hi j = BLSTM hi j−1 , hi j −1 ) (5.8)
• Attention (alignment model)
• The output vector is passed into attention-based alignment model.
Identifies and performs alignment on encoded frames that are relevant to
producing current output.
(
eu ,t = score huatt−1 , htenc ) (5.9)
exp ( eu ,t )
α u ,t = T (5.10)
Σ t ′ =1
exp ( eu ,t ′ )
152 Deep Learning Approach for NLP, Speech, and Computer Vision
Σ
T
Cu = ± henc (5.11)
t =1 u ,t t
7.1.6 JASPER
Jasper (just another speech recognizer) is developed by Nvidia. It makes use of Mel
filter bank as input speech feature representation. The window size of each frame is
20 ms with 10 ms shift. Jasper B*R architecture represents B blocks and R subblocks
as shown in Figure 7.8. The architecture of sub block is composed of 1D convolution,
batch normalization with Rectified Linear Unit as activation and dropout as regulariza-
tion parameter. Jasper B*R architecture is designed to fast GPU inference. In addition,
it possesses four convolution layers, one at the initial stage for pre-processing and other
three at the end for post processing. The following are the variants of Jasper:
1. Three variants of normalization batch norm, weight norm, and layer norm
are used.
2. Three variants of rectified linear units ReLU, clipped ReLU (cReLU), and
leaky ReLU (lReLU) are used.
3. Uses two variants of gated units, namely gated linear units (GLUs) and
gated activation units (GAUs).
Jasper achieves a lower WER as shown in Table 7.3 with and without external lan-
guage model.
7.1.7 QuartzNet
QuartzNet design is variant of Jasper architecture, with convolutional model trained
using Connectionist Temporal Classification (CTC) loss. QuartzNet’s architecture
End-to-End Speech Recognition Models 155
TABLE 7.3
Comparison of Jasper B*R Model
Model Dataset Language Model WER (in%)
Jasper 10 × 3 LibriSpeech Dev-Clean – 4.51
Dev-Other 4.15
Jasper 10 × 3 Wall Street Journal Validation 4-Gram 9.9
Testing 7.1
Validation Transformer-XL 9.3
Testing 6.9
Jasper 10 × 5 Hub5’00 Switchboard (SWB) 4-Gram 8.3
Call Home (CHM) 19.3
Switchboard (SWB) 7.8
Call Home (CHM) Transformer-XL 16.2
• Optimizer: NovoGrad
• Learning rate: 0.05
• Two language models: 4 gram and transformer XL
FIGURE 7.10 QuartzNet.
The performance analysis of QuartzNet is compared with that of Jasper and LAS in
Table 7.4.
TABLE 7.4
Comparison of QuartzNet Model
Model Language Model Test Clean (%) Test Other (%) Parameters
(in Million)
Listen Attend and RNN 2.5 5.8 360
Spell (LAS)
JasperDR 10 × 5 6-Gram 3.24 8.76 333
T-XL 2.84 7.84
QuartzNet 15 × 5 6-Gram 2.96 8.07 19
T-XL 2.96 7.25
7.2.1 Wav2Vec
Wav2Vec 2.0 utilizes a more recent technique called self-supervised learning devel-
oped by Meta AI. Self-supervised learning is one of the new deep-learning-based
techniques to handle unlabeled data. This model achieves a greater performance in
case of varied dialects and more languages. The Wav2vec architecture is composed
of following layers:
Traditional speech recognition models are generally trained using transcriptions of anno-
tated speech audio. Large amounts of annotated data, which is only available for a few
languages, are required for good systems. Self-supervision allows you to use unannotated
data to improve your systems. Our model consists of a multilayer convolutional feature
encoder f : X → Z that receives raw audio X as input and produces latent speech rep-
resentations Z1 ⊃ .ZT for T time steps. They’re subsequently passed into a transformer
g : Z → C , which creates representations C1 , ⊃ , CT that capture data from the full
sequence. To signify the objectives in the self-supervised objective, the feature encoder
output is discretized to qt with a quantization module Z → Q . The encoder is made up of
multiple blocks that include temporal convolution, layer normalization, and GELU acti-
vation. The encoder’s raw waveform input is normalized to zero mean and unit variance.
The feature encoder’s output is routed into a context network that uses the Transformer
Architecture. The model uses a multilayer convolutional neural network to analyze the
raw waveform of the speech audio to generate latent audio representations of 25 ms each.
The quantizer selects a speech unit from an inventory of learned units for the latent audio
representation. Before being supplied into the transformer, around half of the audio rep-
resentations are disguised. The transformer incorporates data from the whole audio track.
Finally, the transformer’s output is employed to solve a contrastive problem. The model
must identify the correct quantized speech units for the masked places in order to com-
plete this task. The architecture of Wav2vec is depicted in Figure 7.12.
The predefined assumption made is to apply mask for particular fraction of the
input before the context network as shown in Figure 7.13.
7.2.2 Data2vec
Data2vec is a self-supervised learning approach suitable for multimodal architecture,
especially text, speech, and image modalities developed by Meta AI as shown in
Figure 7.14.
FIGURE 7.14 Data2vec.
TABLE 7.5
Comparison of Wav2Vec with Data2vec Model
Model Dataset Unlabeled Language Labeled Data
Data Model 100 h 960 h
Wav2vec 2.0 LibriSpeech LS-960 4 Gram 8.0 6.1
(base models)
Wav2Vec 2.0 LibriSpeech LS-960 4 Gram 4.6 3.6
(large models)
Data2vec LibriSpeech LS-960 4 Gram 4.6 3.7
7.2.3 HuBERT
Hidden unit BERT (HuBERT) is a self-supervised ASR model by Meta AI. HuBERT
architecture is composed of the following layers as shown in Figure 7.15.
• CNN encoder
• Transformer
• Projection layer
• Code-embedding layer
There are two phases in the modeling, which are generating hidden units and
masked prediction. Generating hidden units, that is, finding the hidden units, is the
initial phase in the training process, which starts with the extraction of MFCCs (Mel
frequency cepstrum) from the audio waveforms. These are basic auditory character-
istics that can be used to describe speech. The K-means clustering algorithm is then
used to assign each audio segment to one of K clusters. The hidden units are then
used to label every audio frame according to the cluster to which it belongs. These
units are then transformed into embedding vectors for use in training step B. The
output of an intermediary of the BERT encoder from the previous iteration is used by
the model to produce representations that are superior to the MFCCs after the first
training step.
The second stage, masked prediction, uses masked language modeling to simulate
the training of the initial BERT model. Characteristics from the raw audio are created
by the CNN and supplied into the BERT encoder after being randomly masked. The
masked tokens are filled in by the BERT encoder, which outputs a feature sequence.
The cosine similarity between these outputs and each hidden and output embedding
created in step A is calculated when this output is projected into a low dimensional
space to match the labels. Next, the logits are subjected to the cross-entropy loss to
penalize incorrect predictions.
HuBERT variants are as follows.
• HuBERT base
• No. of CNN encoders is 512.
• Transformers: 12 layers; dropout probability: 0.05; no of attention heads: 8.
End-to-End Speech Recognition Models 161
FIGURE 7.15 HuBERT.
TABLE 7.6
On-Device RNN-T-based Model
Model Dataset WER EOU Latency
On-Device RNN-T +VAD Voice search 7.4% 860 ms
On-Device RNN-T EP 6.8% 790 ms
7.3.3 Conformer Model
The CNN and transformer have given a great result in ASR. CNNs effectively use
local features, while transformer models identify the global context more efficiently.
In conformer models, we have integrated both convolution and transformer models
to work in an efficient manner as shown in Figure 7.19.
The conformer consists of two feed forward layers that resemble macrons, sand-
wiching multi-headed self-attention and convolution modules with half-step residual
connections. There is a post layer norm after that.
The conformer encoder consists of multiple blocks to process the audio. Four
modules make up a conformer block, which are explained here.
164 Deep Learning Approach for NLP, Speech, and Computer Vision
TABLE 7.7
Comparison of Wav2Letter in LibriSpeech Corpus
Model Dataset WER %, Greedy Decoding WER %, Beam Search
Wav2Letter dev-clean 6.67 4.75
test-clean 6.58 4.94
dev-other 18.67 13.87
dev-other 19.61 15.06
• Convolution module
• Conformer block
The multi-headed self-attention module and the convolution module are present
between two feed forward modules of the conformer block. Two half feed forward
models are present, one before the attention and one after.
End-to-End Speech Recognition Models 165
7.4 SUMMARY
This chapter explores various models for developing end-to-end ASR pertained mod-
els such as CTC, LAS, deep speech, Jasper, QuartzNet, and Wav2vec-based models in
detail. Self-supervised models and streaming ASR models were also discussed in detail.
BIBLIOGRAPHY
A. Baevski and Wei-Ning Hsu, Qiantong Xu, Thirunavukkarasu Arun Babu, Jiatao Gu and
Michael Auli (2022). data2vec: A general framework for self-supervised learning
in speech, vision and language. Proceedings of the 39th International Conference on
Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Editors: Kamalika
Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, Sivan Sabato.
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0:
a framework for self-supervised learning of speech representations. In Proceedings of
the 34th International Conference on Neural Information Processing Systems (NIPS’20).
166 Deep Learning Approach for NLP, Speech, and Computer Vision
Curran Associates Inc., Red Hook, NY, USA, Article 1044, 12449–12460. Editors:
H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, H. Lin.
W. Chan, N. Jaitly, Q. Le and O. Vinyals. (2016). Listen, attend and spell: A neural network
for large vocabulary conversational speech recognition. 2016 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960–4964). New
York: IEEE. doi: 10.1109/ICASSP.2016.7472621.
R. Collobert, C. Puhrsch and G. Synnaeve. (2016). Wav2Letter: An end-to-end ConvNet-based
speech recognition system. ArXiv, abs/1609.03193.
A. Graves and N. Jaitly. (2014). Towards end-to-end speech recognition with recurrent neural
networks. Proceedings of the 31st International Conference on Machine Learning. Edi-
tors: Eric P. Xing, Tony Jebara 32(2): 1764–1772.
A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu and
R. Pang. (2020). Conformer: Convolution-augmented transformer for speech recogni�-
tion. ArXiv, abs/2005.08100.
A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. F. Diamos, E. Elsen, R. J. Prenger, S. Sath��-
eesh, S. Sengupta, A. Coates and A. Ng. (2014). Deep speech: Scaling up end-to-end
speech recognition. ArXiv, abs/1412.5567.
Y. He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, et al. (2019).
Streaming end-to-end speech recognition for mobile devices (pp. 6381–6385). Brighton,
UK: ICASSP. doi: 10.1109/ICASSP.2019.8682336.
S. Kriman et al. (2020). Quartznet: Deep automatic speech recognition with 1D time-channel
separable convolutions. ICASSP 2020–2020 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP) (pp. 6124–6128). New York: IEEE. doi:
10.1109/ICASSP40776.2020.9053889.
J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen and R. T.
Gadde. (2019). Jasper: An end-to-end convolutional neural acoustic model. Interspeech.
H. Wei-Ning, B. Benjamin, H.T. Yao-Hung, K. Lakhotia, R. Salakhutdinov and A. Mohamed
(2021). HuBERT: Self-supervised speech representation learning by masked prediction
of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing
29: 3451–3460. doi: 10.1109/TASLP.2021.3122291.
.
8 Computer Vision Basics
LEARNING OUTCOMES
After reading this chapter, you will be able to:
8.1 INTRODUCTION
Computer vision is a technology solution that can assist computers in seeing and
fully comprehending image content such as photographs and videos. It is primar-
ily an unsolved problem due to a lack of understanding of physiological vision as
well as the complexity of human vision in a dynamic and nearly infinitely differ-
ent physical world. It is a broad field that can be broadly classified as a subfield of
machine learning algorithms, and it can utilize both specialized and general learn-
ing techniques. It can appear jumbled as an interdisciplinary field of knowledge,
with methods borrowed and recyclable from a variety of widely divergent computer
science fields. A handcrafted quantitative research approach may be appropriate for
one aspirational problem, but another may necessitate a massive and complicated
combination of generalized machine learning algorithms. The aim of digital vision is
to understand the information in digital images. Generally, this involves interest and
motivation to mimic human vision capability. Recognizing the image content may
entail first extracting a description from the picture, which could be an item, a word
document, a three-dimensional concept, or something entirely different.
A subfield of computer science is CV the objective of which is to build machines,
so that it can process and interpret images and the videos just like the human visual
system does.
In general as is evident from Figure 8.1, the eye’s job is to transform light into
nerve impulses, which the brain then uses to create images of our environment.
In Figure 8.2, CV utilizes machine learning approaches and algorithms in order to
recognize, differentiate, and classify objects according to their size or color, as well
as to find and decipher patterns in visual data such as images and videos.
• Pre-processing
• Segmentation
• Feature extraction
• Classification
In CV, the image analysis system first preprocesses the input image, and after that,
segmentation has to be done. Segmentation means the partitioning of an image into
connected homogeneous regions. After this, extracting features from the images and
features vector takes place. Finally, part of the classification of the image or image
recognition is done as shown in Figure 8.3.
8.1.2.8 Compression
Compression is involved with ways to reduce the image size without degrading its
quality as well as the bandwidth required to submit it. Data compression is critical,
especially when using the Internet.
8.2 Image Segmentation
Image segmentation is a technique used in digital image processing and analysis to divide
a picture into different parts or region, usually based on the pixels in the picture. Various
types of image segmentation are thresholding techniques, the split and merge techniques,
region green techniques, active control, watershed algorithm, and k-means clustering.
Image segmentation is pre-processing step of CV systems, and, after image segmenta-
tion, it can extract image features for image classification and image recognition.
Mathematically, the segmentation problem partitions image I into regions
R1 , R2 , ⊃ ., RN . such that
1. I = R1 ∪ R2 ∪…∪ RN
2. Ri ∩ R j = ∅ , i ≠ j
3. There is a predicate that exists such that P(Ri)= True as well as for adja-
(
cent (i, j) P R ∪ R = False .
i j )
8.2.1 Steps in Image Segmentation
Image segmentation has six primary types of techniques that are
• Thresholding
• Region growing
• Region split and merge
• Edge based
• User-stored
• Active contour
• Topology-based
• Watershed
• K-means clustering
Computer Vision Basics 171
8.2.1.1 Thresholding
Thresholding is a kind of picture division wherein we change an image’s pixel cre-
ation to work with the examination. Through thresholding, we transform a variety or
grayscale pictures into paired pictures.
In thresholding, the image histogram is shown in Figure 8.4. Y axis will be rep-
resenting frequency, and the x axis will be representing grey level. Based on this
model, if I(x, y) is less than the threshold, then the object is a character, and else the
background is paper. Assume that a picture, f(x, y), is made out of light items on a dull
foundation, and the accompanying figure is the histogram of these pictures. The item
can then be extracted by contrasting the pixel value and an edge T.
Advantage is as follows:
Disadvantage is as follows:
For example, in the case of color images, these comparisons have to be made in
terms of RGB values. Suppose
I ( x, y ) = R1 G1 B1
( )
I x1 , y1 = R11 G11 B11
Compare = ( R − R ) + (G − G ) + ( B − B )
1
1 2
1 1
1 2
1 1
1 2
1
So, like this finding of the similarities between two pixels, the predicate is considered
to determine the homogeneity condition at the seed points to the pixels for which the
predicate is true, and this procedure has to repeat until all the pixels are segmented
out. This concept of region growth is shown in Figure 8.5.
Advantages are as follows:
• It is a simple technique.
• It is adaptive to gradual changes or sound noise.
FIGURE 8.5 Difference between Original Image, Region Growing, and Segmented Region.
Computer Vision Basics 173
• Combine two contiguous locales R_i and R_j assuming that they are
homogeneous. Then P(R_i∪R_j) = True.
• Stop when no further split or merging is possible.
is generated by minimizing snake energy, initial energy (curve bending and continuity),
external energy (Image every [gradient]), and constraint energy (proportion of outside
limitations either from more elevated level shape data or client-applied energy).
The following are the steps involved in modeling:
Advantage is as follows:
Disadvantage is as follows:
• Close boundary
• Correct boundary achievable
• Convolutional layer
• Pooling or down-sampling layer
• Flattening layer
• Fully connected layer
8.4.2 Convolution Layer
A convolution layer contains several filters that can perform convolution operations.
The base units in this system are filters or kernels. These layers are made up of a
number of learnable filters. Convolution is accomplished by computing the dot prod-
uct of the input matrix and the filter or kernel.
178 Deep Learning Approach for NLP, Speech, and Computer Vision
The kernel iterates over the input vector performing element-by-element matrix
multiplication in order to conduct convolution. The element map records the result
for each responsive field or the locale where convolution happens. The input image
is a 6 × 6 matrix convolved with a 3 × 3 filter which produces a featured map. Since
the shape of the filter is 3 × 3, this convolution is called a 3 × 3 convolution as shown
in Figure 8.11. Three parameters used to determine the size of the feature map are
as follows:
1. Depth: It is the number of filters used for convolution operation. If the con-
volution is performed on an original image using n filters, then it produces
n different feature maps. So the depth of the feature map will be n.
2. Stride: The strides are the number of pixels that turn to the input matrix. We
shift the filters to 1 pixel at a time when the number of strides is 1. The filters are
carried to the next 2 pixels when the number of strides is 2 and so on. They are
crucial because they regulate how the filter convolutions happen with the input.
3. Zero-padding: Zero-cushioning alludes to the course of evenly adding
zeroes to the information lattice. A regularly utilized change permits the
size of the contribution to be changed in accordance with the prerequisites.
Sometimes, the filter will not fit perfectly with the image. In that case, we
need to pad the picture with zeros so that it fits. It is called valid padding,
which keeps just the substantial piece of the picture.
• Max-pooling:
• Returns the best worth from the piece of the image covered by the piece.
• Average pooling:
• Returns the typical of the huge number of values from the piece of the
image covered by the piece.
• Average pooling works well for straight lines and smaller curves but can-
not detect extreme features like sharp edges.
180 Deep Learning Approach for NLP, Speech, and Computer Vision
• Sum pooling:
• The total amount of components in the element map is called total pooling.
In Figure 8.13, the input image of size 6 × 6 is convolved with a filter size of
3 × 3, then the 4 × 4 matrix is mapped. It is then pooled with the filter size 2 × 2 and
stride 2. Every max pooling operation in this scenario will only accept a maximum
of four values.
8.4.4 Flattening Layer
It is time for classification now that the features from the convolution layer
have been extracted and the dimension has been reduced by the pooling layer.
Multidimensional data cannot be processed by fully connected layers. As a result,
before processing, the data should be reduced to a single dimension (flattened) is
shown in Figure 8.14.
connected layer’s purpose is to classify the input image using high-level features
from the convolution and pooling layers. Multidimensional data cannot be processed
by fully connected layers. As a result, before processing, the data should be reduced
to a single dimension (flattened).
F( x ) = max(0, x ) (6)
2. Softmax
• It is used to predict a multinomial probability for a neural network model.
→ e Zi
σ Z( )i = K
Σ
zj
e
j =1
8.5.2 MATLAB
Applications involving image, video, and digital signals and AI can all benefit from
the programming environment MATLAB. It includes a CV toolkit with numerous fea-
tures, applications, and algorithms to assist you in creating remedies for CV-related
problems.
Architects and researchers can utilize the programming climate MATLAB to
review, make, and test frameworks and innovations that can influence the world.
MATLAB is a lattice-based language that empowers the most normal articulation of
PC math, which is the center of MATLAB.
MATLAB is used for the following:
• Parallel computing:
Utilizes multicore desktops, GPUs, clusters, and clouds to carry out massive compu-
tations and parallelizes simulations.
Python, C/C++, Java, and more languages can all be used with MATLAB.
Server for a single user, and many more Jupyter-based provisioning systems that are
running locally or in the cloud may all be integrated with MATLAB. By opening it
from the Jupyter interface you can work directly in MATLAB without leaving your
web browser.
MATLAB can be accessed from another programming environment using
MATLAB Engine APIs. The APIs allow MATLAB commands to be executed from
inside your programming language even without opening a MATLAB desktop ses-
sion. There are MATLAB Engine APIs for:
• C/C++
• Fortran
• Java
• Python
Various applications and components, many of which are built in languages like
Visual C#,. NET and Visual Basic.NET
8.6.2 Face Recognition
Face recognition is involved by facial acknowledgment innovation in iPhone and high-
level security frameworks to perceive a face. It must be fit for perceiving the distinctive
attributes of your face to forestall unapproved admittance to the cell phone or PC.
In Figure 8.16, the image data is given as input. The data is pre-processed by using
various pre-processing techniques to prepare the data for face recognition. Then the
feature extraction takes place, and the process of extracting facial features involves
locating the most recognizable facial features such as the eyes, nose, and mouth in
photographs of people’s faces, and the final part includes training a part of data and
test with another set of data.
8.6.4 Image-based Search
Google and other picture-based web crawlers utilize picture division frameworks to
decide the articles in the picture and coordinate their decisions with the related pho-
tographs that find to give clients web search tool results.
In Figure 8.18, the input image is processed with histogram features, and the
image features are computed for similarity score. The similarity score is computed
with the help of predicted class feature. The image features are extracted from the
input image and given to neural network model, which extracts the image features as
clusters, and the output similar image is got as the final output.
8.6.5 Medical Imaging
Picture division is utilized in the clinical business to find and recognize growth
cells, evaluate tissue volumes, direct virtual careful reenactments, and do in view
Computer Vision Basics 187
of the buried route. Picture division has various restorative applications. It sup-
ports the recognizable proof of regions impacted and the preparation of suitable
consideration.
In Figure 8.19, the training image database is fed to a deep CNN model for train-
ing, and then the output is fed to a trained model which also gets a query image as
input. The trained model is further processed for feature extraction, which is then
fed to a feature database for similarity measurement. Retrieved image is got as one
output, and the predicted class label is got as an output from the trained model.
8.7 SUMMARY
In this chapter, the digital image processing and fundamental steps of image process-
ing and various image segmentation techniques are all explained briefly. Also, this
chapter has explained the feature extraction techniques and components of CNN.
The image classification model was also included. It highlights the tools and libraries
for components like OpenCV and MATLAB. Finally, the chapter is concluded with
a brief explanation of the applications of CV in real-time. Further, these applications
are implemented as case studies in Chapter 10.
BIBLIOGRAPHY
Bayoudh, Khaled, Raja Knani, Fayçal Hamdaoui, and Abdellatif Mtibaa . A survey on deep
multimodal learning for computer vision: Advances, trends, applications, and data-
sets. The Visual Computer (2022): 38(8): 2939–2970.
Chowdhary, Chiranji Lal, G. Thippa Reddy, and B. D. Parameshachari. Computer Vision and
Recognition Systems: Research Innovations and Trends. London: CRC Press (2022).
Guo, Meng-Hao, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang
Mu, Song-Hai Zhang, Ralph R. Martin, Ming-Ming Cheng, and Shi-Min Hu. Attention
mechanisms in computer vision: A survey. Computational Visual Media (2022): 1–38.
Hammoudeh, Mohammad Ali A., Mohammad Alsaykhan, Ryan Alsalameh, and Nahs Alth�� -
waibi. Computer vision: A review of detecting objects in videos--challenges and tech-
niques. International Journal of Online & Biomedical Engineering (2022): 18(1).
Rani, Shilpa, Kamlesh Lakhwani, and Sandeep Kumar. Three-dimensional objects recogni��-
tion & pattern recognition technique; related challenges: A review. Multimedia Tools
and Applications (2022): 1–44.
Tong, Kang, and Yiquan Wu. Deep learning-based detection from the perspective of small or
tiny objects: A survey. Image and Vision Computing (2022): 104471.
Yang, Xi, Jie Yan, Wen Wang, Shaoyi Li, Bo Hu, and Jian Lin. Brain-inspired models for visual
object recognition: An overview. Artificial Intelligence Review (2022): 1–49.
Zeng, Kai, Qian Ma, Jia Wen Wu, Zhe Chen, Tao Shen, and Chenggang Yan. FPGA-based
accelerator for object detection: A comprehensive survey. The Journal of Supercomput-
ing (2022): 1–41.
9 Deep Learning Models
for Computer Vision
LEARNING OUTCOMES
After reading this chapter, you will be able to:
layer with three fully associated layers and a neural generator with convolution layer.
A summary of the architecture is shown in Figure 9.1. LeNet is made up of two pri-
mary components: a dense network with three fully associated layers and a neural
generator with a convolution layer. These layers commonly enhance the quantity
of channels by mapping spatially organized inputs to numerous two-dimensional
feature extraction. The output channels of the first convolutional layer are 6, and
those of the second are 16. Each of the 22 aggregating processes (stride 2) lowers
complexity by a four-fold increase through geographical down sampling. The batch
size, the quantity of channels, the height, and the breadth all have an impact on the
output form of the multilayer block. The three completely linked layers that make up
LeNet’s dense block each has 120, 84, and 10 outlets. The 10-dimensional activation
function accurately represents the number of possible output layers, given the ongo-
ing categorization.
The structural information contained in images may be effectively utilized by
CNN. Few parameters are used in the convolutional layer, which is also a result
of its primary features, shared weights, and local connections. The model in this
architecture lacks the ability to do image classification, and it will have overfitting
problems. AlexNet architecture was later designed to address these problems and
perform classification task.
9.2.2 AlexNet
In order to classify images, AlexNet is a deep-CNN model that was first introduced
in ILSVRC-2012, a contest created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey
E. Hinton in 2012. Its top most of five test error rate was 15.3%, whereas the phase-2
entry’s was 26.2%. Although many more network topologies with many more layers
have subsequently emerged, eight layers make up AlexNet. The final three layers
are completely connected, and the first five are convolutional. As seen in Figure 9.2,
there are also other “layers” in between called pooling and activation.
The order of layers of AlexNet is shown in the Figure 9.2. The problem set is split
into two sections, with one section running on GPU 1 and the other on GPU 2. Low
communication overhead makes it easier to attain overall good performance. The
AlexNet results demonstrate that a large, deep convolutional neural network can set
new records on a highly difficult set of problems. CNN views AlexNet as a turning
point in the classification of wizards. The direct image input to the classification
model is AlexNet’s distinctive benefit. The supervised learning project AlexNet pro-
duced excellent results. Choosing techniques that improved the performance of the
network, such as dropout and data augmentation, was also crucial. ConvNets’ break-
through implementations by AlexNet, like ReLU and dropout, are still in use today.
Low classification errors without overfitting are difficult to achieve. The architec-
ture’s shortcoming is its inability to handle complex applications for high-resolution
images. In order to address these shortcomings in AlexNet, the standard consensus
network VGG (Visual Geometry Group) was developed to surpass all previous per-
formance thresholds in image recognition tasks. Deep CNNs might be trained sig-
nificantly more quickly utilizing ReLU nonlinearity than they could using saturating
activation functions like tanh or sigmoid, according to AlexNet. Using CIFAR-10
data, it flattens each example in the mini batch before passing through convolutional
block and dense block to obtain the outputs.
9.2.3 VGG
The previously discussed 16-layer VGGNet-16 can identify images into 100 sepa-
rate object categories, such as keyboard, animals, pencil, mouse, and many more.
Correspondingly, images with a resolution of 224 × 224 are backed by the model. The
core concept of the VGG19 model, also known as VGGNet-19, is similar to that of the
VGG16 model, with the exception that it holds 19 layers. The name VGG comprises
192 Deep Learning Approach for NLP, Speech, and Computer Vision
the number 16 because the deep neural network has 16 layers (VGGnet). Given that
the VGG16 network has more than 138 million parameters overall, this suggests that
it is quite vast. The pre trained architecture VGG16 is trained on ImageNet, has a
top of 5 test accuracy of around 92.7%. ImageNet contains over 14 million photo-
graphs in over 1,000 categories. Furthermore, it was placed highly among the models
that participated in ILSVRC-2014. The model outperforms AlexNet by exchanging
a number of 3 × 3 kernel-sized filters for massive kernel-sized filters. Over the course
of several weeks, the VGG16 model was trained on NVidia Titan Black GPUs. The
term VGG, which means Visual Geometry Group, refers to a deep-CNN design that
is common in that it has several layers. The count of layers is referred to as “deep”
and VGG-16 or 19. Using VGGNet, creative item identification models are created.
In contrast to baselines, the VGGNet, established as a deep network, performs
well than ImageNet on a different tasks and on datasets. Besides that, it is still one of
the common recognition for Images architectures today. Because it still supports 16
layers, the VGG model, often known as VGGNet or VGG16, is a convolutional neural
networks model, created by A. Simonyan and K. Zisserman in 2014. The scenario
is explained in Figure 9.3. Even by yesterday’s high standards, the model presented
by these researchers is a large network and can be found in the publication “Very
Convolution Neural Networks for Huge Image Recognition.” VGGNet’s design is
more appealing since it is straightforward.
There are approximately 64 filters that can be accessed, and those numbers can
go up to about 128 and eventually to about 256 filters. In the end, 512 filters may
be used. For each step or each stack, the number of filters that can be used on a
convolution layer doubles. The VGG16 was developed under the direction of this
essential notion. Disadvantages of the VGG16 are that it is huge and takes more time
to train its parameters. The model is larger than 533 MB due to the model’s depth
and the number of entirely connected layers. This prolongs the process of implement-
ing a VGG network. The model is utilized for many classification challenges, while
smaller network topologies like Google Net and Squeeze Net are also widely used.
In any case, the VGGNet is an excellent building block for instructional purposes
because it is so easy to implement. The model’s inability to be used for deep networks
as the network’s depth increases, its propensity for the vanishing gradients problem,
and the enormous amount of time and money needed for calculation due to the large
number of parameters are just a few of the significant disadvantages of this architec-
ture. This issue is then addressed and Inception architecture is introduced to improve
speed and accuracy while keeping high performance.
9.2.4 Inception
The improved Inception V3 replaced the basic model Inception V1, which was pub-
lished as Google Net in 2014. As the name implies, it was formed by a Google team.
Data overfitting occurred when a model employed numerous deep layers of convolu-
tions. The Inception models will have parallel layers rather than deep layers, result-
ing in wider than deep models. The Inception model consists of several Inception
modules. Four parallel layers make up the basic module of the Inception V1 model:
1 × 1 convolution, 3 × 3 pooling layer, 5 × 5 batch normalization, and 3 × 3 pooling
Deep Learning Models for Computer Vision 193
layers. By applying a filter to each image and its immediate neighbors throughout
the whole image, convolution layers change the image. Pooling is the process used
to reduce the dimensionality of the feature map. There are other forms of pooling,
but average and maximum pooling are the most popular. One of the main benefits
of the Inception model was the extensive dimension reduction. In order to improve
the model further, the larger convolutions were divided into smaller convolutions.
Fundamental module for the conceptualization V1 module serves as an illustration as
shown in Figure 9.4. It includes 55 convolutional layers, which, as already mentioned,
required expensive computing. Thus, to reduce computational cost, the 55 convolu-
tional layers were swapped out for two 33 convolutional layers. The fewer parameters
also resulted in lower computing costs. The architecture is constrained by a represen-
tational problem that reduces the feature space of the layer below, which can end in
the loss of a significant amount of details. A program that can detect photos without
losing any information was developed using R-CNN.
194 Deep Learning Approach for NLP, Speech, and Computer Vision
9.2.5 R-CNN
The region-based convolution neural network, or R-CNN for short, was developed
in 2014 by a group of academics at UC Berkeley and is capable of identifying 80
different types of objects in pictures. R-primary CNN’s contribution to object iden-
tification is limited to feature extraction using a CNN, as opposed to the full object
detection procedure shown in the previous image. Figure 9.5 (R-CNN architecture)
depicts how the R-CNN model operates. The first module produces recommenda-
tions for 2,000 regions by employing the Selective Search methodology. The sec-
ond module shrinks each region proposal to a predetermined predefined size before
extracting feature vector length 4,096. A pre-trained SVM system selects the clas-
sification model with the dark background or one of the object classes in module
three. The comprehensive dimension reduction was one of the main advantages of
the Inception model.
The R-CNN designs also have the following drawbacks. In order to improve
the model further, the larger convolutions were divided into smaller convolutions.
Consider the basic module of the conceptualization V1 module, which is depicted in
Figure 9.5. It includes 55 convolutional layers, which, as already mentioned, required
expensive computing. Thus, in order to reduce the computational cost, the 55 convo-
lutional layers were swapped out for two 33 convolutional layers. The fewer parame-
ters also resulted in lower computing costs. To get beyond R-CNN design restrictions,
the Fast R-CNN model is recommended as an enhancement to the R-CNN model.
Deep Learning Models for Computer Vision 195
9.2.6 Fast R-CNN
A primary developer of the object detector Fast R-CNN is Facebook AI researcher
and former Microsoft researcher Ross Girshick in 2015 [7]. Fast R-CNN can solve
several R-CNN issues, as its name implies. The new layer, dubbed ROI Pooling,
recovers feature vectors of the same length from each proposal (ROI). In contrast to
R-CNN, which has three stages, Fast R-CNN only has one phase (region proposal
generation, feature extraction, and classification using SVM). Because it conducts
calculations (like those for convolutional layers) only once and then divides them
among all of the proposals, R-CNN is quicker (such as ROIs). Fast R-CNN is thus
faster than R-CNN, which is achieved by incorporating the latest ROI pooling layer.
In contrast to R-CNN, which requires hundreds of gigabytes of disc space, Fast
R-CNN will not collect the features extracted. It is a bit less accurate than R-CNN.
Figure 9.6 depicts the Fast R-overall CNN design. R-three CNN has three steps,
whereas the model has one. It only requires a photo as input and returns the bounding
boxes and classifier of the detected items. The feature map from the preceding con-
volutional layer is fed into a ROI pooling layer. The rationale is to retrieve a length
that is fixed with the vector of each region suggestion. The ROI layer operates by
dividing each region proposal into cells in a grid. To return a single value, each cell
in the grid obtains the maximum pooling operation.
The vector map for feature extraction is made up of values in the cells. If grid size
is said to be 22, then feature vector length is 4. Following that, the ROI pooling-ex-
tracted feature vector is passed to a few fully connected layers. Each region proposal
made using R-CNN is fed to the model separately from the others. Fast R-CNN’s
search region proposal generation approach consumes the majority of its processing
time during detection. Therefore, Faster R- CNN’s focus was on the architecture’s
bottleneck.
9.2.7 Faster R-CNN
Faster R-CNN is an improvement on Fast R-CNN. Faster R-CNN outperforms Fast
R-CNN due to the region proposal network (RPN), which was developed in 2015
by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. A fully convolutional
network, the region proposal network (RPN), generates proposals with varying
scales and aspect ratios. Anchor boxes are not the same as photo pyramids or filter
196 Deep Learning Approach for NLP, Speech, and Computer Vision
pyramids. An anchor box is a references box with a certain scale and flexural mod-
ulus. The same region can have multiple sizes and dimensions if there are many
reference anchor boxes. The mapping of each region to a unique reference anchor box
that results allows for the recognition of objects with varying scales and aspect ratios.
The convolutional calculations used by RPN and Fast R-CNN are the same. As a
result, the computation time is cut in half. Figure 9.7 depicts the layout of the Faster
R-architectural CNN. The RPN module is in charge of creating region suggestions.
Attention is implemented using neural networks, which instruct the Fast R-CNN
module for the detection on the place to look upon the object within the given picture.
Regional proposals are created by the RPN. Using the ROI pooling layer, each
area suggested in the image is used to extract a fixed-length feature vector. The
output feature vectors are then classified using the Fast R-CNN. Along with their
bounding boxes, the class scores of the discovered objects are returned. The selec-
tive search technique is used by the R-CNN and Fast R-CNN algorithms to produce
Deep Learning Models for Computer Vision 197
9.2.8 Mask R-CNN
The Mask R-CNN model, developed by Kaiming He, Georgia Gkioxari, Piotr Dollar,
and Ross Girshick in 2017, extends Faster R-CNN by including an additional branch that
returns a mask for each recognized item. Faster R-CNN is a region-based convolutional
neural network that creates bounding boxes and a confidence score for the class label of
each object. Mask R-CNN, also known as Mask RCNN, is a cutting-edge categorization
model based on Faster R-CNN. Mask R-CNN is the most advanced CNN for image seg-
mentation. Understanding how Mask R-CNN works necessitates an understanding of
image segmentation. Image segmentation is a CV technique for dividing a digital image
into distinct regions (sets of pixels, also known as image objects). This segmentation
includes both borders and objects (lines, curves, etc.). Mask R-CNN supports two basic
types of image segmentation: semantic and instance segmentation. Semantic segmen-
tation categorizes each pixel without distinguishing between various instances of the
same object. To put it another way, semantic segmentation attempts to recognize and
group similar items into a single category at the pixel level. Background segmentation,
also known as semantic segmentation, is the process of separating the subjects of an
image from its background. The method of accurately identifying and finely segment-
ing each object in an image is known as instance segmentation, also known as instance
recognition. As a result, it combines object detection, localization, and classification. In
other words, this type of segmentation takes it a step further by separating each item
designated as a comparable instance. Despite the fact that all objects in a segmentation
scenario are individuals, each individual is given individual attention throughout the
procedure. Semantic segmentation is also known as foreground segmentation because it
highlights the subjects of the image rather than the background.
Mask R-CNN was created using faster R-CNN, as shown in Figure 9.8. For each
candidate item, Faster R-CNN outputs a class label and a bounding box offset,
whereas Mask R-CNN adds a third branch that outputs the object mask. Since it
differs from the class and box outputs, it requires the retrieval of a much more pre-
cise spatial arrangement of an object. CNN’s Mask R-CNN extends its functionality
by trying to introduce a branch for predicting an object mask (region of interest) as
well as a branch for bounding box detection. Fast/Faster R-CNN has an important
feature that Mask R-CNN does not have—pixel-to-pixel alignment. Mask R-CNN
uses the same two-step procedure, beginning with the same first stage (which is
RPN). Mask R-CNN must generate a binary mask for each ROI in relation to class
and box offset predictions in the second stage. The majority of modern systems, on
the other hand, rely on mask predictions for classification. Furthermore, the Faster
198 Deep Learning Approach for NLP, Speech, and Computer Vision
9.2.9 YOLO
YOLO is a method that provides real-time object detection using neural networks
developed by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi
in 2016. The efficiency and speed of this algorithm account for its popularity. It
has been used in many different contexts to distinguish between animals, people,
parking meters, and traffic signals. YOLO is an abbreviation for You Only Look
Once. This algorithm recognizes and locates multiple elements in a photograph (in
real-time). The class probabilities of the discovered photographs are provided by the
object identification procedure in YOLO, which is carried out as a regression prob-
lem. To recognize objects quickly, the YOLO approach employs CNNs. To detect
objects, as implied by the name and depicted in Figure 9.9, the method only requires
one forward propagation through a neural network. This demonstrates that the image
is subjected to a specific algorithm for prediction. CNN is used to forecast multiple
bounding boxes and class probabilities at the same time. There are several versions
of the YOLO algorithm. Tiny YOLO and YOLOv3 are two of the more well-known
versions. The YOLO algorithm is essential for the reasons listed here:
• Speed: It can anticipate objects in real time, and this method speeds up detection.
• Accuracy: In the background, the YOLO prediction method yields precise
results with low errors. Because of its superior learning capabilities, the
algorithm can recognize and apply object portrayal for object detection.
Deep Learning Models for Computer Vision 199
The techniques listed here will work with the YOLO architecture described earlier:
When compared to all pre-trained models, the YOLO architecture is noted for oper-
ating quite quickly. Due to its algorithm and training properties, it has a high gener-
ative and generalized network. It has the unique capacity to process each frame at a
rate ranging from 150 frames per second (fps), which is a smaller network and better
suited for using the model for real-time applications, to 45 frames per second (fps),
which is a faster network. Detecting small items from a picture is quite challenging,
and, in some instances, it will not work because the detected objects are too close to
one another on the grid. It is recommended to upgrade to the most recent versions of
YOLO in order to gain better outcomes when improvising and achieving high accu-
racy with performance using the limitations of the YOLO architecture.
200 Deep Learning Approach for NLP, Speech, and Computer Vision
9.3 SUMMARY
This chapter includes a detailed analysis of pre-trained models and covers every
aspect of CV in deep learning. The pre-trained mode is used in CV to simplify
the work so that it may be more readily integrated with the application, to quickly
obtain a strong model performance, and to process data without the requirement for
well-maintained labeled data. The chapter also covers the use of pre-trained models
in well-known advancements. The pre-trained architectures are covered including
LeNet, AlexNet, R-CNN, Fast R-CNN, Faster R-CNN, Inception, Mask R-CNN, and
YOLO. Each pre-trained concept’s architecture was described, along with a specific
architectural example showing the working. The following chapter will concentrate
on current computer-vision-based applications.
BIBLIOGRAPHY
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate
object detection and semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 580–587). New York: IEEE.
Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on
computer vision (pp. 1440–1448). New York: IEEE
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the
IEEE international conference on computer vision (pp. 2961–2969). New York: IEEE.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep con�-
volutional neural networks. Advances in Neural Information Processing Systems, 25.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-
time object detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 779–788). New York: IEEE.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detec� -
tion with region proposal networks. Advances in Neural Information Processing Sys-
tems, 28.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 2818–2826). New York: IEEE.
Voulodimos, A., Doulamis, N., Doulamis, A., & Protopapadakis, E. (2018). Deep learning for
computer vision: A brief review. Computational Intelligence and Neuroscience, 2018.
10 Applications of
Computer Vision
LEARNING OUTCOMES
After reading this chapter, you will be able to:
10.1 INTRODUCTION
Digital images, videos, and other visible inputs may all be used as sources of relevant
information by machines and systems owing to CV. AI aids in the way computers
think, whereas CV aids in how they observe and comprehend their surroundings.
CV is being used in more areas than was expected. CV has become a part of our
daily lives, from identifying early cancer signals to enabling automated checkouts in
stores. Here are a few additional uses for CV:
Since its 1999 publication, this well-known library of handwritten pictures has
served as the standard for classification algorithms as shown in Figure 10.1. MNIST
continues to be a trustworthy resource alike even as new machine learning method-
ologies are developed.
10.2.1 Code Snippets
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
Here, the training and evaluation datasets are extracted from the data. The dataset con-
sists of 60,000 photos for trained and 10,000 testing images. Grayscale codes are used
in the x train and x test, while labels representing the numbers 0 through 9 are used
in the y test and y train. When examining the form of datasets to determine whether
they can be used with CNN or not, the result shown is (60000,28,28), which indicates
that our dataset has 60,000 photos, each of which is 28 × 28 pixels in size. A four-
dimensional array is required to use the Keras API, but only a 3D NumPy array is used.
In order to have floating-point values after the division, we set the type of the
four-dimensional NumPy array to float.
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
Now that we’ve reached the normalization stage, we always do it in our neural
network models. To do this, we divide it by 255.
x_train = x_train / 255
x_test = x_test / 255
Built the model using the Keras API. Conv2D, MaxPooling, Flatten, Dropout, and
Dense layers to be added after importing the sequential model from Keras.
While training, dropout layers combat overfitting by ignoring some of the neu-
rons. Before constructing the completely connected layers, flattened layers convert
2D arrays to 1D arrays.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=
['accuracy'])
model.fit(x=x_train,y=y_train, epochs=10)
model.evaluate(x_test, y_test)
image_index = 2853
plt.imshow(x_test[image_index].reshape(28, 28),cmap='Greys')
predict = x_test[image_index].reshape(28,28)
pred = model.predict(x_test[image_index].reshape(1, 28, 28, 1))
print(pred.argmax())
Output
7
Here, we have accessed an image of a character which is “number 7,” and the model
has predicted accurately.
impact on the findings. If the lighting is unexpected, the same subject is captured
utilizing the same detector, and they are in virtually identical positions and facial
expressions, the results may appear significantly different.
10.3.3.3 Occlusion
An obstruction known as occlusion occurs when one or perhaps more facial char-
acteristics are hidden and the full face is not accessible for input. Occlusion is one
of the main issues with facial recognition technology. It frequently occurs in real-
world circumstances and is triggered by accessories, beards, and moustaches (such
as goggles, hats, and masks). These components diversify the subject, which makes
computerized biometric technology a difficult issue to resolve.
10.3.3.4 Expression
The face represents one of the most essential biometrics since it is so crucial to a
personality and feelings. Different situations lead to various moods, which in turn
lead people to display various emotions and, eventually, alter their facial expressions.
The many ways which the same individual thinks themselves are a crucial factor as
well. Human emotions including joy, sorrow, fury, contempt, fear, and surprise are
all examples of macro-expressions. Facial cues are brief, unplanned alterations in the
direction the face moves.
10.3.3.6 Ageing
The fact that the texture and look of the face vary over time and that ageing is
reflected in them present another challenge for facial recognition systems. As we
age, our faces’ features, contours, and lines alter, among other things. For accuracy
testing, a dataset is created for a range of age groups throughout time. It is done for
long-term picture retrieval and visual inspection. The recognition procedure in this
instance depends heavily on image retrieval, which makes use of essential character-
istics like creases, marks, eyebrows, and hairstyles.
10.3.3.9 Dataset
A library of face images called the Labelled Faces in the Wild (LFW) dataset was
created to examine the issue of unrestricted face identification. The size of the dataset
is 173 MB and contains over 13,000 face images that were gathered from the Internet.
with mp_holistic.Holistic(min_detection_confidence=0.5,min_tracking_confi-
dence=0.5) as hol
while(True):
ret, frame = cap.read()
#Recoloring the feed
image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = holistic.process(image)
print(results)
10.3.4 Result Analysis
The face is detected and recognized. The facial key correspond to a particular face
feature, such as nose tip, and the center of eyes as shown in figure 10.2.
10.4.1 Framework Used
Language used: Python.
Frameworks used: MediaPipe, backend and implemented on Jupyter notebook,
and Anaconda IDE.
Applications of Computer Vision 207
10.4.2 Code Snippets
First, in the following code, we will first initialize a holistic model and draw the
utilities using MediaPipe.
The mediapipe_dect function is used to draw the mediapipe model on the image.
Initially, the image is converted from BGR to RGB image, and the flags are made
non-writable. After processing the image, the flags are again made writable. The
image is then converted from RGB to BGR as a result.
The draw_landmark function is for plotting the landmarks on the given image.
The following function is to provide styles for the drawn landmarks. For example,
the landmarks drawn on hands and legs are of different colors to differentiate from
each other.
cap = cv2.VideoCapture(0)
# Set mediapipe model
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confi-
dence=0.5) as holistic:
while cap.isOpened():
# Read feed
Applications of Computer Vision 209
ret, frame = cap.read()
# Make detections
image, results = mediapipe_detection(frame, holistic)
print(results)
# Draw landmarks
draw_styled_landmarks(image, results)
# Show to screen
cv2.imshow(‘OpenCV Feed’, image)
# Break gracefully
if cv2.waitKey(10) & 0xFF == ord(‘q’):
break
cap.release()
cv2.destroyAllWindows()
draw_landmarks(frame, results)
plt.imshow(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
len(results.right_hand_landmarks.landmark)
pose = []
for res in results.pose_landmarks.landmark:
test = np.array([res.x, res.y, res.z, res.visibility])
pose.append(test)
pose= np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.
landmark]).flatten()
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_land-
marks.landmark]).flatten() if results.pose_landmarks else np.zeros(132)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).
flatten() if results.face_landmarks else np.zeros(1404)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.land-
mark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.land-
mark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)
rh
def extract_keypoints(results):
pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_land-
marks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).
flatten() if results.face_landmarks else np.zeros(468*3)
lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.land-
mark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.land-
mark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)
return np.concatenate([pose, face, lh, rh])
result_test = extract_keypoints(results)
result_test
np.save(’0’, result_test)
np.load(’0.npy’)
210 Deep Learning Approach for NLP, Speech, and Computer Vision
model.add(Dense(actions.shape[0], activation=‘softmax’))
model.compile(optimizer=‘Adam’, loss=‘categorical_crossentropy’,
metrics=[‘categorical_accuracy’])
model.fit(X_train, y_train, epochs=100, callbacks=[tb_callback])
model.summary()
res = model.predict(X_test)
actions[np.argmax(res[1])]
actions[np.argmax(y_test[1])]
model.save(‘action.h5’)
from sklearn.metrics import multilabel_confusion_matrix, accuracy_score
yhat = model.predict(X_test)
ytrue = np.argmax(y_test, axis=1).tolist()
yhat = np.argmax(yhat, axis=1).tolist()
multilabel_confusion_matrix(ytrue, yhat)
accuracy_score(ytrue, yhat)
from scipy import stats
colors = [(245,117,16), (117,245,16), (16,117,245)]
def prob_viz(res, actions, input_frame, colors):
output_frame = input_frame.copy()
for num, prob in enumerate(res):
cv2.rectangle(output_frame, (0,60+num*40), (int(prob*100), 90+num*40), col-
ors[num], -1)
cv2.putText(output_frame, actions[num], (0, 85+num*40), cv2.FONT_
HERSHEY_SIMPLEX, 1, (255,255,255), 2, cv2.LINE_AA)
return output_frame
plt.figure(figsize=(18,18))
#plt.imshow(prob_viz(res, actions, image, colors))
# 1. New detection variables
sequence = []
sentence = []
threshold = 0.5
cap = cv2.VideoCapture(0)
# Set mediapipe model
with mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confi-
dence=0.5) as holistic:
while cap.isOpened():
# Read feed
ret, frame = cap.read()
# Make detections
image, results = mediapipe_detection(frame, holistic)
print(results
# Draw landmarks
draw_styled_landmarks(image, results)
# 2. Prediction logic
Applications of Computer Vision 213
keypoints = extract_keypoints(results)
sequence.insert(0,keypoints)
sequence = sequence[:30]
if len(sequence) == 30:
res = model.predict(np.expand_dims(sequence, axis=0))[0]
print(actions[np.argmax(res)])
#3. Viz logic
if res[np.argmax(res)] > threshold:
if len(sentence) > 0:
if actions[np.argmax(res)]!= sentence[-1]:
sentence.append(actions[np.argmax(res)])
else:
sentence.append(actions[np.argmax(res)])
if len(sentence) > 5:
sentence = sentence[-5:]
# Viz probabilities
image = prob_viz(res, actions, image, colors)
cv2.rectangle(image, (0,0), (640, 40), (245, 117, 16), -1)
cv2.putText(image, ‘‘.join(sentence), (3,30),
cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2, cv2.LINE_AA)
# Show to screen
cv2.imshow(‘OpenCV Feed’, image)
# Break gracefully
if cv2.waitKey(10) & 0xFF == ord(‘q’):
break
cap.release()
cv2.destroyAllWindows()
10.4.3 Result Analysis
The hand gesture is recognized, and the correct output for the gesture is obtained.
10.4.4.2 Movement
Common sense dictates that a gesture corresponds more to a movement than a static
picture. Therefore, pattern detection for gestures should be possible. To shut the cur-
rent application, for example, we may recognize a wave pattern as a command rather
than just an image of a hand extended.
214 Deep Learning Approach for NLP, Speech, and Computer Vision
posture detection can also enable the projection of digital data and content over the
physical environment in virtual worlds.
A significantly high- position model called Blaze Pose was created primarily to
support difficult areas like yoga, fitness, and dance. It extends the 17 key-point topol-
ogy of the initial Pose Net model that we launched a few years ago by being able
to identify 33 key points. These extra key points offer crucial details regarding the
position of the face, hands, and feet along with scale and rotation.
10.5.1 Framework Used
Language used: Python
Framework used: MediaPipe, Blaze poses, backend and implemented on Jupiter
notebook, Anaconda IDE.
MediaPipe and other necessary packages needed should be installed the dimen-
sion of the image should be provided. And read the image once to preview it.
10.5.2 Squats
The following code snippet can be used to detect postures involved in squats exercise:
!pip install mediapipe
import cv2
from google.colab.patches import cv2_imshow
import math
import numpy as np
DESIRED_HEIGHT = 480
DESIRED_WIDTH = 480
def resize_and_show(image):
h, w = image.shape[:2]
if h < w:
img=cv2.resize(image,(DESIRED_WIDTH,math.floor(h/(w/DESIRED_WIDTH))))
else:
img = cv2.resize(image, (math.floor(w/(h/DESIRED_HEIGHT)), DESIRED_
HEIGHT))
cv2_imshow(img)
# Read images with OpenCV.
images = {name: cv2.imread(name) for name in uploaded.keys()}
# Preview the images.
for name, image in images.items():
print(name)
resize_and_show(image)
import mediapipe as mp
mp_pose = mp.solutions.pose
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
help(mp_pose.Pose)
216 Deep Learning Approach for NLP, Speech, and Computer Vision
with mp_pose.Pose(
static_image_mode=True, min_detection_confidence=0.5, model_complex-
ity=2) as pose:
for name, image in images.items():
# Convert the BGR image to RGB and process it with MediaPipe Pose.
results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
# Print nose landmark.
image_hight, image_width, _ = image.shape
if not results.pose_landmarks:
continue
print(
f’Nose coordinates: (‘
f ’{r e s u l t s . p o s e _ l a n d m a r k s . l a n d m a r k [m p _ p o s e . P o s e L a n d m a r k .
NOSE].x * image_width}, ’
f ’{r e s u l t s . p o s e _ l a n d m a r k s . l a n d m a r k [m p _ p o s e . P o s e L a n d m a r k .
NOSE].y * image_hight})’
)
# Draw pose landmarks.
print(f’Pose landmarks of {name}:’)
annotated_image = image.copy()
mp_drawing.draw_landmarks(
annotated_image,
results.pose_landmarks,
mp_pose.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles.get_default_pose_landmarks_
style())
resize_and_show(annotated_image)
import math
def getAngle(firstPoint, midPoint, lastPoint):
result =math.degrees(math.atan2(lastPoint.y—midPoint.y,lastPoint.x—mid-
Point.x)-math.atan2(firstPoint.y—midPoint.y,firstPoint.x—midPoint.x))
result = abs(result)# Angle should never be negative
if (result > 180):
result = (360.0—result)# Always get the acute representation of the angle
return result
def distance_between(a,b):
a = np.array(a)
b= np.array(b)
return np.linalg.norm(a—b)
with mp_pose.Pose(
static_image_mode=True, min_detection_confidence=0.5, model_complex-
ity=2) as pose:
for name, image in images.items():
results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
# Print the real-world 3D coordinates of nose in meters with the origin at
# the center between hips.
Applications of Computer Vision 217
FIGURE 10.4 (a) Pose Detection and (b) Pose Detection Plot.
10.5.3 Result Analysis
In the end, we built a classifier to classify yoga poses using MediaPipe. The pose is
correctly detected, and the key points are clearly visible. This can be processed on
other yoga poses.
10.6 SUMMARY
This chapter explains some of the real-world applications of CV. We have used many
libraries and datasets for different applications. This helps the reader to identify the
problems and solve using CV and build models using appropriate open-source frame-
works. CV is being used in more and more applications every day. Numerous data-
sets are accessible, and these can train computers to recognize and understand items.
This technology also exemplifies a crucial development in our civilization’s quest to
develop AI which will match human intelligence.
BIBLIOGRAPHY
El-Komy, A., Shahin, O. R., Abd El-Aziz, R. M., & Taloba, A. I. (2022). Integration of com�-
puter vision and natural language processing in multimedia robotics application. Infor-
mation Sciences, 7, 6.
Li, Y., & Zhang, Y. (2020). Application research of computer vision technology in automation.
2020 International Conference on Computer Information and Big Data Applications
(CIBDA) (pp. 374–377). New York: IEEE. doi: 10.1109/CIBDA50819.2020.00090.
Long, T., Gao, Q., Xu, L., & Zhou, Z. (2022). A survey on adversarial attacks in computer
vision: Taxonomy, visualization and future directions. Computers & Security, 102847.
Messaoud, S., Bouaafia, S., Maraoui, A., Ammari, A. C., Khriji, L., & Machhout, M. (2022).
Deep convolutional neural networks-based hardware–Software on-chip system for com-
puter vision application. Computers & Electrical Engineering, 98, 107671.
Applications of Computer Vision 219
Sinha, G. R., Subudhi, B., Fan, C. P., & Das, D. (2022). Attaining strong learning outcomes
using modern pedagogies in teaching image processing and computer vision. In Devel-
opment of Employability Skills through Pragmatic Assessment of Student Learning Out-
comes (pp. 1–20). London: IGI Global.
Xiaogang, W., Siwen, Q., Ji, Z., Junjun, G., & Cao, P. (2022). Exploration and application of
library automatic book inventory checking system based on computer vision and artifi-
cial intelligence. Library Journal, 41(7), 96.
Index
1D time-channel, 155 Bi-RNN, 60
blank email spam, 87
A Blaze Pose, 215
BLEU, 37
acoustic features, 145 boosting, 12
acoustic model, 7, 115, 145 botnets, 87
acoustic-phonetic approach, 119 bounding box regression, 199
activation function, 53
active control, 173 C
Adam optimizer, 159
AlBERT, 67 Caffe, 21
AlexNet, 190 Call Home (CHM), 155
anagrams, 79 CamemBERT(French), 69
ANN, 185 CAN-SPAM, 87
answer processing, 91 cestrum, 100
answer type detection, 90 character error rate, 122
antivirus warnings, 87 chatbots, 93, 94
AraBERT (Arabic), 69 CIDEr, 37
artificial intelligence, 1, 77 classification, 168
artificial neural network, 127, 133 classification model, 3
aspect-based sentiment analysis, 86 closed API access, 73
attention, 196 CLS token, 67
attention based, 9 clustering model, 4
attention model, 58, 140 CNN, 49
audio pre-processing, 104 complex decoding, 143
audio signal framing, 104 computational linguistic, 5
audio spectrum centroid, 101 computer vision, 1, 7, 49, 167, 201
audio spectrum flatness, 102 conditional independence assumptions, 143
audio spectrum spread, 102 conformer model, 163
automatic speech recognition, 96, 99 connected words, 102
context state, 152
B continuous density HMM, 128
continuous speech, 102
back propagation, 138 continuous speech ASR model, 130
back translation, 29 continuous word recognizer, 127
bagging, 12 Conv2D, 203
bag of words, 41 conventional ASR models, 145
Bahdanau attention, 60 conventional convolution neural network, 49
batch normalisation, 153 conventional succession, 65
beam search, 119, 149 convolution layer, 177
BeautifulSoup, 39 convolution neural network (CNN), 9, 177
Bernoulli, 13 cosine similarity, 47
BERT, 38 cReLU, 153
BERT architecture, 62, 71, 72 cross-entropy, 138
bias, 47 cross entropy loss, 160
bigram, 118 CTC, 145
bigram flipping, 29 CTC loss function, 142
Bi(GRU), 138
Bi-LSTM, 138 D
binary classification, 15, 82
Bing voice search, 120 Data2vec, 158
BioBERT, 69 data acquisition, 29
221
222 Index