Machine Learning
Machine Learning
Machine Learning: Concepts, Techniques and Applications starts at the basic conceptual
level of explaining machine learning and goes on to explain the basis of machine learning
algorithms. The mathematical foundations required are outlined along with their associa-
tions to machine learning. The book then goes on to describe important machine learning
algorithms along with appropriate use cases. This approach enables the readers to explore
the applicability of each algorithm by understanding the differences between them. A com-
prehensive account of various aspects of ethical machine learning has been discussed. An
outline of deep learning models is also included. The use cases, self-assessments, exercises,
activities, numerical problems, and projects associated with each chapter aims to concret-
ize the understanding.
Features
The book aims to make the thinking of applications and problems in terms of machine
learning possible for graduate students, researchers, and professionals so that they can
formulate the problems, prepare data, decide features, select appropriate machine learning
algorithms, and do appropriate performance evaluation.
Machine Learning
Concepts, Techniques and Applications
T V Geetha
S Sendhilkumar
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has not
been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted,
or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, includ-
ing photocopying, microfilming, and recording, or in any information storage or retrieval system, without writ-
ten permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works
that are not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003290100
Typeset in Palatino
by SPi Technologies India Pvt Ltd (Straive)
Contents
Preface.............................................................................................................................................xix
Author Biography.........................................................................................................................xxi
1 Introduction.............................................................................................................................. 1
1.1 Introduction.................................................................................................................... 1
1.1.1 Intelligence......................................................................................................... 2
1.1.2 Learning............................................................................................................. 2
1.1.3 Informal Introduction to Machine Learning................................................. 3
1.1.4 Artificial Intelligence, Machine Learning, and Deep Learning..................3
1.2 Need for Machine Learning.......................................................................................... 4
1.2.1 Extracting Relevant Information.................................................................... 4
1.2.2 Why Study Machine Learning?...................................................................... 4
1.2.3 Importance of Machine Learning................................................................... 5
1.3 Machine Learning—Road Map.................................................................................... 5
1.3.1 Early History..................................................................................................... 5
1.3.2 Focus on Neural Networks.............................................................................. 7
1.3.3 Discovery of Foundational Ideas of Machine Learning.............................. 7
1.3.4 Machine Learning from Knowledge-Driven to Data-Driven..................... 7
1.3.5 Applied Machine Learning—Text and Vision and
Machine Learning Competitions.................................................................... 8
1.3.6 Deep Learning—Coming of Age of Neural Nets......................................... 8
1.3.7 Industrial Participation and Machine Learning........................................... 8
1.4 What Is Machine Learning?.......................................................................................... 9
1.5 Explaining Machine Learning.................................................................................... 11
1.6 Areas of Influence for Machine Learning................................................................. 14
1.7 Applications of Machine Learning............................................................................ 14
1.7.1 Applications of Machine Learning Across the Spectrum......................... 14
1.7.2 Machine Learning in the Big Data Era......................................................... 16
1.7.3 Interesting Applications................................................................................. 16
1.8 Identifying Problems Suitable for Machine Learning............................................ 17
1.9 Advantages of Machine Learning............................................................................. 17
1.10 Disadvantages of Machine Learning......................................................................... 18
1.11 Challenges of Machine Learning............................................................................... 19
1.12 Summary....................................................................................................................... 19
1.13 Points to Ponder........................................................................................................... 19
E.1 Exercises........................................................................................................................ 20
E.1.1 Suggested Activities....................................................................................... 20
Self-Assessment Questions.................................................................................................... 20
E.1.2 Multiple Choice Questions............................................................................ 20
E.1.3 Match the Columns........................................................................................ 21
E.1.4 Sequencing....................................................................................................... 21
References������������������������������������������������������������������������������������������������������������������������������� 21
v
vi Contents
3.4.2
Self Information............................................................................................... 67
3.4.3
Entropy............................................................................................................. 67
3.4.4
Entropy for Memory and Markov Sources................................................. 68
3.4.4.1 The Source Coding Theorem......................................................... 69
3.4.5 Cross Entropy.................................................................................................. 69
3.4.6 Kullback–Leibler Divergence or Relative Entropy.................................... 69
3.5 Summary....................................................................................................................... 70
3.6 Points to Ponder........................................................................................................... 70
E.3 Exercises........................................................................................................................ 70
E.3.1 Suggested Activities....................................................................................... 70
Self-Assessment Questions.................................................................................................... 71
E.3.2 Multiple Choice Questions............................................................................ 71
E.3.3 Match the Columns........................................................................................ 74
E.3.4 Problems........................................................................................................... 74
E.3.5 Short Questions............................................................................................... 75
4.8.2
Unsupervised Learning................................................................................. 93
4.8.2.1 Workflow of Unsupervised Learning System............................. 94
4.8.2.2 Clustering, Association Rule Mining, and
Dimensionality Reduction............................................................. 94
4.8.3 Semi-supervised Learning............................................................................. 95
4.8.4 Reinforcement Learning................................................................................ 95
4.8.5 Neural Networks and Deep Learning......................................................... 96
4.9 Summary....................................................................................................................... 97
4.10 Points to Ponder........................................................................................................... 97
E.4 Exercises........................................................................................................................ 97
E.4.1 Suggested Activities....................................................................................... 97
Self-Assessment Questions.................................................................................................... 98
E.4.2 Multiple Choice Questions............................................................................ 98
E.4.3 Match the Columns...................................................................................... 100
E.4.4 Short Questions............................................................................................. 100
This book aims to explain machine learning from multiple perspectives such as conceptual,
mathematical, and algorithmic, as well as focusing on using tools and software. However,
the main focus of this book is to make readers think of applications in terms of machine
learning so that readers can define their applications in terms of machine learning tasks
through appropriate use cases that run throughout the chapters. Moreover, handling of
data, selection of algorithms, and evaluation needed for the applications are discussed. The
important issues of fairness-aware and interpretable machine learning are also discussed.
The book then goes on to discuss deep learning basics and will discuss similar applications
for both machine learning and deep learning, highlighting the difference between the two
approaches. In addition, this book will have a large number of exercises and activities
to enable the reader to understand concepts and apply them. Therefore, this book hopes
to address the issue of understanding machine learning and enabling applications using
machine learning.
xix
Author Biography
xxi
1
Introduction
1.1 Introduction
The important question that can be asked is “Why is there a sudden interest in machine
learning?” For anyone in the computing field, if you know machine learning you have
a definite edge. Why is that? One important aspect is the huge availability of data from
multiple and varied sources, for example shopping data, social network data, and search
engine data becoming available online. The world today is evolving and so are the needs
and requirements of people. Furthermore, now there is the fourth industrial revolution
of data. In order to derive meaningful insights from this data and learn from the way in
which people and the system interface with the data, we need computational algorithms
that can churn the data and provide us with results that would benefit us in various ways.
Machine learning facilitates several methodologies to make sense of this data. In addition,
now the time is ripe to use machine learning due to many basic effective and efficient
algorithms becoming available. Finally, large amounts of computational resources are now
available, as we know computers are becoming cheaper and cheaper day by day. In other
words, higher performance of computers, larger memory in handling the data, and greater
computational power has made even online learning possible. These are the reasons why
studying and applying machine learning has gained prominence, where machine learning
is more about analyzing large amounts of data; the bigger the amount of data, the better
the learning ability.
The challenges of the world today, which are vastly different from a few years ago, can
be outlined as a largely instrumented and interconnected world with complex organiza-
tions, demanding citizens, requirements of compliance with regulations, highly automated
adversaries, and diverse, evolving, and sophisticated threats. In addition, the top technol-
ogy trends of 2021, in the post-COVID era, are digital workplaces, online learning, tele-
health, contactless customer experience, and artificial intelligence (AI)-generated content.
These challenges and trends have made AI, machine learning, and deep learning the center
of the connected world of today.
Finally, we can judge the importance of a topic by the opinions expressed by eminent
industrialists and researchers across the spectrum. A few quotes are as follows:
DOI: 10.1201/9781003290100-1 1
2 Machine Learning
As you can see, many industries and research organizations realize and predict that
machine learning is going to play a bigger and defining role in society as well as in industry.
Some of the important concepts we will be talking about are intelligence, learning, and
machine learning.
1.1.1 Intelligence
Before we start off, let us look at what intelligence is. The Merriam-Webster definition
of intelligence says that “intelligence is the ability to learn, understand or deal with new
or trying situations (it is not rote learning) and the ability to apply knowledge to manip-
ulate one’s environment or to think abstractly as measured by some objective criteria.”
(Merriam Webster Online Dictionary 2023). One important perspective is that we should
have improved some parameter that describes the learning objective. Another important
aspect of intelligence is abstraction, that is the ability to find common patterns—especially
true for research and even for many commercial applications. Finally, intelligence involves
adaptation or the ability to apply what we have learned to the changing environment of
tomorrow, or in other words intelligence requires learning to be dynamic. Some of the tasks
that require intelligence include reasoning for solving puzzles and making judgements,
planning action sequences, learning, natural language processing, integrating skills, and
the ability to sense and act.
1.1.2 Learning
Having looked at the definition of intelligence, let us understand learning. Herbert Simon
defines learning thus: “Learning is any process by which a system improves its perfor-
mance from experience” (Herbert Simon, 1983). Simon explains that learning also entails
the ability to perform a task in a situation which has never been encountered before. This
definition is important because one important definition of machine learning is based on
it. For a machine, experience comes in the form of data. One of the reasons machine learn-
ing has become important today is because of the huge amounts of data that have become
available. The task we want to do may be, for example, classification, problem solving,
Introduction 3
planning, or control. The next component in the definition of learning is the improvement
in performance. This improvement in performance is measured in terms of an objective
which can be expressed as gain (for example, gain in profit) or loss (as loss in error).
Learning is the core of many activities including high-level cognition, performing
knowledge- intensive inferences, building adaptive intelligent systems, dealing with
messy real-world data, and analyzing data. The purpose of learning could be manifold,
including acquiring knowledge, adapting to humans as well other systems, and making
decisions.
With these definitions of intelligence and learning, we can proceed to define machine
learning. We will talk about many definitions of machine learning, but before we do that,
let us look at why we need machine learning.
Artificial intelligence (AI) is the name of a whole knowledge field, where machine
learning is an important subfield. Artificial intelligence is defined in Collins
Dictionary as “The study of the modelling of human mental functions by com-
puter programs (Collins Dictionary, 2023).” AI is composed of two words: artificial
and intelligence. Anything which is not natural and created by humans is artificial.
Intelligence means the ability to understand, reason, plan, and so on. Therefore, any
code, technology, or algorithm that enables a machine to mimic human cognition
or behavior is called AI. AI uses logic, if–then rules, decision trees, and machine
learning, which includes deep learning.
Machine learning is a subset of AI that involves implementing algorithms that are able
to learn from the data or previous instances and are able to perform tasks with-
out explicit instructions or programming. The procedure for learning from data
involves statistical recognition of patterns and fitting the model so as to evaluate
the data more accurately and provide precise results.
Deep learning is a part of machine learning that involves the usage of artificial neu-
ral networks where neural networks are one of the types of machine learning
approaches. Deep learning involves a modern method of building, training,
and using neural networks and performs tasks by exposing multilayered neural
networks to enormous amounts of data. Deep learning machine learning algo-
rithms are the most popular choice in many industries due to the ability of neural
networks to learn from large data more accurately and provide reliable results to
the user (Microsoft 2022).
Figure 1.1 shows a comparison of artificial intelligence, machine learning, and deep
learning.
4 Machine Learning
FIGURE 1.1
Comparison of AI, machine learning and deep learning.
1.3.1 Early History
Neural networks first came into being in 1943, when neurophysiologist Warren McCulloch
and mathematician Walter Pitts created a model of neurons and their working using an
electrical circuit. Meanwhile, Arthur Samuel of IBM developed a learning-based computer
program for playing checkers in the 1950s. Arthur Samuel first came up with the phrase
6
Machine Learning
FIGURE 1.2
Machine learning—a roadmap.
Introduction 7
approaches inherited from AI research to methods and tactics used in probability theory
and statistics. Programs were created that could analyse large amounts of data and draw
conclusions from the results. In 1990, Robert Schapire and Yoav Freund introduced boost-
ing for machine learning in order to enhance the predicting power by combining many
weak machine learning models and combining them using either averaging or voting. In
1995, Tin Kam Ho introduced the random decision forests that merged multiple decision
trees into a forest to improve accuracy. However, in general the 1990s was considered the
“AI Winter” period, where funding for AI research was low. However, this was also the
golden period for AI research, including research on Markov chain Monte Carlo, variation
inference, kernels and support vector machines, boosting, convolutional networks, and
reinforcement learning. In addition, there was the development of the IBM Deep Blue com-
puter, which won against the world’s chess champion Garry Kasparov in 1997.
FIGURE 1.3
Traditional programming versus machine learning.
Tom Mitchell (Mitchell, 2017) defined machine learning as a computer program that
improves its performance at some task through experience (1997). His definition reads,
A Computer Program is said to learn from past experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with the experience E.
Machine learning is thus the study of algorithms that improve their performance at some
task with experience. As an example, in the game playing scenario, the experience or data
could be games played by the program against itself, and the performance measure is the
winning rate. As another example, let us consider an email program that observes which
mails you mark as spam and which you do not and based on these observations learns to
filter spam. In this example, classifying emails as spam or not spam is the task T under
consideration, observing the labelling of emails as spam and not spam is the experience E,
and the fraction of emails correctly classified is the performance criterion P.
In fact, Tom Mitchell was the first person to come out with a book on machine learning.
We will be discussing the components of Mitchell’s definition later. Later on, Ethem Alpaydin
(Alpaydin, 2020), the author of a machine learning book, gave the following definition:
The goal of machine learning is to develop methods that can automatically detect pat-
terns in data, and then to use the uncovered patterns to predict future data or other
outcomes of interest.
(Kevin P. Murphy, 2012)
Introduction 11
The field of pattern recognition is concerned with the automatic discovery of regularities
in data through the use of computer algorithms and with the use of these regularities to
take actions.
(Christopher M. Bishop, 2013)
Another definition was given by Svensson and Söderberg in 2008 (Svensson and Söderberg
2008) where the focus for the first time was on the methods used namely computational
and statistical methods. This definition gives emphasis to data mining, because at that time
data mining was perhaps the only important aspect of machine learning:
Machine learning (ML) is concerned with the design and development of algorithms and
techniques that allow computers to “learn”. The major focus of ML research is to extract
information from data automatically, by computational and statistical methods. It is thus
closely related to data mining and statistics.
Therefore, in conclusion, we can describe machine learning as a system that learns models
(programs) from input data and output that optimize performance for a specified class of
tasks using example data or past experience by automatically detecting patterns in data
utilizing computational and statistical methods.
FIGURE 1.4
Training and prediction model of machine learning.
FIGURE 1.5
Data samples and target values.
The next slightly different example is the game playing example. This is an interesting
example because the training data, new samples, and output can be specified in different
ways. For example, the input could be specified as a state of the chessboard and the best
move for that state, where the new sample would be just a state (hitherto unseen) and the
learning would output the best move. On the other hand, the input could be a sequence of
moves starting from the initial position leading to a final position with the final position
being marked as win or loss. The specification could also be altered where the training data
consists of board positions and how many times the particular move from the board posi-
tion resulted in a win. This explanation is to impress upon you that the same problem can
be specified and represented in different ways. The learning involved could be finding a
sequence of moves that would result in a win. In Figure 1.6b, the training data consists of
the sequence of moves of many games played and whether the game resulted in a win or
a loss. The learning model uses this training data to decide the strategy of searching and
evaluating whether a move from the current board is likely to end in a win. Based on this
evaluation, the model suggests the best move possible from the current board position.
The next example we will consider is the case of disease diagnosis. Here the input to the
training phase is a database of patient medical records, with the target value indicating
whether the disease is present or absent. In this case the learned disease classifier when
given the new patient’s data would indicate the absence or presence of the disease
Introduction 13
FIGURE 1.6
Examples of machine learning.
FIGURE 1.7
Application areas of machine learning.
Health care is another industry where machine learning is used in a variety of ways.
Due to the advent of wearable devices and sensors, patient data can be assessed in real
time. Machine learning provides methods, techniques, and tools that can help in solving
diagnostic and prognostic problems in a variety of medical domains. It is being used for
the analysis of the importance of clinical parameters and of their combinations for progno-
sis. Machine learning is also being used for data analysis, such as detection of regularities
in the data by appropriately dealing with imperfect data, interpretation of continuous data
used in the intensive care unit, and for intelligent alarms resulting in effective and efficient
monitoring. Machine learning methods can help the integration of computer-based sys-
tems in the healthcare environment providing opportunities to facilitate and enhance the
work of medical experts and ultimately to improve the efficiency and quality of medical
care. Machine learning is also used in biology in medicine for drug discovery and compu-
tational genomics (analysis and design).
Retail websites use machine learning to recommend items based on analyzing buying
history and to personalize shopping experience, devise a marketing campaign, decide
price optimization, plan supply, and gain customer insights.
The oil and gas industry uses machine learning to discover new energy sources, ana-
lyze minerals in the ground, predict refinery sensor failure, and enable streamlining oil
distribution for efficiency and cost effectiveness.
Transportation uses machine learning to analyze data to identify patterns and trends in
order to make routes more efficient and predict potential problems to increase profitability.
Machine learning also improves the efficiency of delivery companies and public
transportation.
Web-based applications of machine learning are discussed next. The web contains lot
of data. Tasks with very big datasets often use machine learning, especially if the data is
noisy or nonstationary. Web-based applications include spam filtering, fraud detection,
information retrieval (finding documents or images with similar content), and data visual-
ization where huge databases are displayed in a revealing way.
Social media is based on machine learning where it is used for sentiment analysis, spam
filtering, and social network analysis.
16 Machine Learning
Virtual assistants: natural language processing and intelligent agents are areas where
machine learning plays an important part.
1.7.3 Interesting Applications
A number of interesting applications combine the effectiveness of both machine learning
and big data processing.
1.12 Summary
• Outlined the different definitions of machine learning
• Explained the need for machine learning and its importance.
• Explored the history of machine learning from different perspectives
• Explained machine learning
• Discussed different applications of machine learning
• Outlined the advantages, disadvantages, and challenges of machine learning
1.13 Points to Ponder
• Can you give a flowchart on how human beings learn to read?
• Is machine learning needed to perform sorting of numbers?
• Why do you think machine learning is suitable for handwriting recognition?
• What is experience in machine learning terms?
• Can you list the important components of machine learning that evolved over the
years?
20 Machine Learning
E.1 Exercises
E.1.1 Suggested Activities
E.1.1.1 Fill the following table for the human learning process (use your imagination).
Some of the problems are intuitive, and some can use different criteria; however,
you are required to outline as shown:
How Will
Data (Experience) Features Needed You Evaluate
Problem Needed for Learning Process Used Quantitatively?
E.1.1.2 Give and describe an example data set with values of data with 10 data samples
having five independent variables or features and one dependent variable or
target for classifying the students in a class.
E.1.1.3 Develop simple models (as shown in Figure 1.6) for the following problems:
a. Image classification
b. Finding the best route to travel
c. Weather prediction
Self-Assessment Questions
E.1.2 Multiple Choice Questions
E.1.2.4 Machine learning is the field of study that gives computers the ability to learn.
i By being explicitly programmed
ii By being given lots of data and training a model
iii By being given a model
No Match
E.1.4 Sequencing
E.1.4.1 A massive visual database of labelled images, ImageNet, was released by Fei-Fei Li.
E.1.4.2 Kaggle was launched originally as a platform for machine learning competitions.
E.1.4.3 An artificial neural network was built by Minsky and Edmonds consisting of 40 interconnected
neurons.
E.1.4.4 The nearest neighbour algorithm, a basic pattern recognition algorithm, was used for mapping
routes in finding a solution to the travelling salesperson’s problem of finding the most efficient
route.
E.1.4.5 Groundbreaking natural language processing algorithm GPT-3 was released.
E.1.4.6 The IBM Deep Blue computer won against the world’s chess champion Garry Kasparov.
E.1.4.7 “Eugene Goostman” was the first chatbot that was considered to pass the Turing test
E.1.4.8 An artificial neural network was used by the NetTalk program for a human-level cognitive task.
E.1.4.9 AlphaGo was developed for Chinese complex board game Go.
E.1.4.10 Facebook introduced DeepFace, a special software algorithm able to recognize and verify
individuals on photos at the same level as humans.
References
Alpaydin, E. (2020). Introduction to machine learning (4th Ed.). Adaptive Computation and Machine
Learning Series. MIT Press.
Bishop, C. M. (2013). Pattern recognition and machine learning (Corr. 2nd Ed.). Springer.
Collins Dictionary. (2023). Artificial intelligence. https://www.collinsdictionary.com/dictionary/
english/
22 Machine Learning
• Data: The more diverse the data, the better the output of the machine learning sys-
tem. In case we want to detect spam, we need to provide the system with a large
number of samples of both spam and non-spam messages. Similarly, if we want
to forecast price of stocks, we need to provide price history; and if we want to find
user preferences, we need to provide a large history of user activities.
• Features: Features are also known as attributes or variables. Selecting the right set
of features is important for the success of machine learning. Features are factors
for the machine learning system to consider, such as car mileage, user’s gender,
stock price, or word frequency in a text. We have to remember that the selection of
features depends on the learning task at hand as well as the availability of infor-
mation. When data is stored in tables, then features are column names. However,
data can be in the form of pictures where each pixel can be considered as a feature.
• Algorithms: Any problem can be solved using any of the available machine learn-
ing algorithms. The method we choose affects the precision, performance, and size
of the final model. There is one important aspect—if the data is bad, even the best
algorithm won’t help.
When we apply machine learning in practice, we first need to understand the domain, the
prior knowledge available, and the goals of the problem to be tackled. In addition, even if
the data to be considered is available, it needs to be integrated, cleaned, preprocessed, and
selected. Feature selection is the next important step because this selection can make or
break the success of a machine learning system. Another issue is that some of the features
DOI: 10.1201/9781003290100-2 23
24 Machine Learning
may be sensitive or protected. The next issue is the choice of the learning model or the type
of machine learning that is going to be used. Training the model on a data set and tuning
all relevant parameters for optimal performance is the next component. Once we build
the system, then evaluating and interpreting the results is important. Finally comes the
deployment in an actual scenario. After deployment, the system is looped to fine tune the
variables of the machine learning system. These top-level understandings of the machine
learning process (Mohri et al. 2012) are given in Figure 2.1.
Consider the following two simple examples given in Table 2.1 to illustrate the proce-
dure shown in Figure 2.1.
Viewed from another perspective, specifically from the supervised learning viewpoint,
in machine learning the labelled input data (data with both independent and dependent
variables) needs to divided into training and test sets. The training set is used to learn the
hypothesis or model. The learning model depends on the type of learning algorithm used.
Finally, an important component is the measurement of system performance in terms of
gain or loss, which in turn depends on the type of output or goal of the system. It is also
important that the system generalizes well to unknown samples.
FIGURE 2.1
Applying machine learning in practice.
TABLE 2.1
Examples of the Machine Learning Process
No Step Example 1 Example 2
1 Set the goal Predict heavy traffic on given day Predict stock price
2 Choosing features Weather forecast could be a good feature Previous wear price could be
a good feature
3 Collect the data Collect historical data and weather for each day Collect price for each year
4 Test the hypothesis Train a model using this data Train a model using this data
5 Analyse results Is this model performing satisfactorily? Is this model performing
satisfactorily?
6 Reach a conclusion Reason why this model is not good Reason why this model is
not good
7 Refine hypothesis Time of year alternative feature Average price of previous
week
Understanding Machine Learning 25
FIGURE 2.2
Machine learning terminology.
26 Machine Learning
• Output: Prediction label obtained from input set of samples using a machine
learning algorithm.
• Training samples: Examples used to train a machine learning algorithm.
• Labelled samples: Examples that contain features and a label. In supervised train-
ing, models learn from labelled examples.
• Test samples: Examples used to evaluate the performance of a learning algorithm.
The test sample is separate from the training and validation data and is not made
available during the learning stage.
• Validation samples: Examples used to tune parameters of a learning algorithm.
• Loss function: This is a function that measures the difference/loss between a pre-
dicted label using the learnt model and the true label (actual label). The aim of the
learning algorithms is to minimize the error (cumulative loss across all training
examples).
• Holdout data set: Examples intentionally not used (“held out”) during training.
The validation data set and the test data set are examples of holdout data. Holdout
data helps evaluate a model’s ability to generalize to data other than the data it
was trained on. The loss on the holdout set provides a better estimate of the loss
on an unseen data set than does the loss on the training set.
• Hypothesis set: A set of mapping functions that maps features (feature vectors) to
the set of labels. The learning algorithm chooses one function among those in the
hypothesis set as a result of training. In general, we pick a class of functions (e.g.,
linear functions) parameterized by a set of free parameters (e.g., coefficients of the
linear function) and pinpoint the final hypothesis by identifying the parameters
that minimize the error.
• Model selection: Process for selecting the free parameters of the algorithm (actu-
ally of the function in the hypothesis set).
• Inference: Inference refers to the process of applying trained machine learning
models to hitherto unseen data to make predictions. In statistics, inference refers
to the process of drawing conclusions about some parameters of a distribution
obtained from observed data.
• Supervised learning model: Here the model trains from input data that is also
associated with the corresponding outputs. This learning is similar to a student
learning a subject by studying a set of questions and their corresponding answers.
After studying the mapping between questions and answers, the student can pos-
sibly provide answers to new, hitherto unseen questions on the same topic.
• Unsupervised learning model: Training a model to find patterns in a dataset, typi-
cally an unlabelled dataset.
FIGURE 2.3
Types of machine learning tasks.
• Classification: This is the task that is intuitive and easy to understand. This type
of machine learning task distinguishes among two or more discrete classes; that is,
it specifies which among a set of given categories the input belongs to. There are
basically two types of classification. Binary classification distinguishes between
two classes, for example whether an email is spam or not, whether a person is
diagnosed with a disease or not, or whether a loan can be sanctioned to a person
or not. On the other hand, multiclass classification distinguishes between more
than two classes. For example, an image processing classification could determine
whether an input image is a flower, a tree, or an automobile, or a student result
classification system could distinguish between performance grades as excellent,
very good, good, average, and poor. Note that this student result system could
also be considered as a binary classification task, with the student result being
pass or fail. Another variation in classification is multi-label classification, where
one sample can belong to more than one class. An example would be classifying
people according to their job where a cricketer, for example, can also be a busi-
nessman. Yet another variation is hierarchical classification, where a sport can be
classified as cricket, tennis, football, and so on, where again within the cricket class
we can have test cricket, one-day cricket, and T20 cricket. The performance of clas-
sification is often evaluated using accuracy, that is, how many unknown samples
were classified correctly among all the unknown samples.
• Regression: This task outputs continuous (typically, floating-point) values.
Examples include prediction of stock value, prediction of amount of rainfall, pre-
diction of cost of house, and prediction of marks of a student. Again, note here
that a student performance problem can also be considered as a regression task.
The performance of regression is often evaluated using squared loss, which will be
explained later.
• Clustering/segmentation: Clustering involves automatically discovering natural
groupings in data. This task can be described as grouping or partitioning items
into homogeneous groups by finding similarities between them. In other words,
28 Machine Learning
clustering can be described as grouping a given set of samples based on the simi-
larity and dissimilarity between them. One of the most important real-world exam-
ples of clustering is customer segmentation, or understanding different customer
groups around which to build marketing or other business strategies. Another
example of clustering is identifying cancerous data sets from a mix of data con-
sisting of both cancerous and noncancerous data. The performance of clustering
is often evaluated using purity, which is a measure of the extent to which clusters
contain a single class based on labelled samples available. Since the evaluation
uses samples that are labelled, it is considered as the external evaluation criterion
of cluster quality.
• Predictive modelling: This is the task of indicating something in advance based
on previous data available. Predictive modelling is a form of artificial intelligence
that uses data mining and probability to forecast or estimate granular, specific out-
comes. For example, predictive modelling could help identify customers who are
likely to purchase a particular product over the next 90 days. Thus, given a desired
outcome (a purchase of the product), the system identifies traits in customer data
that have previously indicated they are ready to make a purchase soon. Predictive
modelling would run the data and establish which factors actually contributed to
the sale.
• Ranking: In the task of ranking, given a set of objects, the system orders them in
such a way that the most interesting one comes first and then the next most inter-
esting one, and so on; in other words, it orders examples by preferences which are
represented as scores. For example, a ranking system could rank a student’s per-
formance for admission. Another important example is the ranking of web pages
in the case of a search engine. The performance is measured based on the mini-
mum number of swapped pairs required to get the correct ranking.
• Problem solving: This task uses using generic or ad hoc methods, in an orderly
manner, for finding solutions to problems. Problem solving requires acquiring
multiple forms of knowledge, and then problem solving is viewed as a search
of a state-space formulation of the problem. With this formalism, operators are
applied to states to transit from the initial state to the goal state. The learning
task is to acquire knowledge of the state-space to guide search. Examples of prob-
lem-solving include mathematical problems, game playing, and solving puzzles.
Performance is measured in terms of correctness.
• Matching: This task is about finding if one entity is like another in one or more
specified qualities or characteristics. The main tasks in many applications can be
formalized as matching between heterogeneous objects, including search, recom-
mendation, question answering, paraphrasing, and image retrieval. For example,
search can be viewed as a problem of matching between a query and a document,
and image retrieval can be viewed as a problem of matching between a text query
and an image. Matching two potentially identical individuals is known as “entity
resolution.” The performance measure used here is again accuracy.
• Tagging/annotation: This task is a part of classification that may be defined as the
automatic assignment of labels or tokens for identification or analysis. Here the
descriptor is called a tag, which may represent the part of speech, semantic infor-
mation, image name, and so on. Text tagging adds tags or annotation to various
components of unstructured data. In Facebook and Instagram, tagging notifies the
Understanding Machine Learning 29
recipient and hyperlinks to the tagged profile. A label is assigned for identification.
The performance measure used here is again accuracy.
• Recognition: This task refers to the identification of patterns in data. Pattern rec-
ognition indicates the use of powerful algorithms for identifying the regularities in
the given data. Pattern recognition is widely used in domains like computer vision,
speech recognition, face recognition, and handwriting recognition. Performance is
measured in terms of efficiency.
• Forecasting or time-based prediction: This task is often used when dealing with
time series data and refers to forecasting the likelihood of a particular outcome,
such as whether or not a customer will churn in 30 days, where the model learning
is based on past data and then applied to new data. It works by analysing current
and historical data and projecting what it learns on a model generated to forecast
likely outcomes. Examples where forecasting can be used include weather predic-
tion, a customer’s next purchase, credit risks, employee sentiment, and corporate
earnings.
• Structured prediction: This machine learning task deals with outputs that have
structure and are associated with complex labels. Although this task can be
divided into sequential steps, each decision depends on all the previous decisions.
Examples include parsing—mapping a sentence into a tree—and pixel-wise seg-
mentation, where every group of pixels is assigned a category.
• Sequence analysis: This task involves predicting the class of an unknown
sequence. An example would be, given a sequence of network packets, to label
the session as intrusion or normal. The task can also be defined as the process
of analyzing sequences; in the context of bioinformatics, given protein or DNA
sequences, seeks to understand their features, function, structure, and evolution.
The performance measure is generally accuracy.
• Anomaly detection/outlier detection: This task involves analysing a set of objects
or events to discover any of them as being unusual or atypical. Examples include
credit card fraud detection, intrusion detection, and medical diagnosis.
• Association rules: This task identifies hidden correlations in databases by applying
some measure of interestingness to generate an association rule for new searches.
This involves finding frequent or interesting patterns, connections, correlations, or
causal structures among sets of items or elements in databases or other informa-
tion repositories. Examples include store design and product pricing in business
and associations between diagnosis and patient characteristics or symptoms and
illnesses in health care. The performance measure is generally accuracy.
• Dimensionality reduction: This task involves transforming an input data set
(reducing the number of input variables in a data set) into a lower-dimensional
representation while preserving some important properties. Dimensionality reduc-
tion aims to map the data from the original dimension space to a lower dimension
space while minimizing information loss. Examples could include preprocessing
images, text, and genomic data.
Although this section outlines the basics of different machine learning tasks, any applica-
tion problem can be modelled based on more than one type. However, understanding the
type of machine learning task associated with the problem we want to solve is the first step
in the machine learning process.
30 Machine Learning
• Choose the training experience (the set X), and how to represent it.
• Choose exactly what is to be learnt, i.e., the target function C.
• Choose how to represent the target function C.
• Choose a learning algorithm to infer the target function from the experience.
• Find an evaluation procedure and a metric to test the learned function.
The design cycle (Pykes, 2021) is shown in Figure 2.4. The first step is the collection of data.
The next step is the selection of features. This is an important step that can affect the over-
all learning effectiveness. In most cases, prior knowledge about the input data and what
is to be learned is used in selecting appropriate features. The third step is model selection,
which is essentially selection of a model that will fit the data. Again here prior knowledge
about the data can be used to select the model. Once the model is selected, the learning
step fine tunes the model by selecting parameters to generalize it. Finally, the evaluation
and testing step evaluates the learning system using unseen data. The steps are explained
in detail as follows.
1 Collection of data: As already explained, the first step is the collection of the
data D = {d1, d2,…dm,…dn}, where each data point represents the input data and
corresponding output (Figure 2.5). Here for simplicity we assume input x is one
dimensional.
As a simple example, let us consider the case where we want to find the
expected weight of a person given the height. Here we need to collect the data of
each weight–height combination, where each point in the one-dimensional graph
represents a data point. However, in all machine learning examples, the number
of features or attributes considered and the number of data points are both large.
The weight–height example is purely for understanding. Here note each data
point d1 can be associated with many attributes or features, say m attributes, and
FIGURE 2.4
Design cycle.
Understanding Machine Learning 31
FIGURE 2.5
The collected data.
FIGURE 2.6
Visualization of a higher dimensional example.
so each data point can be m-dimensional. When dimensions are higher, the data
set can be imagined as a point cloud in high-dimensional vector space. A higher
dimension example can be visualized as shown in Figure 2.6.
2 Feature selection: Feature selection is essentially the process of selecting relevant
features for use in model construction. The selection of features depends on the
learning problem as given in Example 2.1.
Feature selection could be carried out in two ways; one is by reducing the num-
ber of attributes considered for each data point. This type of feature selection is
called dimensionality reduction as shown in Figure 2.7. The second method is to
reduce the number of data points considered, where the original D = {d1, d2, … dm,
… dn} is reduced to D = {d1…dm}, where m < n (Figure 2.8).
32 Machine Learning
FIGURE 2.7
Reduction of attributes.
FIGURE 2.8
Reduction of data points.
3 Model selection: The model is learnt from the training data. The next step is the
model selection, where we select a model or set of models that would most likely
fit the data points. Note that the model is selected manually. Associated with the
selected model (or function), we need to select the parameters. We assume a very
simple linear model (Equation 2.1),
N 0, (2.1)
and learning is associated with fitting this model to the training data given,
which in this simple example involves learning the parameters slope a and inter-
cept b (Figure 2.9).
Here this line can help us to make prediction. Our main goal is to reduce the
distance between estimated value and actual value, i.e., the error. In order to
achieve this, we will draw a straight line which fits through all the points.
When the number of dimensions increases, the model can be represented as
This equation (Equation 2.2) represents the linear regression model and will be
explained in detail in the chapter on linear regression. The role of error ε will now
be explained. Here error is an important aspect and is defined as the difference
between the output obtained from the model and the expected outputs from the
training data. An error function ε needs to be optimized. A simple example of an
error function is the mean squared error given subsequently, which describes the
difference between the true and predicted values of output y (Equation 2.3):
Understanding Machine Learning 33
FIGURE 2.9
Choosing a model.
FIGURE 2.10
The error function.
y f x
1 2
i i (2.3)
n i 1
In this function n is the number of data points, yi is the true value of y from the
training data, and f(Xi) gives the predicted value of y obtained by applying the
function learnt (Figure 2.10).
4 Learning: The learning step involves finding the set of parameters that optimizes
the error function, that is, the model and parameters with the smallest error.
Our main goal is to minimize the errors and make them as small as possible.
Decreasing the error between actual and estimated value improves the perfor-
mance of the model, and in addition the more data points we collect, the better
our model will become. Therefore, when we feed new data, the system will pre-
dict the output value.
34 Machine Learning
FIGURE 2.11
Testing the system.
5 Testing and evaluation: The final step is the application of the learnt model for
evaluation to predict outputs y for hitherto unseen inputs x using learned func-
tion f(x) (Figure 2.11).
Here we have a first look at evaluation. In the succeeding chapters we will discuss other
methods of evaluation. One method of evaluation is the simple holdout method where
the available experience (data) is divided into training and testing data. While we design
the model using only the training data, we test it using the test data. When we want to
compare the prediction performance of a regression or classification problem using two
different learning methods, we need to compare the error results on the test data set. In
the context of the discussion in this chapter, the learning method with smaller testing error
would work better.
We also need to understand the degree to which the learner controls the sequence
of training examples. In some cases, the teacher selects appropriate informative
boards and gives the correct move. In other cases the learner requests the teacher
to provide correct moves for some board states which they find confusing. Finally,
the learner controls the board states as well as the training classifications.
It is also important that the training experience represents the distribution of
examples over which the final system performance P will be measured. In the case
of training the checkers program, if we consider only experiences played against
itself, then crucial board states that are likely to be played by the human checkers
champion will never be encountered. Most machine learning systems assume that
the distribution of training examples is identical to the distribution of test examples.
• Choose the target function: The choice of the target function is possibly one of the
most important aspects of the design. The model we choose must be as simple as
possible. This is in keeping with the principle of Occam’s razor, which states that
we need to prefer a simpler hypothesis that best fits the data. However, we need
to ensure that all the prior information is considered and integrated, and we learn
only the aspect that is the goal of the learning task.
• Choose a representation: The representation of the input data should integrate
all the relevant features only and not include what is not affecting the learning. If
too many features are chosen, then we will encounter the curse of dimensional-
ity, which we will discuss later. The other aspect is the representation of the tar-
get function and model to be used. There are different types of function such as
classification, regression, density estimation, novelty detection, and visualization,
which to a large extent depends on how we look at the machine learning task.
Then we need to decide on the model that is to be used, whether symbolic, such as
logical rules, decision trees, and so on, or sub-symbolic, such as neural networks
or a number of other models which we will discuss in detail in other chapters.
• Choose the parameter fitting algorithm: This is the step where we estimate the
parameters to optimize some error or objective on the given data. There are many
such parameter estimation techniques such as linear/quadratic optimization and
gradient descent greedy algorithm, which also will be discussed later.
• Evaluation: Evaluation needs to determine whether the system behaves well in
practice and whether the system works for data that is not used for training; that
is, whether generalization occurs. In other words, the system finds the underlying
regularity and does not work only for the given data.
This example will now be used to explain the design steps. We use handwriting charac-
ter recognition (Figure 2.12) as an illustrative example to explain the design issues and
approaches.
We explain learning to perform a task from experience. Therefore, let us understand the
meaning of the task. The task can often be expressed through a mathematical function. In
this case input can be x, output y, and w the parameters that are learned. In case of classifi-
cation, output y will be discrete, such as class membership, posterior probability, and so
on. For regression, y will be continuous. For the character recognition, the output y will be
as shown in Figure 2.13.
FIGURE 2.12
Set of handwritten characters.
FIGURE 2.13
Character recognition task.
Understanding Machine Learning 37
The following are the steps in the design process for character recognition.
Step 0: Let us treat the learning system as a black box (Figure 2.14).
FIGURE 2.14
Learning system as a black box.
FIGURE 2.15
Collection of training data.
FIGURE 2.16
Representation of data.
38 Machine Learning
FIGURE 2.17
10-dimensional binary vector.
FIGURE 2.18
Choosing function F.
FIGURE 2.19
Adjusting weights.
Understanding Machine Learning 39
FIGURE 2.20
Testing.
2.4.2.2 Checkers Learning
A checkers learning problem can be defined as:
Let us assume that we can determine all legal moves. Then the system needs to learn the
best move from among legal moves; this is defined as a large search space known a priori.
Then we can define the target function as (Equation 2.4):
However, “choose move” is difficult to learn when we are given only indirect training.
An alternative target function can be defined. Let us assume that we have an evaluation
function that assigns a numerical score to any given board state (Equation 2.5).
Let V(b) be an arbitrary board state b in B. Let us give values for V(b) for games that are
won, lost, or drawn. If b is a final board state that is won, then V(b) = 100; if b is a final board
state that is lost, then V(b) = −100; if b is a final board state that is drawn, then V(b) = 0.
However, if b is not a final state, then V(b) = V(b′), where b′ is the best final board state that
can be achieved starting from b and playing optimally until the end of the game. V(b) gives
a recursive definition for board state b but is not usable because the function cannot be
determined efficiently except for the first three trivial cases discussed previously. Therefore,
the goal of learning is to discover an operational description of V, in other words learning
the target function as a function approximation problem.
Now we need to choose the representation of the target function. The choice of representa-
tions involves trade-offs, where the choice can be of a very expressive representation to allow
close approximation to the ideal target function V; however, the more expressive the repre-
sentation, the more training data is required to choose among alternative hypotheses. One
option is the use of a linear combination of the following six board features.: x1 is the number
of black pieces on the board, x2 is the number of red pieces on the board, x3 is the number of
black kings on the board, x4 is the number of red kings on the board, x5 is the number of
black pieces threatened by red, and x6 is the number of red pieces threatened by black. The
target function representation for the target function V: Board → ℜ is given in Equation (2.6):
V̂ b w0 w1x1 w2 x2 w3 x3 w4 x 4 w5 x5 w6 x6 (2.6)
40 Machine Learning
Here the weights are to be determined to obtain the target function. Choose the weights
wi to best fit the set of training examples in such a way that error E between the training
values and the values predicted by the hypothesis is minimized. One algorithm that can be
used is the least mean squares (LMS) algorithm where the weights are refined as new train-
ing examples become available and will be robust to errors in these estimated training
values (Equation 2.7):
∑ (V ( b ) − Vˆ ( b ) )
2
E≡ train (2.7)
( )
〈 b ,Vtrain b 〉∈training examples
As can be seen, the concept used is similar to the simple example we have discussed in
the previous section.
In order to learn the function, we require a set of training examples describing the board
b and the training value Vtrain(b); that is, we need the ordered pair
<b, Vtrain(b)>. For example, a winning state may be presented by the ordered pair given
in Equation (2.8):
x1 3, x 2 0, x 3 1, x 4 0, x 5 0, x6 0 , 100 (2.8)
In the preceding example we need to estimate training values; that is, there is a need to
assign specific scores to intermediate board states. Here we approximate intermediate
board state b using the approximation of the next board state following b given as
(Equation 2.9)
The simple approach given here can be used in any game playing or problem solving
scenario. The final design of this game playing scenario is given in Figure 2.21.
FIGURE 2.21
Final design flow of playing checkers.
Understanding Machine Learning 41
Therefore, in the example of playing checkers, the following choices were made during
the design. While determining type of training experience, what was considered was
games played against self rather than games against experts or a table of correct moves.
While choosing the target function, we chose Board → (Real value) rather than Board →
Move. Again, for the representation of the learnt function, we chose a linear function of six
features, and finally we chose gradient descent as the learning algorithm.
2.5 Summary
In this chapter the following topics were discussed:
2.6 Points to Ponder
• Machine learning is not suitable for finding the speed of a vehicle using an equa-
tion but the same problem is modelled based on machine learning to predict the
speed of a vehicle?
• Many machine learning applications can be modelled using more than one type of
machine learning task.
E.2 Exercises
E.2.1 Suggested Activities
E.2.1.1 Can you think of experience as described in this chapter for predicting credit
risk?
E.2.1.2 Can you list some other terms commonly associated with machine learning?
E.2.1.3 By searching the web, find at least five common terms encountered but not
already mentioned in the chapter and give a suitable explanation of each term.
42 Machine Learning
E.2.1.5 Take any three applications, and formulate each application using any three
machine learning tasks, with examples.
S.No Application Machine Learning Task Type—Examples
1. 1.
2.
3.
2. 1.
2.
3.
3. 1.
2.
3.
E.2.1.6 Develop a simple design (as shown in illustrative examples) for the following
problems:
a. Weather prediction
b. Medical diagnosis
c. Sentiment classification
Understanding Machine Learning 43
Self-Assessment Questions
E.2.2 Multiple Choice Questions
No Match
E.2.3.5 Predictive modelling E Finding if one entity is like another in one or more specified
qualities or characteristics
E.2.3.6 Anomaly detection F Grouping items into homogeneous groups by finding similarities
between them
E.2.3.7 Dimensionality detection G Reducing number of input variables in a data set
E.2.3.8 Matching H Analysing a set of objects or events to discover any of them as
being unusual or atypical
E.2.3.9 Problem solving I Automatic assignment of labels or tokens for identification or
analysis
E.2.3.10 Clustering J Ordering of entities of a class from highest to lowest
E.2.4 Short Questions
E.2.4.1 What are the three important components of machine learning?
E.2.4.2 What is meant by the hypothesis set?
E.2.4.3 What is inference in the context of machine learning?
E.2.4.4 Why is loss function important in machine learning?
E.2.4.5 What is the difference between classification and regression?
E.2.4.6 What is the difference between classification and clustering?
E.2.4.7 Describe association rule mining as a machine learning task.
E.2.4.8 Give the steps in the design of a machine learning task assuming the input has
three features.
E.2.4.9 Why is feature selection important in machine learning?
E.2.4.10 Assume you have a student database. Give two machine learning tasks you
want to learn, outlining different features needed for each task.
References
Google. (2023). Machine learning glossary. https://developers.google.com/machine-learning/
glossary
Huyen, C. (2022). Design a machine learning system. https://huyenchip.com/machine-learning-
systems-design/design-a-machine-learning-system.html
Kumar, A. (2022, December 4). Most common machine learning tasks. Data analytics. https://
vitalflux.com/7-common-machine-learning-tasks-related-methods/
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012). Foundations of machine learning. MIT Press.
Nkengsa, B. (2018, October 26). Applying machine learning to recognize handwritten characters.
Medium. https://medium.com/the-andela-way/applying-machine-learning-to-recognize-
handwritten-characters-babcd4b8d705
Pykes, K. (2021, September 5). How to design a machine learning system. Medium. https://medium.
com/geekculture/how-to-design-a-machine-learning-system-89d806ff3d3b
3
Mathematical Foundations and Machine Learning
3.1 Introduction
In Table 3.1 we list the context in which mathematical concepts are used.
The applications mentioned in the table will be discussed in the succeeding chapters.
TABLE 3.1
Mathematical Concepts Used in Machine Learning
Linear Algebra for Machine Learning
• Data representation and data preprocessing—attributes of input, image, audio, text
• Operations on or between vectors and matrices for finding patterns in data
• Linear equations to model the data
• Linear transformation to map input and output
• Cosine similarity between vectors
• Eigen vectors, principal components, and dimensionality reduction
Probability Theory for Machine Learning
• Classification models generally predict a probability of class membership.
• Many iterative machine learning techniques like maximum likelihood estimation (MLE) are based on
probability theory. MLE is used for training in models like linear regression, logistic regression, and
artificial neural networks.
• Many machine learning frameworks based on the Bayes theorem are called in general Bayesian learning
(e.g., naive Bayes, Bayesian networks).
• Learning algorithms make decisions using probability (e.g., information gain).
• Algorithms are trained under probability frameworks (e.g., maximum likelihood).
• Models are fitted using probabilistic loss functions (e.g., log loss and cross entropy).
• Model hyper-parameters are configured with probability (e.g., Bayesian optimization).
• Probabilistic measures are used to evaluate model skill (e.g., brier score, receiver operator characteristic
(ROC)).
Information Theory for Machine Learning
• Entropy is the basis for information gain and mutual information used in decision tree learning—ID3
algorithm.
• Cross-entropy is used as a loss function that measures the performance of a classification model.
• Kullback–Liebler (KL) divergence measures are used in deep learning.
• KL divergence is also used in dimensionality reduction technique—t-distributed stochastic neighbour
embedding (tSNE). Entropy can be used to calculate target class imbalances.
DOI: 10.1201/9781003290100-3 45
46 Machine Learning
In vector notation this equation is represented as aTx = b and is called the linear trans-
formation of x. Therefore, linear algebra is essential to understanding machine learning
algorithms, where often input vectors (x1, x2, … xn) are converted into outputs by a series
of linear transformations. Linear regression is often used in machine learning to simplify
numerical values as simpler regression problems. Linear equations often represent the
Mathematical Foundations and Machine Learning 47
FIGURE 3.1
Example data showing vector and matrix representation.
FIGURE 3.2
10-dimensional binary vector.
linear regression, which uses linear algebra notation: y = A · b where y is the output vari-
able, A is the data set, and b is the model coefficients.
Modelling data with many features is challenging, and models built from data that
include irrelevant features are often less effective than models trained from the more rele-
vant data. Methods for automatically reducing the number of columns of a data set are
called dimensionality reduction, and these methods are used in machine learning to create
projections of high-dimensional data for both visualizations and training models.
Linear algebra accomplishes tasks such as graphical transformations for applications
such as face morphing, object detection and tracking, audio and image compression, edge
detection, signal processing, and many others.
Therefore, concepts such as vectors and matrices, products, norms, vector spaces, and
linear transformations are used in machine learning.
FIGURE 3.3
Matrix operations with scalar.
FIGURE 3.4
Transpose of a matrix.
FIGURE 3.5
Product of a matrix.
FIGURE 3.6
Identity matrix and symmetric matrix.
is singular. The identity matrix and symmetric matrices are important special cases of a
matrix (Figure 3.6).
The inverse of a matrix is defined only for a square matrix. To calculate the inverse, one
has to find the determinant and adjoint of that given matrix. The adjoint is given by the
transpose of the cofactor of the particular matrix. The equation (Equation 3.2) to find out
the inverse of a matrix is given as,
Mathematical Foundations and Machine Learning 49
FIGURE 3.7
Vector in 2D Euclidean space.
adj ( A )
A −1 = ;A ≠0 (3.2)
A
Example 3.1
In general, the vector is defined as an n-tuple of values (usually real numbers) where n
is the dimension of the vector. The dimension n can be any positive integer from 1 to infinity.
The vector can be written in column form or row form; however, it is conventional to use
the column form. In the column form the vector elements are referenced by using
subscripts.
x1
v=
x2
We can think of a vector as a point in space or as a directed line segment with a magnitude
and direction as shown in Figure 3.8.
FIGURE 3.8
Vector as directed line segment in 2D and 3D space.
FIGURE 3.9
Transpose of a row vector.
z = x + y = ( x1 + y1 +… xn + y n )
T
(3.3)
Figure 3.10 shows the vector representation, vector addition, and vector addition
graphically.
The next operation is the scalar multiplication of a vector where each element of the vec-
tor is multiplied by a scalar (Equation 3.4).
y = ax = ( ax1 …… axn )
T
(3.4)
To obtain the dot product (scalar product or inner product) of two vectors, we multiply
the corresponding elements and add products, resulting in a scalar (Equation 3.5).
Mathematical Foundations and Machine Learning 51
FIGURE 3.10
Graphical representation vector operations.
FIGURE 3.11
Graphical representation of dot product of two vectors.
The alternative form of the dot product is shown graphically in Figure 3.11. The dot prod-
uct and magnitude are defined on vectors only.
n
a = x.y = ∑x y
i =1
i i (3.5)
x = 〈 x , x〉 = ∑x
i =1..n
2
i (3.6)
In Example 3.1 given in Figure 3.8, the Euclidean distance of a point x is given by
Equation (3.7):
d −1
∑(x )
2
X = i = X = 12 + 3 2 = 10 (3.7)
i =0
The Euclidean distance between vectors x and y is given by Figure 3.12. For Example 3.1,
the length of y − x = (2, 3) = 3 2 + 22 = 13 .
FIGURE 3.12
Euclidean distance between vectors.
52 Machine Learning
FIGURE 3.13
Angle between two vectors.
Vectors x and y are orthonormal if they are orthogonal and ||x|| = ||y|| = 1.
The angle θ between vectors x and y is given by defining the cosine of the angle θ
(Equation 3.8) between them (Figure 3.13). This cosine of the angle is used extensively in
machine learning to find similarity between vectors. We will discuss this in detail in subse-
quent chapters.
xT y
cos θ = (3.8)
x y
ui = 1 ∀i (3.9)
ui ⊥ u j ∀i ≠ j (3.10)
3.2.2.3.1 Vector Projection
Orthogonal projection of y onto x can take place in any space of dimensionality > 2. Unit
vector in direction of x is x / || x ||. Length of projection of y in direction of x is || y ||
x y cos (θ )
cos(θ) (Figure 3.14). Orthogonal projection of y onto x is the vector projx(y) =
x.y x
or (using dot product alternative form).
x
Mathematical Foundations and Machine Learning 53
FIGURE 3.14
Orthogonal projection of y on x.
3.3 Probability Theory
As already discussed, data and analysis of data play a very important role in machine
learning, and here we discuss approaches based on probability and the Bayes theorem.
Probability theory is a mathematical framework for representing uncertain statements
and provides a means of quantifying uncertainty and axioms for deriving new uncertain
statements.
3.3.2 Basics of Probability
When we talk about probability, there are basically two aspects that we consider. One is the
classical interpretation, where we describe the frequency of outcomes in random experi-
ments. The other is the Bayesian viewpoint or subjective interpretation of probability,
where we describe the degree of belief about a particular event.
Before we go further let us first discuss about a random variable. A random variable
takes values subject to chance. In probability, the set of outcomes from an experiment is
known as an event. In other words, an event in probability is the subset of the respective
sample space. The entire possible set of outcomes of a random experiment is the sample
space or the individual space of that experiment. The likelihood of occurrence of an event
is known as probability. The probability of occurrence of any event lies between 0 and 1.
Probability quantifies the likelihood of an event. Specifically, it quantifies how likely a
specific outcome is for a random variable, such as the flip of a coin, the roll of a die, or the
drawing a playing card from a deck. For example, A is the result of an uncertain event such
as a coin toss, with nonnumeric values Heads and Tails. It can be denoted as a random
variable A, which has values 1 and 0, and each value of A is associated with a probability.
In informal terms, probability gives a measure of how likely it is for something to happen.
There are three approaches to assessing the probability of an uncertain event.
54 Machine Learning
FIGURE 3.15
Axioms of probability.
Find the probability of selecting a male taking statistics from the population described in the following
table:
Simple events are events that consist of a single point in the sample space. For
example, if the sample space S = {Monday, Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday} and event E = {Wednesday}, then E is a simple event.
Joint events involve two or more characteristics simultaneously, such as drawing an
ace that is also red from a deck of cards. In other words, the simultaneous occur-
rence of two events is called a joint event. The probability of a joint event P(A and
B) is called a joint probability. Considering the same example if the sample space
S = {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} and event
E = {Wednesday, Friday}, then E is a joint event.
Independent and dependent events: If the occurrence of any event is completely
unaffected by the occurrence of any other event, such events are known as inde-
pendent events in probability and the events which are affected by other events
are known as dependent events.
Mutually exclusive events: If the occurrence of one event excludes the occurrence
of another event, such events are mutually exclusive events, that is, two events
that don’t have any common point. For example, if sample space S = {Monday,
Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} and E1, E2 are two
events such that E1 = {Monday, Tuesday} and E2 = {Saturday, Sunday}, then E1 and
E2 are mutually exclusive.
Exhaustive events: A set of events is called exhaustive if all the events together con-
sume the entire sample space.
Complementary events: For any event E1 there exists another event E1′ which rep-
resents the remaining elements of the sample space S. E1 = S − E1′. In the sample
space S = {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday}
if E1 = {Monday, Tuesday, Wednesday}, then E1′ = {Thursday, Friday, Saturday,
Sunday}.
Collectively exhaustive event: A set of events is said to be collectively exhaustive if
it is mandatory that at least one of the events must occur. For example, when
56 Machine Learning
FIGURE 3.16
(a) Contingency table, (b) tree diagram.
Mathematical Foundations and Machine Learning 57
Suppose that out of a total of 500 customers, 200 are new customers; then the probability of
new customers is given as (Equation 3.11):
The contingency table for joint probability is given in Figure 3.17. In this figure we have
taken the example of two sets of events A1, A2 and B1, B2. When events A and B occur together,
we have joint probability, but the total is the marginal probability of events A and B.
3.3.6 Marginalization
Let us now explain the concept of marginalization. Marginal probability is defined as the
probability of an event irrespective of the outcome of another variable. Consider the prob-
ability of X irrespective of Y (Equation 3.14):
cj
P (X = xj ) = (3.14)
N
FIGURE 3.17
Contingency table for joint probability.
58 Machine Learning
The number of instances in column j is the sum of instances in each cell of that column
and is as follows (Equation 3.15):
L
cj = ∑n
i =1
ij (3.15)
p (X = xj ) = ∑p ( X = x , Y = y )
j =1
j i (3.16)
Sum Rule p ( X ) = ∑p ( x, y )
y
(3.17)
P ( A and B )
P( A|B) = The conditional probability of A given tha
at B has occurred (3.19)
P (B)
The conditional probability can also be with respect to B given A and is as follows
(Equation 3.20):
P ( A and B )
P(B A) = The conditional probability of B given thatt A has occurred (3.20)
P ( A)
Mathematical Foundations and Machine Learning 59
In these definitions.
3.3.7.1 Bayes Theorem
Bayes theorem is used to revise previously calculated probabilities based on new informa-
tion. This theorem was developed by Thomas Bayes in the eighteenth century. This theo-
rem has become one of the most important underlying concepts of many machine learning
approaches. Let us first discuss the Bayes theorem for two variables A and B. It is an exten-
sion of conditional probability and is given by (Equation 3.21):
P( A B).P ( B )
P(B A) = (3.21)
P( A B).P ( B ) + P( A ~ B).P ( ~ B )
Now we can extend this to the general case as follows (Equation 3.22):
P( A Bi ).P ( Bi )
P(Bi A) = (3.22)
P( A B1 ).P ( B1 ) + P( A B2 ).P ( B2 ) +…+ P( A Bk ).P ( Bk )
Where
Bi = ith event of k mutually exclusive and collectively exhaustive events
A = new event that might impact P(Bi).
P(D S)P ( S )
P(S D) =
P(D S)P ( S ) + P(D U )P (U )
=
( 0.6 )( 0.4 )
( )( 0.4 ) + ( 0.2 )( 0.6 )
0. 6
0.24
= = 0.667
0.24 + 0.12
Given the detailed test, the revised probability of a successful well has risen to .667 from
the original estimate of .4. The given probabilities can be represented using a contingency
table (Figure 3.18):
(a) What is the probability that no fraudulent activity happened given that it was
detected by the system?
(b) What is the probability that fraudulent activity actually happened given that a
fraudulent activity was detected?
(c) What is the probability that fraudulent activity happened given that no fraudu-
lent activity was detected?
(d) What is the probability that fraudulent activity happened given that no fraudu-
lent activity was detected?
First let us list all the probabilities available. Let the happening of the fraudulent activity
be denoted as event F and the complement of this event, that is, no fraudulent activity hap-
pening, as F^. Let the event of fraudulent activity being detected be denoted by D and it
not being detected as D^. Here we present probability as a percentage.
The probabilities of fraudulent activity happening and not happening are as follows:
FIGURE 3.18
Contingency table for Example 3.7.
Mathematical Foundations and Machine Learning 61
Similarly, the probability that the system detects the fraudulent activity happening and
that the system does not detect the fraudulent activity happening are as follows:
Now the conditional probability that a fraudulent activity happening is detected given
that it has happened is:
= =
P(D F) 98% and therefore P(D F^) 6%
1. With these probabilities we can use the Bayes theorem to calculate the probability
that no fraudulent activity happened even if the system detected that it had hap-
pened (Equation 3.22):
P(D F^)P ( F ^ )
P(F ^ D) =
P(D F^)P ( F ^ ) + P(D F)P ( F )
2% × 92%
=
2% × 92% + 98% × 8%
%
= 0.1901
2. Now we use the Bayes theorem to calculate the probability that a fraudulent activ-
ity actually happened when the system detects it has happened (Equation 3.23):
P(D F)P ( F )
P(F D) =
P(D F)P ( F ) + P(D F^)P ( F ^ )
98% × 8%
=
98% × 8% + 2% × 92%
= 0..8099
P(D ^ F)P ( F )
P(F D^) =
P(D ^ F)P ( F ) + P(D ^ F^)P ( F ^ )
2% × 8%
=
2% × 8% + 94% × 92%
%
= 0.0018
4. Now we again use the Bayes theorem to calculate the probability that a fraud-
ulent activity did not happen and it was correctly not detected by the system
(Equation 3.25).
62 Machine Learning
P(D ^ F^)P ( F ^ )
P(F ^ D^) =
P(D ^ F^)P ( F ^ ) + P(D ^ F)P ( F )
94% × 92%
=
94% × 92% + 2% × 8%
= 0.9982
FIGURE 3.19
Normal or Gaussian distribution.
Mathematical Foundations and Machine Learning 63
Now the question is, why do we use a Gaussian distribution? This is due to the fact that
it has convenient analytic properties; it is governed by the central limit theorem and works
reasonably well for most real data. Though it does not suit all types of data, it acts as a good
building block. The values of the data point x is given by the function (Equation 3.23)
1 2
f (x) = e ( )
− x − µ /2σ 2
(3.23)
2πσ
• A random variable that has a normal distribution with a mean of zero and a stan-
dard deviation of one is said to have a standard normal probability distribution.
• The letter z is commonly used to designate this normal random variable.
• The following expression (Equation 3.24) converts any normal distribution into
the standard normal distribution:
x−µ
z= (3.24)
σ
In this section we discussed the basics of probability theory, which is an important math-
ematical basis for many machine learning techniques.
Problem: What is the probability that the patient has COVID-19 given that the test comes back
positive?
Given values:
A patient can have COVID-19, but the test may not detect it. This capability of the test to
detect COVID-19 is referred to as the sensitivity, or the true positive rate. In this case, we
assume a sensitivity value of 85%. In other words
Now, we can assume the probability of COVID-19 given that infection rate is low and it
is (0.0001); 0.01% have COVID-19.
We also assume that is the probability of a negative test result (Test=-ve) given a person
has no COVID-19 (COVID-19 = False). This is called as the true negative rate or the
specificity.
P(B A)P ( A )
P ( A B) =
P ( B)
P ( B ) = P(B A) ∗ P ( A ) + P(B ~ A) ∗ P ( ~ A )
Mathematical Foundations and Machine Learning 65
Now we can plug this false alarm rate into our calculation of P(Test = +ve)
In other words, irrespective of the person suffering from COVID-19 or not, the probabil-
ity of the test returning a positive result is about 2%.
Calculation After Plugging into Bayes Theorem
With information available, we can estimate the probability of a randomly selected per-
son having COVID-19 if they get a positive test result using the Bayes theorem.
This shows that even if a patient tests positive with this test, the probability that they
have COVID-19 is only 0.44%. In a diagnostic setup, considerable information such as sen-
sitivity, infection rate, and specificity is needed for the determination of conditional
probability.
66 Machine Learning
3.4 Information Theory
Information theory is about the analysis of a communication system and is concerned with
data compression and transmission, and it builds upon probability and supports machine
learning. Information provides a way to quantify the amount of surprise for an event mea-
sured in bits. Quantifying information essentially means measuring how much surprise
there is in an event. Those events that are rare have low probability and are more surpris-
ing, and therefore have more information than those events that are common and have
high probability. Information theory quantifies how much information there is in a mes-
sage, and in general, it can be used to quantify the information in an event and a random
variable. This is called entropy and is calculated using probability.
Calculating information and entropy is used in machine learning and forms the basis for
techniques such as feature selection, building decision trees, and in general for fitting clas-
sification models.
Claude Shannon came up with the mathematical theory of communication which was
published in the Bell System Technical Journal in 1948 (Shannon 1948) and marked the begin-
ning of the area of information theory. Shannon’s theory remains the guiding foundation
for the most modern, faster, more energy efficient, and more robust communication
systems.
3.4.1.1 Information Source
The characteristics of any information source can be specified in terms of the number of
symbols n, say S1, S2,….,Sn and the probability of occurrence of each of these symbols,
P(S1), P(S2), …, P(Sn); the correlation between a stream of successive symbols (called a mes-
sage) can specify the characteristics of an information source.
3.4.1.2 Stochastic Sources
Assume that a source outputs symbols X1, X2, …. which take their values from an alphabet
A = (a1, a2, …). The model P(X1,…,XN) will be a sequence consisting of all combinations.
For such stochastic sources, there are two special cases:
• The memoryless source where the value of each symbol is independent of the
value of the previous symbols in the sequence.
P(S1, S2, …, Sn) = P(S1) . P(S2) . … .P(Sn)
• The Markov source where the value of each symbol depends only on the value of
the previous one in the sequence.
P(S1, S2, …, Sn) = P(S1) . P(S2|S1) . P(S3|S2) . … . P(Sn|Sn−1)
Mathematical Foundations and Machine Learning 67
3.4.2 Self Information
Before we discuss entropy associated with information, let us understand how to measure
the amount of information contained in a message from the sender to the receiver. We can
calculate the amount of information there is in an event using the probability of the event.
This is called “Shannon information,” “self-information,” or simply the information.
Shannon associated self information in order to find the amount of information con-
veyed by a symbol ai of a memoryless source with alphabet A = (a1, …, an) and symbol
probabilities (p1, …, pn), knowing that the next symbol is ai. The negative of the algorithm
is taken to indicate that with decreasing probabilities the information conveyed increases
(Equation 3.25).
1
I ( ai ) = log = − log ( pi ) (3.25)
pi
where log() is the base-2 logarithm and p(x) is the probability of the event x.
The choice of the base-2 logarithm means that the unit of the information measure is in
bits (binary digits). This can be directly interpreted in the information processing sense as
the number of bits required to represent the event.
Let us find the amount of information obtained during the result of an examination where
the student gets a pass. If the result is fair, i.e., P(pass) = P(fail) = 0.5, the amount of informa-
tion obtained is 1 bit. However, if we already have the information that the result is pass, i.e.,
P(pass) = 1, then the amount of information obtained is zero. The amount of information is
greater than 0 and less than 1 if we have an unfair result.
1
pi = 0.5 = I ( 0.5 ) = log 2 = 1[bit ]
0.5
1
pi = 1 = I (1) = log 2 =0
1
Assume two independent events A and B, with probabilities P(A) = pA and P(B) = pB. When
both the events happen, the probability is pA. pB. The amount of information is as follows
(Equation 3.26):
I ( PA .PB ) = I ( PA ) + I ( PB ) (3.26)
3.4.3 Entropy
Entropy provides a measure of the average amount of information needed to represent an
event drawn from a probability distribution of a random variable. Information about the
symbols in a sequence is given by averaging over the probability of all the symbols, given as
(Equation 3.27):
68 Machine Learning
H= ∑p a 1
i i (3.27)
H(X) is the degree of uncertainty about the succeeding symbol and hence is called the
first-order entropy of the source.
Source entropy is defined as the minimum average number of binary digits needed to
specify a source output uniquely. Entropy is the average length of the message needed to
transmit an outcome using the optimal code.
For a memoryless source,
p = P ( X k = 1) , q = P ( X k = 0 ) = 1 − p
1 1
H = p log + ( 1 − p ) log (3.28)
p 1− p
Nn
Assuming that the block length approaches infinity, we can divide by n to get the num-
ber of bits per symbol. Thus the entropy for a memory source is defined as follows (Equation
3.30):
1
H ∞ = lim n →∞ H ( X1 , …., X n ) (3.30)
n
∑P log P
1
H ( Sk ) = kl (3.31)
i =1 kl
Mathematical Foundations and Machine Learning 69
In this case Pkl is the transition probability from state k to state l in a sequence. Averaging
over all states, we get the entropy for the Markov source as follows (Equation 3.32):
r
HM = ∑ P (S ) H (S )
k =1
k k (3.32)
Entropy can be used as a measure to determine the quality of our models and as a mea-
sure the difference between two probability distributions. We will discuss these aspects
later.
3.4.5 Cross Entropy
Cross-entropy is a measure from the field of information theory, building upon entropy
and generally calculating the difference between two probability distributions. Cross-
entropy can be used as a loss function when optimizing classification models like logistic
regression and artificial neural networks. Cross entropy is a concept very similar to rela-
tive entropy. Relative entropy is when a random variable compares the true distribution p
with how the approximated distribution q differs from p at each sample point (divergence
or difference). On the other hand, cross-entropy directly compares true distribution p with
approximated distribution q. Cross-entropy is heavily used in the field of deep learning. It
is used as a loss function that measures the performance of a classification model whose
output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted
probability diverges from the actual label.
in Q (e.g., the terms in the fraction are flipped). This is the more common implementation
used in practice (Equation 3.34):
The KL divergence between p and q can also be seen as the average number of bits that
are wasted by encoding events from a distribution p with a code based on a not-quite-right
distribution q. The intuition for the KL divergence score is that when the probability for an
event from P is large, but the probability for the same event in Q is small, there is a large
divergence. It can be used to measure the divergence between discrete and continuous
probability distributions, where in the latter case the integral of the events is calculated
instead of the sum of the probabilities of the discrete events.
In this section we discussed information theory which will be used in decision trees in
general machine learning evaluation.
3.5 Summary
• Discussed the basics of linear algebra and the need for linear algebra in machine
learning.
• Outlined the definitions of basics of probability for machine learning and its role
in machine learning.
• Explained the concept of the Bayes theorem and the concepts of Bayesian learning
to understand the normal probability distribution.
• Discussed the principles and practice of information theory.
• Outlined the role of the preceding mathematical concepts in machine learning.
3.6 Points to Ponder
• Most data is represented with vectors or matrices.
• Probability plays a major role in machine learning.
• Entropy for information gained in your professional degree is an interesting
concept.
E.3 Exercises
E.3.1 Suggested Activities
Case study
E.3.1.1 Take an example of university results and formulate it as a Bayes method.
Clearly state any assumptions made.
Mathematical Foundations and Machine Learning 71
Self-Assessment Questions
E.3.2 Multiple Choice Questions
E.3.2.2 For a matrix A, the inverse of the matrix does exist if and only if the
i Matrix is singular and square
ii Matrix is nonsingular and square
iii Matrix is singular and rectangular
E.3.2.3 The rank of a matrix if the number of linearly independent rows (or equivalently
columns)
i Is the number of linearly independent rows
ii Is the number of rows
iii Is the product of rows and columns
1 1 0 0
2 2 0 0
0 0 3 0
0 0 5 5
72 Machine Learning
E.3.2.10 There are only crayons in a pencil case, and the crayons are either orange, pur-
ple, or brown. The table shows the probability of taking at random a brown
crayon from the pencil case.
The number of orange crayons in the pencil case is the same as the number of
purple crayons. Complete the table.
i 0.2, 0.2
ii 0.3, 0.3
iii 0.4, 0.2
E.3.2.11 What will be the probability of getting odd numbers if a die is thrown?
i 2
ii ½
iii ¼
E.3.2.12 The probability of getting two tails when two coins are tossed is
i ¼
ii 1/3
iii 1/6
Mathematical Foundations and Machine Learning 73
E.3.2.13 In a box, there are 8 orange, 7 white, and 6 blue balls. If a ball is picked up ran-
domly, what is the probability that it is neither orange nor blue?
i 1/21
ii 5/21
iii 1/3
E.3.2.14 In 30 balls, a batsman hits the boundaries 6 times. What will be the probability
that he did not hit the boundaries?
i 1/5
ii 4/5
iii 3/5
E.3.2.15 A card is drawn from a well shuffled pack of 52 cards. Find the probability that
the card is neither a heart nor a red king.
i 27/52
ii 25/52
iii 38/52
E.3.2.19 Suppose you could take all samples of size 64 from a population with a mean
of 12 and a standard deviation of 3.2. What would be the standard deviation of
the sample means?
i 3.2
ii 0.2
iii 0.4
No Match
E.3.4 Problems
E.3.4.1 A coin is thrown three times. What is the probability that at least one head is
obtained?
E.3.4.2 Find the probability of getting a numbered card when a card is drawn from the
pack of 52 cards.
E.3.4.3 A problem is given to three students, Ram, Tom, and Hari, whose chances of
solving it are 2/7, 4/7, 4/9 respectively. What is the probability that the problem
is solved?
E.3.4.4 A vessel contains 5 red balls, 5 green balls, 10 blue balls, and 10 white balls. If
four balls are drawn from the vessel at random, what is the probability that the
first ball is red, the second ball is green, the third ball is blue, and finally the
fourth ball is white?
E.3.4.5 Suppose you were interviewed for a technical role. 50% of the people who sat
for the first interview received the call for a second interview. 95% of the people
who got a call for the second interview felt good about their first interview. 75%
of people who did not receive a second call also felt good about their first inter-
view. If you felt good after your first interview, what is the probability that you
will receive a second interview call?
E.3.4.6 Suppose we send 40% of our dresses to shop A and 60% of our dresses to shop B.
Shop A reports that 4% of our dresses are defective, and shop B reports that 3%
of our dresses are defective. (Use a tree diagram)
a. Find the probability that a dress is sent to shop A and it is defective.
b. Find the probability that a dress is sent to shop A and it is not defective.
c. Find the probability that a dress is sent to shop B and it is defective.
d. Find the probability that a dress is sent to shop B and it is not defective.
Mathematical Foundations and Machine Learning 75
E.3.4.7 Assume you sell sandwiches. 60% people choose vegetable, and the rest choose
cheese. What is the probability of selling two vegetable sandwiches to the next
three customers?
E.3.4.8 One of two boxes contains 5 red balls and 4 green balls, and the second box
contains 2 green and 3 red balls. By design, the probabilities of selecting box 1 or
box 2 at random are 1/4 for box 1 and 3/4 for box 2. A box is selected at random,
and a ball is selected at random from it.
a. Given that the ball selected is red, what is the probability it was selected from
the first box?
b. Given that the ball selected is red, what is the probability it was selected from
the second box?
E.3.4.9 0.5% of a population have a certain disease and the remaining 99.5% are free
from this disease. A test is used to detect this disease. This test is positive in 95%
of the people with the disease and is also (falsely) positive in 5% of the people
free from the disease. If a person, selected at random from this population, has
tested positive, what is the probability that they do not have the disease?
E.3.4.10 Three factories produce air conditioners to supply to the market. Factory A
produces 30%, factory B produces 50%, and factory produces 20%. 2% of the
air conditioners produced in factory A, 2% produced in factory B, and 1% pro-
duced in factory C are defective. An air conditioner selected at random from
the market was found to be defective. What is the probability that this air con-
ditioner was produced by factory B?
E.3.5 Short Questions
E.3.5.1 What is the importance of vectors in machine learning? Explain.
E.3.5.2 Differentiate between linear dependence and linear independence.
E.3.5.3 Discuss covariance between two variables.
E.3.5.4 What are the three fundamental axioms of probability?
E.3.5.5 What are the three approaches to probability? Discuss.
E.3.5.6 Explain collectively exhaustive events with an example.
E.3.5.7 Differentiate between marginal, joint, and conditional probabilities.
E.3.5.8 What is the Bayes theorem? Explain with an example
E.3.5.9 Describe the characteristics of a normal probability distribution.
E.3.5.10 Discuss self-information.
E.3.5.11 Differentiate between entropy, cross-entropy, and relative entropy.
E.3.5.12 Differentiate between entropy of a memory source and a Markov source.
4
Foundations and Categories of Machine Learning
Techniques
4.1 Introduction
Data refers to distinct pieces of information, usually formatted and stored in such a way
that it can be used for a specific purpose. The basic aspect of machine learning is data and
how it is collected and represented, how features specific to the objective of the learning
process and their representation are described, and the types of machine learning algo-
rithms used. In this chapter we will discuss these aspects. We will also discuss the underly-
ing concepts of machine learning algorithms.
4.1.2 Problem Dimensions
When we talk of a machine learning problem we talk of problem dimensions, the number
of samples n (often very large), input dimensionality d that is the number of input fea-
tures or attributes characterizing each sample (often 100 to 1000 or even more), and target
dimensionality m, for example the number of classes (often small). The data suitable for
machine learning will often be organized as a matrix having dimension n × (d + 1) if the
target dimension is 1 as is true in many applications (as in Figure 4.1) or n × (d + m) if the
target dimension is m. The data set shown in Figure 4.1 can be visualized as a point cloud
in a high-dimensional vector space as shown in Figure 4.2.
DOI: 10.1201/9781003290100-4 77
78 Machine Learning
FIGURE 4.1
Data representation for binary classification of images.
FIGURE 4.2
Data set as a point cloud in high-dimensional vector space.
Foundations and Categories of Machine Learning Techniques 79
FIGURE 4.3
Example of data set with two target variables.
As we will discuss later, in the case of classification the labels will be discrete; in the care
of regression they will be continuous; and in case of structured prediction these labels may
be more complex. In the example in Figure 4.1, we assumed one target variable, but there
can be more than one target variable. Such an example is shown in Figure 4.3 where two
variables are predicted; one is binary classification and the other is a regression problem,
the details of which will be discussed later.
4.2.1 Types of Data
The following are the common types of data that can be used to represent features:
• Discrete data: A set of data having a finite number of values or data points is called
discrete data. In other words, discrete data only include integer values. “Count”
data, derived by counting the number of events or animals of interest, are types of
discrete data. Other examples of discrete data include marks, number of registered
cars, number of children in a family, and number of students in a class.
• Ordinal data: Ordinal data are inherently categorical in nature but have an intrin-
sic order to them. A set of data is said to be ordinal if the values or observations
belonging to it can be ranked or have an attached rating scale. Examples include
academic grades (i.e., O, A+, A, B+), clothing size, and positions in an organization.
• Continuous or real valued data: Continuous data can take any of a range of val-
ues, and the possible number of different values which the data can take is infi-
nite. Examples of types of continuous data are weight, height, and the infectious
period of a pathogen. Age may be classified as either discrete (as it is commonly
80 Machine Learning
4.2.2 Data Dependencies
Another important aspect to be considered when talking about input data is the data
dependencies between the different features.
4.2.3 Data Representation
The way the input is represented by a model is also important for the machine learning
process. Vectors and matrices are the most common representations used in machine learn-
ing. Vectors are collections of features, examples being height, weight, blood pressure, age,
and so forth. Categorical variables can also be mapped to vectors. Matrices are often used
to represent documents, images, multispectral satellite data, and so on. In addition, graphs
provide a richer representation. Examples include recommendation systems with graph
Foundations and Categories of Machine Learning Techniques 81
databases which enable learning the paths customers take when buying a product, and
when another new customer is halfway through that path, we recommend where they may
want to go.
The representations may be the instances themselves but may also include decision
trees, graphical models, sets of rules or logic programs, neural networks, and so on. These
representations are usually based on some mathematical concepts. Decision trees are based
on propositional logic, Bayesian networks are based on probabilistic descriptions, and
neural networks are based on linear weighted polynomials.
4.2.4 Processing Data
The data given to the learning system may require a lot of cleaning. Cleaning involves get-
ting rid of errors and noise and removal of redundancies.
Data preprocessing: Real-world data generally contains noises; is missing values; can
be incomplete, inconsistent, or inaccurate (contains errors or outliers); and often lacks spe-
cific attribute values or trends and hence cannot be directly used for machine learning
models. Data preprocessing helps to clean, format, and organize the raw data and make
the data ready to be used by machine learning models in order to achieve effective learn-
ing. The first step is the acquiring of the data set, which comprises data gathered from
multiple and disparate sources which are then combined in a proper format to form a data
set. Data set formats differ according to use cases. For instance, a business data set will
contain relevant industry and business data while a medical data set will include healthcare-
related data. Data cleaning is one of the most important steps of preprocessing and is the
process of adding missing data and correcting, repairing, or removing incorrect or irrele-
vant data from a data set.
Preprocessing techniques include renaming, rescaling, discretizing, abstracting, aggre-
gating, and introducing new attributes. Renaming or relabeling is the conversion of cate-
gorical values to numbers. However, this conversion may be inappropriate when used
with some learning methods. Such an example is shown in Figure 4.3 where numbers
impose an order to the values that is not warranted.
Some types of preprocessing are discussed subsequently. Rescaling, also called normal-
ization, is the transferring of continuous values to some range, typically [–1, 1] or [0, 1].
Discretization or binning involves the conversion of continuous values to a finite set of
discrete values. Another technique is abstraction, where categorical values are merged
together. In aggregation, actual values are replaced by values obtained with summary or
aggregation operations, such as minimum value, maximum value, average, and so on.
Finally, sometimes new attributes that define a relationship with existing attributes are
introduced. An example is replacing weight and height attributes by a new attribute, obe-
sity factor, which is calculated as weight/height. These preprocessing techniques are used
only when the learning is not affected due to such preprocessing.
4.2.5 Data Biases
It is important to watch out for data biases. For this we need to understand the data source.
It is very easy to derive unexpected results when data used for analysis and learning are
biased (preselected). The results or conclusions derived for preselected data do not hold
for general cases.
82 Machine Learning
4.2.6.3 Feature Extraction
In some cases, instead of only reducing existing features, well-conceived new features can
capture the important information in a data set much more effectively than the existing
original features. In this context, three general methodologies are used, namely feature
extraction, which can be domain specific and typically results in significant reduction in
dimensionality; mapping existing features to a new space using an appropriate mapping
function; and feature construction, where existing features are combined. One method of
dimensionality reduction is to extract relevant inputs using a measure such as a mutual
Foundations and Categories of Machine Learning Techniques 83
FIGURE 4.4
Perspectives of feature processing.
4.3.2 Generalization
The central concept of machine learning is the generalization from data by finding pat-
terns in data. Machine learning attempts to generalize beyond the input examples given
at the time of learning the model by applying the model and making predictions about
hitherto unseen examples during test time. Generalization decides how well a model per-
forms on new data. Now the big question is how to get good generalization with a limited
number of examples. The intuitive idea is based on the concept of Occam’s razor, which
advises favoring simpler hypothesis when we select among a set of possible hypothesis.
For example, a simpler decision boundary may not fit ideally to the training data but tends
to generalize better to new data.
• Bias indicates the extent to which the average model over all training sets differs
from the true model. Bias is affected by the assumptions or simplifications made
by a model to make a function easier to learn or caused because the model cannot
represent the concept. High bias results in a weaker modelling process having a
simple fixed size model with small feature set.
• Variance indicates the extent to which models estimated from different training
sets differ from each other. In other words, if the model is trained using a data set
and a very low error value is obtained, but when the data set is changed using the
same model, a high error value is obtained, then variance is said to occur. This
may be because the learning algorithm overreacts to small changes (noise) in the
training data. High variance results in a complex scalable model of a higher order
polynomial with a large feature set.
When we are finding a model to fit the data, there needs to be a trade-off between bias and
variance since they measure two different sources of error of the model. As discussed, bias
measures the expected deviation from the true value of the function or parameter of the
model while variance provides a measure of the expected deviation that any particular
sampling of the data is likely to cause.
4.3.4.1 Overfitting
Overfitting occurs when the model fits the training data including the details and therefore
does not generalize so it performs badly on the test data. This can happen if a model fits
more data than needed and starts fitting the noisy data and inaccurate values in the data.
Overfitting is often a result of an excessively complicated model which even fits irrelevant
Foundations and Categories of Machine Learning Techniques 85
FIGURE 4.5
Overfitting and underfitting.
characteristics such as noise and can lead to poor performance on unseen data. This means
that the noise or random fluctuations in the training data are picked up and learned as
concepts by the model. The problem is that these learnt concepts do not apply to new data,
hence impacting the generalization of the model. Thus overfitting occurs when there is
high variance and low bias.
Overfitting is more likely with flexible nonparametric and nonlinear models; many such
algorithms often constrain the amount of detail learnt. For example, decision trees are a
nonparametric machine learning algorithm that is very flexible and is subject to overfitting
training data. Methods to address these issues will be discussed in the respective chapters.
A simple solution to avoid overfitting is using a linear algorithm if we have linear data.
The evaluation of machine learning algorithms carried out on training data is different
than the evaluation that is actually required on unseen data, resulting in overfitting becom-
ing a problem.
• Hold-out method is a technique where we hold some data out of the training set
and train or fit the model on the remaining training set (without data held out) and
finally use the held out data for fine tuning the model.
• Cross validation is one of the most powerful techniques to avoid or prevent over-
fitting. The initial training data is used to generate mini train–test splits, and then
these splits are used to tune the model. In a standard k-fold validation, the data is
partitioned into k subsets also known as folds. After this, the algorithm is trained
iteratively on k – 1 folds while using the remaining folds as the test set, also known
as the holdout fold. In cross-validation, only the original training set is used to
tune the hyper-parameters, that is, parameters whose values control the learning
process. Cross-validation basically keeps the test set separate as a true unseen data
set for selecting the final model. This helps to avoid overfitting altogether.
• Training with more data: One method to avoid overfitting is to ensure that there
are sufficient numbers of samples in the training set. However, in some cases the
increased data can also mean feeding more noise to the model. Therefore, when
86 Machine Learning
we are training the model with more data, it is necessary to make sure the data is
clean and free from randomness and inconsistencies.
• Removing features: There are some machine learning algorithms which incorpo-
rate the automatic selection of features. Otherwise, a few irrelevant features can be
removed from the input features to improve the generalization. This can be carried
out by deriving how a feature fits into the model. A few feature selection heuristics
can be used as a good starting point.
• Early stopping: In the course of training the model, the performance of the model
can be measured after each iteration. The model can be trained until the iterations
do not improve the performance of the model. It can be assumed that after this
point model overfitting happens as the generalization weakens.
• Regularization: Another important mathematical technique is the use of the con-
cept of regularization, which is the process of introducing additional information in
order to prevent overfitting. This information is usually in the form of a penalty for
complexity. A model should be selected based on the Occam’s razor principle (pro-
posed by William of Ockham), which states that the explanation of any phenomenon
should make as few assumptions as possible, eliminating the observable predictions
of the explanatory hypothesis or theory choosing complex ones. In other words, the
simplest hypothesis (model) that fits almost all the data is the best compared to more
complex ones; therefore, there is explicit preference toward simple models.
Other methods to tackle overfitting are specific to the machine learning techniques used
and will be discussed in the corresponding chapters.
4.3.4.2 Underfitting
In order to avoid overfitting, training may stop at an earlier stage, which however may
lead to the model not being able to learn enough from training data and may not capture
the dominant trend. This is known as underfitting. The result is the same as overfitting:
inefficiency in predicting outcomes. In other words, underfitting occurs when the model
does not properly fit the training data. Underfitting is often a result of an excessively sim-
ple model that fails to represent all the relevant data characteristics. Thus underfitting
occurs when there is high bias and low variance.
Underfitting usually happens when we have less data to build an accurate model or
when we try to build a linear model with nonlinear data. In such cases the machine learn-
ing model will probably make a lot of wrong predictions. Underfitting can be essentially
avoided by using more data. Techniques to reduce underfitting include increasing model
complexity, performing feature engineering, removing noise from the data, and increasing
the number of epochs or increasing the duration of training to get better results.
Both overfitting and underfitting lead to poor predictions on new data sets; that is, they
result in poor generalization.
be answered is the list of algorithms available for learning, which in turn depends on the
availability of the data and its characteristics. The next issue is the performance of these
algorithms. The amount of training data that is sufficient to learn with high confidence is
the next question to be answered. Then comes the question whether prior knowledge will
be useful in getting better performance of the algorithm. Another interesting question is
whether some training samples can be more useful than others.
Y = f ( x).
The method in which this mapping function is modelled differentiates parametric and
nonparametric methods.
TABLE 4.1
Parametric and Nonparametric Learning
Description Parametric Method Nonparametric Method
Assumption Simplify the function that is to be learnt to No strong assumptions about the
a known form, and the set of parameters form of the mapping function, and
of the learnt function is of fixed size hence learn any functional form
irrespective of the number of training from the training data.
samples.
Learning Learns the coefficients of the selected Seeks to best fit the training data to
function from the training data a mapping function from a large
number of functional forms.
Usage Prior information about the problem is Large amount of data is available,
available and there is no prior knowledge
Example methods Logistic regression, linear discriminant k-nearest neighbors (KNN), decision
analysis, perceptron, and naïve Bayes trees, support vector machines
methods. (SVMs)
Advantages Simplicity; learn faster with the training Flexibility due to fitting with a large
data and require less training data number of functional forms
Works even if the function does not Powerful since no assumptions
perfectly fit the data. are made about the underlying
function
Better performance models for
prediction
Disadvantages Form of the function is predetermined. Requirement of a large amount
Suited to solve simpler problems. of training data to estimate the
Unlikely match of the mapping function to mapping function
the actual training data. Comparatively slower to train as
more parameters need to be trained
Risk of overfitting
Harder to explain why specific
predictions are made.
Probabilistic methods General form (model) of prior probabilities, Assume an infinite dimensional
likelihood, and evidence with several function can capture the
unknown parameters is selected, and distribution about the input data
these parameters are estimated for the Flexibility, since function can
given data. improve as the amount of data
increases
to same class fall into the same segment. Rule-based and tree-based methods fall into this
category. These models are nonparametric since the expressions are formed to fit the exam-
ples and are not fixed beforehand. In geometric models, the features are considered as
points in the instance space, and similarity between examples is based on geometry. There
are two types of geometric models, namely linear models and distance-based models. In
linear models, a function defines a line or a plane to separate the instances. Linear models
are parametric since the function has a fixed form and only the parameters of the function
are learnt from the instance space. The linear models need to learn only a few parameters
and hence are less likely to overfit. In the distance-based geometric models, the concept
of distance between instances is used to represent similarity. The most common distance
metrics used are Euclidean, Minkowski, Manhattan, and Mahalanobis distances. These
distances will be described in detail in a succeeding chapter. Examples of distance-based
models include the nearest-neighbour models for classification, which use the training
data as exemplars, while the k-means clustering algorithm uses exemplars to create clus-
ters of similar instances.
The third class of machine learning algorithms is the probabilistic models. Probabilistic
models use the idea of probability and represent features and target variables as random
variables. The process of modelling represents and manipulates the level of uncertainty
with respect to these variables. The Bayes rule (inverse probability) allows us to infer
unknown quantities, adapt our models, make predictions, and learn from data (Figure 4.6).
Bayes rules allow us to carry out inference about the hypothesis from the input data.
These models can further be classified as generative and discriminative methods where
the difference lies in determining the posterior probability. In Bayes rules terms, in genera-
tive methods, the likelihood, prior, and evidence probabilities are modelled, using which
the posterior probability is determined. In discriminative methods, the posterior probabil-
ity is determined directly, thus focusing the computational resources on the task at hand
directly.
This section discussed the basis of machine learning algorithms, and as can be seen from
Figure 4.7, many machine learning algorithms fall into more than one category.
Understanding this basis will help us when we discuss the types of machine learning algo-
rithms in the following section as well as when we outline the details of the various
machine learning algorithms.
FIGURE 4.6
Bayes rule.
90 Machine Learning
FIGURE 4.7
Categorization of underlying basis of machine learning algorithms.
4.8.1 Supervised Learning
Supervised learning is the most well understood category of machine learning algorithms.
Here the training data given to the algorithm is composed of both the input and the cor-
responding desired outputs or targets. Supervised learning aims to learn a function from
examples and map its inputs to the outputs. Therefore, the algorithm learns to predict the
output when given a new input vector. Learning decision trees is a form of supervised
learning. The advantages of supervised learning are that it can learn complex patterns and
in general give good performance. However, these algorithms require large amount of
output labeled data.
Foundations and Categories of Machine Learning Techniques 91
FIGURE 4.8
Categorization of classical machine learning.
Supervised learning is also called a predictive model, which means they predict a target
y from a given new input x once it has learnt the model. In classification, y represents a
category or “class,” while in regression, y is a real valued number. Classification can be
described as grouping of entities or information based on similarities. Supervised methods
take in two kinds of training data: a set of objects for which we have input data (e.g.,
images, or features) and a set of associated labels for those objects. The labels could be
things like “dog,” “cat,” or “horse,” and these are considered to be ground truths, or feed-
back provided by an expert. The labels need not necessarily have discrete classifications;
they can be continuous variables, like the length of a dog or the amount of food the dog
eats per day. Now let us consider an example of classification (Figure 4.9). This example
has as input two words and output labels that classify the set of words as having positive
or negative sentiment. The machine is trained to associate input data with their labels.
FIGURE 4.9
Example of classification.
FIGURE 4.10
Workflow of a supervised learning system.
regression a numerical output value is predicted based on the training data. Therefore, in
classification, the output type is discrete; we try to find a boundary between categories;
and the evaluation normally used is accuracy. In regression, the output type is continuous;
we try to best fit a line, that is, find a linear or polynomial function to fit the data; and the
evaluation is done using the sum of squared errors (which will be discussed later). It is to
be noted that the same problem can be modelled as either classification or regression. An
example would be prediction of temperature, where in classification we predict whether
the temperature is hot or cold tomorrow, while in regression we predict the actual tem-
perature tomorrow.
4.8.2 Unsupervised Learning
Unsupervised machine learning is described as learning without a teacher, where the
desired output is not given. In unsupervised learning, also called descriptive modelling, no
explicit output label is given. The learning task involves discovering underlying structures
in the data or modelling the probability distribution of the input data. Here the machine
is trained to understand which data points in the instance space are similar. When a new
unseen input is given, it determines which group this data point is most similar to. In other
words, the goal of unsupervised learning is to discover useful representations of the data,
such as finding clusters, reduced space using dimensionality reduction, or modelling the
data density.
TABLE 4.2
Examples of Supervised Learning
Sl. No. Example Instance Space X h(x) Type
1. Face recognition Raw intensity of face images x Name of the person Classification
2. Loan approval Properties of customers (age, Approval of loan Classification
income, liability, job, …) or not
3. Predicting student Properties of Students (age, Height Regression
height weight, height)
4. Classifying cars Features of cars (engine Family car or Classification
power, price) otherwise
5. Predicting rainfall Features (year, month, amount Amount of rainfall Regression
of rainfall)
6. Classifying documents Words in documents Classify as sports, Classification
news, politics, etc.
7. Predicting share price Market information up to Share price Regression
time t
8. Classification of cells Properties of cells Anemic cell or Classification
healthy cell
94 Machine Learning
FIGURE 4.11
Workflow of unsupervised learning system.
Foundations and Categories of Machine Learning Techniques 95
4.8.3 Semi-supervised Learning
In semi-supervised machine learning the training data includes some of the desired outputs
but not completely, as is the case with supervised learning. In many applications, while
unlabeled training samples are readily available, obtaining labelled samples is expensive,
and therefore semi-supervised learning assumes importance. In semi-supervised learning
we use the unlabeled data to augment a small set of labelled samples to improve learning.
In order to make any use of unlabeled data, semi-supervised learning must assume some
structure to the underlying distribution of data. These algorithms make use of following
three assumptions, namely the smoothness assumption, that is, if two points x1, x2 are
close, then so should be the corresponding outputs y1, y2; the cluster assumption, that is,
if points are in the same cluster, they are likely to be of the same class; and finally the mani-
fold assumption, that is, the (high-dimensional) data lie (roughly) on a low-dimensional
manifold. Some of the common types of semi-supervised learning techniques include co-
training, transductive SVM, and graph-based methods.
4.8.4 Reinforcement Learning
Supervised (inductive) learning is the simplest and most studied type of machine learn-
ing. Now we need to question how an agent can learn behaviors when it doesn’t have a
teacher to tell it how to perform as is the case of supervised learning. In reinforcement
learning, the agent takes actions to perform a task. The agent obtains feedback about its
performance of the task. The agent then performs the same task repeatedly depending
on the feedback. This situation is called reinforcement learning. The agent gets positive
reinforcement for tasks done well and gets negative reinforcement for tasks done poorly. This
method is inspired by behavioral psychology. The goal is to get the agent to act in the
world so as to maximize its rewards. For example, consider teaching a monkey a new trick:
you cannot instruct the monkey what is to be done, but you can reward (with a banana) or
punish (scold) it depending on whether it does the right or wrong thing. Here the monkey
needs to learn the action that resulted in reward or punishment, which is known in rein-
forcement learning contexts as the credit assignment problem. Reinforcement learning
consists of the agent and the environment. The agent performs the action under the policy
being followed, and the environment is everything else other than the agent. Learning
takes place as a result of interaction between an agent and the world. In other words, the
percept received by an agent should be used not only for understanding, interpreting, or
96 Machine Learning
prediction, as in the machine learning tasks we have discussed so far, but also for acting.
Reinforcement learning is more general than supervised and unsupervised learning and
involves learning from interaction with the environment to achieve a goal and getting an
agent to act in the world so as to maximize its rewards. Reinforcement learning approaches
can be used to train computers to do many tasks such as playing backgammon and chess
and controlling robot limbs.
The major difference between reinforcement learning and supervised and unsupervised
learning is shown in Table 4.3.
TABLE 4.3
Supervised, Unsupervised, and Reinforcement Learning
Type of Learning Difference
Supervised learning The training information consists of the input and desired output pairs of the
function to be learned. The learning system needs to predict the output by
minimizing some loss.
Unsupervised learning Training data contains the features only, and the learning system needs to find
“similar” points in high-dimensional space.
Reinforcement learning The agent receives some evaluation (“rewards” or “penalties”) of the
output, that is, the action of the agent, but is not told which action is the
correct one to achieve its goal. The learning system needs to develop an
optimal policy so as to maximize its long-term reward.
Foundations and Categories of Machine Learning Techniques 97
4.9 Summary
• Explained the importance of data, types of data, representation, and processing of
data.
• Discussed features and their extraction and selection.
• Explained the different aspects of the basics of machine learning including induc-
tive learning, generalization, bias and variance, and overfitting and underfitting.
• Discussed the underlying concepts of machine learning algorithms.
• Explored the basic concepts of major types of machine learning algorithms.
4.10 Points to Ponder
• Problem dimensions depend on the number of samples, number of input features,
and target dimensionality.
• The same data (for example images) can be represented in different ways.
• Data cleaning is the process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set.
• Generalization is needed in all application areas of machine learning.
• Concept learning can be defined as discovering rules by observing examples.
• Parametric learning need not produce a good prediction model.
• Supervised learning aims to learn a function from examples that map its inputs to
the outputs.
• In unsupervised learning, no explicit output label is given.
E.4 Exercises
E.4.1 Suggested Activities
Use Case
E.4.1.1 Take two examples from the business scenario and formulate the same examples
as a classification, clustering, associative rule mining, and reinforcement learning
problem. Clearly state any assumptions made.
Thinking Exercise
E.4.1.2 Give an everyday mathematical example of overfitting.
E.4.1.3 Can you give an example data set with three target variables?
E.4.1.4 Can you think of examples of generalization in education and industry?
98 Machine Learning
Self-Assessment Questions
E.4.2 Multiple Choice Questions
Give answers with justification for correct and wrong choices.
No Match
E.4.4 Short Questions
E.4.4.1 Give a data representation (as in Figure 4.1) for multiple classifications (three
classes) of documents.
E.4.4.2 Assume that the number of samples is 1000, the number of input features is
30, and the number of target classes is 1. Find the problem dimension. If the
target class increases to 5, find the problem dimension.
E.4.4.3 What are data dependencies? Discuss.
Foundations and Categories of Machine Learning Techniques 101
E.4.4.4 What is data bias? Give an example from the education domain.
E.4.4.5 Why do we need feature selection?
E.4.4.6 What is inductive learning? Discuss in terms of classification, regression, and
probability estimation.
E.4.4.7 Why is generalization considered the core concept of machine learning?
E.4.4.8 Distinguish between bias and variance. Use an illustrative diagram.
E.4.4.9 Distinguish between overfitting and underfitting. Use an illustrative diagram.
E.4.4.10 Discuss any three methods that are used to handle overfitting.
E.4.4.11 Compare and contrast offline and online machine learning.
E.4.4.12 Compare and contrast parametric and nonparametric learning.
E.4.4.13 Why are lazy methods also called instance-based methods? Discuss.
E.4.4.14 What are the three basic approaches to machine learning? Discuss.
E.4.4.15 Distinguish between linear models and distance-based models.
E.4.4.16 Use another running example from the business domain for categorization of
classical machine learning (Figure 4.8).
E.4.4.17 Distinguish the components in the workflow of supervised and unsupervised
machine learning systems. Discuss.
E.4.4.18 Why are clustering, association rule mining, and dimensionality reduction
considered as unsupervised machine learning? Discuss.
E.4.4.19 What is the concept of reinforcement learning?
E.4.4.20 How does deep learning help neural network models?
5
Machine Learning: Tools and Software
5.1 Weka Tool
Weka is an open-source software that provides a collection of machine learning algo-
rithms developed at the University of Waikato. Weka stands for Waikato Environment for
Knowledge Analysis and is also a bird.
There are two ways in which these algorithms can be used: the data can be given directly
to the algorithms or they can be called as part of Java code. Weka is a collection of tools for
tasks needed for data mining such as data preprocessing, regression, association, classifica-
tion, and clustering.
5.1.1 Features of Weka
Weka includes all aspects needed to create and apply data mining models. It covers all
important data mining tasks including tools needed for preprocessing and visualization.
The features of Weka are listed in Figure 5.1 and include platform independence, free and
open-source software, ease of use, data-preprocessing facility, availability of different
machine learning algorithms, flexibility, and a good graphical interface.
5.1.2 Installation of Weka
In order to download Weka, you need to go the official website, http://www.cs.waikato.
ac.nz/ml/weka/. You need to download and install by choosing a stable version to down-
load, not the developer version. You also need to be sure that you choose the Windows,
Mac OS, or Linux version as appropriate for your machine The following commands can
be executed at the command prompt to set the Weka environment variable for Java:
Once the download is completed, the exe file is run, and initially the default setup can be
chosen.
FIGURE 5.1
Features of Weka.
FIGURE 5.2
Application interfaces of Weka.
• CSV
• ARFF
• Database using ODBC
1. The header section, which allows specification of features and defines the relation
(data set) name, attribute name, and the type.
2. The actual data section lists the data instances.
An ARFF file requires the declaration of the relation, attribute, and data.
An example of an ARFF file is shown in Figure 5.3.
• @relation: Written in the header section, this is the first line in any ARFF file. This
is followed by the relation/data set name. The relation name must be a string, and
in case there are spaces, then the relation name must be enclosed within quotes.
• @attribute: This part is also included in the header section, and the attributes are
declared with their names and the type or range. The following attribute data
types are supported by Weka”
• Numeric
• <nominal-specification>
• String
106 Machine Learning
FIGURE 5.3
Example of an ARFF file.
• date
• @data is defined in the Data section, and the list of all data segments follows @data
5.1.5 Weka Explorer
The Weka Explorer is illustrated in Figure 5.4 and contains a total of six tabs.
The tabs are as follows.
The user cannot move between the different tabs until the initial preprocessing of the data
set has been completed.
5.1.6 Data Preprocessing
The data that is collected from the field contains many unwanted things that lead to wrong
analysis. For example, the data may contain null fields, it may contain columns that are
Machine Learning 107
FIGURE 5.4
Weka Explorer.
irrelevant to the current analysis, and so on. Thus, the data must be preprocessed to meet
the requirements of the type of analysis you are seeking. This is the done in the preprocess-
ing module.
To demonstrate the available features in preprocessing, we will use the Weather data-
base that is provided in the installation (Figure 5.5).
Using the Open file … option under the Preprocess tag, select the weather-nominal.
arff file.
When you open the file, your screen looks as shown here (Figure 5.6).
This screen tells us several things about the loaded data, which are discussed further in
this chapter.
5.1.7 Understanding Data
Let us first look at the highlighted Current relation sub-window. It shows the name of the
database that is currently loaded. You can infer two points from this sub-window:
On the left side, notice the Attributes sub-window that displays the various fields in the
database (Figure 5.7).
108 Machine Learning
FIGURE 5.5
Demonstration with sample dataset.
FIGURE 5.6
Snapshot of loading the data.
Machine Learning 109
FIGURE 5.7
Snapshot of understanding the data.
The weather database contains five fields: outlook, temperature, humidity, windy, and
play. When you select an attribute from this list by clicking on it, further details on the
attribute itself are displayed on the right-hand side.
Let us select the temperature attribute first. When you click on it, you would see the fol-
lowing screen (Figure 5.8).
5.1.7.1 Selecting Attributes
In the Selected Attribute sub-window, you can observe the following:
At the bottom of the window, you see the visual representation of the class values.
If you click on the Visualize All button, you will be able to see all features in one single
window as shown here (Figure 5.9).
110 Machine Learning
FIGURE 5.8
Snapshot of viewing single attribute (“temperature”).
FIGURE 5.9
Visualization of all attributes.
Machine Learning 111
5.1.7.2 Removing Attributes
Many times, the data that you want to use for model building comes with many irrelevant
fields. For example, the customer database may contain customers’ mobile number, which
is irrelevant in analyzing their credit rating (Figure 5.10).
To remove attributes, select them and click on the Remove button at the bottom. The
selected attributes are removed from the database. After you fully preprocess the data, you
can save it for model building.
Next, you will learn to preprocess the data by applying filters on this data.
5.1.7.3 Applying Filters
Some of the machine learning techniques such as association rule mining require cate-
gorical data. To illustrate the use of filters, we will use the weather-numeric.arff database
that contains two numeric attributes—temperature and humidity. We will convert these
to nominal by applying a filter on our raw data. Click on the Choose button in the Filter
sub-window and select the following filter (Figure 5.11):
weka→filters→supervised→attribute→Discretize
Click on the Apply button and examine the temperature and/or humidity attribute. You
will notice that these have changed from numeric to nominal types.
Let us look into another filter now (Figure 5.12). Suppose you want to select the best
attributes for deciding the play. Select and apply the following filter:
weka→filters→supervised→attribute→AttributeSelection
You will notice that it removes the temperature and humidity attributes from the database
(Figure 5.13).
After you are satisfied with the preprocessing of your data, save the data by clicking the
Save … button. You will use this saved file for model building.
FIGURE 5.10
Snapshot of removing attributes.
112 Machine Learning
FIGURE 5.11
Sub-window of filter.
FIGURE 5.12
Snapshot another filter example.
FIGURE 5.13
After removing attributes (temperature and humidity).
Machine Learning 113
• Regression: The regression technique helps the machine learning approach to pre-
dict continuous values, for example, the price of a house.
• Classification: The input is divided into one or more classes or categories for the
learner to produce a model to assign unseen modules. For example, in the case of
email fraud, we can divide the emails into two classes, namely “spam” and “not
spam.”
Machine Learning 115
5.2.7 Running R Commands
Method 1: R commands can run from the console provided in RStudio (Figure 5.16). After
opening RStudio, simply type R commands to the console.
Method 2: R commands can be stored in a file and can be executed in an Anaconda prompt
(Figure 5.17). This can be achieved by the following steps.
Rscript <FILE_NAME>.R
FIGURE 5.15
Creating new environments.
FIGURE 5.16
Running R Command—Method 1.
118 Machine Learning
FIGURE 5.17
Running R Command—Method 2.
1. Open RStudio and click the Install Packages option under Tools, which is present
in the menu bar (Figure 5.18).
2. Enter the names of all the packages you want to install, separated by spaces or
commas, and then click Install (Figure 5.19).
FIGURE 5.18
Step 1: Installing packages through RStudio.
Machine Learning 119
FIGURE 5.19
Step 2: Installing the packages by giving name of the package.
While downloading the packages, you might be prompted to choose a CRAN mirror. It is
recommended to choose the location closest to you for a faster download.
FIGURE 5.20
Installing the packages through R console or terminal.
learning, and so on. Here let’s discuss some of the important machine learning packages
by demonstrating an example.
• Preparing the data set: Before using these packages, first of all import the data set
into RStudio, clean the data set, and split the data into train and test data sets.
• CARET: Caret stands for classification and regression training. The CARET pack-
age is used for performing classification and regression tasks. It consists of many
other built-in packages.
• ggplot2: R is most famous for its visualization library ggplot2. It provides an aes-
thetic set of graphics that are also interactive. The ggplot2 package is used for
creating plots and for visualizing data.
• randomForest: The randomForest package allows us to use the random forest
algorithm easily.
• nnet: The nnet package uses neural networks in deep learning to create layers
which help in training and predicting models. The loss (the difference between the
actual value and predicted value) decreases after every iteration of training.
• e1071: The e1071 package is used to implement the support vector machines, naive
Bayes algorithm, and many other algorithms.
• rpart: The rpart package is used to partition data. It is used for classification and
regression tasks. The resultant model is in the form of a binary tree.
• dplyr: Like rpart, the dplyr package is also a data manipulation package. It helps
manipulate data by using functions such as filter, select, and arrange.
Machine Learning 121
Types
• Regression
• Logistic regression
• Classification
• Naïve Bayes classifiers
• Decision trees
• Support vector machines
Implementation in R
Let’s implement one of the very popular supervised learning algorithms, namely simple
linear regression, in R programming. Simple linear regression is a statistical method that
allows us to summarize and study relationships between two continuous (quantitative)
variables. One variable, denoted x, is regarded as an independent variable, and the other
one, denoted y, is regarded as a dependent variable. It is assumed that the two variables
are linearly related. Hence, we try to find a linear function that predicts the response value
(y) as accurately as possible as a function of the feature or independent variable (x). The
basic syntax for regression analysis in R is as follows:
Syntax:
lm(Y ~ model)
where
Y is the object containing the dependent variable to be predicted and model is the formula for the
chosen mathematical model.
The command lm() provides the model’s coefficients but no further statistical information.
The following R code is used to implement simple linear regression.
• Clustering: Clustering is the task of discovering the inherent groupings in the data
where members within the same group are similar to each other and dissimilar to
members of other groups. Examples including grouping students based on their
interests or grouping customers based on their purchasing behaviour.
• Association: An association rule learning problem is where you want to discover
interesting hidden relationships and associations from large databases. Examples
include rules like if A buys milk and ghee, A is likely to buy sugar.
Types
Clustering:
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:
1. Hierarchical clustering
2. K-means clustering
3. KNN (k nearest neighbors)
4. Principal component analysis
5. Singular value decomposition
6. Independent component analysis
• x is numeric data
• centers is the predefined number of clusters
• the k-means algorithm has a random component and can be repeated nstart times to
improve the returned model
The tools and software described in this chapter can be used for implementation of the
programming assignments given in subsequent chapters.
Machine Learning 123
5.3 Summary
• Introduced the features of Weka.
• Described the application interfaces available in Weka.
• Outlined the installation of the Weka tool.
• Discussed the tabs of Weka Explorer used for data preprocessing, selection of attri-
butes, removing attributes, and visualization.
• Outlined the use of R programming for supervised as well as unsupervised
machine learning.
• Listed some advantages of using R programming.
5.4 Points to Ponder
• Weka is a free, open-source tool having a collection of machine learning algorithms.
• The Weka tool provides the facility to upload a data set, choose from among avail-
able machine learning algorithms, and analyze the results.
• The R language is a convenient method to carry out programming for machine
learning as it provides several powerful packages to carry out the task.
E.5 Exercises
E.5.1. Suggested Activities
E.5.1.1 Configure a machine learning experiment with the iris flower data set and three
algorithms ZeroR, OneR, and J48 available in Weka. Explain the data set and
algorithms used. You are required to analyze the results from an experiment and
understand the importance of statistical significance when interpreting results.
E.5.1.2 Use the air quality data set and carry out data visualization using a histogram,
box plot, and scatter plot with R programming.
Self-Assessment Questions
E.5.2 Multiple Choice Questions
E.5.2.1 Which one of the following options is used to start the weka application from
command line?
i java -jar weka.jar
ii java jar weka.jar
iii java weka.jar
124 Machine Learning
E.5.2.2 Which one of the following tab does not exist in the Weka Explorer options?
i Associate
ii Cleaning
iii Visualize
E.5.2.3 The data file format which is not supported by the Weka tool is
i .arff
ii .tsv
iii .txt
E.5.2.4 Which one of the following options contains the j48 algorithm?
i Cluster
ii Classify
iii Associate
E.5.2.6 How does one activate the Anaconda Navigator from the command line?
i anaconda activate <ENVIRONMENT_NAME>
ii activate conda <ENVIRONMENT_NAME>
iii conda activate <ENVIRONMENT_NAME>
E.5.2.10 Which of the following can be used to impute data sets based only on informa-
tion in the training set?
i Postprocess
ii Preprocess
iii Process
Machine Learning 125
No Match
E.5.3.1 Weka A Graphical interface to perform the data mining tasks on raw data
E.5.3.2 Supervised learning B Used to see the possible visualization produced on the data as
scatter plot or bar graph output
E.5.3.3 Explorer C Building a decision model from labelled training data
E.5.3.4 .libpaths() D Used to view all installed packages
E.5.3.5 Visualize E Used for seeing currently active libraries
E.5.3.6 library F Has functions for examining and cleaning dirty data
E.5.3.7 getOption(defaultPackages) G “Recommended” package in R
E.5.3.8 janitor H Explorer with drag and drop functionality that supports
incremental learning
E.5.3.9 Spatial I Data mining tool
E.5.3.10 Knowledge flow J Shows the default packages in R
E.5.4 Short Questions
6.1 Introduction
Classification is the process of assigning any unknown data X with a known class label Ci
from set C = {C1,C2,…,Cn}, where Ci Ɛ C. Classification algorithms help you divide your
data into different classes. A classification algorithm helps you in classifying data like how
books are sorted in a library under various topics.
Classification is a machine learning technique which identifies any particular data to be
of one target label or another. Figure 6.1 shows two classes, class A and class B. The items
of the same class are similar to each other while items of different classes are dissimilar to
each other.
For any classification model, there is going to be two phases: training and testing.
During the training phase, the model is said to be built by learning using a supplied data
set. Hence, all we require is a training data set that has many instances with appropriate
attributes with corresponding output labels using which learning of the model is carried
out. Having trained, next the model has to be tested for how well the model has learnt. We
need a testing data set which is similar to the training data. The model is tested with the
test data set for how accurately the model predicts a class label for the test data and hence
its accuracy.
The most basic example for classification can be how incoming emails are segregated in
our mail box as either “spam” or “not spam.” As another example, suppose we want to
FIGURE 6.1
Two class problem.
predict whether our customers would buy a particular product or not according to the
customer data we have. In this case, the target classes would be either “Yes” or “No.”
On the other hand, we might want to classify vegetables according to their weight, size,
or color. In this scenario, the available target classes might be spinach, tomato, onion,
potato, and cabbage. We might perform gender classification as well, where the target
classes would be female and male.
Classification can be of three types, such as binary classification, when we have only
two possible classes; multi-class classification, when we are dealing with more than two
classes where every sample is assigned only a single target label; and multi-label classifi-
cation, when each sample can have more than one target label assigned to it.
6.1.2 Binary Classification
A binary classification classifies examples into one of two classes. Examples include emails
classified as spam and not spam, customer retention where customers are classified as not
churned and churned, customers’ product purchase classified as bought an item and not
bought an item, and cancer detection classified as no cancer detected and cancer detected.
Generally, we use binary values where the normal state is assigned a value of 0 and the
abnormal state a value of 1. Some of the most popular algorithms used for binary classifi-
cation are as follows:
• k-nearest neighbors
• Logistic regression
• Support vector machine
• Decision trees
• Naive Bayes
6.1.3 Multi-Class Classification
Multi-class types of classification problems can deal with more than two classes. An exam-
ple is a face recognition system, which uses a huge number of labels for predicting a pic-
ture as to how closely it might belong to one of tens of thousands of faces. Some of the
common algorithms used for multi-class classification are as follows:
• k-nearest neighbours
• Naive Bayes
• Decision trees
Classification Algorithms 129
• Gradient boosting
• Random forest
We can also use the algorithms for binary classification for multi-class classification by
classifying one-versus-rest or one-versus-one where we define a binary model for every
pair of classes.
required to address this type of problem. We can simply determine the category or class of
a data set with the help of KNN.
Because it delivers highly precise predictions, the KNN algorithm can compete with the
most accurate models. As a result, the KNN algorithm can be used in applications that
require high accuracy but don’t require a human-readable model.
The distance measure affects the accuracy of the predictions. As a result, the KNN
method is appropriate for applications with significant domain knowledge. This under-
standing aids in the selection of an acceptable metric.
6.2.3 Working of KNN
The following example is explained using Figures 6.2–6.4. In Figure 6.2, the “initial data”
is a graph where data points are plotted and clustered into classes, and a new example to
classify is present. In Figure 6.3, the “calculate distance” graph, the distance from the new
example data point to the closest trained data points is calculated. However, this still does
not categorize the new example data point. Therefore, using a k-value essentially created a
neighbourhood where we can classify the new example data point.
FIGURE 6.2
Initial data.
FIGURE 6.3
Calculate distance.
Classification Algorithms 131
FIGURE 6.4
Finding neighbors and voting for labels.
Figure 6.4 shows different value of k. We would say that k = 3 and the new data point
will belong to class B as there are more trained class B data points with similar characteris-
tics to the new data point in comparison to class A. If we increase the k-value to 7, we will
see that the new data point will belong to class A, as there are more trained class A data
points with similar characteristics to the new data point in comparison to class B.
A figure shows the initial data points, where data points are plotted.
x y
2
Euclidean Distance i i (6.1)
i 1
Manhattan Distance x y
i 1
i i (6.2)
132 Machine Learning
1/q
k
x y
q
Minkowski Distance i i (6.3)
i 1
Figure 6.5 explains the difference between the three distance functions.
FIGURE 6.5
Distance functions.
Classification Algorithms 133
FIGURE 6.6
Training error rate.
FIGURE 6.7
Validation error curve.
6.2.6 KNN Algorithm
• Store all input data in the training set.
• Initialize the value of k.
• For each sample in the test set,
◦ Search for the k nearest samples to the input sample using a Euclidean distance
measure. (other distances such as Hamming or Manhattan, or metrics such as
Chebyshev, cosine, etc. can also be used.)
• For classification, compute the confidence for each class as
◦ Ci/k (where Ci is the number of samples among the k nearest neighbors belong-
ing to class i).
◦ The classification for the input sample is the class with the highest confidence.
134 Machine Learning
1. High requirements on memory. We need to store all the data in memory in order
to run the model.
2. Computationally expensive. Recall that the model works in the way that it selects
the k nearest neighbors. This means that we need to compute the distance between
the new data point to all the existing data points, which is quite expensive in
computation.
3. Sensitive to noise. Imagine we pick a really small k; the prediction result will be
highly impacted by the noise if there is any.
1. Root node: This is the initial node of the decision tree and represents the entire
population. A decision tree is grown by starting at the root node and dividing or
splitting the sample space into homogeneous sets.
Classification Algorithms 135
FIGURE 6.8
Components of decision trees.
2. Splitting: Starting from the root node and dividing a node into two or more sub-
nodes is called splitting. There exist several methods to split a decision tree, involv-
ing different metrics (e.g., information gain, Gini impurity).
3. Decision node: The internal node of a decision tree which can be split into sub-
nodes is called the decision node.
4. Leaf/terminal node: This represents a final node and cannot be further split.
5. Pruning: By removing sections that are not critical, in other words removing some
sub-nodes of a decision node, we reduce the size of the decision tree. This pruning
process enables better generalization and less overfitting, leading to better predic-
tive accuracy
6. Branch: A subtree rooted at the root or any of the decision nodes is called a branch.
7. Parent and child node: A parent node is divided into sub-nodes or children of the
node.
Decision trees are created by the process of splitting the input space or the examples into
several distinct, nonoverlapping sub-spaces. Each internal node in the tree is tested for the
attribute, and based on all features and threshold values we find the optimal split. On find-
ing the best split, we continue to grow the decision tree recursively until stopping criteria
such as maximum depth are reached. This process is called recursive binary splitting.
• Initially, while starting the training, the complete training set is taken to be the root.
• Decision trees work based on categorical feature values. If we are dealing with
continuous values, they need to be discretized prior to building the model.
• The recursive distribution of the samples is based on attribute values.
• The placing of attributes as root or internal nodes is decided by using some infor-
mation theoretic measures.
136 Machine Learning
The decision trees are a disjunctive normal form of the attribute set. Thus the form of a
decision rule is limited to conjunctions or adding of terms and allows disjunction (or-ing)
over a set of rules. The attribute to be tested at each nonterminal node is identified by the
model. The attributes are split into mutually exclusive disjoint sets until a leaf node is
reached.
An important aspect of decision trees is the identification of the attribute to be consid-
ered as root as well as attribute at each level, or in other words, how do we decide the best
attribute? The best choice of attribute will result in the smallest decision tree. Now the
basic heuristic is to choose the attribute that produces the “purest.”
6.3.6.1 Entropy
Entropy is a measure of the purity or the degree of uncertainty of a random variable. The
distribution is called pure when all the items are of the same class (Equation 6.4).
Classification Algorithms 137
Entropy p log p
i 1
i 2 i (6.4)
6.3.6.2 Gini Impurity
If all elements are accurately split into different classes, the division is called pure (an
ideal scenario). In order to predict the likelihood of a randomly chosen example being
incorrectly classified, we use the Gini impurity wherein impurity indicates how the model
departs from a simple division and ranges from 0 to1. A Gini impurity of 1 suggests that all
items are scattered randomly across various classes, whereas a value of 0.5 shows that the
elements are distributed uniformly across some classes (Equation 6.5).
n
p
2
Gini 1 i (6.5)
i 1
The Gini coefficients are calculated for the sub-nodes, and then the impurity of each
node is calculated using a weighted Gini score.
6.3.6.3 Information Gain
When it comes to measuring information gain, the concept of entropy is key. Information
gain is based on information theory and identifies the most important attributes that con-
vey the most information about a class. Information gain is the difference in entropy before
and after splitting, which describes in turn the impurity of in-class items (Equation 6.6).
For the calculation of information gain, we first calculate for each split the entropy of
each child node independently. Then we calculate the entropy of each split using the
weighted average entropy of child nodes. Then we choose the split with the lowest entropy
or the greatest gain in information. We repeat these steps until we obtain homogeneous
split nodes.
a b class
0 0 positive
0 1 positive
S:
1 0 negative
1 1 positive
0 0 negative
138 Machine Learning
The information gain helps us to decide which feature should we test first and add a
new node. It’s the expected amount of information we get by inspecting the feature.
Intuitively, the feature with the largest expected amount is the best choice. That’s because
it will reduce our uncertainty the most on average.
2 2 3 3
H S log 2 log 2 0.971
5 5 5 5
That’s the measure of our uncertainty in a random object’s class (assuming the previous
checks got it to this point in the tree). If we choose a as the new node’s test feature, we’ll get
two sub-trees covering two subsets of S:
2 1
H S a 0 H , 0.918
3 3
1 1
H S a 1 H , 1
2 2
IG a P a 0 .AI a 0 P a 1 .AI a 1
P a 0 H S H S a 0 P a 1 H S H S a 1
P a 0 P a 1 H S P a 0 H S a 0 P a 1 H(S a 1
H S P a 0 H S a 0 P a 1 H S a 1
0.971 0.60.9188 0.41
0.0202
IG a H S P b 0 H S b 0 P b 1 H S b 1
0.971 0.60.918 0.40
0.4202
Classification Algorithms 139
IG a H S p H S a a
i 1
i i (6.7)
The gain cannot be negative even though individual pieces of information can have a
negative contribution.
6.4.1 Nonlinear Data
Now this example was easy, since clearly the data was linearly separable—we could draw a
straight line to separate circles and diamonds. Consider the example shown in Figure 6.11a.
140 Machine Learning
FIGURE 6.9
Sample labelled data.
FIGURE 6.10
Not all hyperplanes are created equal.
Here there is no linear decision boundary that separates both classes. However, the vectors
are very clearly segregated. To tackle this example, we need to add a third dimension. We
create a new z dimension, and we rule that it be calculated a certain way that is convenient
for us: z = x2 + y2 (that is the equation for a circle). This will give us a three-dimensional
space as shown in Figure 6.11b. Note that since we are in three dimensions now, the 3D
hyperplane is a plane parallel to the x axis at a certain z (let’s say z = 1) as shown in Figure
6.11c. Finally, we can visualize the separation as in Figure 6.11d.
6.4.2 Working of SVM
Step 1: SVM algorithm takes data points and classifies them into two classes as either
1 or –1, or SVM discovers the hyperplane or decision boundary that best sepa-
rates the classes.
Step 2: Machine learning algorithms in general represent the problem to be solved as
a mathematical equation with unknown parameters, which are then found by
formalizing the problem as an optimization problem. In the case of the SVM
classifier, the optimization aims at adjusting the parameters. A hinge loss func-
tion is used to find the maximum margin (Equation 6.8). This hinge loss not
Classification Algorithms 141
FIGURE 6.11
Nonlinear hyperplane.
only penalizes misclassified samples but also correctly classified ones that fall
within a defined margin from the decision boundary.
Hinge Loss Function
0, if y f x 1
c x, y , f x (6.8)
1 y f x ,
else
Step 3: The hinge loss function is a special type of cost function whose value is zero
when all classes are correctly predicted, else the error/loss (Equation 6.9) needs
to be calculated. Therefore, there needs to be a trade-off between maximizing
margin and the loss generated, and in this context, a regularization parameter is
introduced which decides the extent of misclassification allowed by SVM opti-
mization. Large values of the regularization parameter will allow the optimiza-
tion to choose a smaller-margin hyperplane in case the hyperplane results in
all training points being classified correctly. However, for small values of the
regularization parameter, the optimization will choose a larger-margin separat-
ing hyperplane, in spite of more misclassification.
Loss Function for SVM
n
1 y x , w
2
min w i i (6.9)
i 1
2
w 2 wk (6.10)
wk
0 if yi xi , w 1
1 y i xi , w (6.11)
wk yi xik , else
142 Machine Learning
Step 5: Only the regularization parameter is used to update the gradient when there is
no error in the classification (Equation 6.12) while in addition, the loss function
is also used when there is misclassification (Equation 6.13).
w w 2 w (6.12)
w w 2 w yixi 2 w (6.13)
6.4.3.1 Support Vectors
Support vectors are those data points based on which the margins are calculated and maxi-
mized. One of the hyper-parameters to be tuned is the number of support vectors.
6.4.3.2 Hard Margin
A hard margin enforces the condition that all the data points be classified correctly and
there is no training error (Figure 6.13a). While this allows the SVM classifier to have no
error, the margins shrink, thus reducing generalization.
FIGURE 6.12
Concepts of SVM.
Classification Algorithms 143
FIGURE 6.13
(a) Hard margin classification, (b) soft margin classification.
6.4.3.3 Soft Margin
Soft margins are used when we need to allow some misclassification to increase general-
ization of the classification and avoid overfitting. Soft margins are also used when training
set is not linearly separable or we have a data set with noise. A soft margin allows a certain
number of errors and at the same time keeps margin as large as possible so that other points
can still be classified correctly. This can be done simply by modifying the objective of the
SVM (Figure 6.13b). Slack variables or a soft margin are introduced, which allows misclas-
sification of difficult or noisy examples. In other words, we allow some errors and let some
points be moved to where they actually belong, but at a cost. We thus add slack variables
to allow misclassification of difficult or noisy examples, resulting in a soft margin.
6.4.3.4 Different Kernels
In general, working in high-dimensional feature space will be computationally expen-
sive since while constructing the maximal margin hyperplane, we need to evaluate high-
dimensional inner products. Moreover, there are problems which are nonlinear in mature.
Now the “kernel trick” comes to the rescue. The use of kernels makes the SVM algorithm
a powerful machine learning algorithm. For many mappings from a low-dimensional
space to a high-dimensional space, there is a simple operation on two vectors in the low-
dimensional space that can be used to compute the scalar product of their two images in
the high-dimensional space (Figure 6.14).
Depending on the problem, you can use differnt types of kernel functions: linear, polyno-
mial, radial basis function, Gaussian, Laplace, and so on. Choosing the right kernel function
is important for building the classifier, and another hyperparameter is used to select kernels.
6.4.4 Tuning Parameters
The most important parameters used for tuning the SVM classifier are as follows:
• Kernel: We have already discussed how important kernel functions are. Depending
on the nature of the problem, the right kernel function has to be chosen as the
kernel-function defines the hyperplane chosen for the problem.
144 Machine Learning
FIGURE 6.14
SVM kernel functions.
Advantages
• SVM being a mathematically sound algorithm makes it one the most accurate
machine learning algorithms.
• SVM can handle a range of problems, including linear and nonlinear problems,
and binary, binomial, and multi-class classification problems whether they are
classification or regression problems.
• SVM is based on the concept of maximizing margins between classes and hence
differentiates the classes well.
• The SVM model is designed to reduce overfitting, thus making the model highly
stable.
• Due to the use of kernels, SVM handles high-dimensional data.
• SVM is computation fast and allows good memory management.
Classification Algorithms 145
Disadvantages
• While SVM is fast and can work with high-dimensional data, it is less efficient
when compared with the naive Bayes method. Moreover, the training time is rela-
tively long.
• The performance of SVM is sensitive to the kernel chosen.
• SVM algorithms are not highly interpretable, specifically when kernels are used to
handle nonlinear separated data. In other words, it is not possible to understand
how the independent variable affects the target variable.
• Computational cost is high in tuning the hyper-parameters, especially when deal-
ing with a huge data set.
SVM is a widely used algorithm in areas such as handwriting recognition, image classifi-
cation, anomaly detection, intrusion detection, text classification, time series analysis, and
many other application areas where machine learning is used.
6.5 Use Cases
Decision trees are very easily interpretable and hence are used in a wide range of indus-
tries and disciplines, as follows
6.5.1 Healthcare Industries
In healthcare industries, a decision tree can be used to classify whether a patient is suffer-
ing from a disease or not based on attributes such as age, weight, sex, and other factors.
Decision trees can be used to predict the effect of the medicine on patients based on factors
FIGURE 6.15
Decision tree—disease recovery.
146 Machine Learning
such as composition, period of manufacture, and so on. In addition, decision trees can be
used in diagnosis of medical reports.
Figure 6.15 represents a decision tree predicting whether a patient will be able to recover
from a disease or not.
6.5.2 Banking Sectors
A decision tree can be used to decide the eligibility of a person for a loan based on attributes
such as financial status, family members, and salary. Credit card frauds, bank schemes and
offers, loan defaults, and so on can also be predicted using decision trees.
Figure 6.16 represents a decision tree about loan eligibility.
6.5.3 Educational Sectors
Decision trees can be used for short listing of a student based on their merit scores, atten-
dance, overall score, and so on. A decision tree can also to be used decide the overall pro-
motional strategy of faculty in the universities.
Figure 6.17 shows a decision tree showing whether a student will like the class or not
based on their prior programming interest.
FIGURE 6.16
Decision tree—loan eligibility.
Classification Algorithms 147
FIGURE 6.17
Decision tree—student likes class or not.
6.6 Summary
• Classification is a machine learning technique which assigns a label value to
unknown data.
• When the classification is between two classes, it is termed as binary classification.
• Multi-class types of classification problems have more than two classes.
• Multi-label classification is where two or more specific class labels may be assigned
to each example.
• k-nearest neighbors (KNN) is a type of supervised learning machine learning algo-
rithm and is used for both regression and classification tasks.
• The most common methods used to calculate this distance in KNN are Euclidian,
Manhattan, and Minkowski.
• Decision tree algorithms can be used for both regression and classification.
• Support vector machines (SVMs) are mainly used for classification.
6.7 Points to Ponder
• Classification can be of three types, namely binary classification, multi-class clas-
sification, and multi-label classification.
148 Machine Learning
• The “K” in KNN decides the number of nearest neighbors considered. k is a pos-
itive integer and is typically small in value and is recommended to be an odd
number.
• The two parameters associated with KNN are k (the number of neighbors that will
vote) and the distance metric.
• The Minkowski distance is the distance between two points in the normed vec-
tor space and is a generalization of the Euclidean distance and the Manhattan
distance.
• The internal node of a decision tree which can be split into sub-nodes is called the
decision node.
• Entropy is a measure of purity or the degree of uncertainty of a random variable.
• Support vectors are those data points based on which the margins are calculated
and maximized.
• The use of kernels makes the SVM algorithm a powerful machine learning
algorithm.
E.6 Exercises
E.6.1 Suggested Activities
E.6.1.1 Teacher can list down various real-time tasks and ask the students to identify
whether it is a classification task or not.
E.6.1.2 Readers can identify various news events from social media, newspapers, and
so on pertaining to classification tasks and justify why each is classification.
E.6.2 Self-Assessment Questions
E.6.2.1 Projects:
Readers are requested to download data sets from the UCI repository and try
performing various types of classification tasks mentioned in this chapter.
E.6.3.5 The distance between two points that is the sum of the absolute difference of
their Cartesian coordinates is called the
i Manhattan distance
ii Euclidean distance
iii Minkowski distance
E.6.3.6 The KNN algorithm uses two parameters to build the model, k and the distance
metric, and hence KNN is a
i Parametric model
ii Probabilistic model
iii Nonparametric model
E.6.3.8 In KNN
i There are prior assumptions on the distribution of the data
ii There are no prior assumptions on the distribution of the data
iii The data samples are normally distributed
E.6.3.11 The decision trees are the _____ of the attribute set.
i Conjunctive normal form
ii Disjunctive normal form
iii Addition of values
150 Machine Learning
E.6.3.13 In SVM, the hyperplane that best separates the classes is called the
i Hyper boundary
ii SVM boundary
iii Decision boundary
No Match
E.6.4.1 Multi-label classification A Distance between two points using the length of a line between
the two points
E.6.4.2 k-nearest neighbors B Starting from the root node and dividing a node into two or
more sub-nodes.
E.6.4.3 Lazy learner C Distance between two points in the normed vector space
E.6.4.4 Euclidean distance D Decide the recursive distribution of the samples
E.6.4.5 Pure distribution E When a single example is assigned two or more classes
E.6.4.6 A Gini impurity of 0.5 F When all the items are of the same class
E.6.4.7 Regularization parameter G Shows that the elements are distributed uniformly across some
classes
E.6.4.8 Attribute values H Assuming that similar data items exist within close proximity
E.6.4.9 Minkowski distance I Decides the extent of misclassification allowed by optimization
E.6.4.10 Splitting J Because it does not create a model of the data set beforehand
E.6.5 Short Questions
7.1 Probabilistic Methods
7.1.1 Introduction—Bayes Learning
Bayes learning needs to be understood in the context of fitting a model to data and then
predicting target values for unknown hitherto unseen data which we discussed in Chapters
2–4. While in Chapters 2 and 4 we discussed data and fitting models to the training data,
in Chapter 3 we discussed the fundamentals of probability, the Bayes theorem, and appli-
cations of the Bayes theorem. Some of the important criteria when we fit a model are the
choice of weights and thresholds. In some cases, we need to incorporate prior knowledge,
and in other cases we need to merge multiple sources of information. Another impor-
tant factor is the modelling of uncertainty. Bayesian reasoning provides solutions to these
above issues. Bayesian reasoning is associated with probability, statistics, and data fitting.
7.1.2 Bayesian Learning
The Bayesian framework allows us to combine observed data and prior knowledge and
provide practical learning algorithms. Given a model, we can derive any probability where
we describe a model of the world and then compute the probabilities of the unknowns
using Bayes’ rule. Bayesian learning uses probability to model data and quantifies uncer-
tainty of predictions. Moreover, Bayesian learning is a generative (model based) approach
where any kind of object (e.g., time series, trees, etc.) can be classified, based on a proba-
bilistic model. Some of the assumptions are that the quantities of interest are governed by
probability distribution. The optimal decisions are based on reasoning about probabilities
and observations. This provides a quantitative approach to weighing how evidence sup-
ports alternative hypotheses. The first step in Bayesian learning is parameter estimation.
Here we are given lots of data, and using this data we need to determine unknown param-
eters. The various estimators used to determine these parameters include maximum a pos-
teriori (MAP) and maximum likelihood estimators. We will discuss these estimators later.
FIGURE 7.1
Probabilistic model describing relationship between data and hypothesis.
hypothesis (H) as shown in Figure 7.1. Probability is estimated for the different hypotheses
being true given the observed data.
Given some model space (set of hypotheses hi) and evidence (data d), the Bayesian
framework works with the following:
The framework favors parameter settings that make the data likely. Prior knowledge
can be combined with observed data to determine the final probability of a hypothesis.
We now have a probabilistic approach to inference where the basic assumptions are
that the quantities of interest are governed by probability distributions and the optimal
decisions can be made by reasoning about these probabilities together with observed
training data.
Probabilistic and Regression Based Approaches 155
7.2.1 Choosing Hypotheses
The Bayes theorem allows us to find the most probable hypothesis h from a set of candi-
date hypotheses H given the observed data d, that is, P(h|d). We will discuss two common
methods, namely maximum a posteriori (MAP) and maximum likelihood (ML).
156 Machine Learning
Maximum a posteriori (MAP) hypothesis: Choose the hypothesis with the highest a
posteriori probability, given the data. In the MAP method we need to find the hypothesis
h among the set of hypothesis H that maximizes the posterior probability P(h|D). Now by
applying the Bayes theorem, we need to find the hypothesis h that maximizes the likeli-
hood of the data d for a given h and the prior probability of h. The prior probability of the
data d does not affect the maximization for finding h, so that is not considered. Hence, we
have the MAP hypothesis as follows (Equation 7.1):
Maximum likelihood estimate (ML): Assume that all hypotheses are equally likely a
priori; then the best hypothesis is just the one that maximizes the likelihood (i.e., the prob-
ability of the data given the hypothesis). If every hypothesis in H is equally probable a
priori, we only need to consider the likelihood of the data D given h, P(D|h). This gives rise
to hML or the maximum likelihood, as follows (Equation 7.2):
We know the likelihood of + result given cancer and – result given cancer, and similarly we
know the likelihood of + result given ¬cancer and – result given ¬cancer:
We are given that a positive result is returned, that is, d = +. We need to find the hypothesis
that has the maximum a posteriori probability (MAP); given d is + we need to find h that gives
maximum value to P(+|h)
Now the hypothesis that gives maximum value when the data (positive test) is positive is
cancer.
Therefore, the patient does not have cancer.
7.2.2 Bayesian Classification
Classification predicts the value of a class c given the value of an input feature vector
x. From a probabilistic perspective, the goal is to find the conditional distribution p(c|x)
using the Bayes theorem. Despite its conceptual simplicity, Bayes classification works well.
The Bayesian classification rule states that given a set of classes and the respective poste-
rior probabilities, it classifies an unknown feature vector according to when the posterior
probability is maximum. There are two types of probabilistic models of classification, the
discriminative model and the generative model.
7.2.2.1 Discriminative Model
The most common approach to probabilistic classification is to represent the conditional
distribution using a parametric model, and then to determine the parameters using a train-
ing set consisting of pairs <xn, cn> of input vectors along with their corresponding target
output vectors. In other words, in the discriminative approach, the conditional distribution
discriminates directly between the different values of c; that is, posterior p(c|x) is directly
used to make predictions of c for new values of x (Figure 7.2). Here for example the classes
can be C1 = benign mole or C2 = cancer, which can be modelled given the respective data
points.
7.2.2.2 Generative Model
Generative models are used for modelling data directly; that is, modelling observations
drawn from a probability density function. This approach to probabilistic classification
finds the joint distribution p(x, c), expressed for instance as a parametric model, and then
subsequently uses this joint distribution to evaluate the conditional p(c|x) in order to make
predictions of c for new values of x by application of the Bayes theorem. This is known as
FIGURE 7.2
Discriminative model.
158 Machine Learning
FIGURE 7.3
Generative model.
a generative approach since by sampling from the joint distribution it is possible to gener-
ate synthetic examples of the feature vector x (Figure 7.3). Here for the vector of random
variables we learn the probability for each class c given the input. In practice, the general-
ization performance of generative models is often found to be poorer than that of discrimi-
native models due to differences between the model and the true distribution of the data.
Naïve Bayes is one of the most practical Bayes learning methods. The naïve Bayes clas-
sifier applies to learning tasks where each instance xis described by a conjunction of attri-
bute values and where the target function f(x) can take on any value from some finite set.
Some successful applications of the naïve Bayes method are medical diagnosis and classi-
fying text documents.
FIGURE 7.4
Naïve Bayes assumption.
However, the likelihood or the joint probability of the features of the input data given
the class P(x1, x2,…, xn|cj) is of the order of O(|X|n•|C|) and can only be estimated if a
very, very large number of training examples is available. This is where the naïveté of the
naïve Bayes method comes into the picture, that is, the conditional independence assump-
tion where we assume that the probability of observing the conjunction of attributes is
equal to the product of the individual probabilities. In other words, we assume that the
joint probability can be found by finding the individual probability of each feature given
the class (Figure 7.4).
Let us first formulate Bayesian learning. Let each instance x of a training set D be a con-
junction of n attribute values <a1, a2,.., an>and let f(x), the target function, be such that f(x) ∈
V, a finite set. According to the Bayesian approach using MAP, we can specify this as fol-
lows (Equation 7.4):
Here we try to find a value vj that maximizes the probability given the attribute values.
Applying Bayes Theorem this is finding the vj that maximizes the product of the likelihood
P(a1,a2,.., an|vj) and the prior probability P(vj).
Now according to the naïve Bayesian approach, we assume that the attribute values are
conditionally independent so that P( a1 , a2 .., an v j )
P( a1 v j ).
i
Therefore, naïve Bayes classifier: vNB = argmaxvj ∈ VP(vj)∏iP(ai ∣ vj)
Now if the i-th attribute is categorical, then P(xi|C) is estimated as the relative fre-
quency of samples having value xi. However, if the ith attribute is continuous, then P(xi|C)
is estimated through a Gaussian density function. It is computationally simple in both
cases.
160 Machine Learning
FIGURE 7.5
Corona training dataset.
Probabilistic and Regression Based Approaches 161
Training Phase: First we compute the probabilities of going out (positive) and not going
out (negative) as P(p) and P(n), respectively (Figure 7.6).
Probabilities of each of the attributes are then calculated as follows.
We need to estimate P(xi|C), where xi is each value for each attribute given C, which is
either positive (p) or negative (n) class for going out. For example, consider the attribute
symptom has the value fever for 2 of the 8 positive samples (2/8) and for 2 of the 7 negative
samples (2/7). We calculate such probabilities for every other attribute value for attribute
symptom and similarly for each value of each of the other attributes body temperature,
vaccinated, and lockdown (Figure 7.7).
Test Phase: Given a new instance x′ of variable values, we need to calculate the proba-
bility of either going out or not going out. Now assume we are given the values of the four
attributes as follows:
x = (Symptom = Sore throat, Body Temperature = Normal, Vaccinated = false, Lockdown
= Partial)
Here we consider the probability of a symptom being sore throat and look up the prob-
ability of going out and not going out. Similarly, we look up the probabilities for body
temperature = normal, vaccinated = false, and lockdown = partial. Now we can calculate
the probability of going out and not going out given the new instance x′ by finding the
product of each of the probabilities obtained for the value of each variable. Now we find
that P(Yes/x′) = 0.00117 and P(No/x′) = 0.003. Since P(No/x′) is greater, we can conclude
that the new instance x′ will be labelled as No as shown in Figure 7.8.
FIGURE 7.6
Probability of going out and not going out.
FIGURE 7.7
Conditional probabilities of each attribute.
162 Machine Learning
FIGURE 7.8
Calculating whether going out is true given the new instance.
7.4 Bayesian Networks
Bayesian networks (BNs) are different from other knowledge-based system tools because
uncertainty is handled in a mathematically rigorous yet efficient and simple way. BN is a
graphical model that efficiently encodes the joint probability distribution for a large set
of variables. While naïve Bayes is based on an assumption of conditional independence,
Bayesian networks provide a tractable method for specifying dependencies among vari-
ables. BNs have been used for intelligent decision aids, data fusion, intelligent diagnostic
aids, automated free text understanding, data mining, and so on, in areas such as medicine,
bio-informatics, business, trouble shooting, speech recognition, and text classification.
We need to understand BNs because they effectively combine domain expert knowl-
edge with data. The representation as well as the inference methodology is efficient. It is
possible to perform incremental learning using BNs. BNs are useful for handling missing
data and for learning causal relationships.
FIGURE 7.9
Basic Bayesian network.
• Diagnosis: P(cause|symptom) =?
Now a diagnostic problem is defined as the probability of the cause given the
symptom.
• Prediction: P(symptom|cause) =?
On the other hand, a prediction problem is defined as the probability of the symp-
tom given the cause.
• Classification: max P(class|data)
In this context, classification is decided by the class that results in maximum prob-
ability given the data.
• Decision-making (given a cost function)
Decision making on the other hand is finding any of the previous but with an asso-
ciated cost function.
V Y1 V Y2 V Yn (7.5)
Joint probability distribution: specifies the probabilities of the items in the joint space.
Due to the Markov condition, we can compute the joint probability distribution over all the
variables X1, …, Xn in the Bayesian net using the formula.
164 Machine Learning
n
P( X 1 x1 ,..., X n xn ) P ( X i xi | Parents ( X i ))
i 1
In the Figure 7.10 we define the joint probability p(x1, x2, x3, x4, x5, x6) as the product of
probability of x1, probability of x2 given its parent, probability of x3 given its parent, prob-
ability of x4 given both its parents x2 and x3, and so on.
An important concept we need to define in the context of Bayesian networks is condi-
tional independence. Conditional independencemeans that if X and Y are conditionally
independent given Z, equivalently if we know Z, then knowing Y does not change predic-
tions of X. That is, P(X|Y,Z) = P(X|Z), and the usual notation used is Ind(X;Y|Z) or
(X⊥Y|Z). Another important aspect is the Markov condition, which states that given its
parents (P1, P2), a random variable X is conditionally independent of its non-descendants
(ND1, ND2) (Figure 7.11).
FIGURE 7.10
Example of joint probability.
FIGURE 7.11
Markov condition.
Probabilistic and Regression Based Approaches 165
FIGURE 7.12
Example of a Bayesian network.
Now let us consider a simple Bayesian network given in Figure 7.13. Now the conditional
probability distribution for B given A is specified in B’s CPT. For a given combination of
values of the parents (B in this example), the entries for P(C = true | B) and P(C = false | B)
must add up to 1 (Figure 7.13). For example, P(C = true | B = false) + P(C = false |B = false)
= 1. If we have a Boolean variable with k Boolean parents, this table has 2k+1 probabilities
(but only 2k need to be stored).
FIGURE 7.13
Example showing CPTs of each node.
P( X | E )
7.4.5.1 Independence
Let us explain the concept of independence using Figure 7.14. Difficulty and Intelligence are
two random variables that are not connected by an edge, so they are independent. This
means P(D,I) = P(I)P(D) and the independence is represented as P(D ∣ I) = P(D)D ^ I and
P(I ∣ D) = P(I)I ^ D. Hence P(D, I) = P(I ∣ D)P(D) = P(I)P(D) and P(D, I) = P(D ∣ I)P(I) =
P(D)P(I).
Letter is independent of Difficulty and Intelligence given Grade, and this is represented
as P(C|A,G,S) = P(C|S) C ⊥ A,G | S. However, Difficulty is dependent on Intelligence,
given Grade. Similarly Grade and SAT are dependent; however, given Intelligence, Grade
is independent of SAT, that is P(S|G,I) = P(S|I).
7.4.5.2 Putting It Together
Figure 7.15 emphasizes what we have been discussing so far.
The network topology and the conditional probability tables give a compact representa-
tion of joint distribution. These networks are generally easy for domain experts to
construct.
FIGURE 7.14
Explaining independence.
FIGURE 7.15
Putting all the components together.
The conditional probability table (CPT) of each of the three values of Obesity (namely
no, moderate, high) given the different values of eating is given (Figure 7.16). The product
rule for the example is shown in Figure 7.17. Here the conditional probability P(O|E) is
multiplied by probability of P(E) to obtain joint probability P(O,E). This table can be used
for marginalization as shown in Figure 7.18. The total of the columns gives us P(O)—the
marginalized value of O. Now using the values calculated we get P(E|O) = P(O|E).P(E)/P(O)
= P(O,E)/P(O) as shown in Figure 7.19. Therefore, we can infer the probability of the state
of Eating knowing the state of Obesity.
FIGURE 7.16
Example of Bayesian network.
FIGURE 7.17
Product rule.
FIGURE 7.18
Marginalization.
Probabilistic and Regression Based Approaches 169
FIGURE 7.19
Inferred values using marginalization.
FIGURE 7.20
Causal and diagnostic inference.
Now using the Bayes theorem, we know that P(S|G) can be deduced from P(G|S) * P(S)
divided by P(G). Now probability of Good Grades P(G) is the summation of P(G|S)*P(S) +
P(G|~S)*P(~S), that is, the probability of Good Grades is based on the probabilities of the
states of its cause (Study well), which in this case has only two states, S and ~S.
• Serial case: For the serial case, given the condition that F is known, T and C are
conditionally independent. Conditional independence is due to the fact that the
intermediate cause F is known.
• Diverging case: For the diverging case, E, the common cause for two nodes O and
B, is known; that is, the two nodes are connected through the common cause E and
hence O and B are conditionally independent. Conditional independence is due to
the fact that the common cause is known.
• Converging case: For the converging case, we have two nodes O and B, both being
causes for the node H, which in turn is the cause for the node R. Now given the
condition that H, the common effect of O and B is not known, and neither is the
170 Machine Learning
FIGURE 7.21
Conditional independence.
effect R of that common effect H known, then O and B are conditionally indepen-
dent. Conditional independence is due to the fact that the common effect and in
turn its effect are not known.
Step 1: First we determine what the propositional (random) variables should be.
Then we determine causal (or another type of influence) relationships and
develop the topology of the network.
Variables are identified as Burglary, Earthquake, Alarm, JohnCalls, MaryCalls
You have a new burglar alarm installed at home. It is fairly reliable at detecting a
burglary, but also responds on occasion to minor earthquakes.
You also have two neighbors, John and Mary, who have promised to call at work
when they hear the alarm.
John always calls when he hears the alarm, but sometimes confuses the telephone
ringing with the alarm and calls then, too.
Mary, on the other hand, likes rather load music and sometimes misses the alarm
together.
Given the evidence of who has or has not called, we would like to estimate the
probability of a burglary.
FIGURE 7.22
Bayesian network—alarm example.
Probabilistic and Regression Based Approaches 171
FIGURE 7.23
Topology of the network.
FIGURE 7.24
Bayesian network with CPTs of all nodes for alarm example.
172 Machine Learning
P x1 , , xn P x Parents X
i 1
i i (7.6)
We can see that each entry in the joint probability is represented by the product of appro-
priate elements of the CPTs in the belief network.
For the alarm example we can calculate the probability of the event that the alarm has
sounded but neither a burglary nor an earthquake has occurred, and both John and Mary
call as follows:
P J ^ M ^ A^ ~ B^ ~ E P( J | A) P( M | A) P( A|~ B, ~ E ) P ~ B P ~ E
0.90 0.70 0.01 0.999 0.998
0.0062
The second view is that the Bayesian network is an encoding of a collection of condi-
tional independence statements. For the alarm example,
This view is useful for understanding inference procedures for the networks.
Causal Inference
Causal inference is basically inference from cause to effect. In Figure 7.25, given
Burglary (that is, the probability of burglary (P(B) = 1)), we can find the probability of
John calling (P(J|B)). Note that this is indirect inference since Burglary causes Alarm,
which in turn causes John to call. Therefore, we first need the probability of Alarm
ringing given that Burglary has occurred (P(A|B)).
Diagnostic Inference
Diagnostic inference is basically inference from effect to cause. In Figure 7.26 given
that John calls, we can find the probability of Burglary occurring (P(B|J)). Note that
this is indirect inference since John calling was caused by Alarm ringing P(J|A) which
in turn was caused by a Burglary (P(B|A)).
Now we first apply Bayes theorem to find P(B|J) as given in Figure 7.26. Now this
shows that we need to know the probability of John calling that is P(J). However for
finding probability of P(J) we need to know probability of Alarm ringing that is P(A).
FIGURE 7.25
Causal inference.
FIGURE 7.26
Diagnostic inference.
174 Machine Learning
7.5 Regression Methods
Regression is a type of supervised learning which predicts real valued labels or vectors
(Pardoe, 2022). For example, it can predict the delay in minutes of a flight route, the price
of real estate or stocks, and so on. Another example from health care is, given features such
as age, sex, body mass index, average blood pressure, and blood serum measurements, in
the case of regression the target can be predicted as quantitative measure of disease pro-
gression. Therefore, the goal of regression is to make quantitative, that is, real-valued pre-
dictions on the basis of a vector of features or attributes. Here we need to specify the class
of functions, decide how to measure prediction loss, and solve the resulting minimization
problem. When constructing the predictor f: X→Y to minimize an error measure err(f), the
error measure err(f) is often the mean squared error err(f) = E[(f(X) – Y)2].
y f x w x b
j
j j (7.8)
where y is the prediction, w is the weights, and b is the bias or intercept. Here w and b are
the parameters. We wish that the prediction is close to the target, that is, y ≅ t.
Probabilistic and Regression Based Approaches 175
1
y, t 2 y t
2
(7.9)
w x t
1 1
w, b
2N
yi ti 2N T
j j (7.10)
i i
Therefore given a data set D = (x1,y1,…,(xn,yn) we need to find the optimal weight vector
(Equation 7.11):
n
w x t
2
W argmin w T
j j (7.11)
j 1
results in overfitting. However, the least squares fit can have high variance and may result
in overfitting, which in turn affects the prediction accuracy of unseen observations. When
we have a large number of input variables x in the model there will generally be many of
these variables that will have little or no effect on target y. Having these variables in the
model makes it harder to see the effect of the “important variables” that actually effect y
and hence reduces model interpretability. One method to handle these issues is the subset
selection method, where a subset of all p predictors that we believe to be related to the
response y are identified and then the model is fitted using this subset.
Shrinkage or Regularization
The subset selection methods involve using least squares to fit a linear model that
contains a subset of the predictors. We can also fit a model containing all p predictors
using a technique that constrains or regularizes the coefficient estimates, or equiva-
lently, that shrinks the coefficient estimates towards zero. In regression analysis, a
fitted relationship appears to perform less well on a new data set than on the data
set used for fitting, and we use shrinkage to regularize ill-posed inference problems.
Regularization constrains the machine learning algorithm to improve out-of-sample
error, especially when noise is present. The model would be easier to interpret by
removing (i.e., setting the coefficients to zero) the unimportant variables. Shrinkage
regularization methods fit a model containing all p predictors using a technique that
constrains or regularizes the coefficient estimates (i.e., shrinks the coefficient estimates
towards zero).
Shrinkage or regularization methods apply a penalty term to the loss function
used in the model. Minimizing the loss function is equal to maximizing the accuracy.
Recall that linear regression minimizes the squared difference between the actual and
predicted values to draw the best possible regression curve for the best prediction
accuracy. Shrinking the coefficient estimates significantly reduces their variance. The
need for shrinkage method arises due to the issues of underfitting or overfitting the
data. When we want to minimize the mean error (MSE in case of linear regression),
we need to optimize the bias–variance trade-off.
Ridge Regression
As we know, linear regression estimates the coefficients using the values that mini-
mize the following Equation (7.12):
2
n p
RSS yi 0
j xij
(7.12)
i 1 j 1
In other words, the idea of regularization modifies the loss function by adding a
regularization term that penalizes some specified properties of the model parameters
(Equation 7.13). Ridge regression adds a penalty term to this, lambda, to shrink the
coefficients to 0:
2
n p p p
yi 0
j xij
j2 RSS 2
j (7.13)
i 1 j 1 j 1 j 1
Probabilistic and Regression Based Approaches 177
Here λ is a scalar that gives the weight (or importance) of the regularization term.
Shrinkage penalty λ ≥ 0 is a tuning parameter.
Ridge regression’s advantage over linear regression is that it capitalizes on the
bias–variance trade-off. As λ increases, the coefficients shrink more towards 0. Ridge
regression (also called L1 regularization) has the effect of “shrinking” large values
towards zero. It turns out that such a constraint should improve the fit, because
shrinking the coefficients can significantly reduce their variance. Ridge regression
has a major disadvantage that it includes all p predictors in the output model regard-
less of the value of their coefficients, which can be challenging for a model with a
huge number of features.
Lasso Regression
The disadvantage of ridge regression is overcome by lasso regression, which per-
forms variable selection. Lasso regression uses L−1 penalty as compared to ridge
regression’s L−2 penalty, which instead of squaring the coefficient, takes its absolute
value as shown in Equation (7.14):
2
n p p p
yi 0
j xij
j RSS j (7.14)
i 1 j 1 j 1 j 1
Ridge regression brings the value of coefficients close to 0 whereas lasso regression
forces some of the coefficient values to be exactly equal to 0. It is important to opti-
mize the value of λ in lasso regression as well to reduce the MSE error. The lasso has
a major advantage over ridge regression, in that it produces simpler and more inter-
pretable models that involve only a subset of predictors. Lasso leads to qualitatively
similar behavior to ridge regression, in that as λ increases, the variance decreases
and the bias increases and can generate more accurate predictions compared to ridge
regression. In order to avoid ad hoc choices, we should select λ using cross-validation.
y y
1
Cost Function MSE
2
i i pred (7.15)
n i 0
y mx c
1 2
Cost Function MSE i i (7.16)
n i 0
178 Machine Learning
FIGURE 7.27
Bell shape of MSE (error).
Our goal is to minimize the cost as much as possible in order to find the best fit line, and
for this purpose we can use the gradient descent algorithm. Gradient descent is an algo-
rithm that finds the best-fit line for a given training data set in a smaller number of itera-
tions. If we plot m and c against MSE, it will acquire a bowl shape as follows (Figure 7.27).
For some combination of m and c, we will get the least error (MSE), which will give us
our best fit line. The algorithm starts with some value of m and c (usually starts with m = 0,
c = 0) for the calculation of MSE (cost). Let us assume that the MSE (cost) at m = 0, c = 0 is
100. Then we reduce the value of m and c by some amount (learning step). We will notice a
decrease in MSE (cost). We will continue doing the same until our loss function is a very
small value or ideally 0 (which means 0 error or 100% accuracy).
cost function 1
n n
2
yi yi pred x y y
2
Dm i i i pred (7.17)
m m n i 0
n i 0
cost function 1
n n
2
yi yi pred x y y
2
Dc i i i pred (7.18)
c c n i 0
n i 0
Now update the current values of m and c using the following Equations (7.19) and
(7.20). We will repeat this process until our Cost function is very small (ideally 0).
m m LDm (7.19)
c c LDc (7.20)
Probabilistic and Regression Based Approaches 179
Gradient descent algorithm gives optimum values of m and c of the linear regression
equation. With these values of m and c, we will get the equation of the best-fit line and then
we can make predictions.
7.5.2 Logistic Regression
Let us assume that we want to predict what soft drink customers prefer to buy, Fanta or
Sprite. The target variable y is categorical: 0 or 1; however, linear regression gives us a
value between 0 and 1 which specifies how much the customers are loyal to Fanta. Let
us now try to predict the probability that a customer buys Fanta (P(Y = 1)). Thus, we can
model P(Y = 1) using a function that gives outputs between 0 and 1 where we can use a
logistic function based on odds ratio, P(Y−1)/1−P(Y = 1). This is exactly logistic regression.
Logistic regression, despite its name, is a simple and efficient classification model com-
pared to the linear regression model (Sharma, 2021). Logistic regression is a type of classi-
fication algorithm based on linear regression to evaluate output and to minimize error. It
uses the logit function to evaluate the outputs. Logistic regression models a relationship
between predictor variables and a categorical response variable. Logistic regression mod-
els are classification models; specifically, binary classification models, that is, they can be
used to distinguish between two different categories, such as if a person is obese or not
given their weight, or if a house is big or small given its size. This means that our data has
two kinds of observations (Category 1 and Category 2 observations). Considering another
example, we could use logistic regression to model the relationship between various mea-
surements of a manufactured specimen (such as dimensions and chemical composition) to
predict if a crack greater than 10 mils will occur (a binary variable: either yes or no). Logistic
regression helps us estimate a probability of falling into a certain level of the categorical
response given a set of predictors.
The basic idea is to work with a smooth differentiable approximation to the 0/1 loss func-
tion. In the logistic model we model the probability of a target label y to be equal, that is, y
∈ (−1, 1) given data points x ∈Rn. Given yi∈ {0, 1}, we want the output to be also in the range
{0, 1}. Then we use the logistic (sigmoid) or S-shaped function (Equation 7.21; Figure 7.28):
1
z (7.21)
1 ez
FIGURE 7.28
The sigmoid function.
180 Machine Learning
hence the name logistic regression. In logistic regression, we do not fit a straight line to our
data like in linear regression, but instead, we fit an S shaped curve, called sigmoid, to our
observations. Used in this way, σ is called an activation function, and z is called the logit.
This logistic function is a simple strategy to map the linear combination z where z = wTx +
b, lying in the (−inf, inf) range to the probability interval of [0, 1] (in the context of logistic
regression, this z will be called the log(odd) or logit or log(p/1−p)). Consequently, logistic
regression is a type of regression where the range of mapping is confined to [0, 1], unlike
simple linear regression models where the domain and range could take any real value.
The Y-axis goes from 0 to 1. This is because the sigmoid function always takes as maximum
and minimum these two values, and this fits very well our goal of classifying samples into
two different categories. By computing the sigmoid function of X (that is a weighted sum of
the input features, just like in linear regression), we get a probability (between 0 and 1) of
an observation belonging to one of the two categories.
7.6 Summary
7.7 Points to Ponder
• A part of speech tagging application can be defined using the Bayesian rule.
• How do you think the issues associated with Bayesian learning can be tackled?
E.7 Exercise
E.7.1 Suggested Activities
E.7.1.1 Use the Weka tool for spam filtering using the naïve Bayes classifier. Use the
spam base data set, which can be obtained from the UCI machine learning
repository.
E.7.1.2 Design a business application using the Bayesian framework. Outline the
assumptions made.
Probabilistic and Regression Based Approaches 181
E.7.1.3 Take an example of university results and formulate at least five aspects using
the naïve classifier. Clearly state any assumptions made.
E.7.1.4 Design a business application using the Bayesian network having at least seven
nodes and seven edges. Outline the assumptions made.
E.7.1.5 Design a healthcare application using the Bayesian network having at least
seven nodes and seven edges. Outline the assumptions made (including CPTs)
and all inferences possible. Implement the application using Python.
Self-Assessment Questions
E.7.2 Multiple Choice Questions
E.7.2.2 _____ uses probability to model data and quantify uncertainty of predictions.
i Linear regression
ii Bayesian learning
iii SVM
E.7.2.3
_____ of a hypothesis reflects any background knowledge or evidence we have
about the chance that h is a correct hypothesis.
i Prior probability
ii Post probability
iii Marginal probability
E.7.2.6 Finding the hypothesis h that maximizes the likelihood of the data d for a given
hypothesis h and the prior probability of h is called
i Maximum a posteriori (MAP)
ii Maximum likelihood (ML)
iii Maximum prior (MP)
182 Machine Learning
E.7.2.7 The method by which the posterior p(class|data) directly which is then used to
make predictions of class for new values of x is called a
i Generative model
ii Probabilistic model
iii Discriminative model
E.7.2.8 An important assumption that the naïve Bayes method makes is that
i All input attributes are conditionally independent
ii All input attributes are conditionally independent
iii All input attributes are conditionally dependent
E.7.2.10 If the problem is defined as the probability of the cause given the symptom, it
is called
i Prediction
ii Classification
iii Diagnosis
i A is dependent on B given C
ii A is dependent on B given D
iii A is independent of B given C
Probabilistic and Regression Based Approaches 183
No Match
E.7.3.1 Posterior probability A States that given its parents, a random variable X is
conditionally independent of its nondescendants
E.7.3.2 Maximum likelihood estimate B Choosing a higher degree polynomial as a model
E.7.3.3 Generative models are used for C Require initial knowledge of many probabilities
E.7.3.4 Naïve Bayes D Bayesian learning
E.7.3.5 Text classification based on naïve Bayes E The probability of the symptom given the cause
E.7.3.6 Prediction is defined as finding F Assume that all hypotheses are equally likely a priori
E.7.3.7 Bayesian network is associated with G Probability of observing the conjunction of
Markov condition attributes is equal to the product of the individual
probabilities
E.7.3.8 Bayesian networks H Vector space representation to represent documents
E.7.3.9 Lead to overfitting. I Is a classification model
E.7.3.10 Logistic regression J Modelling observations drawn from a probability
density function
184 Machine Learning
E.7.4 Problems
E.7.4.1 Ram is a CS 6310 student; recently his mood has been highly influenced by three
factors: the weather (W), his study habits (S), and whether his friend talks to him
(F). Predict his happiness (H) based on these factors, given observations in the
table shown.
On a new day when weather is Good, S = Pass, and F = No talk, predict the stu-
dent’s happiness using a naïve Bayes classifier. Show all calculations.
E.7.4.2 Given the following table, find if the car will be stolen if it is yellow in colour, is
a sports car, and is of domestic origin. Show all calculations.
E.7.4.4 In a medical study, 100 patients all fell into one of three classes: Pneumonia, Flu,
or Healthy. The following database indicates how many patients in each class
had fever and headache. Consider a patient with a fever but no headache.
(a) What values would a Bayes optimal classifier assign to the three diagnoses?
(A Bayes optimal classifier does not make any independence assumptions
about the evidence variables.) Again, your answers for this question need
not sum to 1.
(b) What values would a naïve Bayes classifier assign to the three possible diag-
noses? Show your work. (For this question, the three values need not sum to
1. Recall that the naïve Bayes classifier drops the denominator because it is
the same for all three classes.)
(c) What probability would a Bayes optimal classifier assign to the proposition
that a patient has pneumonia? Show your work. (For this question, the three
values should sum to 1.)
(d) What probability would a naïve Bayes classifier assign to the proposition that
a patient has pneumonia? Show your work. (For this question, the three
values should sum to 1.)
186 Machine Learning
E.7.4.5 A naïve Bayes text classifier has to decide whether the document “Chennai
Hyderabad” is about India (class India) or about England (class England).
(a) Estimate the probabilities that are needed for this decision from the
following document collection using maximum likelihood estimation (no
smoothing).
(b) Based on the estimated probabilities, which class does the classifier predict?
Explain. Show that you have understood the naïve Bayes classification rule.
(a) Given Wet Grass is True, what is probability that it was Cloudy?
(b) Given Sprinkler, what is the probability that it was Cloudy?
Probabilistic and Regression Based Approaches 187
(a) Calculate the probability that in spite of the exam level being difficult, the
student having a low IQ level and a low aptitude score manages to pass the
exam and secure admission to the university.
(b) In another case, calculate the probability that the student has a high IQ level
and aptitude score, with the exam being easy, yet fails to pass and does not
secure admission to the university.
(c) Given the fact that a student gets admission, what is the probability that
they have a high IQ level?
(a) Find the probability of Positive X-Ray given the person has visited Asia and
is a smoker.
(b) Find the probability of D given the person has not visited Asia but is a
smoker.
(c) Given not (Positive X-Ray), find the probability that the person has visited
Asia.
(d) Given that the person does not have TB or lung cancer, find the probability
that they are a smoker.
E.7.4.10
The number of disk I/Os and processor times of seven programs are
measured as
(a) Find the parameters of the Linear regression model for this data set.
(b) Explain the interpretation of the model.
(c) Find the mean squared error.
(d) Find the coefficient of determination.
E.7.5 Short Questions
References
Agrawal, S. K. (2022, December 2). Metrics to evaluate your classification model to take the right
decisions. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/07/metrics-to-
evaluate-your-classification-model-to-take-the-right-decisions/
Apple. (n.d.). Classification metrics. https://apple.github.io/turicreate/docs/userguide/evaluation/
classification.html
Beheshti, N. (2022, February 10). Guide to confusion matrices & classification performance metrics:
Accuracy, precision, recall, & F1 score. Towards Data Science. https://towardsdatascience.com/
guide-to-confusion-matrices-classification-performance-metrics-a0ebfc08408e
Dharmasaputro, A. A., Fauzan, N. M., et al. (2021). Handling missing and imbalanced data to
improve generalization performance of machine learning classifier. International Seminar on
Machine Learning, Optimization, and Data Science (ISMODE).
190 Machine Learning
Engineering Education Program. (2020, December 11). Introduction to random forest in machine
learning. Section. https://www.section.io/engineering-education/introduction-to-random-
forest-in-machine-learning/
Geeks for Geeks. (2022). Bagging vs boosting in machine learning. https://www.geeksforgeeks.org/
bagging-vs-boosting-in-machine-learning/
Geeks for Geeks. (2023). ML: Handling imbalanced data with SMOTE and near miss algorithm in
Python. https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-
near-miss-algorithm-in-python/
JavaTpoint. (2021). Cross-validation in machine learning. https://www.javatpoint.com/cross-
validation-in-machine-learning#:~:text=Cross%2Dvalidation%20is%20a%20technique,
generalizes%20to%20an%20independent%20dataset
Lutins, E. (2017, August 1). Ensemble methods in machine learning: What are they and why
use them? Towards Data Science. https://towardsdatascience.com/ensemble-methods-in-
machine-learning-what-are-they-and-why-use-them-68ec3f9fef5f
Mitchell, Tom. (2017). Machine learning. McGraw Hill Education.
Narkhede, S. (2018, June 26). Understanding AUC-ROC curve. Towards Data Science. https://
towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Pardoe, I. (2022). STAT 501: Regression methods. Pennsylvania State University, Eberly College of
Science. https://online.stat.psu.edu/stat501/lesson/15/15.1
Pearl, J., & Russell, S. (2000). Bayesian networks. University of California. https://www.cs.ubc.ca/
~murphyk/Teaching/CS532c_Fall04/Papers/hbtnn-bn.pdf
Russell, S., & Norvig, P. (2022, August 22). Bayesian networks. Chapter 14 in Artificial intelligence: A
modern approach. http://aima.eecs.berkeley.edu/slides-pdf/chapter14a.pdf
Sharma, A. (2021, March 31). Logistic regression explained from scratch (visually, mathematically and
programmatically). Towards Data Science. https://towardsdatascience.com/logistic-regression-
explained-from-scratch-visually-mathematically-and-programmatically-eb83520fdf9a
8
Performance Evaluation and Ensemble Methods
8.1 Introduction
After building a predictive classification model, you need to evaluate the performance of
the model, that is, how good the model is in predicting the outcome of new test data that
has not been used to train the model. In other words, we need to estimate the model predic-
tion accuracy and prediction errors using a new test data set. Because we know the actual
outcome of observations in the test data set, the performance of the predictive model can be
assessed by comparing the predicted outcome values against the known outcome values.
Evaluation metrics are used to measure the quality of the classification model. When we
build the model, it is crucial to measure how accurately the expected outcome is predicted.
There are different evaluation metrics for different sets of machine learning algorithms. For
evaluating classification models, we use classification metrics, and for regression models,
we use regression metrics. Evaluation metrics help to assess your model’s performance,
monitor your machine learning system in production, and control the model to fit the
business needs. The goal of any classification system is to create and select a model which
gives high accuracy on unseen data. It is important to use multiple evaluation metrics to
evaluate the model because a model may perform well using one measurement from one
evaluation metric while it may perform poorly using another measurement from another
evaluation metric (Agrawal, 2022).
8.2 Classification Metrics
8.2.1 Binary Classification
Classification is about predicting the class labels given input data. In binary classification,
there are only two possible output classes (i.e., dichotomy). In multi-class classification,
more than two possible classes can be present. There are many ways for measuring classi-
fication performance (Beheshti, 2022). Accuracy, confusion matrix, log-loss, and area under
curve/receiver operator characteristic (AUC-ROC) are some of the most popular metrics.
Precision and recall are widely used metrics for classification problems.
We will first describe the confusion matrix, which is an n × n matrix (where n is the
number of labels) used to describe the performance of a classification model on a set of test
data for which the true values are known. Each row in the confusion matrix represents an
actual class whereas each column represents a predicted class. A confusion matrix for two
classes (binary classification) C1 and C2 is shown in Figure 8.1.
Consider, we have a machine learning model classifying passengers as cancer positive
and negative. When performing classification predictions, there are four types of outcomes
that could occur (Figure 8.2):
True Positive (TP): When you predict an observation belongs to a class and it actu-
ally does belong to that class—in this case, a passenger who is classified as cancer
positive and is actually positive.
True Negative (TN): When you predict an observation does not belong to a class and
it actually does not belong to that class—in this case, a passenger who is classified
as not cancer positive (negative) and is actually not cancer positive (negative).
False Positive (FP): When you predict an observation belongs to a class and it actu-
ally does not belong to that class—in this case, a passenger who is classified as
cancer positive and is actually not cancer positive (negative).
False Negative (FN): When you predict an observation does not belong to a class and
it actually does belong to that class—in this case, a passenger who is classified as
not cancer positive (negative) and is actually cancer positive.
Now, given the confusion matrix, we will discuss the various evaluation metrics.
FIGURE 8.1
Confusion matrix for binary classification.
FIGURE 8.2
Confusion matrix—Binary classification example.
Performance Evaluation and Ensemble Methods 193
8.2.1.1 Accuracy
The first metric is accuracy. Accuracy focuses on true positive and true negative. Accuracy
is one metric which gives the fraction of predictions that are correct. Formally, accuracy is
defined as given in Equation (8.1).
Example 8.1
Now, considering a patient data set of 1,00,000, let us assume that 10 of the patients actually
have cancer. Now let us assume that all patients in the data set are classified as negative for
cancer. So the confusion matrix values are as shown in Figure 8.4.
The accuracy in the example given in Figure 8.3 is 99.9%. The objective here is to identify
actual cancer patients, and the incorrect labelling of these positive patients as negative is
not acceptable; hence in this scenario accuracy is not a good measure.
FIGURE 8.3
Accuracy.
FIGURE 8.4
Accuracy calculation.
194 Machine Learning
FIGURE 8.5
Recall or sensitivity. A figure shows a Recall measure which gives the fraction we correctly identified as positive
out of all positives; in terms of the confusion matrix recall is defined asRecall = TP/TP + FN.
FIGURE 8.6
(a) Calculation of recall, (b) specificity.
measure. We normally want to maximize the recall. Now let us assume that all patients in
the data set are wrongly classified as positive for cancer. This labelling is bad since it leads
to mental agony as well as the cost of investigation. As can be seen from Figure 8.6a, Recall
= 1, and this is again not a good measure.
However, this is also not a good measure when measured independently. Recall needs
to be measured in coordination with precision. We can also define specificity for the nega-
tive cases as in Figure 8.6b.
8.2.1.3 Precision
Precision gives the fraction we correctly identified as positive out of all positives that were
predicted (given in Figure 8.7) or in other words indicates how good we are at prediction.
Now let us assume again that in Example 8.1 all patients in the data set are wrongly
classified as positive for cancer. The calculation of precision in this case is shown in
Figure 8.8. For this case we have high recall value and low precision value.
In some cases, we are pretty sure that we want to maximize either recall or precision at
the cost of others.
Performance Evaluation and Ensemble Methods 195
FIGURE 8.7
Precision.
FIGURE 8.8
Calculation of precision.
8.2.1.4 Precision/Recall Trade-off
As we can see, increasing precision reduces recall and vice versa. This is called the pre-
cision/recall trade-off as shown in Figure 8.9. Classifier models perform differently for
FIGURE 8.9
Precision–recall curve.
196 Machine Learning
different threshold values; that is, positive and negative predictions can be changed by set-
ting the threshold value. We can fix an output probability threshold value for us to label the
samples as positive. In order to classify a disease, for example, only if very confident, we
increase the threshold, leading to higher precision and lower recall. On the other hand, if
we are interested in avoiding missing too many cases of disease, for example, we decrease
the threshold, leading to higher recall but lower precision. Precision–recall curves sum-
marize the trade-off between the true positive rate and the positive predictive value for a
predictive model using different confidence probability thresholds.
The precision–recall curve shows the trade-off between precision and recall for different
thresholds. A high area under the curve represents both high recall and high precision,
where high precision relates to a low false positive rate, and high recall relates to a low
false negative rate. High scores for both show that the classifier is returning accurate results
(high precision), as well as returning a majority of all positive results (high recall).
8.2.1.5 F1 Score
When we want to compare different models with different precision–recall values, we
need to combine precision and recall into a single metric to compute the performance. The
F1 score is defined as the harmonic mean of precision and recall and is given in Figure 8.10.
Harmonic mean is used because it is not sensitive to extremely large values, unlike
simple averages. The F1 score is a better measure to use if we are seeking a balance between
precision and recall.
Example 8.2
Let us consider an example where mail can be spam or not spam and classification is used to
predict whether the incoming mail is spam or not spam. The corresponding confusion matrix
and the classification metrics are shown in Figure 8.11.
8.2.1.6 ROC/AUC Curve
The receiver operator characteristic (ROC) is another common method used for evaluation.
It plots out the sensitivity and specificity for every possible decision rule cutoff between
0 and 1 for a model. For classification problems with probability outputs, a threshold can
convert probability outputs to classifications. We get the ability to control the confusion
matrix, so by changing the threshold, some of the numbers can be changed in the confu-
sion matrix. To find the right threshold we use of the ROC curve. For each possible thresh-
old, the ROC curve plots the false positive rate versus the true positive rate where false
positive rate is the fraction of negative instances that are incorrectly classified as positive
while true positive rate is the fraction of positive instances that are correctly predicted as
positive. Thus the ROC curve, which is a graphical summary of the overall performance of
the model, shows the proportion of true positives and false positives at all possible values
FIGURE 8.10
F1 score.
Performance Evaluation and Ensemble Methods 197
FIGURE 8.11
Example showing evaluation of binary classification.
of probability cutoff. Figure 8.12a discusses a low threshold of 0.1 and high threshold of 0.9
while Figure 8.12b explains the ROC curve for different values of threshold.
The ROC curve when TPR = 1 and FPR = 0 is the ideal model that does not exist, where
positives are classified as positives with zero error. In the strict model when TPR = 0 and
FPR = 0, very rarely is something classified as true. The liberal model when TPR = 1 and
FPR = 1 classifies every positive as positive but at the cost of many negatives also being
classified as positive. For a good model, the ROC curve should rise steeply, indicating that
the true positive rate increases faster than the false positive rate as the probability thresh-
old decreases.
One way to compare classifiers is to measure the area under the curve (AUC) for ROC,
where AUC summarizes the overall performance of the classifier.
8.2.2 Multi-Class Classification
We have discussed evaluation of binary classifiers. When there are more than two labels
available for a classification problem, we have multi-class classification whose perfor-
mance can be measured in a similar manner as the binary case. Having m classes, the
confusion matrix is a table of size m × m, where the element at (i, j) indicates the number
of instances of class i but classified as class j. To have good accuracy for a classifier, ideally
most diagonal entries should have large values with the rest of entries being close to zero.
The confusion matrix may have additional rows or columns to provide total or recogni-
tion rates per class. In case of multi-class classification, sometimes one class is important
enough to be regarded as positive with all other classes combined together as negative.
Thus a large confusion matrix of m × m can be reduced to a 2 × 2 matrix.
Let us now discuss the calculation of TP, TN, FP, and FN values for a particular class in
a multi-class scenario. Here TP is the value in the cell of the matrix when actual and pre-
dicted value are the same. FN is the sum of values of corresponding rows of the class
except the TP value, while FP is the sum of values of corresponding column except the TP
value. On the other hand, TN is the sum of values of all columns and row except the values
of that class that we are calculating the values for.
198 Machine Learning
FIGURE 8.12
ROC curve.
Example 8.3
Let us consider an example where a test is carried out to predict the quality of manufactured
components. The quality can have four values—Very Good, Good, Fair, and Poor. The corre-
sponding confusion matrix and the classification metrics are shown in Figure 8.13.
FIGURE 8.13
Example showing evaluation of multi-class classification.
Performance Evaluation and Ensemble Methods 199
8.4 Ensemble Methods
Ensemble methods combine several base machine learning models in order to produce
one optimal predictive model. Ensemble learning is a general meta approach to machine
learning that seeks better predictive performance by combining the predictions from mul-
tiple models rather than from a single model. The basic idea is to learn a set of classifiers
(experts) and to allow them to vote. An ensemble of classifiers combines the individual
decisions of a set of classifiers in particular ways to predict the class of new samples
(Figure 8.14).
The simplest approach to ensemble classifiers is to generate multiple classifiers which
then vote on each test instance, and the class with the majority vote is considered as the
class of the new instance. Different classifiers can differ because of different algorithms,
different sampling of training data, and different weighting of the training samples or dif-
ferent choice of hyper-parameters of the classification algorithm.
Based on the training strategy and combination method, ensemble approaches can be
categorized as follows:
FIGURE 8.14
Concept of ensemble approach.
• Committee approach use multiple base models where each base model covers the
complete input space but is trained on slightly different training sets. One method
is to manipulate training instances where multiple training sets are created by
resembling the original data using some sampling distribution and the classifier
is built from each created training set. Examples of this method include bagging
and boosting. Another approach is to choose, either randomly or based on domain
expertise, subsets of input features to form different data sets and building classi-
fiers from each training set. An example of such an approach is random forest. The
predictions of all models are then combined to obtain the output whose accuracy
would be better than the base model.
Ensemble methods basically work because they try to handle the variance–bias issues.
When the training sets used by the ensemble method are completely independent, vari-
ance is reduced without affecting bias (as in bagging) since sensitivity to individual points
is reduced. For simple models, averaging of models can reduce bias substantially (as in
boosting). We will now discuss three major types of ensemble methods, namely bagging,
boosting, and stacking.
FIGURE 8.15
Bagging (bootstrap aggregation) ensemble learning.
same error. This method is also called model averaging. Bagging consists of two steps,
namely bootstrapping and aggregation (Figure 8.15).
Bagging is advantageous since weak base learners are combined to form a single strong
learner that is more stable than single learners. Bagging is most effective when using unsta-
ble data sets, nonlinear models where a small change in the data set can cause a significant
change in the model. Bagging reduces variance by voting/averaging, thus reducing the
202 Machine Learning
overall expected error. The reduction of variance using this method increases accuracy,
reducing the overfitting issue associated with many predictive models. One limitation of
bagging is that it is computationally expensive.
Random forest is an ensemble machine learning algorithm specifically designed for
decision tree models and widely used for classification and regression. A random forest
algorithm consists of many decision trees. The “forest” generated by the random forest
algorithm is trained through bagging or bootstrap aggregating. The random forest classi-
fier is an extension to bagging which uses decorrelated trees. When using bagging for deci-
sion trees, the resampling of training data alone is not sufficient. In addition, the features
or attributes used in each split are restricted. In other words, we introduce two sources of
randomness, that is, bagging and random input vectors. In bagging, each tree is grown
using a bootstrap sample of training data, and in addition, in the random vector method,
at each node of the decision tree the best split is chosen from a random sample of m attri-
butes instead of choosing from the full set of p attributes (Figure 8.16).
In case a very strong predictor exists in the data set along with a number of other moder-
ately strong predictors using all attributes, then in the collection of bagged trees created using
bootstrap data sets, most or all of them will use the very strong predictor for the first split and
hence all bagged trees will look similar. Hence all the predictions from the bagged trees will
be highly correlated. Averaging these highly correlated quantities does not lead to a large
variance reduction, but the random forest using a random input vector decorrelates the
bagged trees, leading to further reduction in variance. A random forest eradicates the limita-
tions of a decision tree algorithm. It reduces the overfitting of data sets and increases preci-
sion, resulting in a highly accurate classifier which runs efficiently on large data sets. In
addition, random forest provides an experimental method of detecting variable interactions.
Boosting is an ensemble technique that learns from mistakes of a previous predictor
to make better predictions in the future. It is an iterative procedure that generates a
series of base learners that complement each other. This method essentially tackles
FIGURE 8.16
Random forest.
Performance Evaluation and Ensemble Methods 203
FIGURE 8.17
AdaBoost procedure.
204 Machine Learning
where N is the total number of samples, and correct is the number of samples correctly
predicted. An iteration value is calculated for the trained model which provides a weight-
ing for any predictions that the model makes. The iteration value for a trained model is
calculated as given in Equation (8.3):
1 error
Iteration ln (8.3)
error
The updating of the weight of one training instance (w) is carried out as given in
Equation (8.4):
where w is the weight for a specific training instance, exp() is the Euler’s number raised to
a power, iteration is the misclassification rate for the weak classifier, and serror is the error
the weak classifier made predicting the output variable for the training instance, evaluated
as given in Equation (8.5).
where trueoutput is the output variable for the training instance and predicted is the predic-
tion from the weak learner.
TABLE 8.1
Comparison between Bagging and Boosting
Bagging Boosting
1. Combining predictions from multiple Combining predictions that belong to the different types of
similar models models
2. Resamples data points The distribution of data is changed by reweighting data
points
3. The objective is to reduce variance, not Objective is to reduce bias while keeping variance small
bias
4. Each model is built independently New models are influenced by the performance of
previously built models
5. Weight of each learner model is the same Learner models are weighted according to their accuracy
6. Base classifiers are trained in parallel Base classifiers are trained sequentially
7. Example: The random forest model Example: AdaBoost algorithm
Performance Evaluation and Ensemble Methods 205
8.4.1.3 Stacking
Stacking, also called stacked generalization, involves fitting many different models on the
same data and using another model to learn how to best combine the predictions. Stacking
has been successfully implemented in regression, density estimations, distance learning,
and classifications.
Stacking uses the concept that a learning problem can be attacked with different types
of models which are capable of learning some part of the problem, but not the whole space
of the problem. The multiple different learners are used to build different intermediate
predictions, and a new model learns from these intermediate predictions.
The training data is split into k-folds, and the base model is fitted on the k − 1 parts and
predictions are made for the kth part. We repeat the same for each part of the training data.
The base model is then fitted on the whole training data set to calculate its performance on
the test set. These steps are repeated for each of the base models. Predictions from the train-
ing set are used as features for the second level model, and this model is used to make a
prediction on the test set.
SMOTE synthesizes new minority instances between existing minority instances by linear
interpolation. These synthetic training records are generated by randomly selecting one or
more of the k-nearest neighbors for each example in the minority class. After the oversam-
pling, the data set is reconstructed, and classification algorithms are applied using this
reconstructed balanced data set.
The other method to handle imbalanced class distribution is the NearMiss algorithm,
an undersampling technique which attempts to balance class distribution by eliminating
majority class samples. The method first finds the distances between all instances of the
majority class and the instances of the minority class. Then, n instances of the majority
class that have the smallest distances to those in the minority class are selected. If there
are k instances in the minority class, the nearest method will result in removing k × n
instances of the majority class. After the under sampling of the majority class, class bal-
ance is achieved.
8.6 Summary
• Discussed the confusion matrix in detail.
• Explained the various evaluation metrics used for evaluating quality of the binary
classification model with illustrative examples.
• Discussed how a confusion matrix can be used for multi- class classification
evaluation.
• Discussed the basic concepts of ensemble learning.
• Explained the handling of missing and imbalanced data.
8.7 Points to Ponder
• Accuracy is not always a good evaluation measure for classification.
• It is important to understand the precision–recall trade-off to evaluate a classifica-
tion model.
• The ROC/AUC curve can be used to evaluate the overall performance of the clas-
sification model.
• Ensemble methods basically work because they try to handle the variance–bias
issues.
• Bagging works by resampling data points.
• The objective of boosting is to reduce bias while keeping variance small.
• Oversampling and under sampling are both used to handle imbalanced data.
Performance Evaluation and Ensemble Methods 207
E.8 Exercises
E.8.1 Suggested Activities
E.8.1.1 Implement any three classification algorithms using a standard data set from the
UCI machine learning repository. Now evaluate the performance of the three
models using different evaluation metrics. Give your comments.
E.8.1.2 Take any free imbalanced data set available and use SMOTE and the NearMiss
algorithm to tackle the same. Evaluate the performance of classification of the
data with and without balancing.
Self-Assessment Questions
E.8.2 Multiple Choice Questions
E.8.2.4 Bagging is
i Combining predictions from multiple different models weighted the same
ii Combining predictions from multiple similar models weighted differently
iii Combining predictions from multiple similar models weighted the same
E.8.2.10 If the columns of the data set have numeric continuous values, the imputation
method can be used where the missing values
i Can be replaced by the mean or median or mode of the remaining values in
the column
ii Can be replaced by duplicating the remaining values in the column
iii Can be replaced by removing the column
E.8.2.11 SMOTE is an
i Under sampling method which attempts to solve the imbalance problem by
randomly decreasing the majority class by removing samples at random
ii Oversampling method which attempts to solve the imbalance problem by
randomly increasing the minority class by synthesizing samples by linear
interpolation
iii Oversampling method which attempts to solve the imbalance problem by
randomly increasing the minority class by duplicating samples
Performance Evaluation and Ensemble Methods 209
No Match
E.8.3.1 The fraction correctly identified as positive out of all positives A True negative (TN)
that were predicted
E.8.3.2 Sequential training where the training samples are iteratively B ROC curve plots
reweighted so that the current classifier focuses on hard-to-
classify samples of the previous iteration
E.8.3.3 Parallel training using different data sets C F1 score
E.8.3.4 An extension to bagging which uses decorrelated trees D SMOTE
E.8.3.5 When you predict an observation does not belong to a class and E Precision
it actually does not belong to that class
E.8.3.6 Randomly increasing minority class samples by replicating them F Stacking
E.8.3.7 The false positive rate versus the true positive rate G NearMiss algorithm
E.8.3.8 Uses the concept that a learning problem can be attacked with H Random forest classifier
different types of models which are capable of learning some part
of the problem but not the whole space of the problem.
E.8.3.9 Eliminating majority class samples I Bootstrap aggregation
E.8.3.10 defined as the harmonic mean of precision and recall J Boosting
E.8.4 Short Questions
E.8.4.1 Discuss the confusion matrix for binary classification in detail.
E.8.4.2 Differentiate between the recall and precision evaluation measures.
E.8.4.3 Discuss the ROC curve in detail.
E.8.4.4 Discuss the confusion matrix for four-class classification in detail.
E.8.4.5 What is k-fold cross validation? Discuss.
E.8.4.6 Explain the categorization of ensemble methods based on training strategy
and combination methods.
E.8.4.7 What are the questions that a machine learning system making decisions or
recommendations needs to give explanations for to the different stakeholders?
E.8.4.8 Discuss the bagging method in detail.
E.8.4.9 Differentiate between bagging and boosting ensemble methods.
E.8.4.10 Why is it important to handle missing data and imbalanced data sets?
How are these issues handled?
210 Machine Learning
References
Agrawal, S. K. (2022, December 2). Metrics to evaluate your classification model to take the right
decisions. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/07/metrics-to-
evaluate-your-classification-model-to-take-the-right-decisions/
Beheshti, N. (2022, February 10). Guide to confusion matrices & classification performance metrics:
Accuracy, precision, recall, & F1 score. Towards Data Science. https://towardsdatascience.com/
guide-to-confusion-matrices-classification-performance-metrics-a0ebfc08408e
Dharmasaputro, A. A., Fauzan, N. M., et al. (2021). Handling missing and imbalanced data to
improve generalization performance of machine learning classifier. International Seminar on
Machine Learning, Optimization, and Data Science (ISMODE).
Re, Matteo, & Valentini, Giorgio. (2012). Ensemble methods: A review. In Advances in machine learning
and data mining for astronomy (pp. 563–594). Chapman & Hall.
9
Unsupervised Learning
TABLE 9.1
Supervised Learning Versus Unsupervised Learning
Supervised Learning Unsupervised Learning
Uses feature tagged and labelled data Uses feature tagged data
Finds a mapping between the input and associated Finds the underlying structure of data set by grouping
target attributes data according to similarities
Used when what is required is known Used when what is required is not known
Applicable in classification and regression problems Applicable in clustering and association problems
Human intervention to label the data appropriately Accuracy cannot be measured in a similar way
results in accuracy
Generally objective with a simple goal such as More subjective as there is no simple goal
prediction of a response
Labelled data sets reduce computational complexity No reliance on domain expertise for time-
as large training sets are not needed consuming and costly labelling of data
Unsupervised Learning 213
9.5 Clustering
Clustering is the most important unsupervised learning approach associated with
machine learning. It can be viewed as a method for data exploration, which essentially
means looking for patterns or structures in the data space that may be of interest in a collec-
tion of unlabeled data. Now let us look at a simplistic definition of clustering. Clustering
is an unsupervised learning task where no classes are associated with data instances a pri-
ori as in the case of supervised learning. Clustering organizes data instances into groups
based on their similarity. In other words, the data is partitioned into groups (clusters) that
satisfy the constraints that a cluster has points that are similar and different clusters have
points that are dissimilar.
Clustering is often considered synonymously with unsupervised learning, but other
unsupervised methods include association rule mining and dimensionality reduction.
Clustering groups instances based on similarity, or similar interests or similarity of usage.
A set of data instances or samples can be grouped differently based on different criteria
or features; in other words clustering is subjective. Figure 9.1 shows a set of seven people.
They have been grouped into three clusters based on whether they are school employees,
whether they belong to a family, or based on the gender. Therefore, choosing the attributes
or features based on which clustering is to be carried out is an important aspect of cluster-
ing just as it was for classification. In general, clustering algorithms need to make the inter-
cluster distance maximum and intra-cluster distance minimum.
9.5.1 Clusters—Distance Viewpoint
When we are given a set of instances or examples represented as a set of points, we group
the points into some number of clusters based on a concept of distance such that members
214 Machine Learning
FIGURE 9.1
Grouping of people based on different criteria.
FIGURE 9.2
Clusters and outlier.
of a cluster are close to each other when compared to members of other clusters. Figure 9.2
shows a data set that has three natural clusters where the data points group together based
on the distance. An outlier is a data point that is isolated from all other data points.
9.5.2 Applications of Clustering
In the context of machine learning, clustering is one of the functions that have many inter-
esting applications. Some of the applications of clustering are as follows:
• Group related documents for browsing or webpages based on their content and
links
• Group genes and proteins based on similarity of structure and/or functionality
• Group stocks with similar price fluctuations
• Cluster news articles to help in better presentation of news
Unsupervised Learning 215
• Group people with similar sizes to design small, medium, and large T-shirts
• Segment customers according to their similarities for targeted marketing
• Document clustering based on content similarities to create a topic hierarchy
• Cluster images based on their visual content
• Cluster groups of users based on their access patterns on webpages or searchers
based on their search behavior
9.6 Similarity Measures
As we have already discussed, clustering is the grouping together of “similar” data.
Choosing an appropriate (dis)similarity measure is a critical step in clustering. A similarity
measure is often described as the inverse of the distance function; that is, the less the dis-
tance, the more the similarity. There are many distance functions for the different types of
data such as numeric data, nominal data, and so on. Distance measures can also be defined
specifically for different applications.
FIGURE 9.3
Confusion matrix.
216 Machine Learning
The confusion matrix can be used if both states (0 and 1) have equal importance and
carry the same weights. Then the distance function is the proportion of mismatches of their
value (Equation 9.1):
dist xi , x j b c / a b c d (9.1)
However, sometimes the binary attributes are asymmetric where one of the states is
more important than the other. We assume that state 1 represents the important state, in
which case the Jaccard measure using the confusion matrix can be defined as follows
(Equation 9.2):
JDist xi , x j b c / a b c (9.2)
For text documents, normally we use cosine similarity, which is a similarity measure,
not a distance measure. Cosine similarity is a measure of similarity between two vectors
obtained by measuring the cosine of the angle between them (Figure 9.4). The similarity
between any two given documents dj and dk, represented as vectors is given as (Equation 9.3):
ww
n
d j dk i, j i,k
sim d j , dk i 1
(9.3)
d j dk
w
n n
2
i, j wi2, k
i 1 i 1
In this case wi is a weight probably based on the frequency of words in the documents.
The result of the cosine function is equal to 1 when the angle is 0, and it is less than 1
when the angle is of any other value. As the angle between the vectors decreases, the
cosine value approaches 1, that is, the two vectors are closer, and the similarity between the
documents increases.
FIGURE 9.4
The cosine angle between vectors.
Unsupervised Learning 217
9.7 Methods of Clustering
The basic method of clustering is the hierarchical method, which is of two types, agglom-
erative and divisive. Agglomerative clustering is a bottom-up method where initially we
assume that each data point is by itself a cluster. Then we repeatedly combine the two
nearest clusters into one. On the other hand, divisive clustering is a top=down procedure
where we start with one cluster and recursively split the clusters until no more division is
possible. We normally carry out point assignment where we maintain a set of clusters and
allocate points to the nearest cluster.
9.7.1 Hierarchical Clustering
In the hierarchical clustering approach, we carry out partitioning of the data set in a
sequential manner. The approach creates partitions one layer at a time by grouping objects
into a tree of clusters (Figure 9.5). In this context there is no need to know the number of
clusters in advance. In general, the distance matrix is used as the clustering criteria.
FIGURE 9.5
Hierarchical clustering.
218 Machine Learning
9.7.2.1 Agglomerative Clustering
In agglomerative clustering initially each data point forms its own (atomic) cluster and
hence follows the bottom-up strategy. We then merge these atomic clusters into larger and
larger clusters based on some distance metric. The algorithm terminates when all the data
points are in a single cluster, or merging is continued until certain termination conditions
are satisfied. Most hierarchical clustering methods belong to this category. Agglomerative
clustering on the data set {a, b, c, d, e} is shown in Figure 9.6.
9.7.2.2 Divisive Clustering
Divisive clustering is a top-down strategy which does the reverse of agglomerative hierar-
chical clustering, where initially all data points together form a single cluster. It subdivides
the clusters into smaller and smaller pieces until it satisfies certain termination conditions,
such as a desired number of clusters or the diameter of each cluster is within a certain
threshold (Figure 9.6).
FIGURE 9.6
Agglomerative and divisive hierarchical clustering.
Unsupervised Learning 219
Single link: Distance between two clusters is the distance between the
closest points in the clusters (Figure 9.7a) for single link. This method
is also called neighbor joining. The cluster distance d(Ci, Cj) between
the clusters Ci and Cj in the case of single link is given as the minimum
distance between the data points xip and xjq in the two clusters
(Equation 9.4).
d Ci , C j min d xip , x jq (9.4) FIGURE 9.7a
Average link: Distance between two clusters is the distance between the
cluster centroids (Figure 9.7b) for average link. The cluster distance d(Ci,
Cj) between the clusters Ci and Cj in the case of single link is given as the
average distance between the data points xip and xjq in the two clusters
(Equation 9.5).
d Ci , C j avg d xip , x jq (9.5)
FIGURE 9.7b
Complete link: Distance between two clusters is the distance between
the farthest pair of points (Figure 9.7c) for complete link. The cluster
distance d(Ci, Cj) between the clusters Ci and Cj in the case of single link
is given as minimum distance between the data points xip and xjq in the
two clusters (Equation 9.6).
d Ci , C j max d xip , x jq (9.6)
FIGURE 9.7c
Example 9.1
Hierarchical Clustering. Given a data set of five objects characterized by a single feature,
assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. Assume that we are given the
distance matrix (Figure 9.8). Calculate three cluster distances between C1 and C2.
FIGURE 9.8
Distance matrix.
220 Machine Learning
Single link:
Complete link:
Average link:
d a, c d a, d d a, e d b , c d b , d d b , e
dist C1, C 2
6
2 3 4 3 2 5 19
3.2
6 6
9.8 Agglomerative Algorithm
The agglomerative algorithm is a type of hierarchical clustering algorithm that can be car-
ried out in three steps:
Example 9.2
Agglomerative Algorithm. We will now illustrate the agglomerative algorithm using an
example. Let us assume that we have six data points (A, B, C, D, E, and F). The data points,
the initial data matrix, and the distance matrix are shown in Figure 9.9a. In the first iteration,
from the distance matrix we find that the data points C and F are the closest with minimum
Euclidean distance (1), and hence we merge them to form the cluster (C, F) (Figure 9.9a) and
update the distance matrix accordingly (Figure 9.9b). In the second iteration, in the updated
matrix, we find that the data points D and E are the closest with minimum Euclidean distance
(1), and hence we merge them to form the cluster (D, E) (Figure 9.10c). Similarly, we continue
till we get the final result shown in Figure 9.9f.
Unsupervised Learning 221
FIGURE 9.9
Example of agglomerative clustering.(Continued)
222 Machine Learning
closeness can be defined in various ways, such as smallest maximum distance to other
points, smallest average distance, or smallest sum of squares of distances to other points.
The centroid is the average of all (data) points in the cluster. This means the centroid is an
“artificial” point. However, the clusteroid is an existing (data) point that is “closest” to all
other points in the cluster.
When we are dealing with Euclidean space, we generally measure cluster distances or
nearness of clusters by determining the distances between centroids of the clusters. In the
non-Euclidean case the defined clusteroids are treated as centroids to find inter-cluster
distances.
Another issue is the stopping criteria for combining clusters. Normally we can stop
when we have k clusters. Another approach is to stop when there is a sudden jump in the
cohesion value or the cohesion value falls below a threshold.
9.10 Partitional Algorithm
In the previous section we discussed hierarchical clustering where new clusters are found
iteratively using previously determined clusters. In this section, we will be discussing
another type of clustering called partitional clustering where we discover all the clusters
at the same time. The k-means algorithm, an example of partitional clustering, is a popu-
lar clustering algorithm. In partitional clustering, the data points are divided into a finite
number of partitions, which are disjoint subsets of the set of data points, that is, each data
point is assigned to exactly one subset. This type of clustering algorithm can be viewed as
a problem of iteratively relocating data points between clusters until an optimal partition
of the data points has been obtained.
In the basic algorithm the data points are partitioned into k clusters, and the partitioning
criterion is optimized using methods such as minimizing the squared error. In the case of
the basic iterative algorithm of k-means or k-medoids, both of which belong to the parti-
tional clustering category, the convergence is local and the globally optimal solution is not
always guaranteed. The number of data points in any data set is finite, and the number of
distinct partitions is also finite. It is possible to tackle the problem of local minima by using
exhaustive search methods.
9.11 k-Means Clustering
As the case with any partitional algorithm, basically the k-means algorithm is an iterative
algorithm which divides the given data set into k disjoint groups. As already discussed,
k-means is the most widely used clustering technique. This partitional method uses proto-
types for representing the cluster. The k-means algorithm is a heuristic method where the
centroid of the cluster represents the cluster and the algorithm converges when the cen-
troids of the clusters do not change. Some of the common applications of k-means cluster-
ing are data mining, optical character recognition, biometrics, diagnostic systems, military
applications, document clustering, and so on.
224 Machine Learning
9.11.1 Steps of k-Means
Initialize k values of centroids.
The following two steps are repeated until the data points do not change partitions and
there is no change in the centroid.
• Partition the data points according to the current centroids. The similarity between
each data point and each centroid is determined, and the data points are moved to
the partition to which it is most similar.
• The new centroid of the data points in each partition is then calculated.
Initially the cluster centroids are chosen at random as we are talking about an unsuper-
vised learning technique where we are not provided with the labelled samples. Even after
the first round using these randomly chosen centroids, the complete set of data points is
partitioned all at once. After the next round, the centroids have moved since the data points
have been used to calculate the new centroids. The process is repeated, and the centroids
are relocated until convergence occurs when no data point changes partitions.
Example 9.3
k-Means Clustering. The numerical example that follows can be used to understand the k-means algo-
rithm. Suppose we have four types of medicines, and each has two attributes, pH and weight index
(Figure 9.10). Our goal is to group these objects into k = 2 groups of medicines.
Figure 9.11a shows the random selection of the initial centroids as centroid 1 = (1,1) for
cluster 1 and centroid 2 = (2,1) for cluster 2. We illustrate with data point B and calculate
the distance of C with each of the randomly selected centroids (also sometimes called as
seed points). We find that C is closer to centroid 2, and we assign B to that cluster. Now
Figure 9.11b shows the assignment of each data point to one of the clusters. In our example
only one data point is assigned to cluster 1, so its centroid does not change. However, clus-
ter 2 has three data points associated with it, and the new centroid c2 now becomes (3, 3.33).
Now we compute the new assignments as shown in Figure 9.11c. Figure 9.11c shows the
new values of centroid 1 and centroid 2. Figure 9.11c shows the results after convergence.
FIGURE 9.10
Mapping of the data points into the data space.
Unsupervised Learning 225
FIGURE 9.11
The iterations for the example in Figure 9.10.
these initial centroids can be chosen at random, but this may not lead to fast convergence.
Therefore, this initial cluster problem needs to be addressed. One method is to use hierar-
chical clustering to determine the initial centroids. Another approach is to first select more
than k initial centroids and then select among these initial centroids which are the most
widely separated. Another method is carrying out preprocessing where we can normalize
the data and eliminate outliers.
The stopping criteria can be defined in any one of the following ways. We can stop the
iteration when there are no or a minimum number of reassignments of data points to dif-
ferent clusters. We can also define the stopping criteria to be when there is no (or mini-
mum) change of centroids, or in terms of error as the minimum decrease in the sum of
squared error (SSE), defined in the next section.
dist x, C
2
SSE i (9.7)
i 1 xCL i
where CLi is the ith cluster, Ci is the centroid of cluster CLi (the mean vector of all the data
points in CLi), and dist(x, Ci) is the distance between data point x and centroid Ci. Given
two clusterings (clustering is the set of clusters formed), we can choose the one with the
smallest error. One straightforward way to reduce SSE is to increase N, the number of
clusters.
9.12 Cluster Validity
Cluster validation is important for comparing clustering algorithms, determining the
number of clusters, comparing two clusters, comparing two sets of clusters, and avoiding
finding patterns in noise. Good clustering is defined as a clustering where the intra-class
similarity is high and the inter-class similarity is low. The quality of clustering is depen-
dent on both the representation of the entities to be clustered and the similarity measure
used.
Cluster validation involves evaluation of the clustering using an external index by com-
paring the clustering results to ground truth (externally known results). Evaluation of the
Unsupervised Learning 227
FIGURE 9.12
Precision and recall of clustering.
quality of clusters without reference to external information using only the data is called
evaluation using internal index.
Another approach is discussed next where we assume that the data set has N objects. Let
G be the set of “ground truth” clusters and C be the set of clusters obtained from the clus-
tering algorithm. There are two N × N matrices of objects corresponding to G clusters and
C clusters. An entry in the G matrix is 1 if the objects of row and column corresponding to
that entry belong to the same cluster in C and 0 otherwise. Any pair of data objects falls
into one of the following categories: SS, if the entries in both the matrices agree and are 1;
DD, if the entries in both the matrices agree and are 0; SD, if the entries in both the matrices
disagree with the entry in C = 1 and in G = 0; and DS, if the entries in both the matrices
disagree with the entry in C = 0 and in G = 1. Two evaluation measures are defined based
on these matrices, the Rand index and the Jaccard coefficient (Equations 9.10 and 9.11).
Agree SS DD
Rand (9.10)
Agree Disagree SS SD DS DD
228 Machine Learning
SS
Jaccard coefficient (9.11)
SS SD DS
Entropy of a cluster i : ei p
j 1
ij log pij . (9.12)
Entropy of a clustering is based on average entropy of each cluster as well as the total
number of objects in the data set n (Equation 9.13).
K
ne
mi
Entropy of a clustering : e i (9.13)
i 1
Purity of a cluster i is the class j for which the probability pij, that is the probability that
an element from cluster i is assigned to class j (Equation 9.14).
np
mi
Purity of a clustering : p C i (9.15)
i 1
Example 9.4
Evaluation of Cluster and Clustering. The preceding evaluation is illustrated using the exam-
ples given in Figure 9.14. In Figure 9.14a, the purity of cluster 1, cluster 2, and cluster 3 is calcu-
lated. The overall purity of the clustering is the average of the purity of the three clusters and
shows a good clustering. Figure 9.14b shows another example for bad clustering.
9.12.3 Internal Measure
As discussed previously, internal measures are used to evaluate the clustering without
comparing with external information. There are basically two aspects that are considered,
Unsupervised Learning 229
FIGURE 9.13
Confusion matrix.
FIGURE 9.14
Good and bad clustering.
namely cohesion measures the closeness of the data points in a cluster while separation
measures the well-separated nature of the data points across clusters.
Cluster cohesion is given by Equation (9.16):
x m
2
WSS i (9.16)
i xCi
C m m
2
BSS i i (9.17)
i
where |Ci| is the size of cluster i, and m is the centroid of the whole data set.
9.13 Curse of Dimensionality
In any machine learning problem, if the number of observables or features is increased,
then it takes more time to compute, more memory to store inputs and intermediate results,
230 Machine Learning
and more importantly much more data samples for learning. From a theoretical point of
view, increasing the number of features should lead to better performance. However, in
practice the opposite is true. This aspect is called the curse of dimensionality and is basi-
cally because the number of training examples required increases exponentially as dimen-
sionality increases. This results in the data becoming increasingly sparse in the space it
occupies, and this sparsity makes it difficult to achieve statistical significance for many
machine learning methods. A number of machine learning methods have at least O(nd2)
complexity where n is the number of samples and d is the dimensionality. If the number of
features, that is, dimension d is large, the number of samples n may be too small for accu-
rate parameter estimation.
9.14 Dimensionality Reduction
Some features (dimensions) bear little useful information, which essentially means that
we can drop some features. In dimensionality reduction, high-dimensional points are pro-
jected to a low-dimensional space while preserving the “essence” of the data and the dis-
tances as well as possible. After this projection the learning problems are solved in low
dimensions.
If we have d dimensions, we can reduce the dimensionality to k < d by discarding unim-
portant features or combining several features and then use the resulting k-dimensional
data set for the classification learning problem. In dimensionality reduction we strive to
find the set of features that are effective. Essentially dimensionality reduction results in
reduction of time complexity and space complexity since there will be less computation
and less number of parameters. Most machine learning and data mining techniques may
not be effective for high-dimensional data since the intrinsic dimension (that is, the actual
features that decide the classification) may be small, for example the number of genes actu-
ally responsible for a disease may be small, but the data set may contain a large number of
other genes as features. The dimension-reduced data can be used for visualizing, explor-
ing, and understanding the data. In addition, cleaning the data will allow simpler models
to be built later.
FIGURE 9.15
Dependency of features.
FIGURE 9.16
Feature reduction.
232 Machine Learning
G R p d : X R p y G T X R d d p (9.18)
In the next section we will discuss the first approach, namely PCA.
FIGURE 9.17
Feature reduction—linear transformation.
FIGURE 9.18
Optimized function.
Unsupervised Learning 233
FIGURE 9.19
Representation of function.
9.16.1.1 PCA Methodology
PCA projects the data along the directions where the data varies the most. These direc-
tions are determined by the eigenvectors of the covariance matrix corresponding to the
largest eigenvalues. The magnitude of the eigenvalues corresponds to the variance of the
data along the eigenvector directions. Let us assume that d observables are linear combina-
tion of k < d vectors. We would like to work with this basis as it has lesser dimension and
FIGURE 9.20
2D to 1D projection.
FIGURE 9.21
1D projection of the data.
234 Machine Learning
FIGURE 9.22
Direction of principle components.
has all (almost) required information. Using this bias we expect data is uncorrelated, or
otherwise we could have reduced it further. We choose the projection that shows the large
variance, or otherwise the features bear no information. We choose directions such that the
total variance of data will be maximum (Figure 9.22); that is, we choose a projection that
maximizes total variance. We choose directions that are orthogonal and try to minimize
correlation. When we consider a d-dimensional feature space, we need to choose k < d
orthogonal directions which maximize total variance. We first calculate d by a d symmetric
covariance matrix estimated from samples. Then we select the k largest eigenvalue of the
covariance matrix and the associated k eigenvector. The larger the eigenvalue, the larger is
the variance in the direction of corresponding eigenvector.
Thus PCA can be thought of as finding a new orthogonal basis by rotating the old axis
until the directions of maximum variance are found.
PCA was designed for accurate data representation and not for data classification. The
primary job of PCA is to preserve as much variance in data as possible. Therefore, only if
the direction of maximum variance is important for classification will PCA give good
results for classification. Next we will discuss the Fisher linear discriminant approach.
FIGURE 9.23
Bad and good projections.
FIGURE 9.24
Means of projections.
Z
1
z i (9.19)
n i 1
Z
2
S i z (9.20)
i 1
Thus scatter is just sample variance multiplied by n. In other words, scatter measures
the same concept as variance, the spread of data around the mean, only that scatter is just
on a different scale than variance.
S12 yi 1
2
(9.21)
yi Class 1
236 Machine Learning
S 22 yi 2
2
(9.22)
yi Class 2
We need to normalize by both scatter of class 1 and scatter of class 2. Thus, the Fisher
linear discriminant needs to project on a line in the direction v which maximizes J(V)
(Figure 9.25). Here J(V) is defined such that we want the projected means to be far from
each other and the scatter of each class to be as small as possible; that is, we want the
samples of the respective classes to cluster around the projected means. If we find v which
makes J(v) large, we are guaranteed that the classes are well separated (Figure 9.26).
FIGURE 9.25
Definition of J(V).
FIGURE 9.26
Well separated projected samples.
Unsupervised Learning 237
FIGURE 9.27
SVD.
FIGURE 9.28
SVD illustration.
9.17.1 Market-Basket Analysis
Association rule mining was initially associated with market-basket analysis in the context
of shelf management of a supermarket. The objective is to identify items that are bought
together by sufficiently many customers, and for this purpose the transaction data is pro-
cessed to find the associations or dependencies among the items that the customers put in
shopping baskets to buy. The discovery of such association rules will enable all items that
are frequently purchased together to be placed together on the shelves in a supermarket.
This model can be described as a general many-to-many mapping or associations between
two kinds of things, but we are interested only in the associations between the item side
of the mapping.
An example of a market-basket model is shown in Figure 9.29. Here we assume that a
large set of items are sold in the supermarket. There are also a large set of baskets where
each basket contains a small subset of items that are bought by one customer in a day.
238 Machine Learning
This model can be used in a number of applications. Examples include the following:
• Items = products; Baskets = sets of products someone bought in one trip to the
store— Predict how typical customers navigate stores; hence items can be arranged
accordingly. Another example is Amazon’s suggestions where people who bought
item X also bought item Y.
• Baskets = sentences; Items = documents containing those sentences—Items that
appear together too often could represent plagiarism. Here the items are not in
baskets, but we use a similar concept.
• Baskets = patients; Items = drugs and side effects. This model has been used to
detect combinations of drugs that result in particular side effects. In this case the
absence of an item is as important as its presence.
FIGURE 9.29
Example of market-basket model.
Unsupervised Learning 239
FIGURE 9.30
Example showing association rules.
⚬ Support of an association rule: The rule holds with a particular support value
sup in the transaction data set and is the fraction of transactions that contain
both X and Y out of a total number of transactions n.
⚬ Confidence of an association rule: The rule holds in the transaction set with
confidence conf if conf% of transactions that contain X also contain Y, that is, conf
= Pr(Y | X). Confidence measures the frequency of Y occurring in the transac-
tions which contain X.
⚬ Interest of an association rule: X → Y is the difference between its confidence
and the fraction of baskets that contain j.
Figure 9.30 shows the example for the preceding concepts where 1-itemsets and 2-itemsets
are shown and the corresponding frequent itemsets with minimum threshold of 50% are
shown. The figure also shows four sample association rules and their associated support
and confidence.
240 Machine Learning
9.17.3 Apriori Algorithm
The Aprori algorithm was described by R. Agrawal and R. Srikant for fast mining of asso-
ciation rules in 1994 and is now one of the basic cornerstones of data mining. The problem
to be tackled by the Apriori algorithm is the finding of association rules with support ≥ a
given s and confidence ≥ a given c. However, the hard part is finding the frequent itemsets
with the fact that if {i1, i2,…, ik} → j has high support and confidence, then both {i1, i2,…, ik}
and {i1, i2,…,ik, j} will be frequent.
9.17.3.1 Apriori Principle
The following are the principles on which the Apriori algorithm is based:
• All the subsets of an itemset must also be frequent if the itemset itself is frequent.
• If an itemset is not frequent, then all of its supersets cannot be frequent.
• The anti-monotone property of support states that the support of an itemset never
exceeds the support of its subsets (Equation 9.24), that is,
X , Y : X Y s X s Y (9.24)
9.17.3.2 The Algorithm
The Apriori algorithm uses a level-based approach where the candidate itemsets and
frequent itemsets are generated from 1-itemset to k-itemset until the frequent itemset is
empty (Figure 9.31). Let us assume that Ck = candidate itemsets of size k and Fk = frequent
itemsets of size k. Initially we generate all candidate 1-itemsets C1 and then generate fre-
quent 1-itemsets where all candidate itemsets with support < Min_sup are filtered out. The
FIGURE 9.31
Apriori algorithm—finding frequent itemsets.
Unsupervised Learning 241
frequent itemset set is then used to generate the next level K + 1 candidate dataset. Then
the process is repeated until we get an empty frequent itemset.
All the frequent itemsets generated by the Apriori algorithm are then used to generate
association rules with given support and confidence.
• Hash-based itemset counting: Here we make use of the fact that if the k-itemset has
a hashing bucket count below the threshold, then the itemset cannot be frequent.
• Transaction reduction: We can remove transactions from consideration in the sub-
sequent scans if the previous scan does not contain any frequent k-itemset.
• Partitioning: Any itemset that is frequent in the transaction database must be fre-
quent in at least one of the partitions of the database.
FIGURE 9.32
Example—Apriori algorithm.
242 Machine Learning
9.18 Summary
• Explained the concepts of unsupervised learning and clustering.
• Discussed hierarchical clustering along with different distance measures.
• Explained agglomerative clustering with an illustrative example.
• Discussed the k-means algorithm in detail and illustrated with an example.
• Explained the different aspects of cluster validity and the cluster validation
process.
• Explained the curse of dimensionality and the concept of dimensionality reduction.
• Explained in detail the principal component analysis, Fischer discriminant, and
singular value decomposition methods of dimensionality reduction.
• Explained the concept of association rule mining and explained the Apriori
algorithm.
9.19 Points to Ponder
• Agglomerative clustering is considered to follow the bottom-up strategy.
• Euclidean distance and Manhattan distance are considered as special cases of
Minkowski distance.
• Centroid and clusteroid concepts are required.
• The k-means algorithm is called a partitional algorithm.
• It is important to select appropriate initial centroids in the case of k-means
algorithm.
• Purity based measures are used to evaluate clustering.
• Having a larger number of features does not necessarily result in good learning.
• The anti-monotone property is the basis of the Apriori algorithm.
E.9 Exercises
E.9.1 Suggested Activities
E.9.1.1 Design a business application where the aim is to target existing customers with
specialized advertisement and provide discounts for them. Design appropriate
features for the clustering.
E.9.1.2 Suggest a method to group learners for at least five educational purposes.
Explain the features used for the different purposes.
E.9.1.3 Give two applications of the market-basket model not given in this chapter.
Unsupervised Learning 243
Self-Assessment Questions
E.9.2 Multiple Choice Questions
E.9.2.4 In clustering, data is partitioned into groups that satisfy constraints that
i Points in the same cluster should be dissimilar and points in different clus-
ters should be similar
ii Points in the same cluster should be similar and points in different clusters
should be dissimilar.
iii Points in the same cluster should be similar
E.9.2.8 Hierarchical clustering takes as input a set of points and creates a tree called
_____ in which the points are leaves and the internal nodes reveal the similarity
structure of the points.
i Hierarchical tree
ii Dendogram
iii General tree
244 Machine Learning
No Match
E.9.4 Problems
E.9.4.1 For three objects, A: (1, 0, 1, 1), B: (2, 1, 0, 2), and C: (2, 2, 2, 1), store them in a data
matrix and use Manhattan, Euclidean, and cosine distances to generate distance
matrices, respectively.
E.9.4.2 Use the nearest neighbour clustering algorithm and Euclidean distance to clus-
ter the examples from Exercise E.9.4.2 assuming that the threshold is 4.
E.9.4.3 Use single and complete link agglomerative clustering to group the data in the
following distance matrix (Figure E.9.1). Show the dendrograms.
FIGURE E.9.1
Trans ID Items
T1 Cutlets, Buns, Sauce
T2 Cutlets, Buns
T3 Cutlets, Samosa, Chips
T4 Chips, Samosa
T5 Chips, Sauce
T6 Cutlets, Samosa, Chips
E.9.5 Short Questions
E.9.5.1 How do you think unsupervised learning works? What are the goals of unsu-
pervised learning?
E.9.5.2 Distinguish between supervised and unsupervised learning.
E.9.5.3 What are the main challenges of unsupervised learning?
E.9.5.4 What is Minkowski distance? What are other distances derived from it?
E.9.5.6 What is the difference between agglomerative clustering and divisive clustering?
E.9.5.7 Distinguish between single link, average link, and complete link used to define
cluster distance.
E.9.5.8 What is the difference between a centroid and clusteroid?
246 Machine Learning
References
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases.
Proceedings of 20th Int. Conference on Very Large Databases, Santiago de Chile, 487–489.
Chouinard, J.-C. (2022, May 5). What is dimension reduction in machine learning (and how it works).
https://www.jcchouinard.com/dimension-reduction-in-machine-learning/
Cline, A. K., & Dhillon, I. S. (2006, January). Computation of singular value decomposition. Handbook
of Linear Algebra, 45-1–45-13. https://www.cs.utexas.edu/users/inderjit/public_papers/
HLA_SVD.pdf
Istrail, S. (n.d.). An overview of clustering methods: With applications to bioinformatics. SlideToDoc.
https://slidetodoc.com/an-overview-of-clustering-methods-with-applications-to/
Jaadi, Z., & Powers, J. (2022, September 26). A step- by-step explanation of principal compo-
nent analysis (PCA). BuiltIn. https://builtin.com/data-science/step-step-explanation-
principal-component-analysis
Jin, R. (n.d.). Cluster validation. Kent State University. http://www.cs.kent.edu/~jin/DM08/
ClusterValidation.pdf
Li, S. (2017, September 24). A gentle introduction on market basket analysis—association rules.
Towards Data Science. https://towardsdatascience.com/a-gentle-introduction-on-market-
basket-analysis-association-rules-fa4b986a40ce
Mathew, R. (2021, January 18). A deep dive into clustering. Insight. https://insightimi.wordpress.
com/2021/01/18/a-deep-dive-into-clustering/
Shetty, B., & Powers, J. (2022, August 19). What is the curse of dimensionality? BuiltIn. https://
builtin.com/data-science/curse-dimensionality
Thalles, S. (2019, January 3). An illustrative introduction to Fisher’s linear discriminant. https://
sthalles.github.io/fisher-linear-discriminant/
Wikipendium. (n.d.). TDT4300: Data warehousing and data mining. https://www.wikipendium.
no/TDT4300_Data_Warehousing_and_Data_Mining
10
Sequence Models
10.1 Sequence Models
When the input or output to machine learning models is sequential in nature, these models
are called sequence models. Machine learning today needs to deal with text streams, audio
clips, video clips, and time-series data, which are all sequence data. In sequence models,
data do not have the independently and identically distributed (IID) characteristic since
the sequential order of the data imposes some dependency.
One example of a sequence model in deep learning is recurrent neural networks (RNNs),
which will be discussed later in Chapter 16. However, sequence models in machine learn-
ing and their applications will be discussed in this chapter.
Task 1
In speech recognition using a sound spectrogram, Figure 10.1 highlights the decom-
posed sound waves into frequency and amplitude using Fourier transforms.
Task 2
Consider an application of NLP called named entity recognition (NER), which uses the
following sequence input:
Task 3
Consider a machine translation example: Echte dicke kiste ➔ Awesome sauce, in which
the input and outputs are sequence of words.
Task 4
Consider the task of predicting sentiment from movie reviews, for example:
FIGURE 10.1
Sound waves decomposed into frequency and amplitude using Fourier transforms.
Sequence Models 249
Task 5
In a speaker recognition task, given a sound spectrogram which is sequence data, recog-
nising the speaker; for example:
These examples show that there are different applications of sequence models. Sometimes
both the input and output are sequences; in some either the input or the output is a
sequence.
FIGURE 10.2
Speech recognition.
FIGURE 10.3
Sentiment classification.
250 Machine Learning
FIGURE 10.4
Video activity recognition.
FIGURE 10.5
Auto-completion sequential process.
In other words, the network attempts to anticipate the next character from the
potential 26 English alphabets based on the letter “l” that has been entered. Given
the previous letters, the neural network would generate a soft max output of size
26 reflecting the likelihood of the following letter. Since the inputs to this network
are letters, they must be converted to a one-shot encoded vector of length 26, with
the element corresponding to the index of the alphabet set to 1 and the rest set to 0.
5. Parts of speech tagging: In parts of speech tagging, we are given a sequence of
words and must anticipate the part of speech tag for each word (verb, noun, pro-
noun, etc.). Again, the outcome depends on both the current and past inputs. If the
previous word is an adjective, the likelihood of tagging “girl” as a noun increases
(Figure 10.6).
FIGURE 10.6
Parts of speech sequential process.
Sequence Models 251
yt = fˆ ( x1 , x2 , …..xn ) (10.1)
The important thing to keep in mind in this situation is that the task remains the same
regardless of whether we are predicting the next character or classifying the part of speech
of a word. Every time step requires a new input because the function needs to keep track
of a larger list of words for lengthier sentences.
In other words, we must define a function with the following properties:
i. Make sure that the output yt depends on the inputs before it.
ii. Make sure the function can handle a variety of inputs.
iii. The function that is performed at each time step must be the same.
10.4 Markov Models
Understanding the hidden Markov model requires knowledge about Markov models.
First let us discuss the Markov model before we explore hidden Markov models.
In probability theory, stochastic models that represent randomly changing systems have
the Markov property defined by a Russian mathematician, Andrey Markov. The Markov
property has a memory-less characteristic and states that at any given time, only the cur-
rent state determines the next subsequent state, and this subsequent state is not dependent
on any other state in the past. In terms of probability, when the conditional probability
distribution of the future states of a process depends only on the current state and is inde-
pendent of past states, the stochastic process is said to possess the Markov property.
Markov models are based on this Markov property.
In the simplest form, a Markov model assumes that observations are completely inde-
pendent, that is, like a graph without edges between them as shown in Figure 10.7.
The simplest form can be understood with the following example: To predict whether it
rains tomorrow is only based on relative frequency of rainy days. The simplest form
ignores the influence of whether it rained the previous day.
252 Machine Learning
FIGURE 10.7
Markov model with independent observations.
The general Markov model for an observation {xn} can be expressed by the product rule
which gives the joint distribution of sequence of observations {x1, x2, …..xn−1} as given in
Equation (10.2):
N
P ( x1 ,..x N ) = ∏p(x
n=1
n x1 ,..xn − 1 ) (10.2)
p ( x1 ,..x N ) = p ( x1 ) ∏p(x
n=2
n xn − 1 ) (10.3)
TABLE 10.1
Four Commonly Used Markov Models
System State Is Fully Observable System State Is Partially Observable
System is autonomous Markov chain—show all possible Hidden Markov model—has some
states, and the transition probability unobservable states. They show states
of moving from one state to another and transition rates and in addition
at a point in time. also represent observations and
observation likelihoods for each state.
System is controlled Markov decision process—shows all Partially observable Markov decision
possible states but is controlled by an process—similar to Markov decision
agent who makes decisions based on process; however, the agent does not
reliable information. always have reliable information.
Sequence Models 253
FIGURE 10.8
Markov model with chain of observations.
p( xn x1 … xn − 1 ) = p( xn xn − 1 ) (10.4)
If the model is used to predict the next observation, the distribution of predictions will
only depend on preceding observation and be independent of earlier observations.
Stationarity implies conditional distributions p(xn|xn−1) are all equal.
Example: A Markov chain with three states representing the dinner of the day (sand-
wich, pizza, burger) is given. A transition table with transition probabilities representing
the dinner of the next day given the dinner of the current day is also given (Figure 10.9).
From the previous example, let us find out the probability of dinner for the next seven
days if it will be “B-B-S-S-B-P-B.”
If the given model is O = {S3, S3, S3, S1, S1, S3, S2, S3}
Then, the probability of the given model O is as follows:
P(O|Model) = P(S3, S3, S3, S1, S1, S3, S2, S3| Model)
= P ( S3 )·P(S3|S3)·P(S3|S3)·P(S1|S3)·P(S1|S1)·P(S3|S1)·P(S2|S3)·P(S3|S2)
= π3 ·a33 ·a33 ·a31·a11·a13 ·a32 ·a23
= 1·( 0.8 )·( 0.8 )·( 0.1)·( 0.4 )·( 0.3 )·( 0.1)·( 0.2 )
= 1.536 × 10 −4
FIGURE 10.9
Markov chain with three states.
254 Machine Learning
Let us consider another example, where we shall apply a Markov model for spoken
word production. If the states /k/ /a/ /c/ /h/ represents phonemes, the Markov model
for the production of the word “catch” is as shown in Figure 10.10.
The transitions are from /k/ to /a/, /a/ to /c/, /c/ to /h/, and /h/ to a silent state.
Although only correct “cat” sound is represented by the model, perhaps other transitions
can be introduced, for example /k/ followed by /h/.
In the second-order Markov model, each observation is influenced by the previous two
observations (Equation 10.5). The conditional distribution of observation xn depends on
the values of two previous observations, xn−1 and xn−2, as shown in Figure 10.11.
∏p(x
p ( x1 , … xn ) = p ( x1 ) p( x2 x1 )
n=3
n xn − 1 , xn − 2 ) (10.5)
Similarly for the Mth-order Markov source, the conditional distribution for a particular
variable depends on the previous M variables. The discrete variable with K states in its first
order is p(xn|xn−1), which needs K − 1 parameters for each value of xn−1 for each of K states
of xn, giving K(K−1) parameters. The Mth order will need KM−1(K−1) parameters.
Latent variables: While Markov models are tractable, they are severely limited.
Introduction of latent variables provides a more general framework and leads to state-
space models. When the latent variables are discrete, they are called hidden Markov mod-
els, and if they are continuous, then they are linear dynamical systems. Models for
sequences are not limited by Markov assumptions of any order but with limited number of
parameters. For each observation xn, we can introduce a latent variable zn which can be of
different type or dimensionality when compared to the observed variable. The latent vari-
ables form the Markov chain, resulting in the state-space model.
FIGURE 10.10
Markov model for spoken word production.
FIGURE 10.11
Second-order Markov model.
Sequence Models 255
• Markov assumption: The state transition depends only on the origin and destina-
tion states—that is, it assumes that state at zt is dependent only on state zt−1 and
independent of all prior states (first order).
• Output-independent assumption: All observations are dependent only on the
state that generated them, not on neighbouring observations.
• Moreover, the observation xt at time t was generated by some process whose state
zt is hidden from the observer.
Consider an example (Figure 10.12), where z represents phoneme sequences, and x are
acoustic observations. Graphically HMM can be represented with a state-space model and
discrete latent variables as shown in the following Figure 10.12 and the joint distribution
has the form as shown in (Equation 10.6).
N N
p ( x1 , … x N , z1 , … zn ) = p ( z1 ) ∏
n = 2
p(zn zn − 1 )
∏p(x
n=1
n zn ) (10.6)
A single time slice in the preceding graphical model corresponds to a mixture distribu-
tion with component densities p(x|z).
10.4.3.1 Parameters of an HMM
• In hidden Markov models, S = {s1,…, sn} is a set of states where n is the number of
possible states, while V = {v1,…, vm} is the set of symbols, where m indicates the
number of symbols observable in the states.
• Transition probabilities is a two-dimensional matrix A = a1,1,a1,2,…,an,n where
each ai,j represents the probability of transitioning from state si to sj.
• Emission probabilities or the observation symbol distribution probability is a set
B of functions of the form bi (ot), which is the probability of observation ot being
emitted or observed by state si.
• Initial state distribution π is the probability that si is a start state.
FIGURE 10.12
Graphical model of an HMM.
256 Machine Learning
Therefore, we have two model parameters n, the number of states of the HMM, and m,
the number of distinct observation symbols per state. In addition, we have three prob-
ability measures: A, the transition probability; B, the emission probability; and π, the initial
probability.
Problem 1 Evaluation
Here given the observation sequence O = o1,…,oT and an HMM model,
Problem 2 Decoding
Here, given the observation sequence O = o1,…,oT and an HMM model,
Q2: How do we compute the most probable hidden sequence of states, given a sequence of
observations?
A2: Viterbi’s dynamic programming algorithm
Problem 3 Learning
Here given an observation sequence and set of possible models,
Q3: Which model most closely fits the data? In other words, how are the model parameters—
that is, the transition probabilities, emission probabilities, and initial state distribution—
adjusted so that the model maximizes the probability of the observation sequence given the
model parameters?
A3: The expectation maximization (EM) heuristic finds which model most closely fits the
data.
of partially observable states and decision making agent taking actions based on current
state but based on limited observations. POMDPs are used in a variety of applications
including the operation of simple agents or robots.
FIGURE 10.13
General process of data stream mining.
When mining data streams, it is important that the data preprocessing step, which is an
essential and time-consuming part of the process of discovering new knowledge, be opti-
mized. It is of the utmost importance to develop lightweight preprocessing methods that
are capable of ensuring the quality of the mining results.
In order to arrive at reliable estimates when mining data streams, it is essential to take
into account the restricted resources that are available, such as memory space and comput-
ing power. The accuracy of the results produced by stream data mining algorithms would
significantly suffer if those algorithms do not consider these restrictions.
1
R2 ln
ε= δ (10.7)
2N
In the Hoeffding tree algorithm while choosing a splitting attribute at a node, it uses
the Hoeffding bound to find the minimum number (N) of examples required with high-
est probability. Unlike the majority of other bound equations, the Hoeffding bound is not
dependent on the probability distribution. The Hoeffding tree induction algorithm, also
called the very fast decision tree (VFDT) algorithm, is explained in the flowchart
(Figure 10.14). In this algorithm we start with a tree with a single root node. In order to
select a split attribute for a node only sufficient samples are considered. Given a stream
of examples, we first select the split for the root, sort incoming examples according to the
leaves, pick the best attribute to split, and repeat the procedure till there are no more
examples.
The intended probability that the right attribute is selected at every node in the tree is
one minus the δ parameter used in the Hoeffding bound. Since the desired probability
must be close to one with maximum likelihood of correctness, δ is typically set to a small
amount.
FIGURE 10.14
Hoeffding tree induction or very fast decision tree (VFDT) algorithm.
FIGURE 10.15
Adaptive learning using naïve Bayes.
FIGURE 10.16
Windowing strategy—external to the learning algorithm.
FIGURE 10.17
Windowing strategy—internal to the learning algorithm.
2. In this method, in order for the learning algorithm to maintain and continuously
update statistics, the window is embedded internally as shown in Figure 10.17.
However, it is the responsibility of the learning algorithm to understand the sta-
tistics and act accordingly.
Learning algorithms usually compare statistics of two windows for detecting the change.
The methods could use less memory than expected; for example, they might maintain win-
dow statistics without keeping all of its constituents. Other different window management
strategies are as follows:
• Equal and fixed size sub-windows: In this type comparison is carried out between
a nonsliding window of older data and a sliding window of the same size of cur-
rent data.
• Equal size adjacent sub-windows: Here we compare between two adjacent sliding
windows of the same size of recent data.
• Total window against sub-window: In this strategy, the window that contains all
the data is compared with a sub-window of data from the beginning. The process
extends until the accuracy of the algorithm decreases by a threshold.
262 Machine Learning
1. For each i,
FIGURE 10.18
Small-space algorithm and its representation.
Sequence Models 263
10.7 Applications
10.7.1 Applications of Markov Model
Application in retail domain: If you visit the grocery store once per week, it is quite simple
for a computer program to forecast when your shopping trip will take longer. The hidden
Markov model determines on which day of the week visits take longer than others, and
then uses this knowledge to discover why some trips are taking longer than others for
shoppers such as yourself. The recommendation engine is another e-commerce application
that employs hidden Markov models. The hidden Markov models attempt to forecast the
next thing you will purchase.
Application in travel industry: Using hidden Markov models, airlines are able to fore-
cast how long it will take a passenger to check out of an airport. This allows them to choose
when to begin boarding people!
Application in medical domain: Hidden Markov models are utilised in a variety of
medical applications that seek to uncover the concealed states of a human body system or
organ. For instance, cancer diagnosis can be accomplished by examining specific sequences
and calculating the risk they offer to the patient. Hidden Markov models are also used to
evaluate biological data such as RNA-Seq, ChIP-Seq, and so on, which aids researchers in
comprehending gene regulation. Based on a person’s age, weight, height, and body type,
doctors can calculate their life expectancy using the hidden Markov model.
Application in marketing domain: When marketers employ a hidden Markov model,
they are able to determine at which step of their marketing funnel customers are abandon-
ing their efforts and how to enhance user conversion rates.
need to be solved in order to enhance operations and increase yields is one example of how
IoT data analysis may be applied. A manufacturer may be able to notice that a manufactur-
ing line is churning out too many abnormalities while it is occurring with the help of real-
time stream processing. This is in contrast to the traditional method of discovering a whole
defective batch at the end of the day’s shift. By pausing the line for quick repairs, they will
be able to realize significant cost savings and avoid significant waste.
10.8 Summary
• Discussed learning models that deal with stream data like text streams, audio
clips, video clips, time-series data, and so on as input and output.
• Outlined various applications of sequence models such as speech recognition,
video activity recognition, auto-completion, and so on.
• Explained the Markov model including the Markov chain model, the hidden
Markov model, the Markov decision process, partially observable Markov deci-
sion process, and Markov random field.
• Outlined the process of data stream mining.
• Described the various learning methods for data stream mining, which are tree
based, adaptive, window based, and data clustering.
• Listed the various applications of Markov models and data stream mining.
10.9 Points to Ponder
• In sequence learning, the true output at time step t depends on every input that the
model has encountered up to that point, and hence there is a need to find a func-
tion that depends on all of the prior inputs.
• Markov property has a memoryless characteristic.
• In the first-order Markov model the next state in a sequence depends only on one
state—the current state.
• The assumptions associated with hidden Markov models are the Markov assump-
tion and the output-independent assumption.
Sequence Models 265
E.10 Exercises
E.10.1 Suggested Activities
E.10.1.1 Use a Markov chain to predict which products will be in a user’s next order.
Data set to be used: https://www.kaggle.com/competitions/instacart-market-
basket-analysis/data
E.10.1.2 Implement the hidden Markov model for named entity recognition using
the NER data set: https://www.kaggle.com/datasets/debasisdotcom/name-
entity-recognition-ner-dataset
Self-Assessment Questions
E.10.2 Multiple Choice Questions
E.10.2.2 When modelling sequence learning problems, we need to make sure that
i The function depends on all of the succeeding inputs
ii The function that is performed at each time step is different
iii The true output at each time step depends on every input encountered
up to that point
E.10.2.4 Predicting the price of stock based only on relative frequency of monthly price
is an example of a
i Markov chain
ii Markov decision process
iii Markov model with independence observations
266 Machine Learning
E.10.2.5 All possible states are observed, but the process is controlled by an agent who
makes decisions based on reliable information. This is an example of a
i Markov decision process
ii Hidden Markov model
iii Markov chain
E.10.2.6 When discrete latent variables are introduced into Markov models, they are
called
i Latent Markov models
ii Hidden Markov models
iii Dynamical systems
E.10.2.7 The output-independent assumption of the hidden Markov model states that
i All observations are dependent only on the state that generated them
ii All observations are dependent only neighboring observations
iii The state transition depends only on the origin and destination states
E.10.2.9 In HMM, the _____ algorithm is used for computing the probability of a given
sequence of observations.
i Viterbi
ii Baum–Welch
iii Expectation maximization
E.10.2.10
A state that is influenced by its neighbours in various directions is a
concept of
i Markov random field
ii Hidden Markov model
iii Markov decision process
E.10.2.13 Adaptive learning using naïve Bayes for learning from data streams
i Employs naïve Bayes only when already proven to be more accurate
ii Always employs the majority class
iii Always employs naïve Bayes
No Match
E.10.3.1 Speech recognition A Given an observation sequence and set of possible models,
finding a model that most closely fits the data
E.10.3.2 Sequence learning B Memoryless characteristic
E.10.3.3 Markov model C Achieves a constant factor approximation for the k-median
problem in a single pass and using small space
E.10.3.4 Markov chain D Example of sequence–sequence task
E.10.3.5 Expectation maximization E Develop an estimate of the function that depends on all of the
prior inputs
E.10.3.6 A data stream F Probability of an observation being observed by a state
E.10.3.7 Partially observable G Shows all possible states but is controlled by an decision
Markov decision process making agent who does not always have reliable information.
E.10.3.8 STREAM algorithm H Is an ordered sequence of instances that can be read only once
or a few times
E.10.3.9 Hoeffding trees I Selecting an ideal splitting characteristic may frequently be
done with a relatively small sample size
E.10.3.10 Emission probability J Show all possible states, and the transition probability of
moving from one state to another at a point in time
E.10.4 Problems
E.10.4.1 Consider the Markov chain with three states, S = {1,2,3}, that has the following
transition matrix P:
E.10.4.2 A Markov chain is used to check the status of a machine used in a manufactur-
ing process. Suppose that the possible states for the machine are as follows:
idle and awaiting work (I); working on a job/task (W); broken (B); and in
268 Machine Learning
repair (R). The machine is monitored at regular intervals (every hour) to deter-
mine its status. The transition matrix is as follows:
Use the transition matrix to identify the following probabilities about the sta-
tus of the machine one hour from now:
a. If the machine is idle now, find the probability that the machine is working
on a job (i) one hour from now and (ii) three hours from now.
b. If the machine is working on a job now, find the probability that the
machine is idle (i) one hour from now and (ii) three hours from now.
c. If the machine is being repaired now, find the probability that the machine
is working on a job (i) one hour from now and (ii) three hours from now.
d. If the machine is broken now, find the probability that the machine is being
repaired (i) one hour from now and (ii) three hours from now.
E.10.4.3 Assume that a student can be in one of four states: rich (R), average (A), poor
(P), or in debt(D). Assume the following transition probabilities:
• If a student is rich, in the next time step the student will be
Average:0.75, Poor:0.2, In Debt:0.05
• If a student is average, in the next time step the student will be
Rich: 0.05, Average: 0.2, In Debt:0.45
• If a student is poor, in the next time step the student will be
Average:0.4, Poor:0.3, In Debt: 0.2
• If a student is in debt, in the next time step the student will be
Average: 0.15, Poor: 0.3, In Debt:0.55
a. Model the preceding information as a discrete Markov chain, draw the cor-
responding Markov chain, and obtain the corresponding stochastic matrix.
b. Let us assume that a student starts their studies as average. What will be
the probability of them being rich after one, two, and three time steps?
E.10.4.4 Compute the parameters of a HMM model, given the following sequences of
pairs (state, emission):
(D,the) (N,wine) (V,ages) (A,alone)
(D,the) (N,wine) (N,waits) (V,last) (N,ages)
(D,some) (N,flies) (V,dove) (P,into) (D,the) (N,wine)
(D,the) (N,dove) (V,flies) (P,for) (D,some) (N,flies)
(D,the) (A,last) (N,dove) (V,waits) (A,alone)
a. Draw the graph of the resulting bigram HMM, and list all nonzero model
parameters that we can obtain via MLE from this data.
b. Compute the probability of the following sequence according to the model:
E.10.5 Short Questions
E.10.5.1 Do sequence models have the independently and identically distributed
(IID) characteristic? Justify your answer.
E.10.5.2 Give an example of sequence to symbol application.
E.10.5.3 List five examples of sequence models, justifying your choice.
E.10.5.4 Differentiate between the different types of Markov models.
E.10.5.5 Describe the second-order Markov model with an example.
E.10.5.6 List the various assumptions of HMM.
E.10.5.7 Explain the three problems associated with HMM and name the algorithms
that solve the problems.
E.10.5.8 What is data stream mining?
E.10.5.9 Explain in detail the general process of data stream mining.
E.10.5.10 What are the challenges of data stream mining?
E.10.5.11 Discuss in detail the tree based learning method used for data stream mining.
E.10.5.12 Outline two window based methods used for data stream mining.
E.10.5.13 What is data stream clustering?
E.10.5.14 Explain an algorithm used for data stream clustering.
E.10.5.15 Give some typical applications of Markov models.
E.10.5.16 Explain any two applications of real-time stream processing.
11
Reinforcement Learning
11.1 Introduction
The first thought that comes to mind about when we think about the process of learning
is how we learn by way of interaction with our environment. This process of learning
through interaction results in knowledge about cause and effect, consequence of actions,
and so on. Learning through interaction is a foundational idea underlying nearly all theo-
ries of learning and intelligence. For example, when a child smiles, it has no explicit trainer,
but it does have an explicit connection with its environment. In this chapter we explore a
computational model for machines for learning from interaction. This approach is called
reinforcement learning (RL), which is more focused on goal-directed learning from interac-
tion than are other approaches to machine learning.
One of the most significant subfields of machine learning is reinforcement learning, in
which an agent learns how to behave in an environment by carrying out actions and seeing
the effects of those actions. The idea behind reinforcement learning is that an agent will
learn from the environment by interacting with it and receiving rewards for performing
actions. Humans learn from interaction with the environment using their natural experi-
ences. Reinforcement learning is just a computational approach to learning from action.
Reinforcement learning varies from supervised learning in that the answer key is
included with the training data in supervised learning, so the model is trained with the
correct answer, whereas in reinforcement learning, there is no answer and the reinforce-
ment agent decides what to do to complete the given task. Without any example data or
training data set, it is made to learn by experience.
Example: Consider a question and answering game as an example:
Now given the one type of answer, there are only a finite number of answers (actions) that
can be made; these are determined by the environment (Figure 11.1).
As shown in the figure, the environment will receive the agent’s action as an input and
will output the ensuing state and a reward. In the chess example, the environment would
restore the state of the chessboard after the agent’s move and the opponent’s move, and the
agent’s turn would then resume. The environment will also provide a reward, such as
FIGURE 11.1
Agent, environment, state, and reward in reinforcement learning.
TABLE 11.1
Difference Between Reinforcement Learning and Supervised Learning
Reinforcement Learning (RL) Supervised Learning
The essence of reinforcement learning is sequential In supervised learning, the choice is determined
decision making. Simply said, the output is based on the beginning or initial information.
dependent on the state of the current input, whereas
the next input is dependent on the outcome of the
previous input.
In reinforcement learning, decisions are dependent, In supervised learning, decisions are independent
hence sequences of dependent decisions are labelled. from one another, hence each decision is labelled.
Chess game is an example for reinforcement learning. An example for supervised learning is object
recognition.
capturing a piece. Table 11.1 lists the differences between reinforcement learning and
supervised learning.
Types of reinforcement learning: There are two types of reinforcement, positive reinforce-
ment and negative reinforcement.
time, and an excess of reinforcement can result in an excess of states, which can
lower the outcomes.
• Negative reinforcement is described as the strengthening of behaviour due to the
elimination or avoidance of a negative condition. The positive aspects of nega-
tive reinforcement learning are that it increases conduct, provides resistance to a
minimum performance standard, and only gives the bare minimum for acceptable
behaviour.
• Minimize the worst unhappiness: When actions are not shared, this can be
approximated by W-learning (Equation 11.2).
• Minimize collective unhappiness: When actions are not shared, this can be
approximated by collective W-learning (Equation 11.3).
n
min aA
Q x , a Q x , a
i 1
i i i
(11.3)
• Maximize collective happiness: This strategy can only be executed for shared
actions. Maximize collective happiness is essentially the same as minimize col-
lective unhappiness. When the actions are not shared it can be approximated by
collective W-learning (Equation 11.4).
n
max aA
Q x, a
i 1
i (11.4)
where the function p completely characterizes the dynamics of environment for all s′, s Ɛ
S, r Ɛ R, and a Ɛ A(s). p: S × R × S × A → [0, 1], indicating the dynamics of the function is an
ordinary deterministic function (Equation 11.6).
Therefore only the immediately preceding state–action pair (St−1,At−1) decides the prob-
ability of each possible value for St and Rt. The MDP framework is so flexible and
Reinforcement Learning 275
generalized that it can be applied to a variety of problems in many different ways. Time
steps, apart from referring to real-time intervals, may also refer to arbitrary decision-
making and acting stages. Actions can be like controlling voltages, or high-level decisions.
States can be determined by low-level data such as direct sensor readings, or high-level
and abstract descriptions of room objects. Some of a state’s components may be mental or
subjective, depending on past sensations. In other words, actions are any decisions we
wish to learn how to make, and states may help us to make those decisions.
Example: A robot arm’s pick-and-place motion could be controlled using reinforcement
learning. To be able to learn quick and smooth motions, the learning agent needs low-
latency information regarding the locations and velocities of the mechanical components
in order to have direct control of the motors. In this case, actions may be voltages of the
motors at each joint of the robot and states may be the angles and velocities of the joints.
+1 may be the reward assigned for every item that is picked up and placed. The “jerkiness”
of the motion of the robot may be associated with a modest negative reward.
In a Markov decision process, the objective is to develop a good “policy,” which is a
function Π. The function Π specifies the action Π (s) which the decision maker will select
when in state s. Once a Markov decision process is linked with a policy in this manner, the
action for each state is fixed, and the resulting system behaves like a Markov chain. The
objective of the optimization is to choose a Π that will maximize some cumulative function
of the random rewards. MDPs can have several optimal policies. The Markov property
shows the optimal strategy is a function of present state.
11.4.1 Dynamic Programming
Dynamic programming (DP) in an optimized recursion. In situations when the same inputs
are used multiple times in a recursive solution, DP can be used to improve performance.
Keeping track of past solutions to subproblems means we can avoid recomputation in the
future. By way of this easy optimization, we reduce the time complexity.
One method that can be used to deal with issues related to self-learning is called dynamic
programming. Operations research, economics, and automatic control systems are just
some of the many fields that make use of it. Since DP is primarily concerned with artificial
intelligence, its primary application is in the realm of machine learning, which is primarily
concerned with self-learning in a highly uncertain environment.
In their book titled “Reinforcement Learning: An Introduction,” eminent computer sci-
ence academics Richard Sutton and Andrew Barto provided a simple definition for
dynamic programming.
They have defined DP as follows: “Given a Markov decision process representing an
environment, dynamic programming refers to a group of methods that can be utilized to
derive optimal policies for that environment.”
276 Machine Learning
DP assumes that the input model is perfect, that is, aware of all probability parameters
and reward functions, in order to create it as a Markov decision process based on its speci-
fication (MDP). On this basis, solutions are derived for situations in which a set of action
spaces and model states, which are also continuous, are presented. DP identifies the opti-
mal and best policies in the correct model framework. The authors argue that while DP is
theoretically conceivable, it can be computationally demanding.
DP is also incredibly useful for huge and continuous-space real-time situations. It deliv-
ers insights from complex policy representations via an approximation technique. This
necessitates providing an approximation of the vast number of attainable values in DP via
value or policy iterations. Sampling enormous amounts of data can have a significant
impact on the learning aspect of any algorithm.
The classification of DP algorithms into three subclasses is possible:
1. Value iteration method, which is an iterative method that in general only con-
verges asymptotically to the value function, even if the state space is finite
2. Policy iteration method, which provides a series of stationary policies that are
improved over time. The policy iteration method will have the following steps:
(a) Initialization—Commence with a steady policy
(b) Policy evaluation—Given the stable policy, determine its expense by solving
the linear system of equations.
(c) Policy improvement by acquiring a new stationary policy that meets a mini-
mum constraint.
Stop if the policy does not change. Alternatively, continue the steps 2(b) and 2(c).
3. Policy search methodologies.
minJumps start , end min minJumps k , end , for all k reachable from start
The problem can be seen as a bunch of smaller problems that run into each other. This
problem has both an optimal substructure and sub-problems it overlaps, which are both
features of dynamic programming. The idea is to just store the answers to sub-problems so
we don’t have to figure them out again when we need them later. The amount of time
needed to do this simple optimization goes from being exponential to being polynomial.
Consider another example—solving for the Fibonacci series. The basic recursion
approach and the dynamic programming approach to this problem is as follows
(Figure 11.2):
If we write a simple recursive solution for the Fibonacci numbers, it takes exponentially
long as shown previously. However, if we optimize it by storing the answers to sub-
problems, it takes linear time.
Reinforcement Learning 277
FIGURE 11.2
Recursive versus dynamic programming approach.
It takes up a lot of memory to store the result of each sub-problem’s calculation without
knowing if that value will be used or not. During execution, the output value is often
saved, but it is never used in the next sub-problem.
V St V St Gt V St (11.7)
where,
• V(St) is the state value, which can be initialized randomly or with a certain strategy
• Gt is the final reward and calculated as in Equation (11.8):
Method 1: The first time approach as shown subsequently is followed for each episode,
only the first time that the agent arrives at S counts with the policy.
Method 2: Each time the following approach (the only difference between method 1
and method 2 is step 3.i) is used with policy for each episode, every time that
the agent arrives at S counts.
Method 3: In an incremental approach, for each state St in the episode, there is a reward
Gt. The average value of the state V(St) is calculated by the following formula
for every time St appears (Equations 11.9 and 11.10):
N s N s 1 (11.9)
1
V St V St Gt V St (11.10)
N St
V St V St Gt V St (11.11)
By this approach the progress happening in each episode can be easily under-
stood by converting the mean return into an incremental update. Hence the
mean can be updated with each episode easily.
s argmax
a
p(s, r|s, a) r v s
s . r
(11.12)
The policy is improved by making it greedy in relation to the current value function. A
greedy policy model is not required in this instance because of the presence of an action-
value function (Equation 11.13).
s argmax q s, a (11.13)
a
280 Machine Learning
A greedy policy, as given in the preceding equation, will always favour a certain action
if the majority of actions are not thoroughly studied. There are two possible Monte Carlo
learning approaches for the situation explained previously.
V St V St Rt 1 V St 1 V St (11.14)
where, Rt+1+V(St+1) is called the TD target value and Rt+1+V(St+1)−V(St) is called the TD
error. While the MC strategy employs exact Gt for updating, the TD approach estimates
value using the Bellman optimality equation and then updates the estimated value with
the goal value.
Different forms of temporal difference approaches like Q-learning, SARSA, and deep
Q-networks are discussed in the following subsections.
11.4.3.1 Q-learning
Q-learning is a type of reinforcement learning which uses action values or expected rewards
for an action taken in a particular condition, called Q-values, with which the behavior
of agent will be improved iteratively (Figure 11.3). It is also called “model free” since it
handles problems with stochastic transitions and rewards without any adaptations, that is,
it does not require a model of the environment. Q-learning provides an optimal policy for
Reinforcement Learning 281
FIGURE 11.3
Components of Q-learning.
any finite MDP by maximizing the anticipated value of the total reward across all succes-
sive steps, beginning with the present state. Q(S, A) estimates how well an action A works
at the state S. Q(S, A) will be estimated using TD-Update.
Q-learning is a reinforcement learning policy that will find the next best action, given a
current state. It chooses this action at random and aims to maximize the reward. The pur-
pose of the model is to determine the optimal course of action given the current situation.
The model aims to find the next best action in the context of the current state. The objec-
tive is to maximize the reward, for which the model chooses the action randomly or apply
its own rules, thus deviating from the given policy. This means that the necessity of a pol-
icy is not mandatory.
There are three important parameters considered in Q-learning: transition function (T),
reward function (R), and value function (V). Q-learning is preferred to obtain an optimal
solution from a series of actions and states following value iteration, since the maximum
values for unknown parameters cannot be identified. In other words, q-values are com-
puted in place of iterations of maximum value. Below is a typical formula (Equation 11.15):
where Qk+1(s,a) is the iterated function with states s and action a by considering the sum-
mation values of parameters T and R along with a discount factor. The apostrophes denote
an update function is incorporated all along the process.
The preceding equation is the foundation for Q-learning. In addition to an optimal pol-
icy for the RL outcomes, the update function in the parameters also incorporates an opti-
mal policy. For issues with a highly unpredictable environment, the equation can be
adjusted to include even RL notions like exploration and exploitation.
In a typical MDP, the T and R parameters are unknown, and Q-learning is best suited for
such applications where knowledge of T and R are not known.
Example: Advertisement recommendation systems use Q-learning. Normal advertise-
ment recommendations are based on past purchases or websites visited. You’ll get brand
recommendations if you buy a TV. The advertisement suggestion system is optimized by
using Q-learning to recommend frequently bought products. Reward is assigned depend-
ing on the number of user clicks on the product suggested.
282 Machine Learning
Q-learning algorithm: The step involved in Q-learning is shown in the flowchart shown
in Figure 11.4. The steps are explained as follows:
Step 1: Create an initial Q-table where the values of all states and rewards are set
to 0.
Step 2: Select an action and carry it out. Update the table’s values.
Step 3: Obtain the reward’s value and compute the Q-value using the Bellman
equation.
Step 4: Repeat until the table is full or an episode concludes.
FIGURE 11.4
Flow diagram of Q-learning.
Reinforcement Learning 283
FIGURE 11.5
Bellman equation.
Until the state st+1 is a terminal or final state, the algorithmic episode will continue.
Nonepisodic tasks may be used for further learning by the algorithm. If the dis-
count factor is less than 1, even if the problem contains infinite loops, the action
values are finite. The Q(sf, a) is never changed for all final states sf; it is always set
to the reward value r seen for state sf. Q(sf, a) can typically be assumed as zero.
Figure 11.5 shows the components of the Bellman equation, namely current Q value, learn-
ing rate, reward, discount rate, and maximum expected future reward. The variables in the
Bellman equation that greatly influences the outcome are as follows:
1. Learning rate or step size (α): The factor that decides to what extent newly obtained
information overrides old information is the learning rate or step size. A factor of 0
indicates that the agent learns exclusively from prior knowledge while a factor of
1 indicates that the agent considers only the most recently acquired information.
Normally, α is set to a value of 0.1.
2. Discount factor (γ): The factor that decides the importance of rewards in the distant
future relative to rewards in the immediate future is the discount factor γ. When
γ = 0, the learning agent will be concerned only about rewards in the immediate
future. In other words, this factor is a way to scale down the rewards as the learn-
ing proceeds.
3. Initial conditions (Q0): Forecasting behavior of an agent will be good if the agent
is capable of resetting its initial conditions without assuming any random initial
condition (AIC).
However, the major drawback is that only discrete action and state spaces are supported
by the typical Q-learning technique (using a Q table). Due to the curse of dimensionality,
discretization of these values results in ineffective learning.
11.4.3.2 State–Action–Reward–State–Action (SARSA)
Another method used by reinforcement learning for learning a MDP policy is state–action–
reward–state–action (SARSA). The original name of this method was modified connection-
ist Q-learning” (MCQ-L), later renamed as SARSA by Rich Sutton.
284 Machine Learning
As the name indicates, the function for updating the Q-value is determined by the pres-
ent state S, of the agent. The sequence that occurs is as follows: agent chooses an action A
and subsequent reward R for choosing A; agent enters the state S′ after taking that action
A, and finally the next action A′ that the agent chooses in its new state (Figure 11.6). This
sequence is represented as a quintuple <st, at, rt, st+1, at+1> and hence the name SARSA.
Table 11.2 gives the differences between SARSA and Q-learning approaches. SARSA is
an on-policy learning algorithm since it updates the policy based on actions taken. The
SARSA algorithm is shown in Figure 11.7.
The policy is updated based on action taken when SARSA agent interacts with the envi-
ronment. As per the algorithm given previously, first all parameters are initialized to zero.
Then for each state S in an episode an action a is chosen using policy derived from Q (ɛ-
greedy). Execute the chosen action a and observe the reward R and the corresponding new
FIGURE 11.6
SARSA representation.
TABLE 11.2
Differences Between SARSA and Q-Learning
SARSA Q-Learning
Chooses an action following the same current policy Chooses the greedy action and follows an optimal
and updates its Q-values policy, that is, the action that gives the maximum
Q-value for the state.
Follows on-policy approach Follows the off policy approach
The current state–action pair and the next state–action When passing the reward from the next state to the
pair are used for calculating time difference current state, the maximum possible reward of the
new state is taken, ignoring policy used
FIGURE 11.7
SARSA algorithm.
Reinforcement Learning 285
state s′. Then choose the subsequent action a′ for the new state s′. The Q value for a state–
action is updated by an error, adjusted by the learning rate α as given in Equation (11.15).
Q new st , at Q st , at rt Q st 1 , at 1 Q st , at (11.15)
where Q values represent the potential reward obtained in the next time step for doing
action a in state s, in addition to the discounted future reward gained from the next state–
activity observation.
The parameters α, γ and the initial conditions Q(S0, a0) greatly influence the Q value and
determine the outcomes. SARSA implicitly assumes a low (infinite) initial value, also
known as “optimistic initial conditions.” The first time an action is taken the reward is
used to set the value of Q. This allows immediate learning in the case of fixed deterministic
rewards.
Step 1: Initialize the networks. Deep Q-learning uses two neural networks, the
main network and target network, in the learning process. The two net-
works have the same architecture but use different weights. The weights
from the main network are copied to the target network in each of the N
steps. Usage of two networks results in improved stability and effective
learning. The design of two neural networks is as follows (Figure 11.9):
286 Machine Learning
FIGURE 11.8
Q-learning Versus Deep Q-learning.
FIGURE 11.9
Steps in deep Q-learning.
network) calculates the Q-value for state St+1 (next state). This is done to
have a stabilized training. Any abrupt increments in Q-value count are
stopped by copying it as training data on each iterated Q-value of the
Q-network. This process is as shown in Figure 11.10.
Step 2: Select a course of action using the epsilon–greedy exploration strategy.
In the epsilon–greedy exploration strategy, the agent selects a random
action with probability epsilon and then exploits the best-known action
with probability 1 – epsilon. The output actions of both the networks are
mapped to the input states. The projected Q-value of the model is really
represented by these output actions. In this instance, the best-known
action in that state is the one with the biggest projected Q-value.
Step 3: Using the Bellman equation, revise the network’s weights. The agent per-
forms the chosen action and as per the Bellman equation updates the main
and target networks. Replay is used by deep Q-learning and is the act of
saving and playing back game states such as the state, action, reward, and
next state, which is then used for learning by the reinforcement learning
algorithm. To increase the performance level of the agent, deep Q-learning
uses experience replay to learn in small batches. Experience replay is actu-
ally storing the previous experiences to learn.
This learning approach will ensure the avoidance of skewed data set distribution that
the neural network will see. Importantly, the agent does not need to train after each step.
The neural network is updated with the new temporal difference target using the Bellman
equation. The main objective in deep Q-learning is to replicate the temporal difference
target operation using neural networks rather than using a Q- table. The following
FIGURE 11.10
DQL process.
288 Machine Learning
equation illustrates how the temporal difference target is calculated using the target net-
work rather than the main network (Equation 11.16).
Qtarget Rt max a Q St 1 , a (11.16)
The target network uses experience replay for training, and Q-network uses it for calcula-
tion of Q-value. The loss is calculated by the squared difference of targeted Q-value and
predicted Q-value as per the preceding equation. This is performed only for the training of
the Q-network just before the copying of parameters to the target network. The DQN loss
expectation can be minimized using stochastic gradient descent. If only the last transition
is used, Q-learning is used.
r Q s, a, (11.17)
Reinforcement Learning 289
FIGURE 11.11
Asynchronous RL Procedure.
where a′ is the action and s′ is the state. Except this, it is same as the previous
algorithm.
3. Asynchronous n-step Q-learning: This algorithm follows a “forward view” algo-
rithm by computing n-step return. It uses the exploration policy for each state–
action for a single update and then figures out the gradients for n-step Q-learning
updates for each state–action.
4. Asynchronous advantage actor–critic (A3C): This algorithm follows the same
“forward view” approach as the previous approach; however, it differs in terms of
policy.
All these algorithms have proved to exhibit less training time and have been found more
stable in terms of performance and learning rates.
11.6 Summary
• Outlined a brief introduction of reinforcement learning.
• Discussed the various action selection policies.
• Explained in detail the concepts of finite Markov decision processes.
• Described the various problem-solving methods such as dynamic programming,
Monte Carlo methods, and temporal difference learning.
• Outlined three types of temporal difference learning methods, namely Q-learning,
SARSA, and deep Q-networks.
• Gave a brief introduction to asynchronous reinforcement.
290 Machine Learning
11.7 Points to Ponder
• Learning through interaction is a foundational idea underlying nearly all theories
of learning and intelligence.
• Reinforcement learning learns how to behave in an environment by carrying out
actions and seeing the effects of those actions.
• During reinforcement, the learning environment will receive the agent’s action as
an input and will output the ensuing state and a reward.
• Maximizing immediate reward is ineffective, and choosing the highest-valued
action will cause the model to become stuck in local minima.
• Once a Markov decision process is linked with a policy, the action for each state is
fixed, and the resulting system behaves like a Markov chain.
• At an abstract level, Monte Carlo simulation is a method of estimating the value of
an unknown quantity with the help of inferential statistics.
• Q-learning uses action values or expected rewards for an action taken in a par-
ticular condition, called Q-values, with which the behavior of the agent will be
improved iteratively.
• The asynchronous reinforcement learning approach enables numerous agents to
act and facilitates the correlation of input and output data.
E.11 Exercises
E.11.1 Suggested Activities
E.11.1.1 Design three example tasks that fit into the reinforcement learning framework,
identifying for each its states, actions, and rewards. Make the three examples as
different from each other as possible.
E.11.1.2 Model human and animal behaviors as applications of reinforcement learning.
E.11.1.3 What policy would make on-policy and off-policy learning equivalent, specifi-
cally if we consider Q-learning and SARSA learning? In other words, what pol-
icy used by an agent will make the learning based on Q-learning and SARSA
learning the same?
Reinforcement Learning 291
Self-Assessment Questions
E.11.2 Multiple Choice Questions
E.11.2.5 The policy dictates the actions that the reinforcement agent takes as a function
of the agent’s
i State and the environment
ii Action
iii Environment
E.11.2.6 In a Markov decision process, the objective is to develop a good policy, which
is a function Π. The function Π specifies
i The action Π (s) which the decision maker will select when in state s.
ii The action Π (s) is not fixed
iii The reward which the decision maker will select when in state s
E.11.2.8 At its abstract level, _____ is a method of estimating the value of an unknown
quantity with the help of inferential statistics.
i Dynamic programming
ii Monte Carlo method
iii Temporal difference learning
E.11.2.13 Q-learning is a reinforcement learning policy that will find the next best
action, given
i A current state
ii A current state and the reward
iii The next state and reward
No Match
E.11.3.1 The variables in the Bellman A Acquires knowledge directly from experience episodes
equation that greatly influences
the outcome are
E.11.3.2 Q-learning B A course is selected using the epsilon–greedy
exploration strategy
E.11.3.3 Asynchronous one-step C Initial conditions, learning rate, and discount factor
Q-learning
E.11.3.4 Deep Q-learning D In chess choosing a piece for the next move that will
result in a higher immediate reward
E.11.3.5 SARSA E Is the factor that decides the importance of rewards in
the distant future relative to rewards in the immediate
future
E.11.3.6 Discount factor F Chooses the greedy action and follows an optimal policy
E.11.3.7 Q-values G Refers to a group of methods that can be utilized
to derive optimal policies for an environment,
given a Markov decision process representing that
environment
E.11.3.8 Monte Carlo approach for H Expected rewards for an action taken in a particular
reinforcement learning condition
E.11.3.9 Dynamic programming I Each thread analyses the gradient at each step of
the Q-learning loss by referencing its copy in the
computing environment
E.11.3.10 Greedy policy J The original name was modified connectionist
Q-learning (MCQ-L)
E.11.4 Short Questions
E.11.4.10 Give the Bellman equation and discuss its use in reinforcement learning.
E.11.4.11 Outline the SARSA algorithm for estimating Q-values.
E.11.4.12 Give a detailed note on deep Q-networks.
E.11.4.13 Explain how the Bellman equation is used to update weights in deep
Q-networks.
E.11.4.14 Outline the four asynchronous reinforcement learning methods based on
actor–learner procedure.
12
Machine Learning Applications: Approaches
12.1 Introduction
The first step in building machine learning applications is to understand when to use
machine learning. Machine learning is not an answer to all types of problems. Generally,
machine learning is used when the problem cannot be solved by employing determinis-
tic rule-based solutions. This maybe because it is difficult to identify the rules, when the
rules depend on too many factors or may be overlapping. Another situation where you
use machine learning is when handling large-scale problems. In essence, good problems
for machine learning are those that are difficult to solve using traditional programming.
Machine learning needs a different perspective to understanding and thinking about prob-
lems. Now rather than using a mathematical basis such as logic, machine learning makes
us think in terms of statistics and probability.
Machine learning allows us to effectively identify trends and patterns, allows a maxi-
mum amount of automation once the machine learning model is built, and enables con-
tinuous improvement in accuracy and efficiency as machine learning algorithms gain
experience and are effective in handling multi-dimensional, multi-variety, and dynamic
data. However, machine learning also requires massive unbiased, good quality data sets
for training, needs time and computational resources, and requires the ability to interpret
the results obtained from the machine learning system.
FIGURE 12.1
Machine learning life cycle.
• Define and understand project objectives: The first step of the life cycle is to iden-
tify an opportunity to tangibly improve operations, increase customer satisfaction,
or otherwise create value. We need to frame a machine learning problem in terms
of what we want to predict and what kind of input data we have to make those pre-
dictions. We need to identify and define the success criteria for the project such as
acceptable parameters for accuracy, precision, and confusion matrix values as well
the importance of ethical issues such as transparency, explainability, or bias reduc-
tion while solving the problem. We also need to understand the constraints under
which we operate such as the data storage capacity, whether the prediction needs to
be fast as in the case of autonomous driving, or whether the learning needs to be fast.
• Build the machine learning model: The choice of the machine learning model
depends on the application. A variety of machine learning models are available,
such as the supervised models including classification models, regression mod-
els, unsupervised models including clustering models, and reinforcement learning
models. There is no one solution or one approach that fits all. Some problems are
very specific and require a unique approach such as a recommendation system
while some other problems are very open and need a trial-and-error approach such
as classification and regression. Other factors that help us decide the model are
the requirements of the application scenario including the available computational
time and training time, size, quality, and nature of data such as linearity, number
of parameters, and number of features, and finally accuracy and training time.
• Identify and understand data: Depending on the domain of application and the
goal of machine learning, data come from applications such as manufacturing com-
panies, financial firms, online services and applications, hospitals, social media,
and so on, either in the form of historical databases or as open data sets. The next
step is to collect and prepare all of the relevant data for use in machine learning.
The machine learning process requires large data during the learning stage since
the larger the size of data, the better the predictive power. We need to standardize
the formats across different data sources, normalize or standardize data into the
formatted range, and often enhance and augment data. Sometimes we need to con-
sider anonymizing data. Feature selection is dependent on the most discriminative
dimensions. Finally, we need to split data into training, test, and validation sets.
Sometimes raw data may not reveal all the facts about the targeted label. Feature
Machine Learning Applications 297
• Interpret and communicate: One of the most difficult tasks of machine learn-
ing projects is explaining a model’s outcomes to those without any data science
background, particularly in highly regulated industries such as health care.
Traditionally, machine learning has been thought of as a “black box” because it
is difficult to interpret insights and communicate the value of those insights to
stakeholders and regulatory bodies. The more interpretable your model, the easier
it will be to meet regulatory requirements and communicate its value to manage-
ment and other key stakeholders.
• Deploy the machine learning model: When we are confident that the machine
learning model can work in the real world, it’s time to see how it actually oper-
ates in the real world, also known as “operationalizing” the model. During the
deployment stage of the machine learning life cycle, the machine learning mod-
els are integrated for the application to test whether proper functionality of the
model is achieved after deployment. The models should be deployed in such a
way that they can be used for inference as well as enable regular upgradation.
Model deployment often poses a problem because of the coding and data science
experience it requires and because the time-to-implementation from the beginning
of the cycle using traditional data science methods is prohibitively long.
• Monitor: Monitoring ensures proper operation of the model during the complete
lifespan and involves management, safety, and updating of the application using
the model.
299
300 Machine Learning
the assumptions are that documents in the same class form a contiguous region of
space and documents from different classes do not overlap, and learning the clas-
sifier is the building of surfaces to delineate classes in the space.
The naïve Bayesian classification, the simplest classification method, treats each
document as a “bag of words.” The generative model makes the following further
assumptions: that words of a document are generated independently of context
given the class label and the naïve Bayes independence assumption that the prob-
ability of a word is independent of its position in the document. Other supervised
classification methods include k-nearest neighbours, decision trees, and support
vector machines, all of which require hand-classified training data. Many com-
mercial systems use a mixture of methods.
• POS tagging: POS tagging assigns words in a text with a morph syntactic tag,
based on the lexical definition and form of the word and the context in which the
word occurs. POS tagging is an initial step in many NLP tasks including pars-
ing, text classification, information extraction and retrieval, text to speech systems,
and so on. The POS categories are based on morphological properties (affixes they
take) as well as distributional properties (the context in which the word occurs).
In this chapter we will discuss the probabilistic learning based sequential approach—
the hidden Markov model (HMM) approach to POS tagging. The HMM approach can be
trained on an available human annotated corpora like the Penn Treebank. Here we assume
an underlying set of hidden (unobserved, latent) states in which the model can be (e.g.,
parts of speech) and probabilistic transitions between states over time (e.g., transition from
POS to another POS as sequence is generated). We are given a sentence of n words w1…wn
(an “observation” or “sequence of observations”); we need to find the best sequence of n
tags t1…tn that corresponds to this sequence of observations such that P(t1…tn|w1…wn) is
highest. Here we use n-gram models to determine the context in which the word to be
tagged occurs. In POS tagging the preceding N−1 predicted tags along with the unigram
estimate for the current word is used (Figure 12.2).
In HMM tagging the tags corresponds to an HMM state and words correspond to the
HMM alphabet symbols in order to find the most likely sequence of tags (states) given the
sequence of words (observations). However, we need tag (state) transition probabilities
p(ti|ti−1) and word likelihood probabilities (symbol emission probabilities) p(wi|ti), which
FIGURE 12.2
Sequences of words and tags.
302 Machine Learning
we can determine using the hand tagged corpus, or if there is no tagged corpus then
through parameter estimation (Baum–Welch algorithm).
based on POS and co-occurrence where frequent nouns or noun phrases are found
and then the opinion words associated with them (obtained from a dictionary;
for example, for positive good, clear, interesting or for negative bad, dull, boring)
and nouns co-occurring with these opinion words, which can be infrequent. One
lexical resource developed for sentiment analysis is SentiWordNet, which attached
sentiment-related information with WordNet synsets. Each term has a positive,
negative, and objective score summing up to one.
12.6 Recommendation Systems
Recommendation systems are one class of machine learning task which deals with rank-
ing or rating products, users, or in general entities, or in other words predicts the ratings
an user might give to a specific entity. The goal of recommendation systems is to create a
ranked list of items that caters to the user’s interest. Machine learning driven recommenda-
tion systems today drive almost every aspect of our life: two-thirds of the movies watched
online recommended by Netflix and 35% of items purchased online recommended by
Amazon, while news recommendations by Google News generate 38% more click through.
In other words, in case the user does not find the product, the product finds you by giving
an appropriate search term taken implicitly. The value of recommendation systems is they
can help in upselling, cross selling, and even providing new products. The recommender
problem can be stated as the estimation of a short list of items that fits a user’s interests or
in other words the estimation of a utility function that automatically predicts how a user
will like an item. This estimation is generally based on features such as past behavior, rela-
tions to other users, item similarity, context, and so on.
important aspect are the characteristics of the item pool such as size (do we consider all
web pages or some n stories) and quality of the pool or lifetime (mostly old items vs.
mostly new items). Properties of the context such as whether it is pull based, specified by
an explicit user driven query (e.g., keywords, form) or push based, specified by implicit
context (e.g., page, user), or hybrid push–pull based. Properties of the feedback on the
matches made are also important. These include types and semantics of feedback (such as
click or vote), latency (such as available in 5 minutes versus 1 day), or the volume consid-
ered. Other factors include the constraints specifying legitimate matches such as business
rules, diversity rules, editorial voice, and so on and the available metadata about links
between users, items, and so on.
FIGURE 12.3
Utility matrix.
Machine Learning Applications 305
12.6.3.3 Data Sparsity
Similar to the cold start problem, data sparsity occurs when not enough historical data is
available. Unlike the cold start problem, this is true about the system as a whole and is not
specific to a particular user.
12.6.3.5 Other Issues
One issue is a push attack can occur, where the ratings are pushed up by creating fake
users, in which true recommendation is not possible. The more the recommendation algo-
rithm knows about the customer, the more accurate its recommendations will be, but rec-
ommendation systems should not invade the privacy of users by employing users’ private
information to recommend to others. Another issue is that recommendation systems do
not provide any explanation of why a particular item is recommended.
FIGURE 12.4
Classifications of recommendation systems.
FIGURE 12.5
Collaborative Versus Content-based recommendation.
FIGURE 12.6
Collaborative filtering—memory based: (a) user based, (b) item based.
items (Figure 12.6a). In item based collaborative filtering, future items similar to those that
have received similar ratings previously from the user are likely to receive similar ratings
(Figure 12.6b).
Similarity ( a, i ) = w ( a, i ) , i ∈ K (12.1)
1. Next, we predict the rating that user a will give to specific items using the k neigh-
bors ratings for same or similar items.
2. Now we look for the item j with the best predicted rating.
sim (U i , U j ) = cos((U i , U j ) =
Ui , U j
=
∑r r k
i,k j,k
(12.2)
Ui U j
∑r ∑r k
i,k
2
k
j,k
2
308 Machine Learning
sim (U i , U j ) =
∑ (r k
i,k − rˆi ) ( rj , k − rˆj )
(12.3)
∑ k ( ri,k − rˆi ) ∑ k ( rj,k − rˆj )
2 2
The next step is to find the predicted rating of user u for item i. Here we use the similar-
ity calculation to predict rating as shown in Figure 12.7.
FIGURE 12.7
Calculation of predicted rating—user based collaborative filtering.
Machine Learning Applications 309
FIGURE 12.8
Example—prediction of rating of user for an item using user based collaborative filtering.
S ( i , j ) = corri , j =
∑ u∈U
(Ru,i )(
− Ri Ru , j − Ri )
(12.4)
∑ (R ) ∑ (R )
2 2
u,i − Ri u, j − Ri
u∈U u∈U
The next step is to find the predicted rating of user u for item i based on the past ratings
for similar items as shown in Figure 12.9.
FIGURE 12.9
Calculation of predicted rating—item based collaborative filtering.
FIGURE 12.10
Example—prediction of rating of user for an item using item based collaborative filtering.
filtering can help users discover new items by recommending items because similar users
recommended those items.
However, collaborative filtering also has some disadvantages. Collaborative filtering is
based on previous ratings and hence cannot recommend items that have previously not
been rated or if enough users are not present in the system—called the cold start problem.
Sparsity is another issue, where if the user–item matrix is sparse it is difficult to find users
who have rated the same items. In user based collaborative filtering similarity between
users is dynamic; precomputing user neighborhood can lead to poor predictions, but in
item based collaborative filtering similarity between items is static and enables precom-
puting of item–item similarity, and prediction requires only a table lookup. Further these
systems cannot recommend items to users who have unique tastes since they tend to rec-
ommend popular items. Moreover, it is assumed that items are standardized and that prior
choices determine current choices without considering contextual knowledge.
Machine Learning Applications 311
x·i
u ( x , i ) cos
= = ( x, i ) (12.5)
x·i
FIGURE 12.11
Example of context in movie domain.
{spouse, sister, child}. While in traditional recommendation we want a list of movies to see,
in context aware recommendation we want a list of recommended movies based on time,
location, companion, and so on. In other words, the recommendation needs to adapt to
user preferences in dynamic contexts.
12.8 Summary
• Explained the approaches to building machine learning applications.
• Discussed the factors that decide the selection of the machine learning algorithm.
• Outlined the application of machine learning for text classification, POS tagging,
text summarization, and sentiment analysis tasks of natural language processing.
• Discussed the underlying concepts of recommendation systems.
• Differentiated between user based and item based collaborative filtering.
• Explained content based recommendation systems.
• Explored the basic concepts of context aware recommendations.
12.9 Points to Ponder
• Machine learning is not an answer to all types of problems.
• Building a machine learning application is an iterative process.
• No one machine learning algorithm fits all types of problems.
• Text classification has many applications.
• Online activity has made sentiment analysis one the most important applications
of NLP.
• Recommendation systems can be explained using a utility matrix.
• Cold start is a disadvantage of collaborative filtering.
• In content based recommendation we are not able to exploit the judgements of
other users.
• Context aware recommendations add context as features to the user item matrix.
Machine Learning Applications 313
E.12 Exercises
E.12.1 Suggested Activities
Use case
Thinking Exercise
E.12.1.2 Give an everyday example of text classification.
E.12.1.3 Give the use of recommendation systems in the education sector in five differ-
ent scenarios.
E.12.1.4 Can you give an example user item matrix with five context variables in a
finance scenario?
Self-Assessment Questions
E.12.2 Multiple Choice Questions
E.12.2.8 An unsupervised method used to find top ranked sentences for text summari-
zation is
i Text rank
ii SVM
iii K-means
No Match
E.12.3.1 Machine learning process requires large data A Classification problems
during the learning stage
E.12.3.2 Small number of features but large volume of data B Parametric
E.12.3.3 Problems that require supervised learning, C Text classification
classification, and regression
E.12.3.4 The basis of logistic regression D Learning rate, regularization,
epochs
E.12.3.5 Confusion matrix values E Larger the size of data, the
better the predictive power
E.12.3.6 The basis of random forest F Rule based
E.12.3.7 Finding whether an email is spam or not G Cold start problem
E.12.3.8 In recommendation systems there is no user H Item based filtering
history
E.12.3.9 Uses similarity between the items and the current I Generally open approach
item to determine whether a user prefers the item
E.12.3.10 Hyper-parameters that affect the learning J Choose low bias/high
variance algorithms
E.12.4 Short Questions
13.1 Introduction
Machine learning methodologies are of great practical value in a variety of application
domains in realistic situations where it is impractical to manually extract information from
the huge amount of data available and hence automatic or semi-automatic techniques are
being explored. Machine learning promises to transform our lives and to lead us to a better
world, while creating even greater impact for business and society and augmenting human
capability, and helping us to go exponentially farther and faster in the understanding of
our world. Due to improved algorithms, access to growing and massive data sets, ubiq-
uitous network access, near infinite storage capacity, and exponential computing power,
machine learning is at the core of today’s technical innovations.
There are many ways in which machine learning tasks can be classified; one such typol-
ogy of machine learning applications is discussed next (Figure 13.1):
• Predicting: Machine learning can forecast or model how trends are likely to
develop in the future, thereby enabling systems to predict, recommend, and per-
sonalize responses. One such example is Netflix’s recommendation algorithm,
which analyses users’ viewing histories, stated preferences, and other factors to
suggest new titles that they might like. Data-intensive applications, such as preci-
sion medicine and weather forecasting, stand to benefit from predictive machine
learning and are one of the most used types of machine learning for applications.
• Monitoring: Machine learning can rapidly and accurately analyse large amounts
of data and detect patterns and abnormalities; machine learning is very well
suited for monitoring applications, such as detecting credit card fraud, cyberse-
curity intrusions, early warning signs of illnesses, or important changes in the
environment.
• Discovering: Machine learning, specifically data mining, can extract valuable
insights from large data sets and discover new solutions through simulations. In
particular, because machine learning uses dynamic models that learn and adapt
from data, it is effective at uncovering abstract patterns and revealing novel insights.
• Interpreting: Machine learning can interpret unstructured data—information that
is not easily classifiable, such as images, video, audio, and text. For example, AI
helps smartphone apps interpret voice instructions to schedule meetings, diagnos-
tic software to analyse X-rays to identify aneurysms, and legal software to rapidly
analyse court decisions relevant to a particular case.
DOI: 10.1201/9781003290100-13 317
318 Machine Learning
FIGURE 13.1
Typology of machine learning applications.
This typology along with the array of machine learning tasks such as classification, clus-
tering, regression, predictive modelling, ranking problem solving, recognition, forecast-
ing, sequence analysis, anomaly detection, and association discussed in Chapter 1 will
help us understand the role of machine learning associated with the applications. We will
be discussing various applications in domains such as everyday examples of our daily
life, healthcare, education, business, engineering applications, and smart city applications
(Figure 13.2).
FIGURE 13.2
Domain based machine learning applications.
make a personalized and user-friendly interaction with the phone. Now let us discuss
some everyday examples of machine learning which effect the very way we live.
ask natural language questions, order pizza, hail an Uber, and integrate with smart home
devices.
The important machine learning components needed by personal smart assistants
include the following:
• Speech recognition is needed to convert the voice to text to capture the instructions
given by the user. Interpretation of the text using natural language processing is
required, and response generation using natural language generation is required.
• Since the personal assistant and the user need to communicate, an important com-
ponent is the adaptive dialogue management system.
• To achieve personalization, knowledge acquisition using machine learning sys-
tems is important.
Two important components of a personal assistant are email management and calendar
management components. In email management, pragmatic analysis to determine type of
message and folder, email task processing, and response generation are required. In cal-
endar management, semantic analysis to handle domain specific vocabulary and actions,
pragmatic analysis for handling appointments and create to-do lists, and task processing
and appropriate response generation are needed.
13.2.2 Product Recommendation
This is perhaps the most popular application of machine learning both for the retailers
and the consumers. Machine learning has become the latest trend in the retail industry,
and retailers have started implementing big data technologies to eliminate the problems
involved in data processing. Machine learning algorithms use the data sets to automate
the analysis process and help the retailers in achieving their desired growth by exposing
customers to similar products that match their taste while using an online shopping por-
tal using different machine learning algorithms that use collated information like website
search queries, purchasing behavior, and so on in order to induce the customer to buy
their products and services. Product recommendations will get the best product attributes
and dynamic information and will also improve customer experience through media and
marketing campaigns. Customers can get insights on products they purchased and find
similar ones that will make their experience better. Product recommendations can also
help in understanding which areas can be improved in terms of composition, product per-
formance, scope, function, and more. This will improve overall product features.
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon and Netflix for product recommendations to the user. Whenever we
search for some product on Amazon, then we start getting an advertisement for the same
product while internet surfing on the same browser, and when we use Netflix, we find
some recommendations for entertainment series, movies, and so on, all carried out with
the help of machine learning. Google understands the user interest using various machine
learning algorithms and suggests the product as per customer interest. These algorithms
automatically learn to combine multiple relevance features from past search patterns and
adapt to what is important to the customers. These recommendations for products are
displayed as “customers who viewed this item also viewed” and “customers who bought
this item also bought,” and through personalized recommendations on the home page,
bottom of item pages, and through email.
Domain Based Machine Learning Applications 321
Research has shown that a substantial proportion of sales come from recommendations.
The key to online shopping has been personalization; online retailers increase revenue by
helping you find and buy the products you are interested in. In the future it is likely that
retailers may design your entire online experience individually for you. Some companies
have started offering “personalization as a service” to online businesses while others allow
businesses to run extensive “A/B tests,” where businesses can run multiple versions of
their sites simultaneously to determine which results in the most engaged users.
This particular application has resulted in research in the area of recommendation sys-
tems where machine learning algorithms are designed to give recommendations based on
users and items and handle the cold start problem for first time users.
13.2.3 Email Intelligence
Email intelligence concerns the smart sorting, prioritizing, replying, archiving, and
deleting of emails. This is important since in today’s world people spend a considerable
amount of time on emails. One important component of email intelligence is the spam fil-
ter. Simple rules-based filters (i.e., “filter out messages with the words ‘online pharmacy’
and ‘Nigerian prince’ that come from unknown addresses”) aren’t effective against spam,
because spammers can quickly update their messages to work around them. Instead, spam
filters must continuously learn from a variety of signals, such as the words in the message,
message metadata (where it’s sent from, who sent it, etc.). They must further personalize
their results based on your own definition of what constitutes spam—perhaps that daily
deals email that you consider spam is a welcome sight in the inboxes of others. Machine
learning is used to automatically filter mail as important, normal, and spam. Some of the
spam filters used by Gmail are content filter, header filter, general blacklist filter, rules-
based filters, and permission filters. Machine learning algorithms such as multi-layer per-
ceptron, decision tree, and naïve Bayes classifier are used for email spam filtering and
malware detection.
Another important part of email intelligence is smart email categorization. Gmail cate-
gorizes your emails into primary, social, and promotion inboxes, as well as labeling emails
as important. Google outlines its machine learning approach and uses manual interven-
tion from users to tune their threshold. Every time you mark an email as important, Gmail
learns. Google designed a next-generation email interface called Smart reply which uses
machine learning to automatically suggest three different brief (but customized) responses
to answer the email.
Machine learning in this application involves monitoring of emails, interpreting them,
and using classification and problem solving to handle them.
13.2.4 Social Networking
Mobile usage has been increasing day by day; meanwhile, the usage of mobile applica-
tions has also increased. Every application has to maintain some unique features to deliver
the personalized content that is being liked by its users. We are living in an information
era, where brands have access to a vast customer base, but the problem comes with the
customer segmentation, which is basically targeting people with the content they are actu-
ally interested in, based on their past activity and demographics. Brands have started
implementing big data and machine learning technologies to target specific segments.
Now, organizations can convey the message that they understand and are pushing the ads,
deals, and offers that appeal to users across different channels. Thus mobile applications
322 Machine Learning
have started implementing machine learning to tailor the application according to the
needs of every single user. Adding machine learning to develop the mobile application
would result in acquiring a new customer base by maintaining the older base strong and
stable. Some facilities include customized feeds and content that meets the users’ require-
ments, making searches faster and easier, and fast and protected authentication processes.
This would result in improved sales and revenue and driving more customers to your
site based on the search made by general people, purchase patterns, site content, and so
on. For example, machine learning on Instagram considers your likes and the accounts
you follow to determine what posts you are shown on your explore tab. Facebook uses
machine learning to understand conversations better and translate posts from different
languages automatically. Twitter uses machine learning for fraud detection, removing pro-
paganda, and hateful content and to recommend tweets that users might enjoy, based on
what type of tweets they engage with. Machine learning components involved are same
as personal smart assistants but now have to deal with the peculiarities of different social
networking sites.
13.2.5 Commuting Services
Machine learning has paved a new way in the world of transportation. One important
application is the self-driving cars that are capable of handling the driving by letting the
driver relax or have leisure time. Driverless cars can identify objects, interpret situations,
and make decisions when navigating the roads. Uber uses different machine learning algo-
rithms to make a ride more comfortable and convenient for the customer by deciding the
fare for a ride, enabling ridesharing by matching the destinations of different people, mini-
mizing the waiting time once a ride is booked, and optimizing these services by matching
you with other passengers to minimize detours. Uber has invented a pricing model called
surge pricing by using the machine learning model which can recognize the traffic patterns
to charge appropriate fair to a customer.
Another application is platooning autonomous trucks by monitoring each other’s speed,
proximity, and the road to drive close together to improve efficiency, which can reduce fuel
consumption by up to 15% and can also reduce congestion.
GPS technology can provide users with accurate, timely, and detailed information to
improve safety. Here machine learning is used by logistics companies to improve opera-
tional efficiency, analyse road traffic, and optimize routes. If we want to visit a new place,
we take the help of Google Maps, which shows us the correct path with the shortest route
and predicts the traffic conditions. It also predicts the traffic conditions such as whether
traffic is cleared, slow-moving, or heavily congested with the help of the real-time location
of the vehicle from the Google Map app and sensors and average time taken on past days
at the same time. Moreover, it takes information from the user and sends back to its data-
base to improve the performance.
Another interesting application is the use of machine learning to reduce commute times.
A single trip may involve multiple modes of transportation (i.e., driving to a train station,
riding the train to the optimal stop, and then walking or using a ride-share service from
that stop to the final destination), in addition to handling unexpected events such as con-
struction, accidents, road or track maintenance, and weather conditions which can con-
strict traffic flow with little to no prior warning.
Machine learning is used extensively in commuting service applications such as the
following:
Domain Based Machine Learning Applications 323
• Self-
driving cars use object detection and object classification algorithms for
detecting objects, classifying them, and interpreting what they are.
• Uber leverages predictive modelling in real time by taking into consideration traf-
fic patterns to estimate the supply and demand.
• Uber uses machine learning for automatically detecting the number of lanes and
road types behind obstructions on the roads.
• Machine learning also predicts long-term trends with dynamic conditions such
as changes in population count and demographics, local economics, and zoning
policies.
The applications mentioned previously give some activities where machine learning is
used, but every day the type of applications as well as sophistication of the services are
increasing.
on. In addition, President Obama’s initiative to create a one million person research cohort
consisting of basic health exam data, clinical data derived from EHRs, healthcare claims,
and so on has also helped research in the area. The diversity of digital health data is indeed
large and includes laboratory tests, imaging data, clinical notes as text, phone records, social
media data, genomics, vital signs, and data from devices, which makes health care a suit-
able candidate for machine learning. Another fact that has helped in integrating applica-
tions of machine learning to health care is the standardization adopted in diagnosis codes
(International Classification of Diseases—ICD- 9 and ICD- 10), laboratory tests (LOINC
codes), pharmacy (National Drug Codes—NDCs), and medical concepts (Unified Medical
Language System—UMLS), enabling integration of information from different sources.
Major advances in machine learning are another factor that has kindled its application to
health care. These include the ability to learn with high-dimensional features, and more
sophisticated semi-supervised learning, unsupervised learning, and deep learning algo-
rithms. In addition, the democratization of machine learning with the availability of high-
quality open source software such as TensorFlow, Theano, Torch, and Python’s scikit-learn
have all contributed to the widespread adoption of machine learning for health care. The
application of machine learning to health care has interested the industry and many machine-
learning based health care applications are becoming available such as disease identification
and diagnosis, medical imaging and diagnosis, smart health records, drug discovery and
manufacturing, personalized medicine and treatment, and disease outbreak prediction.
conditions, and so on, and genetic data such as all or key parts of the DNA sequence of
an individual. For example, we may have data about the characteristics of previous heart
attack patients including biometrics, clinical history, lab test results, comorbidities (dis-
eases occurring with heart disease), drug prescriptions, and so on. Based on these features,
it is possible to predict if the given patient can be susceptible to a heart attack and can be
advised to take necessary precautions to prevent it. Machine learning can improve the
accuracy of diagnosis and prognosis with earlier and more accurate prediction, and risk
prediction where the variables which are more associated with the risk of suffering a dis-
ease are discovered. New methods are now available for chronic disease risk prediction
and visualization that give clinicians a comprehensive view of their patient population,
risk levels, and risk factors, along with the estimated effects of potential interventions.
Disease diagnosis using machine learning techniques can enhance the quickness of deci-
sion making, and it can reduce the rate of false positives.
For example, powered by machine learning, the critical needs of patients and practice
administrators are addressed, and ML enables faster and more accurate diagnosis, individu-
alized treatment, and improved outcomes. Supervised learning is used to predict cardiovas-
cular diseases using data on characteristics of previous patients including biometrics, clinical
history, lab tests result, comorbidities, and drug prescriptions obtained from electronic
health records. Clustering has been used to discover disease subtypes and stages. Risk strati-
fication has been used, for example, for the early detection of diabetes. Here machine learn-
ing uses readily available administrative and clinical data to find surrogates for risk factors
and then performs risk stratification at the population level with millions of patients.
used to enhance the management of health information and health information exchange.
The goal is to facilitate access to clinical data, modernize the workflow, and improve the
accuracy of health information. Smart charts were developed that are utilized to identify
and extract health data from various medical records to aggregate a patient’s medical his-
tory into one digital profile.
them to the most appropriate doctor in an available medical network. Another important
area is providing personalized medications and care, enabling the best treatment plans
according to patient data. reducing cost. and increasing effectiveness of care, where for
example machine learning is used to match the patients with the treatments that prove the
most effective for them. Another area is patient data analytics, which analyse patient and/
or third-party data to discover insights and suggest actions. This provides an opportunity
to reduce cost of care, use resources efficiently, and manage population health easily. An
example is where all the relevant health care data at a member level is displayed on a
dashboard to understand risk and cost, provide tailored programs, and improve patient
engagement.
In addition, machine learning can also be used for predicting the occurrence of a disease
such as heart attacks. In this case we have data about the characteristics of previous heart
attack patients including biometrics, clinical history, lab test results, comorbidities (dis-
eases occurring with heart disease), drug prescriptions, and so on. Based on these features,
it is possible to predict if the given patient can be susceptible to a heart attack and can be
advised to take necessary precaution to prevent it. By using machine learning for compar-
ing massive amounts of data, including individual patient health data, to the greater popu-
lation health data, prescriptive analytics can determine what treatments will work best for
each patient and deliver on the promise of precision medicine. Machine learning can anal-
yse data from next-gen sequencing, genome editing, chemical genomics, and combina-
tional drug screening to find the most appropriate patients to treat with novel therapeutics
and develop precision therapeutics for cancer and rare diseases.
13.4.2 Increasing Efficiency
Machine learning makes the work of teachers and students easier, and that makes them
happy and comfortable with education. This also increases involvement and their love
towards participation and learning, thus increasing the efficiency of education. Thus, the
educators are free to focus on tasks that cannot be achieved by AI and that require a human
touch. Machine learning has the potential to make educators more efficient by complet-
ing tasks such as classroom management, scheduling, and so on. The tasks delegated to
machine learning will be performed automatically and almost instantly. Machine learning
has the capability of better content and curriculum organization and management. It helps
to bifurcate the work accordingly and understand the potential of everyone. This helps to
analyse what work is best suited for the teacher and what works for the student.
Machine learning in education can track learner progress and adjust courses to respond
to students’ actual needs, thus increasing engagement and delivering high-quality train-
ing. Feedback from machine learning algorithms allows instructors to understand learn-
ers’ potential and interests, identify struggling students, spot skill gaps, and provide extra
support to help students overcome learning challenges. Machine learning can determine
whether learners are interacting with the course materials, time spent by students on each
section, whether the students get stuck or just skim the content, and time the students took
to complete the test.
The Georgia Institute of Technology has implemented an automated teaching assistant
named Jill, powered by IBM’s Watson cognitive computing platform, to help respond to
student inquiries for an online course. Jill can analyse and answer student questions, such
as where to find course materials, and could help improve retention rates for online
courses, which are generally low because students have trouble getting the information
they need from professors.
13.4.3 Learning Analytics
Machine learning in the form of learning analytics can help teachers gain insight into data
that cannot be gleaned otherwise. Machine learning can analyse content to be provided to
students for adaptive learning.
With learning analytics, the teacher can gain insight into data and can perform deep
dives into content, interpret it, and then make connections and conclusions. This can posi-
tively impact the teaching and learning process. Apart from this, the learning analytics
suggests paths the student should take. Students can gain benefits by receiving sugges-
tions concerning materials and other learning methodologies.
Use of machine learning can help to provide all kinds of reports (attendance, academic
performance, engagement, certification tracking, trainer/teacher approval, etc.),
330 Machine Learning
measuring both quality and quantity of educational materials available, analyzing input
data (number of logins, time spent on the platform, students’ background, emails, requests,
etc.), and visualizing the information flow to determine existing issues and miscommuni-
cation sources.
13.4.4 Predicative Analytics
Machine learning in the form of predictive analytics can make conclusions about things
that may happen in the future. For instance, using a data set of middle school students’
cumulative records, predictive analytics can tell us which ones are more likely to drop
out because of academic failure or even their predicated scores on a standardized exam.
Predictive analytics in education is all about knowing the mindset and needs of the stu-
dents. It helps to make conclusions about the things that might happen in the future. With
class tests and half-yearly results, it could be understood which students are going to per-
form well in the exam and which students will have a tough time. This helps the faculty
and the parents to get alerts and take appropriate measures. Through this, a student can be
helped in a better way and can work on their weak subjects.
13.4.5 Evaluating Assessments
Students often complain about human biases in assessments. Educators, in turn, point to
the need for more precise and fairer grading systems. Automated test scoring has been
around for a while, but incorporating machine learning in education enables smart assess-
ments that can instantly evaluate multiple formats, including written assignments such as
papers, essays, and presentations. Innovative grading tools can evaluate style, structure,
and language fluency, analyze narrative depth, and detect plagiarism. Machine learning
turns assessment into a matter of a few seconds, ensures accurate measurement of students’
academic abilities, and eliminates the chance of human error. Machine learning can be used
to grade student assignments and exams more accurately than a human can, though some
input from humans is required. But the best results will have higher validity and reliability
when a machine does the work as there are higher reliability and low chances of error.
Education software company Turnitin has developed a tool called Revision Assistant that
uses machine learning to evaluate students’ writing while they draft essays to provide feed-
back. Revision Assistant evaluates four traits in student writing—language, focus, organiza-
tion, and evidence—and can detect the use of imprecise language or poor organizational
structure to provide both positive feedback and recommendations for improvement. Many
high school and college students are familiar with services like Turnitin’s popular machine
learning based tool used by instructors to analyze students’ writing for plagiarism. Machine
learning can help detect the plagiarizing of source code by analyzing a variety of stylistic
factors that could be unique to each programmer, such as average length of line of code,
how much each line was indented, how frequent code comments were, and so on.
the data iteratively and finds different types of hidden insights without being explicitly
programmed to do so. Machine learning in business helps in enhancing business scal-
ability and improving business operations for companies across the globe. Factors such
as growing volumes, easy availability of data, cheaper and faster computational process-
ing, and affordable data storage have contributed to the use of machine learning in busi-
ness. Organizations can now benefit by understanding how businesses can use machine
learning and implement the same in their own processes. Experts are of the opinion that
machine learning enables businesses to perform tasks on a scale and scope previously
impossible to achieve, speeding up the pace of work, reducing errors, and improving
accuracy, thereby aiding employees and customers alike. Moreover, innovation-oriented
organizations are finding ways to harness machine learning not just to drive efficiencies
and improvements but to fuel new business opportunities that can differentiate their com-
panies in the marketplace.
13.5.1 Customer Service
Using machine learning for gathering and analyzing social, historical, and behavioral data
enables brands to gain a much more accurate understanding of their customers. Machine
learning is used for continuously learning and improving from the data it analyses and is
able to anticipate customer behavior. This allows brands to provide highly relevant con-
tent, increase sales opportunities, and improve the customer journey.
Machine learning is often used to evaluate customer-service interactions and predict
how satisfied a customer is, prompting an intervention if a customer is at risk of leaving a
business. Machine learning analyses data such as customer-support ticket text, wait times,
and the number of replies it takes to resolve a ticket to estimate customer satisfaction, and
it adjusts its predictive models in real time as it learns from more data about tickets and
customer ratings. Machine learning will also improve customer trends, aid in improving
customer interactions, and extract valuable points. Real-time data will help visually anal-
yse and engage with users on a personal level and help in delivering good support and
service for customers and build stronger customer relationships. Machine learning is able
to provide digitally connected customers 24/7 online support and help customers with
their queries. With predictive analytics and automated phone bot accuracy, it is possible for
customers to get smart solutions.
Machine learning based approaches allow customers to have a personalized experience
by giving customers the right messages at the right time. Machine learning based chat bots
provide customers fast, efficient, and friendly service. Chat bots improves the scope for
customers to get the right information as they need it and get consistently better at analyz-
ing that information. Customer support can be a game-changer as it needs to be respon-
sive, consistent, and focused. Chat bots can solve basic queries, reduce touchpoints,
streamline interactions, and help with complex issues. Virtual assistants help customers
navigate the process and engage them in conversations. AI agents can reduce the hassle of
reaching customers online with natural language processing, machine learning, and voice
assistant help. Customer engagement often centers on digital content, but machine learn-
ing enables the combining of natural language processing to gain better insights into each
individual customer experience.
Machine learning based translation uses a combination of crowdsourcing and machine
learning to translate businesses’ customer-service operations such as web pages, customer
service emails and chats, and social-media posts into 14 different languages at a substan-
tially faster rate and reduced cost, making it easier for businesses to reach international
332 Machine Learning
audiences. The translated material is then reviewed by human translators for better cus-
tomer experience.
Machine learning analyses purchase decisions of customers and makes appropriate rec-
ommendations to build targeted marketing campaigns that build customer interest.
Machine learning understands purchase patterns and performs predictive and prescrip-
tive analysis that will drive engagement and improve opportunities for upselling and
cross-selling. Machine learning based tools can make tasks like data cleaning, combining,
combing, and rearranging quicker and less expensive. Machine learning can thus influence
customer experience by providing voice-enabled customer services like Amazon Echo and
Alexa, useful customer insights, streamline customer interactions, data-backed customer
and marketing strategies, automated assistance, and personalized content.
13.5.2 Financial Management
Financial monitoring is a typical security use case for machine learning in finance. Machine
learning algorithms can be used to enhance network security significantly, to detect any
suspicious account behavior such as a large number of micropayments, and flag such
money laundering techniques. One of the most successful applications of machine learning
is credit card fraud detection. Banks are generally equipped with monitoring systems that
are trained on historical payments data. Algorithm training, validation, and back testing
are based on vast data sets of credit card transaction data. Machine learning classification
algorithms can easily label events as fraud versus non-fraud to stop fraudulent transactions
in real time. Machine learning reduces the number of false rejections, which in turn helps
improve the precision of real-time approvals. These models are generally built based on
customer behavior on the internet and transaction history.
Machine learning enables advanced market insights that allow the creation of auto-
mated investment advisors, who identify specific market changes much earlier as com-
pared to traditional investment models. These advisors can also apply traditional data
processing techniques to create financial portfolios and solutions such as trading, invest-
ments, retirement plans, and so on for their users.
Finance companies use machine learning to reduce repetitive tasks through intelligent
process automation. Chat bots, paperwork automation, and employee training gamifica-
tion are examples of machine learning based process automation in the finance industry,
which in turn improves customer experience, reduces costs, and scales up their services. In
addition, machine learning uses the vast amount of data available to interpret behaviors
and enhance customer support systems and solve the customers’ unique queries.
Machine learning is also used by budget management to offer customers highly special-
ized and targeted financial advice and guidance. Moreover, machine learning allows cus-
tomers to track their daily spending and identify their spending patterns and areas where
they can save. The massive volume and structural diversity of financial data from mobile
communications and social media activity to transactional details and market data make it
a big challenge even for financial specialists to process it manually. Machine learning tech-
niques such as data analytics, data mining, and natural language processing enable the
management of the massive volume and structural diversity of financial data from mobile
communications and social media activity to transactional details and market data, and
can bring in process efficiency and extract real intelligence from data for better business
productivity. The structured and unstructured data such as customer requests, social media
interactions, and various business processes internal to the company can be analyzed to
discover trends, assess risk, and help customers make informed decisions accurately.
Domain Based Machine Learning Applications 333
Machine learning can also improve the level of customer service in the finance industry
where using intelligent chat bots; customers can get all their queries resolved in terms of
finding out their monthly expenses, loan eligibility, affordable insurance plan, and much
more. Further, use of machine learning in a payment system can analyze accounts and let
customers save and grow their money, analyze user behavior, and develop customized
offers.
Using machine learning techniques, banks and financial institutions can significantly
lower the risk levels by analyzing a massive volume of data sources where in addition to
credit score, significant volumes of personal information can also be included for risk
assessment. Credit card companies can use machine learning to predict at-risk customers
and to specifically retain good ones. Based on user demographic data and credit card trans-
action activity, user behavior can be predicted and used to design offers specifically for
these customers. Banking and financial services organizations can be provided with action-
able intelligence by machine learning to help them make subsequent decisions. An exam-
ple of this could be machine learning programs tapping into different data sources for
customers applying for loans and assigning risk scores to them. Machine learning uses a
predictive, binary classification model to find out the customers at risk and a recommender
model to determine best-suited card offers that can help to retain these customers. By ana-
lyzing data such as the mobile app usage, web activity, and responses to previous ad cam-
paigns, machine learning algorithms can help to create a robust marketing strategy for
finance companies. Unlike the traditional methods which are usually limited to essential
information such as credit scores, machine learning can analyze significant volumes of
personal information to reduce their risk.
Machine learning in trading is another example of an effective use case in the finance
industry. Algorithmic trading (AT) has, in fact, become a dominant force in global finan-
cial markets. Machine learning models allow trading companies to make better trading
decisions by closely monitoring the trade results and news in real time to detect patterns
that can enable stock prices to go up or down. Algorithm trading provides increased accu-
racy and reduced chances of mistakes, allows trades to be executed at the best possible
prices, and enables the automatic and simultaneous checking of multiple market
conditions.
13.5.3 Marketing
Machine learning enables real-time decision making, that is, the ability to make a decision
based on the most recent data that is available, such as data from the current interaction
with near-zero latency. Real-time decision making can be used for more effective marketing
to customers. One example of real-time decision making is to identify customers that are
using ad blockers and provide them with alternative UI components that can continue to
engage them. Another is personalized recommendations, which are used to present more
relevant content to the customer. By machine learning and real-time decisioning to recog-
nize and understand a customer’s intent through the data that they produce in real time,
brands are able to present hyper-personalized, relevant content and offers to customers.
Machine learning analyses large amounts of data in a very short amount of time and
uses predictive analytics to produce real-time, actionable insights that guide the next inter-
actions between a customer and a brand. This is often referred to as predictive engage-
ment, enabled by machine learning that provides knowledge of when and how to interact
with each customer. Machine learning, by providing insights into historical data, can pre-
scribe actions to be taken to facilitate a sale through suggestions for related products and
334 Machine Learning
accessories, making the customer experience more relevant to generate a sale, as well as
providing the customer with a greater sense of emotional connection with a brand.
13.5.4 Consumer Convenience
Machine learning can be used to facilitate consumer convenience. Machine learning can
analyse user-submitted photos of restaurants to identify a restaurant’s characteristics, such
as the style of food and ambiance, and improve its search results and provide better rec-
ommendations. Machine learning can analyse image-based social media sites and identify
items in pictures that users like and find out where they can be bought online.
The Google Photos app uses machine learning to recognize the contents of users’ pho-
tographs and their metadata to automatically sort images by their contents and create
albums. The Google Photos algorithms can identify pictures taken in the same environ-
ment, recognize and tag landmarks, identify people who appear in multiple photo-
graphs, and generate maps of a trip based on the timestamps and geotags of a series of
photographs.
13.6.1 Manufacturing
Machine learning is the core of what are called smart factories. Machine learning helps
in harnessing useful data from previously unused data, unlocking insights that were too
time-consuming to analyse in the past. Machine learning enables efficient supply chain
communication, keeping delays to a minimum as real-time updates and requests are
instantly available.
Manufacturers have been successful in including machine learning into the three aspects
of the business—operations, production, and post-production. Manufacturers are always
keen to adopt technology that improves product quality, reduces time-to-market, and is
scalable across their units. Machine learning is helping manufacturers fine-tune product
quality and optimize operation. Manufacturers aim to overcome inconsistency in equip-
ment performance and predict maintenance by applying machine learning to flag defects
and quality issues before products ship to customers, improve efficiency on the production
line, and increase yields by optimizing use of the manufacturing resources.
Machine learning can automatically generate industrial designs based on a designer’s
specific criteria, such as function, cost, and material, and generate multiple alternative
designs that meet the same criteria and provide designers with performance data for each
design, alter designs in real time based on designer feedback, and export finalized designs
into formats used for fabrication.
By continuously monitoring data (power plant, manufacturing unit operations) and
providing them to machine learning based decision support systems, manufacturers can
predict the probability of failure. Machine learning based predictive maintenance is an
emerging field in industrial applications that helps in determining the condition of in-
service equipment to estimate the optimum time of maintenance and saves cost and time
on routine or preventive maintenance. Apart from industrial applications, predicting
mechanical failure is also beneficial for industries like the airline industry. Airlines need to
be extremely efficient in operations, and delays of even a few minutes can result in heavy
penalties.Machine learning based analytics can help manufacturers with the prediction of
calibration and test results to reduce the testing time while in production. Early prediction
from process parameters, descriptive analytics for root-cause analysis, and component fail-
ures prediction can avoid unscheduled machine downtimes.
336 Machine Learning
The biggest use case for machine learning is in 3D printing and additive manufacturing
and lies in enhancing design and improving the efficiency of the processes. Key challenges
that machine learning will help overcome for additive manufacturing will revolve around
improving prefabrication printability checks, reducing the complexity involved in the pro-
cess, and reducing the talent threshold for manufacturing industries. In defect detection,
machine learning could be included in 3D modelling programs. These technologies could
develop tools that are able to find defects that could make the model nonprintable in a 3D
model. A real-time control using machine learning could considerably reduce time and
material waste. Machine learning could provide a solution for one of the biggest chal-
lenges in additive manufacturing, that is, the need for greater precision and reproducibility
in 3D-printed parts, by quickly correcting computer-aided design models and producing
parts with improved geometric accuracy, ensuring that the printed parts conform more
closely to the design and remain within necessary tolerances.
13.6.3 Environment Engineering
Weather forecasting using meteorological and satellite data uses predictive machine learn-
ing. Climate change vulnerability and risk assessment in agriculture, coastal, forest and
biodiversity, energy, and water sectors at a cadastral level is an important application of
machine learning. Aerosol emission inventory and forecasting is carried out by machine
learning using satellite and meteorological data. Machine learning based analytics is used
to determine sea level rise estimates in real time with high resolution imagery.
Machine learning was used to analyze satellite imagery of forests over time to detect
early warning signs of illegal logging. The system can flag changes such as new roads
Domain Based Machine Learning Applications 337
indicating a new logging operation, as well as learn to identify changes that occur before
major cutting to improve its warning system.
Machine learning systems can analyze data about pollution levels to forecast changes to
air quality hours in advance. Machine learning is used in the development of an autono-
mous recycling system that uses a robotic arm and an array of sensors to identify recycla-
ble items in waste and separate them for recycling, removing the need for manual sorting.
Machine learning analyses trash on a conveyer belt using 3D scanning, spectrometer anal-
ysis, and other methods to determine what a piece of trash is made of, and if it is recyclable,
a robotic arm will pick it up and move it to a separate container.
Networked acoustic sensors and machine learning can improve efforts to save threat-
ened species of birds. Networked acoustic sensors collect data about bird calls, and a
machine-learning algorithm parses through this audio to identify bird calls of endangered
birds and flag when and where they were recorded. This data can help better map the
bird’s habitat, estimate its population, and guide conservation efforts.
• Physical category data including location, presence , proximity, time, and date
• Environment category including temperature, noise level, humidity, pressure, and
air quality
• User centric category including identity, preferences, history, and social interactions
• Resource category including connectivity, bandwidth, nearby networks, and
device characteristics
Using and integrating such context aware data, machine learning can provide the follow-
ing smart city applications:
The shared characteristic of all these smart city applications is the availability of huge
amounts of different types of data from physical and nonphysical systems at different
granularity levels. Machine learning needs to integrate the right data from the right source,
at the right level, and at the right time to monitor, analyse, classify, and predict to enable
decision making and to in some cases actuate physical systems.
13.8 Summary
In this module, the following topics were discussed:
13.9 Points to Ponder
• Why do you think machine learning applications are suddenly gaining momen-
tum? Discuss.
• Can you think of other areas in which machine learning would be useful such as
electoral campaigns, real estate, insurance, and so on?
Domain Based Machine Learning Applications 339
E.13 Exercises
E.13.1 Suggested Activities
E.13.1.1 By searching the web, find at least five applications of machine learning each
in health care and business not mentioned in the chapter and give a suitable
objective, features to be used, machine learning task required, and evaluation
parameters.
Healthcare
Machine Evaluation
S.No Application Objective Features Used Learning Task Parameter
1.
2.
3.
4.
5.
Business
1.
2.
3.
4.
5.
E.13.1.2 Fill the following table with a statement (one or two sentences) on an example
to fit the particular machine learning task of the typology discussed in this
chapter (do not give examples already mentioned in the chapter):
1. Predicting
2. Monitoring
3. Discovering
4. Interpreting
5. Interacting with the physical environment
6. Interacting with people
7. Interacting with machine
E.13.1.3 Project
1. Design a hospital administration system that uses machine learning to
provide efficient and effective services to the management, doctors, and
patients. Clearly outline all stakeholders involved and services provided.
Implement the system clearly, showing the effect of using machine learning.
2. Design a smart city application with any 10 components that clearly show-
case the integration aspects between different sources and types of data.
Implement the system clearly, showing the effect of using machine learning.
340 Machine Learning
Self-Assessment Questions
E.13.2 Multiple Choice Questions
E.13.2.5 Using machine learning to closely monitor the trade results and news in real
time to detect patterns and enable stock prices to go up or down is called
i Intelligent pricing
ii Algorithmic trading
iii Predictive trading
E.13.2.6 Meteorological and satellite data uses predictive machine learning for
i Water quality testing
ii Waste separation
iii Weather forecasting
E.13.2.7 Machine learning can improve efforts to save threatened species of birds using
i Bird calls
ii Networked acoustic sensors
iii Satellite data
E.13.3 Questions
E.13.3.1 Outline some more applications of machine learning in engineering applica-
tions not discussed in the chapter.
E.13.3.2 Assume you are an edutech company that wants to attract universities to use
your system. Outline how you would sell the product, indicating the intelli-
gent components you would provide including facilities to upload legacy data
seamlessly.
14
Ethical Aspects of Machine Learning
14.1 Introduction
AI and machine learning applications are permeating the world in applications such as
shared economy, driverless cars, personalized health, and improved robots and in fact all
situations where we need to learn, predict, prescribe, and make truly data-driven deci-
sions. Harnessed appropriately, AI and machine learning can deliver great benefits and
support decision making which is fairer, safer, and more inclusive and informed. However,
for machine learning to be effective and acceptable, great care and conscious effort are
needed, including confronting and minimizing the risks and misuse of machine learning.
Therefore, the amount of influence that AI and machine learning will eventually have is
determined by the choices people make while embracing it because of the concerns raised
on many fronts due to its potentially disruptive impact. These fears (Figure 14.1) include
negative economic and social impact like loss of jobs, loss of privacy and security, potential
biases in decision making, human uninterpretable solutions, and failures and lack of con-
trol over automated systems and robots.
While the issues of fear are significant, they can be addressed with the right planning,
design, and governance. Programmers and businesses need to choose which techniques to
apply and for what; within the boundaries set by governments and cultural acceptance,
there must be an efficient process to allow one to challenge the use or output of the algo-
rithm, and the entity responsible for designing machine learning algorithms should be
identifiable and accountable for the impacts of those algorithms, even if the impacts are
unintended.
FIGURE 14.1
Concerns related to AI and machine learning.
In today’s scenario some of the core principles of AI and machine learning include the
generation of benefits for people that are greater than the costs, implementation in ways
that minimize any negative outcomes, and compliance with all relevant national and inter-
national government obligations, regulations, and laws.
• AI should be socially beneficial: with the likely benefit to people and society sub-
stantially exceeding the foreseeable risks and downsides
• AI should not create or reinforce unfair bias: avoiding unjust impacts on people,
particularly those related to sensitive characteristics such as race, ethnicity, gender,
nationality, income, sexual orientation, ability, and political or religious belief
• AI should be built and tested for safety: designed to be appropriately cautious
and in accordance with best practices in AI safety research, including testing in
constrained environments and appropriate monitoring
• AI should be accountable to people: providing appropriate opportunities for
feedback, relevant explanations, and appeal, and subject to appropriate human
direction and control
• AI should incorporate privacy design principles: encouraging architectures with
privacy safeguards, and providing appropriate transparency and control over the
use of data
• AI development should uphold high standards of scientific excellence:
Technological innovation is rooted in the scientific method, and a commitment to
open inquiry, intellectual rigor, integrity, and collaboration should be encouraged.
• AI should be made available for uses that accord with these principles: We will
work to limit potentially harmful or abusive applications.
The rules specify that scientists need to guard against bias, but at the same time these sys-
tems should be transparent so that others can look for bias and undo any harm. From the
viewpoint of ethics, the principles specify that AI systems should embrace the concepts
of fairness, safety, accountability, and maintenance of privacy. In the next section we will
discuss the concepts of ethics.
Ethical Aspects of Machine Learning 343
All these examples show how digital information has been used in an unethical manner.
Machine learning based ethical cases in the near future could be associated with autono-
mous cars, autonomous weapons (will there be meaningful human control?), the Internet
of Things (IoT), personalized medicine (use of genomic information), social credit systems
(is it based on just credit score?), and so on. Therefore, learning from historical data used
for human decision making may lead to these traditional prejudices or biases being propa-
gated and deeply hidden within the machine learning models.
In the following section we discuss some important opinions of ethics in AI and machine
learning.
“By 2018, 50% of business ethics violations will occur through improper use of
big data analytics.” —Gartner’s 2016 Hype Cycle for Emerging Technologies,
STAMFORD, Conn., August 16, 2016
“Nearly 50% of the surveyed developers believe that the humans creating AI should
be responsible for considering the ramifications of the technology. Not the bosses.
Not the middle managers. The coders.” —Mark Wilson, Fast Company on Stack
Overflow’s Developer Survey Results, 2018 (Fast Company, 2018).
“If machines engage in human communities as autonomous agents, then those agents
will be expected to follow the community’s social and moral norms. A necessary
step in enabling machines to do so is to identify these norms. But whose norms?”
—The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems
Ethical Aspects of Machine Learning 345
“IBM supports transparency and data governance policies that will ensure people
understand how an AI system came to a given conclusion or recommendation.
Companies must be able to explain what went into their algorithm’s recommen-
dations. If they can’t, then their systems shouldn’t be on the market.” —Data
Responsibility at IBM, 2017 (IBM, 2017).
“By progressing new ethical frameworks for AI and thinking critically about the qual-
ity of our datasets and how humans perceive and work with AI, we can accelerate
the [AI] field in a way that will benefit everyone. IBM believes that AI actually
holds the keys to mitigating bias out of AI systems—and offers an unprecedented
opportunity to shed light on the existing biases we hold as humans.” —Bias in AI:
How We Build Fair AI Systems and Less-Biased Humans, 2018 (IBM, 2018).
FIGURE 14.2
Concepts of ethical machine learning.
346 Machine Learning
machine learning. Fairness in machine learning is about ensuring that the biases in the
data and learnt model do not result in systems that treat individuals unfavourably on the
basis of characteristics such as, for example, race, gender, disabilities, and sexual or politi-
cal orientation. Therefore, fairness in machine learning designs algorithms that make fair
predictions regardless of the inherent or acquired characteristics.
Explainability is the ability to explain to all stakeholders the various aspects of the
machine learning system, including data, algorithm, and output of the machine learning
system. On the other hand, interpretability is the ability to give and present such explana-
tions in terms that is understandable to humans. Transparency is another important aspect
of ethical machine learning where the reasoning behind the decision making process is
discoverable and understandable. Associated with transparency is accountability, which
ensures that responsibility is fixed for the decisions taken by machine learning systems
and such decisions are explainable and controllable. Further, in case of any harm due to the
decisions, there is a fixed legal responsibility. Privacy protection is also an important com-
ponent of ethical machine learning where private data is protected and kept confidential
and there is a system in place to prevent dangerous data breaches.
We need some approaches to tackle ethical issues associated with machine learning.
One approach is to have strict national regulation or international regulation. We need to
decide the ethical position that can be handled technically and legally and the ethical posi-
tion of the organization.
Biased AI and machine learning based recognition called for regulation on use of facial
recognition after consistently higher error rates were found for darker-skinned and female
faces. Bias appears in face recognition systems because of the use of old algorithms, fea-
tures related to facial features such as color, racial-biased data sets, and due to deep learn-
ing classifiers. This bias result in inefficiency of video surveillance systems in public city
areas, increased privacy concerns, lower accuracy rate for African American and Asian
males and females where innocent black suspects come under police scrutiny, and finally
a major lag in mass implementation and acceptance of the technology. This bias can be
dealt with by making the training data sets more diverse, using additional operations for
detection of faces, and setting more sensitive parameters of the classifier.
an option to change them; robustness of the machine learning systems against manipula-
tion; and timely reaction to ethically compromised input.
14.7.2 Types of Bias
As we have seen, bias is the root cause of not catering to fairness principles. The two major
categories of bias and their differences are shown in Table 14.1.
The distortion in data can be defined along five data properties, namely population
biases, behavioral biases, content production biases, linking biases, and temporal biases, as
shown in Figure 14.3. Population bias is due to differences in demographics or other user
characteristics between a user population represented in a data set and the actual target
population. There is a need to tackle population biases, address issues due to oversam-
pling from minority groups, and data augmentation by synthesizing data for minority
groups. Behavioral bias is due to differences in user behavior across platforms or contexts,
or across users represented in different data sets. Lexical, syntactic, semantic, and
TABLE 14.1
Data Bias and Algorithmic Bias
Description Data Bias Algorithmic Bias
Basis Based on the type of data that is used for Associated with feature or model selection
building machine learning models
Definition Defined as the systemic distortion Defined as computer systems that
in the data that compromises its systematically and unfairly discriminate
representativeness against certain individuals or groups
of individuals in favor of others which
in turn leads to discrimination and
unfairness
Stage of machine Directly associated with data sampling Unintentionally introduced during the
learning system and has to be dealt with by the development and testing stage; results in
designer of the machine learning a vicious cycle of bias
system For example—a company with only 8%
Partly dependent on the type of machine women employees. The hiring algorithm
learning task undertaken trained only on current data, based on
For example, gender discrimination in current employee success, scores women
general is illegal, while gender specific candidates lower, and so the company
medical diagnosis is desirable again ends up hiring fewer women.
Causes • Label bias, where the attribute that • use data without our knowledge
is observed sometimes wrongly • can be based on incorrect or
becomes data, for instance, arrests not misleading knowledge about us
crimes • Not accountable to individual citizens
• Subgroup validity where the • Built by specialists who do what they
predictive power of features can vary are told without asking questions
across subgroup
• Representativeness where the training
data is not representative of the
population due to skewed samples,
sample size disparity, or limitation of
selected features
• Statistical patterns that apply to the
majority might be invalid within a
minority group
350 Machine Learning
structural differences in the contents generated by users can cause content production bias.
Linking bias is due to the differences in the attributes of networks obtained from user con-
nections, interactions, or activity while temporal bias is due to the differences in popula-
tions and behaviors over time where different demographics can exhibit different growth
rates across and within social platforms.
FIGURE 14.3
Distortions of data based on data properties.
FIGURE 14.4
Data bias and the data analysis pipeline.
Ethical Aspects of Machine Learning 351
introduced during data collection due to API limits, query formulation, and removal of
data that is wrongly considered as irrelevant. The bias can also be introduced at the data
processing stage due to cleaning, enrichment, or aggregation. Data analysis itself can add
on to the bias due to lack of generalizability during qualitative analyses, confusing mea-
surements while carrying out descriptive analyses, and improper data representation
while predicting. Even evaluation can introduce bias due to improper metrics and inter-
pretation of results.
14.8 Fairness Testing
It is important to understand how one could go about determining the extent to which the
model is biased and hence unfair. One of the most common approaches is to determine
the relative significance or importance of input values (related to features) on the model’s
prediction or output. Determining the relative significance of input values would help
ascertain the fact that the models are not overly dependent on the protected attributes or
bias-related features such as race, gender, color, religion, national origin, marital status,
sexual orientation, education background, source of income, and so on. Other techniques
include auditing data analysis, ML modeling pipeline, and so on.
• Ensure that disproportionate advantage is not given to any specific attribute value,
since this could cause disadvantage to other attribute values
• Guarantee there is a minimum representation for an attribute value—this property
is more important to ensure fairness.
The fairness-aware re-ranking algorithm for fair recruitment starts with partitioning the
set of potential candidates into different buckets for each attribute value. Then the candi-
dates in each bucket are ranked according to the scores assigned by the machine-learned
model used. Then the ranked lists are merged so that the representation requirements are
balanced and the highest scored candidates are selected. The algorithm was validated and
found that that over 95% of all searches are gender representative compared to the quali-
fied population of the search.
14.10 Explainability
In general, people who cannot explain their reasoning cannot be trusted. The same applies
even more to machine learning systems, especially when crucial decision-making process
should be made explainable in terms that people can understand. This explainability is
key for users interacting with machine learning systems specifically to understand the
system’s conclusions and recommendations. As a starting point the users must always be
aware that they are interacting with the machine learning system.
Explainability is needed to build public confidence in disruptive technology, to promote
safer practices, and to facilitate broader societal adoption. Explanations given should be
actionable, and there should be a balance between explanations and model secrecy. The
Ethical Aspects of Machine Learning 353
explanations should be given for failure modes also. There need to be explanations across
the machine learning cycle, and they should be focused on both the model developer as
well as the end user. In the case of health care, they need to explain why x was classified at
risk for colon cancer, or in the case of finance, why y was denied a mortgage loan. There are
some application-specific challenges in giving explanations, for example contextual; expla-
nations are necessary for conversational systems, and in some cases there needs to be a
gradation of explanations. Even when in certain cases the users cannot be provided with
the full decision-making process, the level of transparency should be clearly defined.
Explainable machine learning systems bring in transparency. This transparency results in
improvement of the machine learning system itself by discovering mismatched objectives,
multi-objective trade-offs, and learning about the causality associated with the system.
Definition: Explainable machine learning is defined as systems with the ability to
explain their rationale for decisions, characterize the strengths and weaknesses of their
decision-making process, and convey an understanding of how they will behave in the
future. An open issue is who the explanations are for: advanced mathematicians or engi-
neers, or employees and customers? Much of the machine learning employed today auto-
mates more traditional statistical methods, which are more easily explained than neural
net based decisions used for image recognition or self-driving cars. There are basically two
ways in which explanations can be generated, namely model specific techniques, which
essentially deal with the inner working of the algorithm or the model to interpret the
results, and model agnostic techniques, which deal with analyzing the features and their
relationships with the output.
There are two notable efforts to create explanations, namely DARPA-XAI and Local
Interpretable Model- Agnostic Explanations (LIME). The Defense Advanced Research
Projects Agency (DARPA) launched the Explainable Artificial Intelligence (XAI) program
to identify approaches that will give the systems the ability to give explanations. LIME is a
technique developed at the University of Washington that helps explain predictions in an
interpretable and faithful manner.
Let us consider an example the classification problem of machine learning. As shown in
Figure 14.5, the current systems use the features learnt by the machine learning model to
classify the newly unseen image as an elephant. However, an explainable machine
FIGURE 14.5
Example—explainable machine learning.
354 Machine Learning
learning system needs to classify the image as an elephant and in addition indicate the
reasons for doing so, in this case that it has a long trunk, wide flat ears, and is generally
gray or grayish brown in color, and where possible also show the parts of the image based
on which the classification was carried out. However, explanations are not always easy,
because for examples where the features learnt are not present, the system is unable to give
an explanation.
FIGURE 14.6
Questions to black box machine learning system.
Ethical Aspects of Machine Learning 355
FIGURE 14.7
The flow diagram of an explainable system.
driven streetscape, over which it then highlights areas to which the algorithm gave the
most weight during navigation.
• To incorporate explainability into the system without affecting the user experience
or sacrificing the effectiveness of the task at hand
• To decide the components that need to be hidden from the users due to security
concerns and the method to be used to explain this issue to the users
• To decide the components of the decision-making process that can be expressed to
the user in an easily palatable and explainable manner
There are basically two approaches to achieve explainability. The first one is to explain a
given machine learning model in a post hoc fashion. In this approach, individual predic-
tion explanation can be given in terms of input features, influential examples, concepts,
local decision rules, and so on. On the other hand, global prediction explanations can be
given for the entire model in terms of partial dependence plots, global feature importance,
global decision rules, and so on. In the second approach, the model used itself is interpre-
table. Such models include logistic regression, decision trees, decision lists and sets, and
generalized additive models (GAMs).
356 Machine Learning
14.10.2.1 Attribution Methods
One post hoc individual method to give an explanation is the attribution approach. Here
we assign the model prediction on an input to the features of the input. Such examples
include attributing the prediction of an object recognition network to its pixels, a text
sentiment network to individual words, or a lending model to its features. This useful
approach is a minimal formulation of the reason for the prediction. Application of attribu-
tions include debugging model predictions like attributing an image misclassification to
the pixels responsible for it, analyzing model robustness by for example crafting adver-
sarial examples using weaknesses brought out by attributions, and so on.
There are different attribution methods such as ablations, gradient based methods, score
backpropagation based methods, and Shapley value based methods. Ablation methods
essentially drop each feature and attribute the corresponding change in prediction to that
feature. The disadvantages of ablation methods is that they give rise to unrealistic inputs,
give improper accounting of interactive features, and are computationally expensive, spe-
cifically if the number of features is large. Gradient based methods attribute a feature to the
feature value times gradient where this gradient captures sensitivity of output with respect
to the feature. The score backpropagation based methods used for neural network models
redistribute the prediction score through the neurons in the network to explain the effect of
the weights.
The Shapley value is a concept in game theory used to fairly determine the contribution
in terms of both gains and costs of each player in a coalition or a cooperative game deter-
mined. The Shapley value applies primarily in situations when the contributions of each
actor are unequal, but each player works in cooperation with each other to obtain the gain
or payoff. Similarly, a prediction can be explained by assuming that each feature value of
the instance is a “player” in a game where the prediction is the pay-out, where Shapley
values tell us how to fairly distribute the total gain among the features based on their con-
tribution. In other words, the Shapley value for a feature is a specific weighted aggregation
of its marginal over all possible subsets of other features. In the explainability context,
players are the features in the input, gain is the model prediction (output), and feature
attributions are the Shapley values of this game. However, Shapley values require the gain
to be defined for all subsets of features.
The attribution methods can be evaluated by having humans review attributions and/
or comparing them to (human provided) ground truth on “feature importance.” This eval-
uation helps in assessing if the attributions are human-intelligible and in addition increases
trust in the system; however, the attributions may appear incorrect because the model
reasons in a different manner. Another method of evaluation of attributions is the pertur-
bation method, where the perturbation of the top-k features by attribution is carried out
and the change in prediction is observed. If the change in prediction is higher, it indicates
that the method is good. In the axiomatic justification method, a list of desirable criteria
(axioms) for an attribution method is listed and we must be able to establish a uniqueness
result that this is the only method that satisfies these criteria. However, attribution does not
explain feature interactions, the training examples that influence the prediction (training
agnostic), or the global properties of the model.
some data instances may not be well represented by the set of prototypes. Another method
is determining the influence functions by tracing a model’s prediction through the learning
algorithm and discovering the training points that are responsible for a given prediction.
Global prediction explanation methods include a partial dependence plot showing the
marginal effect that one or two features have on the predicted outcome of a machine learn-
ing model and the permutations method, where the importance of a feature is the increase
in the prediction error of the model after we permuted the feature’s values, which breaks
the relationship between the feature and the true outcome.
In addition, machine learning models can provide responsible metadata about data sets,
fact sheets, and model cards. Datasheets for data sets (Gebru et al., 2018) provides for every
data set, model, or pretrained API that documents the creation, intended uses, limitations,
maintenance, legal and ethical considerations, and other features of the data sets. Another
important metadata is the factsheets whose data focuses on the final service provided such
as the intended use of the service output, the algorithms or techniques this service imple-
ment, the data sets the service was tested on, the details of the testing methodology, and
the test results.
Responsible metadata also includes model cards which provide basic information about
the model including owner, date, version, type, algorithms, parameters, fairness con-
straints, resources used, license, and so on; the intended users of the system; and factors
such as demographic groups, environmental conditions, technical attributes, and so on.
Additional data include metrics such as model performance measures and decision thresh-
olds, and evaluation data such as the data sets used for quantitative analysis and ethical
considerations. These model cards help the developer to investigate the model and pro-
vide information for others to understand the system.
14.11 Transparency
Let us first understand the subtle difference between two related terms explainability and
transparency in the context of machine learning. As already discussed, explainability is
associated with the concept of explanation, as being an interface between humans and
the machine learning based decision maker. On the other hand, a transparent machine
learning model is understandable on its own. Designers and implementers of machine
learning systems must make the affected stakeholders understand the cause and manner
in which a model performed in a particular way in a specific context. Increasing transpar-
ency can benefit different stakeholders: the end users can better understand the reason
why certain results are being generated; developers can more easily debug, tune, and opti-
mize machine learning models; and project managers can better comprehend the technical
details of the project.
Most machine learning systems are based on discovering patterns, and unusual changes
in the patterns make the system vulnerable; hence we need transparency. When the
machine learning systems make critical decisions, such as detecting cancer, allocating
loans, and so on, understanding the algorithmic reasoning becomes crucial. In some cases,
transparency is necessary for legal reasons.
14.12 Privacy
First of all, let us discuss the right to privacy, stated by the United Nations Universal
Declaration of Human Rights as “No one shall be subjected to arbitrary interference with
[their] privacy, family, home or correspondence, nor to attacks upon [their] honor and rep-
utation.” It is also described as “the right of a person to be free from intrusion or publicity
concerning matters of a personal nature” by Merriam-Webster and “The right not to have
one’s personal matters disclosed or publicized; the right to be left alone” by Nolo’s Plain-
English Law Dictionary.
Threats to privacy are posed by machine learning systems through design and develop-
ment processes and as a result of deployment. The structuring and processing of data is the
core of any machine learning system; such systems will frequently involve the utilisation
of personal data. This data is sometimes captured and extracted without gaining the proper
consent of the data subject or is handled in a way that reveals personal information. On the
deployment side, most machine learning systems target, profile, or nudge data subjects
without their knowledge or consent, infringing upon their ability to lead a private life.
The question now arises that, while we get the value and convenience of machine learn-
ing systems that operate by collecting, linking, and analyzing data, can we at the same time
avoid the harms that can occur due to such data about us being collected, linked, analyzed,
and propagated? There is a need to define reasonable agreeable rules that can ensure trade-
offs, knowing that people generally have different privacy boundaries.
14.12.1 Data Privacy
Data privacy is the right to have some control over how your personal information is
collected and used. Privacy is perhaps the most significant consumer protection issue or
even citizen protection issue in the global information economy. Data privacy is defined
as the use and governance of personal data and is different from data security, which is
defined as the protection of data from malicious attacks and the exploitation of stolen data
for profit. In other words, security is the confidentiality, integrity, and availability of data,
while privacy is the appropriate use of the information of the user. Although security is
necessary, it is not sufficient for addressing privacy.
14.12.2 Privacy Attacks
Attackers are able infer whether a particular member performed an action, for example
clicking on an article or an ad, and are also able to use auxiliary knowledge such as knowl-
edge of attributes associated with the target member (say, obtained from this member’s
LinkedIn profile) or knowledge of all other members that performed similar actions (say,
by creating fake accounts). Therefore, rigorous techniques are needed to preserve member
privacy.
Some of the possible privacy attacks can be classified as follows:
Targeting: The attacker matched the target having LinkedIn members over a mini-
mum targeting threshold. For example, the attacker could carry out identification
through search logs. In another case, AOL Research publishes anonymized search
logs of 650,000 users. The attacker was able to get the identity of a particular per-
son through the AOL records of the person’s web searches.
360 Machine Learning
14.12.3 Privacy-Preserving Techniques
In order to avoid data privacy violation, given a data set with sensitive personal informa-
tion, we need to compute and release functions of the data set appropriately while protect-
ing individual privacy. Privacy can be maintained by privacy-preserving model training,
robust against adversarial membership inference attacks. In addition, for privacy of
highly sensitive data, the model training and analytics have to be carried out using secure
enclaves, homomorphic encryption, federated learning or device learning, and design of
privacy-preserving mechanisms for data marketplaces.
Anonymizing data is therefore important, where techniques are used to maintain the
statistical characteristics of the data while at the same time reducing the risk of revealing
personal data. Techniques used to protect highly sensitive data, such as an individual
user’s medical condition in a medical research data set, or being able to track an individual
user’s locations in an advertising data set, include data anonymization such as masking or
replacing sensitive customer information in a data set, generalizing data through bucket-
ing a distinguishing value such as age or satisfaction scores into less distinct ranges, and
perturbing data, where random noise is inserted into data such as dates to prevent joining
with another data set.
More advanced techniques include differential privacy, which offer formal guarantees
that no individual user’s data will affect the output of the overall model. Differential pri-
vacy involves publicly sharing information about a data set by describing the patterns of
groups within the data set while withholding information about individuals in the data
set. A combination of machine learning and differential privacy can be used to further
enhance individual user privacy. These methods also increase flexibility with the ability to
generate new data sets of any size, and they offer formal mathematical guarantees around
privacy. The objective is to create both synthetic and differentially private synthetic data
sets of the same size and distribution as the original data.
Federated learning can also be used in a privacy-preserving environment. Federated
learning is decentralized learning that works with a network of devices capable of training
themselves without a centralized server. In a machine learning context, the machine learn-
ing algorithm is implemented in a decentralized collaborative learning manner wherein
the algorithm is executed on multiple local data sets stored at isolated data sources
Ethical Aspects of Machine Learning 361
(i.e., local nodes) such as smartphones, tablet, PCs, and wearable devices without the need
for collecting and processing the training data at a centralized data server. The results of
the training (i.e., parameters) are alone exchanged at a certain frequency. The natural
advantage of federated learning is the ability to ensure data privacy because personal data
is stored and processed locally, and only model parameters are exchanged. In addition, the
processes of parameter updates and aggregation between local nodes and a central coordi-
nation server are strengthened by differential policy-based privacy-preserving and cryp-
tography techniques, which enhance data security and privacy.
Another approach to privacy preservation is the use of trusted execution environments
(TEEs), which are CPU-encrypted isolated private enclaves inside the memory, which
essentially protect data in use at the hardware level. Hardware enclaves enable computa-
tion over confidential data, providing strong isolation from other applications, the operat-
ing system, as well as the host, and unauthorized entities cannot remove this confidential
data, modify it, or add more data to it. The contents of an enclave remain invisible and
inaccessible to external parties, protected against outsider and insider threats, thus ensur-
ing data integrity, code integrity, and data confidentiality.
Homomorphic encryption is another method used to transfer data in a secure and pri-
vate manner. Homomorphic encryption differs from typical encryption methods in that it
allows computation to be performed directly on encrypted data without requiring access
to a secret key. The result of such a computation remains in encrypted form and can at a
later point be revealed by the owner of the secret key. Hence homomorphic encryption can
be used for privacy-preserving outsourced storage and computation.
14.13 Summary
• Explored ethical issues in general and ethical aspects of machine learning in
particular.
• Discussed fairness in machine learning and the role of bias in fair machine learning.
• Used case studies to illustrate fairness in machine learning.
• Discussed the basic concepts of explainability in machine learning.
• Explained transparency and privacy issues of machine learning.
14.14 Points to Ponder
• The influence of machine learning is determined by the ethical choices people
make while embracing it.
• Ethical predictive models are important in decision-making applications such as
short listing candidates for interviews or deciding insurance premiums.
• Personal and psychological profiling affects ethical principles of privacy, discrimi-
nation, and confidentiality.
• Interpretability is the ability to explain in terms understandable to a human being.
362 Machine Learning
E.14 Exercises
E.14.1 Suggested Activities
E.14.1.1 If ethical aspects were not taken care of, how do you think it will impact you
when using some applications in everyday life?
E.14.1.2 Give two examples how lack of privacy can impact you based on two applica-
tions you are currently using.
E.14.1.3 Can you think of new examples of bias in machine learning based applications
in health care, banking, or customer care?
E.14.1.4 Taking a business scenario where machine learning based services are pro-
vided (at least three), discuss ethical issues of fairness, explainability, and pri-
vacy that will be concerns.
E.14.1.5 Assume you have to design a machine learning solution to select the best stu-
dent for merit cum means scholarship. Discuss how you will introduce expla-
nations at each and every step of the solution, including the algorithm used.
E.14.1.6 Case Study: University administration system. Design a fairness solution to
tackle three examples of bias that can affect a machine learning based univer-
sity administration system
Self-Assessment Questions
E.14.2 Multiple Choice Questions
Give answers with justification for correct and wrong choices.
E.14.2.1 This is not an ethical concern due to usage of AI and machine learning:
i Loss of jobs
ii Potential bias in decision making
iii Ineffectiveness of machine learning solution.
E.14.2.6 A harm that is caused because a system does not work as well for one person
as it does for another, even if no opportunities, resources, or information are
extended or withheld, is called
i Ethical harm
ii Allocation harm
iii Quality-of-service harm
E.14.2.8 When the attribute that is observed sometimes wrongly becomes data, it is
called
i Label bias
ii Representativeness bias
iii Subgroup validity bias
E.14.2.9 The law that states that the more unusual or inconsistent the result, the more
likely it is an error is called
i Murphy’s law
ii Bernoulli’s law
iii Twyman’s law
E.14.2.10 Testing specifically carried out close to launching of the product to search for
rare but extreme harms is called
i Ecologically valid testing
ii Adversarial testing
iii Targeted testing
E.14.2.16 Publicly sharing information about a data set by describing the patterns of
groups within the data set while withholding information about individuals
is called
i Different policy
ii Secret policy
iii Differential policy
No Match
No Match
E.14.3.6 Fairness, explainability, transparency, and privacy F Transparency, fairness
are the four cornerstones
E.14.3.7 Machine learning model issue caused due to lack G Microsoft
of sufficient features and related data sets used for
training the models
E.14.3.8 Systemic distortion in the data that compromises H Accountability
its representativeness
E.14.3.9 Fairlearn tool I Mark Wilson, Fast Company on
Stack Overflow’s Developer
Survey Results, 2018
E.14.3.10 _____ is defined as the ratio of the proportion of J Understand biases in model
candidates having a given attribute value among prediction
the top k ranked results to the corresponding
desired proportion
E.14.4 Short Questions
E.14.4.1 Compare and contrast the following in the context of ethical machine learning:
i Bias and fairness
ii Explainability and interpretability
iii Trust and transparency
iv Privacy and security
References
Fast Company. (2018, March 15). What developers really think about bias. https://www.
fastcompany.com/90164226/what-developers-really-think-about-ai-and-bias
Gebru, T., et al. (2018). Datasheets for datasets. Proceedings of the 5th Workshop on Fairness, Accountability,
and Transparency in Machine Learning, Stockholm, Sweden. PMLR, 80.
IBM. (2017, October 10). Data responsibility @IBM. https://www.ibm.com/policy/dataresponsibility-
at-ibm/
IBM. (2018, February 1). Bias in AI: How we build fair AI systems and less-biased humans. https://
www.ibm.com/policy/bias-in-ai/
Nadella, S. (2016, June 28). The partnership of the future. Slate. https://slate.com/technology/
2016/06/microsoft-ceo-satya-nadella-humans-and-a-i-can-work-together-to-solve-societys-
challenges.html
15
Introduction to Deep Learning and
Convolutional Neural Networks
15.1 Introduction
Deep learning is a subset of machine learning, which is based on learning and improving
on its own by examining computer algorithms. Deep learning works based on artificial
neural networks (ANNs), which are designed to imitate how humans think and learn. In
the past, neural networks had limitations pertaining to computing power and hence to
complexity also. However, in the recent past, larger, more sophisticated neural networks,
allowing computers to observe, learn, and react to complex situations faster than humans,
evolved due to advancements in big data analytics. Deep learning finds applications in
aided image classification, language translation, and speech recognition, predominately
used to solve any pattern recognition problem and without human intervention.
Deep neural networks (DNNs) are driven by ANNs, comprising many layers, where
each layer can perform complex operations such as representation and abstraction that
make sense of images, sound, text, and so on. Now DNNs are considered the fastest-
growing field in machine learning, deep learning represents a truly disruptive digital tech-
nology, and it is being used by increasingly more companies to create new business
models.
TABLE 15.1
Evolution of Deep Learning
1943 For the first time a computational model based on the brain’s neural connections using
mathematics and threshold magic was created by Walter Pitts and Warren McCulloch.
1950 Alan Turing proposed the “learning machine” which marked the beginning of machine learning.
1958 A pattern recognition algorithm based on a two-layer neural network using simple addition and
subtraction was proposed by Frank Rosenblatt.
1959 IBM computerised a computer learning program proposed by Arthur Samuel to play the game
of checkers.
1960 The basics of a continuous backpropagation model was proposed by Henry J. Kelley.
1962 Stuart Dreyfus developed a simpler version of a backpropagation model based only on the chain
rule.
1965 Neural network models with polynomial equations as activation functions were proposed by
Alexey Grigoryevich Ivakhnenko and Valentin Grigor′evich Lapa.
1970 Research in deep learning and artificial intelligence was limited due to lack of funding.
1979 Neocognitron, a hierarchical, multilayered artificial neural network used for handwriting
recognition and other pattern recognition problems, was developed by Kunihiko Fukushima.
1985 The term “deep learning” gained popularity because of the use of a many-layered neural
network that could be pretrained one layer at a time demonstrated by Hinton and Rumelhart
Williams.
1989 Combining convolutional neural networks with backpropagation to read handwritten digits was
developed by Yann LeCun at Bell Labs.
1985–1990 Research in neural networks and deep learning was limited for the second time due to lack of
funding.
1999 The next significant advancement was the adoption of GPU processing for deep learning.
2000 The vanishing gradient problem appeared for gradient learning methods where “features”
learned at lower layers were not being learned by the upper layers, because no learning signal
reached these layers.
2009 Launch of ImageNet, an assembled free database of more than 14 million labeled images by Fei-
Fei Li, at Stanford.
2011 AlexNet, a convolutional neural network using rectified linear units, won several international
competitions during 2011 and 2012.
2012 In its “cat experiment” Google Brain using unsupervised learning, a convolutional neural net
given unlabeled data needed to find recurring patterns.
2014 Ian Goodfellow proposed a generative adversarial neural network (GAN) in which two
networks compete against each other in a game.
2016 AlphaGo, a neural network based computer program, mastered the complex game Go and
beat a professional Go player.
dog-recognition skills by repeatedly adjusting its weights over and over again. This train-
ing technique is called supervised learning, which occurs even when the neural networks
are not explicitly told what “makes” a dog. They must recognize patterns in data over time
and learn on their own.
15.4.1 Applications
1. Automatic text generation: A corpus of text is learned, and from this model new
text is generated, word by word or character by character. Then this model is capa-
ble of learning how to spell, punctuate, and form sentences, or it may even capture
the style.
2. Health care: Helps in diagnosing various diseases and treating them.
3. Automatic machine translation: Certain words, sentences, or phrases in one
language are transformed into another language (deep learning is achieving top
results in the areas of text and images).
4. Image recognition: Recognizes and identifies peoples and objects in images as
well as to understand content and context. This area is already being used in gam-
ing, retail, tourism, and so on.
5. Predicting earthquakes: Teaches a computer to perform viscoelastic computa-
tions, which are used in predicting earthquakes.
370 Machine Learning
TABLE 15.2
Comparison Between Machine Learning and Deep Learning
Machine Learning Deep Learning
Works on small amount of data set for accuracy Works on large data sets.
Dependent on low-end machine Heavily dependent on high-end machine
Divides the tasks into subtasks, solves them individually and Solves problem end to end
finally combines the results
Takes less time to train Takes longer time to train
Testing time may increase Less time to test the data
15.5 Neural Networks
Neural networks are networks of interconnected neurons, for example interconnection of
neurons in human brains. Artificial neural networks are highly connected to other neurons
and perform computations by combining signals from other neurons. Outputs of these
computations may be transmitted to one or more other neurons. The neurons are con-
nected together in a specific way to perform a particular task.
A neural network is a function. It consists basically of (1) neurons, which pass input
values through functions and output the result, and (2) weights, which carry values (real-
number) between neurons. Neurons can be categorized into layers: (1) input layer, (2) hid-
den layer, and (3) output layer.
a Receptors: Receptors convert stimuli from the external environment into electri-
cal impulses. The best example is the rods and cones of eyes. Pain, touch, hot, and
cold receptors of skin are another example for receptors.
b Neural net: Neural nets receive information, process it, and make appropriate
decisions. The brain is a neural net.
c Effectors: Effectors convert electrical impulses generated by the neural net (brain)
into responses to the external environment. Examples of effectors are muscles and
glands and speech generators.
Introduction to Deep Learning and Convolutional Neural Networks 371
FIGURE 15.1
Stages of human nervous system.
FIGURE 15.2
Basic components of a biological neuron.
The basic components of a biological neuron as given in Figure 15.2 are (1) cell body (soma),
which processes the incoming activations and converts them into output activations; (2)
the neuron nucleus, which contains the genetic material (DNA); (3) dendrites, which form
a fine filamentary bush, each fiber thinner than an axon; (4) the axon, a long, thin cylinder
carrying impulses from the soma to other cells; and (5) synapses, the junctions that allow
signal transmission between the axons and dendrites.
Computation in biological neurons happens as follows: (1) Incoming signals from syn-
apses are summed up at the soma, and (2) on crossing a threshold, the cell fires, generating
an action potential in the axon hillock region.
15.5.1.1 Perceptrons
Perceptrons, invented by Frank Rosenblatt in 1958, are the simplest neural network that
consists of n number of inputs, only one neuron, and one output, where n is the number of
features of our data set as shown in Figure 15.3. The process of passing the data through
the neural network, known as forward propagation, is explained in the following three
steps.
FIGURE 15.3
Single layer perceptron.
372 Machine Learning
Step 1: The input value xᵢ for each input is multiplied by weights wᵢ representing
the strength or influence of the connection between neurons and summed as
given in Equation (15.1). The higher the weights, the higher their influence.
∑ = ( x1 ∗ w1 ) + ( x2 ∗ w2 ) +…. + ( xn ∗ wn ) (15.1)
The row vectors of the inputs and weights are x = [x₁, x₂, … ,xn] and w =[w₁, w₂,
… , wn], respectively, and their dot product gives the summation (Equations
15.2 and 15.3).
∑ = x.w (15.3)
Bias, also known as the offset, is added to shift the output function
Step 2:
(Equation 15.4).
y = x.w + b
1, if ∑ wi xi > threshold (15.4)
y=
0 otherwise
Step 3: This value will be presented to the activation function where the type of acti-
vation function will depend on the need and has a significant impact on the
learning speed of the neural network. Here we use the sigmoid—also known
as a logistic function—as our activation function (Equation 15.5).
1
ŷ = σ ( z ) = (15.5)
1 + e −z
where σ denotes the sigmoid activation function, and the output we get after the forward
propagation is known as the predicted value ŷ.
Example: A simple decision via perceptron may be whether you should go to watch
amovie this weekend.
The decision variables are:
For b = –8 and b = –3, let’s observe the output y as shown in Figure 15.4.
15.5.1.2 Perceptron Training
A perceptron works by taking in some numerical inputs along with what is known as
weights and a bias. It then multiplies these inputs with the respective weights (this is
known as the weighted sum). These products are then added together along with the bias.
Introduction to Deep Learning and Convolutional Neural Networks 373
FIGURE 15.4
Sample example.
The activation function takes the weighted sum and the bias as inputs and returns a final
output.
A perceptron consists of four parts: input values, weights and a bias, a weighted sum,
and an activation function. Let us understand the training using the following algorithm.
15.5.2 Activation Functions
An activation function decides whether a neuron should be activated or not. It helps the
network to use the useful information and suppress the irrelevant information. An activa-
tion function is usually a nonlinear function. If we choose a linear function, then it would
be a linear classifier with limited capacity to solve complex problems. Step, sign, sigmoid,
tan, and ReLU functions, shown in Figure 15.5, can be activation functions.
• Sigmoid
1
σ(z) = (15.6)
1 + exp ( − z )
The value of the sigmoid function lies between 0 and 1 ,and the function is con-
tinuously differentiable and not symmetric around the origin, but the vanishing
gradients problem does occur (Equation 15.6).
374 Machine Learning
FIGURE 15.5
Activation functions.
• Tanh
exp ( z ) − exp ( − z )
tanh ( z ) = (15.7)
exp ( z ) + exp ( − z )
This function is a scaled version of the sigmoid and is symmetric around the origin
(Equation 15.7), but the vanishing gradients problem does occur.
• ReLU
Also called a piecewise linear function because the rectified function is linear for
half of the input domain and nonlinear for the other half.
This function is trivial to implement, has a sparse representation, and avoids the
problem of vanishing gradients (Equation 15.8).
15.6 Learning Algorithm
A neural network with at least one hidden layer can approximate any function and the
representation power of the network increases with more hidden units and more hidden
layers.The learning algorithm consists of two parts — backpropagation and optimization.
Introduction to Deep Learning and Convolutional Neural Networks 375
15.6.1 Backpropagation
• A backpropagation algorithm is used to train artificial neural networks; it can
update the weights very efficiently.
• It is a computationally efficient approach to compute the derivatives of a complex
cost function.
• The goal is to use those derivatives to learn the weight coefficients for parameter-
izing a multilayer artificial neural network.
• It computes the gradient of a cost function with respect to all the weights in the
network, so that the gradient is fed to the gradient descent method, which in turn
uses it to update the weights in order to minimize the cost function.
15.6.2 Chain Rule
The simpler version of the backpropagation model is based on chain rules.
Single path: The following Equation (15.9) represents the chain rule of a single path.
∂z ∂z ∂y
= (15.9)
∂x ∂y ∂x
Figure 15.6 shows the diagrammatic representation of the single path chain rule.
Multiple path: The following Equation (15.10) represents the chain rule of multiple paths.
∂z ∂z ∂y1 ∂z ∂y 2
= +
∂x ∂y1 ∂x ∂y 2 ∂x
T (15.10)
∂z ∂z ∂yt
∂x
= ∑
t =1
∂yt ∂x
Figure 15.7 shows the diagrammatic representation of the multiple path chain rule.
FIGURE 15.6
Diagrammatic representation of single path.
FIGURE 15.7
Diagrammatic representation of multiple paths.
376 Machine Learning
The total error in the network for a single input is given by Equation (15.11).
K
∑(a − t )
1 2
E= k k (15.11)
2 k =1
To reduce the overall error, the network’s weights need to be adjusted as shown in
Equation (15.12).
∂E
∆W α − (15.12)
∂W
FIGURE 15.8
Sample architecture diagram with weights.
FIGURE 15.9
Backpropagation for outermost layer.
Introduction to Deep Learning and Convolutional Neural Networks 377
∂E ∂E ∂ak ∂zk
= (15.13)
∂w jk ∂ak ∂zk ∂w jk
Each component of the backpropagation of the outermost layer’s equation (Equation 15.13)
is derived in the following equations (Equations 15.14–15.17):
∂E ∂ 1
∑ ( a − t ) = ( a − t )
2
= k k k k (15.14)
∂ak ∂ak 2
k∈K
∂ak ∂
=
∂zk ∂zk
( fk ( zk ) ) = fk, ( zk ) (15.15)
∂zk ∂
=
∂w jk ∂w jk ∑ a j w jk = a j
(15.16)
j
∂zk
= ( ak − tk ) ak ( 1 − ak ) a j ) (15.17)
∂w jk
∂E ∂ 1 2
∂wij
=
∂wij
2
k∈K
(∑ak − t k )
∂
= ∑ ( ak − t k )
∂wij
ak
k∈K (15.18)
∂
= ∑
k∈K
( ak − t k )
∂wij
( fk ( zk ) )
∂
= ∑ (a
k∈K
k − t k ) fk’ ( zk )
∂wij
zk
378 Machine Learning
∂zk ∂z ∂a j
= k
∂wij ∂a j ∂wij
∂ ∂a j
= a j w jk
∂a j ∂wij
∂a j
= w jk
∂wij
∂f j ( z j ) (15.19)
= w jk
∂wij
∂z j
= w jk f j’ ( z j )
∂wij
∂
= w jk f j′ ( z j )
∂wij
∑
i
aiwij
= w jk f j’ ( z j ) ai
(15.20)
The backpropagation algorithm is used for calculating the gradient of the loss function,
which points us in the direction of the value that minimizes the loss function, and using
gradient descent iteratively in the direction given by the gradient, we move closer to the
minimum value of error. The perceptron uses backpropagation in the following manner.
Step 1: There is a need to estimate the distance between the predicted solution and
the desired solution, which is generally defined as a loss function such as
mean squared error. In the case of a regression problem using mean squared
error as a loss function, the squares of the difference between actual (yᵢ) and
predicted value (ŷᵢ) are as shown in Equation (15.21).
MSEi = ( yi − yˆ i )
2
(15.21)
The average loss function for the entire training data set: the cost function C
for the entire data set is the average loss function for all n datapoints and is
given in Equation (15.22) as follows.
n
∑ ( y − yˆ )
1 2
C = MSE = i i (15. 22)
n i =1
Introduction to Deep Learning and Convolutional Neural Networks 379
Step 2: In order to find the best weights and bias for our perceptron, we need to
know how the cost function changes in relation to weights and bias. This
is done with the help of the gradients (rate of change)—how one quantity
changes in relation to another quantity. In our case, we need to find the gradi-
ent of the cost function with respect to the weights and bias.
The gradient of cost function C with respect to the weight wᵢ can be calculated using partial
derivation. Since the cost function is not directly related to the weight wᵢ, the chain rule can
be used as given in Equation (15.23).
∂C ∂C ∂yˆ ∂z
= ∗ ∗ (15.23)
∂wi ∂yˆ ∂z ∂wi
∂C ∂yˆ ∂z
=? =? =?
∂yˆ ∂z ∂w1
Let’s start with the gradient of the cost function (C) (Equation 15.24) with respect to the
predicted value (ŷ) (Equation 15.24).
n n
∂C ∂ 1
∑ ( yi − yˆ i ) ∑ ( y − yˆ )
2 1
= = 2∗ i i (15.24)
∂yˆ ∂yˆ n i =1
n i =1
Let y = [y₁, y₂, … yn] and ŷ = [ŷ₁, ŷ₂, … ŷn] be the row vectors of actual and predicted
values. Hence the previous equation is simplified as follows (Equation 15.25):
∑ ( y − yˆ )
∂C 2
= (15.25)
∂yˆ n
Now let’s find the gradient of the predicted value with respect to z (Equation 15.26). This
will be a bit lengthy.
∂yˆ ∂
= σ (z)
∂z ∂z
∂ 1
=
∂z 1 + e − z
e −z
=
( )
2
1 + e −z
(15.26)
1 e −z
= *
(
1+ e −z
) (
1 − e −z )
1 1
= * 1−
( 1+ e )
−z
(
1 + e −z )
= σ ( z ) ∗ (1 − σ ( z ) )
380 Machine Learning
∂z ∂
= (z)
∂wi ∂wi
∂ n
=
∂wi
∑
i =1
xiwi + b
(15.27)
= xi
∑ ( y − yˆ ) ∗ σ ( z ) ∗ ( 1 − σ ( z )) ∗ x
∂C 2
= i (15.28)
∂wi n
What about bias? Bias is theoretically considered to have an input of constant value 1.
Hence (Equation 15.29),
∑ ( y − yˆ ) ∗ σ ( z ) ∗ ( 1 − σ ( z ))
∂C 2 (15.29)
=
∂b n
15.7 Multilayered Perceptron
A multilayered perceptron (MLP) is one of the most common neural network models
used in the field of deep learning. Often referred to as a “vanilla” neural network, an MLP
is simpler than the complex models of today’s era. However, the techniques it introduced
have paved the way for further advanced neural networks.
The MLP is used for a variety of tasks, such as stock analysis, image identification, spam
detection, and election voting predictions.
• Input Layer: This is the initial layer of the network which takes in an input that
will be used to produce an output.
• Hidden Layer(s): The network needs to have at least one hidden layer. The hid-
den layer(s) perform computations and operations on the input data to produce
something meaningful.
• Output Layer: The neurons in this layer display a meaningful output.
Connections
The MLP is a feed forward neural network, which means that the data is transmit-
ted from the input layer to the output layer in the forward direction.
Introduction to Deep Learning and Convolutional Neural Networks 381
FIGURE 15.10
Multilayer perceptron.
The connections between the layers are assigned weights as shown in Figure
15.11. The weight of a connection specifies its importance. This concept is the back-
bone of an MLP’s learning process.
While the inputs take their values from the surroundings, the values of all
the other neurons are calculated through a mathematical function involving the
weights and values of the layer before it.
For example, the value of the h5 node in figure 15.11 could be (Equation 15.30),
15.7.2 Backpropagation
Backpropagation is a technique used to optimize the weights of an MLP using the outputs
as inputs.
In a conventional MLP, random weights are assigned to all the connections as shown in
Figure 15.12. These random weights propagate values through the network to produce the
actual output (Figure 15.13). Naturally, this output would differ from the expected output.
The difference between the two values is called the error.
Backpropagation refers to the process of sending this error back through the network
(Figure 15.14, 15.15), readjusting the weights automatically so that eventually the error
between the actual and expected output is minimized (Figure 15.16).
In this way, the output of the current iteration becomes the input and affects the next
output. This is repeated until the correct output is produced. The weights at the end of the
process would be the ones on which the neural network works correctly (Figure 15.16, 15.17).
FIGURE 15.11
MLP architecture with weights.
FIGURE 15.12
Simple model for predicting the next binary number.
FIGURE 15.13
Iteration 1—training with first input of training set.
CNN classification takes any input image and finds a pattern in the image, processes it,
and classifies it in various categories such as car, animal, or bottle. CNN is also used in
unsupervised learning for clustering images by similarity. It is a very interesting and com-
plex algorithm, which is driving the future of technology.
Introduction to Deep Learning and Convolutional Neural Networks 383
FIGURE 15.14
Backpropagation—adjusting weights from output layer to hidden layer.
FIGURE 15.15
Back propagation—adjusting weights from hidden layer to input layer.
FIGURE 15.16
Training for particular input.
FIGURE 15.17
Generation of final model.
when exposed to vertical edges and some when shown horizontal or diagonal edges.
Hubel and Wiesel found out that all of these neurons were organized in a columnar archi-
tecture and that together they were able to produce visual perception. This idea of special-
ized components inside of a system having specific tasks (the neuronal cells in the visual
cortex looking for specific characteristics) is one that machines use as well and is the basis
behind CNNs.
1. Convolutional layer
2. Pooling layer (optional)
3. Output layers
Input layers are connected with convolutional layers where the input layer in CNN
reshapes the image data represented as a three-dimensional matrix to a single column.
FIGURE 15.18
Convolutional neural network.
• Pooling layer (optional layer): A pooling layer, optionally placed between two
layers of convolution, reduces the spatial volume of the input image after con-
volution while preserving the important characteristics. This layer improves the
efficiency and avoids overfitting.
• Output layer: The fully connected layer is the last layer of the CNN as is the case
of any neural network. This layer applies a linear combination and then possibly
an activation function to produce the output. This layer classifies the image and
returns as output vector of size N where N is the number of classes of our task and
each element of the vector indicates the probability of the input belonging to that
particular class.
1. Image channels
2. Convolution
3. Pooling
4. Flattening
5. Full connection
15.8.4.1 Image Channels
The CNN needs the image to be represented in numerical format where each pixel is
mapped to a number between 0 and 255 to represent color. For a black and white image,
a 2-dimensional array of size m × n containing the corresponding pixel value is used. In
the case of a colored image, a 3-dimensional array representing the corresponding pixel is
used, where the dimensions correspond to the red, blue, and green channels (Figure 15.19).
The image is represented as a 3-dimensional array, with each channel representing red,
green, and blue values, respectively, as shown in Figure 15.20.
FIGURE 15.19
Representation of image 3D array.
FIGURE 15.20
Representation of 3D image with each channel (red, green, blue).
388 Machine Learning
15.8.4.2 Convolution
A combination of numbers represents the image from which the key features within the
image need to be identified for which convolution is used. The convolution operation
modifies or convolutes one function to the shape of another, normally used in images to
sharpen, smooth, and intensify images for extraction of important features.
Feature Detection
A filter or a kernel is an array that represents the feature to be extracted. This filter
is strided over the input array, resulting in a 2-dimensional array that contains
the correlation of the image with respect to the applied filter. The output array is
referred to as the feature map.
The resulting image contains just the edges present in the original input. For
example, the filter used in the previous example is of size 3 ×3 and is applied to the
input image of size 5 × 5. The resulting feature map is of size 3 × 3. In summary,
for an input image of size n × n and a filter of size m × m, the resulting output is of
size (n – m + 1) × (n – m + 1).
Strided Convolutions
During the process of convolution, you can see how the input array is transformed
into a smaller array while still maintaining the spatial correlation between the pix-
els by applying filters. Here, we discuss an option on how to help compress the
size of the input array to a greater extent.
Striding
In the previous section, you saw how the filter is applied to each 3 × 3 section of
the input image. You can see that this window is slid by one column to the right
each time, and the end of each row is slid down by one row. In this case, sliding of
the filter over the input was done one step at a time. This is referred to as striding.
The following example shows the same convolution but strided with two steps.
The filter used in the previous example is of size 3 × 3 and is applied to the input
image of size 5 ×5 with a stride = 2. The resulting feature map is of size 2 × 2. In
summary, for an input image of size n × n and a filter of size m × m with stride = k,
the resulting output will be of size ((n – m)/k + 1) × ((n – m)/k + 1)
Padding
During convolution, the size of the feature map is reduced drastically when com-
pared to the input. In addition, the filter stride filters the cells in the corners just
once, but the cells in the center quite a few times.
To ensure that the size of the feature map retains its original input size and
enables equal assessment of all pixels, one or more layers of padding is applied to
the original input array. Padding is the process of adding extra layers of zeros to
the outer rows and columns of the input array. If we add 1 layer of padding to the
input array before a filter is applied, in general, for an input image of size n × n and
a filter of size m × m with padding = p, the resulting output is of size (n + 2p – m
+1) × (n + 2p – m +1).
Introduction to Deep Learning and Convolutional Neural Networks 389
15.8.4.3 Pooling
In order to further reduce the size of the feature map generated from convolution, pool-
ing or subsampling, which helps to further compress the dimensions of the feature map,
is used.
Pooling is the process of summarizing the features within a group of cells in the feature
map. This summary of cells can be acquired by taking the maximum, minimum, or aver-
age within a group of cells. Each of these methods is referred to as min, max, and average
pooling, respectively. In CNN, pooling is applied to feature maps that result from each
filter if more than one filter is used.
FIGURE 15.21
Applying 3 × 3 filter.
390 Machine Learning
FIGURE 15.22
Applying three filters of size 3 × 3.
FIGURE 15.23
Flattening.
15.8.4.4 Flattening
We can think of CNN as a sequence of steps that are performed to effectively capture the
important aspects of an image before applying ANN on it. In the previous steps, we saw
the different transitions that are applied to the original image.
The final step in this process is to make the outcomes of CNN be compatible with an
artificial neural network. The inputs to ANN should be in the form of a vector. To support
that, flattening is applied (Figure 15.23), which is the step to convert the multidimensional
array into an n × 1 vector, as shown previously.
Introduction to Deep Learning and Convolutional Neural Networks 391
FIGURE 15.24
Overall flow of convolutional neural network.
Note that Figure 15.23 shows flattening applied to just one feature map. However, in
CNN, flattening is applied to feature maps that result from each filter.
15.9 Summary
• Introduced deep learning including its evolution, working, and applications.
• Outlined the basic components of neural networks.
• Discussed perceptrons in detail with an example, activation function used, and the
learning algorithm comprising backpropagation and optimization.
• Described briefly multilayered perceptron (MLP) and convolutional neural net-
works (CNNs) including their components and the use of CNN for image detec-
tion or classification.
• Listed some examples of convolutional neural networks.
392 Machine Learning
15.10 Points to Ponder
• Convolutional neural networks are used for computer vision because of their abil-
ity to recognize features in raw data.
• Convolutional neural networks can learn features on their own, building up from
low-level (edges, circles) to high-level (faces, hands, cars) features.
• Often the features learnt by convolutional neural networks are not human
interpretable.
• When using convolutional neural networks, a wild pixel in the image can result in
new surprising outputs.
E.15 Exercises
E.15.1 Suggested Activities
E.15.1.1 Use CNN for automatic digit recognition with the MNIST DATABASE,
http://yann.lecun.com/exdb/mnist/.
E.15.1.2 In traffic sign classification project we need to identify traffic signs from the
image using CNN. You should use the GTSRB dataset: https://www.kaggle.
com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign.
E.15.1.3 Apply convolutional neural networks (CNNs) for facial expression recognition,
and correctly classify each facial image into one of the seven facial emotion cat-
egories: anger, disgust, fear, happiness, sadness, surprise, and neutral. Use the
Kaggle data set: https://www.kaggle.com/datasets/msambare/fer2013.
Self-Assessment Questions
E.15.2 Multiple Choice Questions
Give answers with justification for correct and wrong choices.
E.15.2.2 Google Brain conducted the “_____” experiment in 2012 for unsupervised
learning using a convolutional neural net.
i Dog
ii Cat
iii Lady
Introduction to Deep Learning and Convolutional Neural Networks 393
E.15.2.3 Effectors convert electrical impulses generated by the neural net (brain) into
responses to the external environment. Examples include
i Hands
ii Muscles
iii Skin
E.15.2.4 Synapses
i Form a fine filamentary bush, with each fiber thinner than an axon
ii Process the incoming activations and converts them into output activations
iii Are junctions that allow signal transmission between the axons and den-
drites
E.15.2.6 The process of passing the data through the neural network is known as
i Inward propagation
ii Backward propagation
iii Forward propagation
E.15.2.8 The _____ method is used to update the weights in order to minimize the cost
function.
i Gradient
ii Gradient descent
iii Weight reduction
E.15.2.9 The _____ algorithm is used for calculating the gradient of the loss function.
i Forward propagation
ii Activation function
iii Backward propagation
E.15.2.10 The _____ performs computations and operations on the input data to pro-
duce something meaningful.
i Input layer
ii Hidden layer
iii Output layer
E.15.2.11 In the case of a CNN, the convolution is performed on the input data using the
i Kernel
ii Input layer
iii Pooling
394 Machine Learning
E.15.2.13 The _____ performs many tasks such as padding, striding, and the function-
ing of kernels.
i Pooling layer
ii Convolutional layer
iii Fully connected layer
E.15.2.15 Sliding of the filter over the input done one step at a time is called
i Padding
ii Filtering
iii Striding
No Match
E.15.3.1 Ian Goodfellow A Is the process of passing the data through the neural
network
E.15.3.2 ImageNet B Is the process where a window representing the
feature of the image is dragged over the input and
the product between the feature and each portion
of the scanned image is calculated
E.15.3.3 Rods and cones of eyes, skin C The process of readjusting the weights automatically
so that eventually, the error between the actual
and expected output is minimized
E.15.3.4 Forward propagation D Examples of receptors
E.15.3.5 Activation function E Reduces the size of the feature map generated from
convolution
E.15.3.6 ReLU function F Developed for handwritten digit recognition and
was responsible for the term “convolutional
neural network.”
E.15.3.7 Backpropagation G Generative adversarial neural network (GAN)
E.15.3.8 Convolution H Has a sparse representation and avoids the problem
of vanishing gradients
E.15.3.9 Pooling I Assembled free database by Fei-Fei Li, at Stanford
E.15.3.10 LeNet-5 J Helps the network to use the useful information and
suppress the irrelevant information
Introduction to Deep Learning and Convolutional Neural Networks 395
E.15.4 Short Questions
E.15.4.1 What is deep learning? Discuss.
E.15.4.2 Give some typical applications of deep learning.
E.15.4.3 Compare and contrast machine learning and deep learning.
E.15.4.4 Outline the stages of the human nervous system.
E.15.4.5 How is computation carried out in biological neurons?
E.15.4.6 Give a detailed description of the perceptron.
E.15.4.7 How is a perceptron trained?
E.15.4.8 Explain and give three examples of activation functions with illustrations.
E.15.4.9 What is backpropagation? Explain in the context of neural networks.
E.15.4.10 Outline single path and multiple path chain rules.
E.15.4.11 Give details of backpropagation for the hidden as well as output layer of a
perceptron.
E.15.4.12 Describe the basic structure of the multilayer perceptron (MLP).
E.15.4.13 Explain in detail how backpropagation readjusts the weights automati-
cally to minimize the error between the actual and expected output using
as an example a model to predict the next binary number with appropriate
diagrams.
E.15.4.14 Explain the biological connection to convolutional neural networks.
E.15.4.15 Describe the architecture of convolutional neural networks.
E.15.4.16 Explain the evolution of convolutional neural networks in the context of the
ImageNet Challenge.
E.15.4.17 Describe the Neocognitron, considered as the originator of convolutional
neural networks.
E.15.4.18 How does learning occur in convolutional neural networks? Explain.
E.15.4.19 Explain the terms padding, striding, and flattening in the context of convolu-
tional neural networks.
E.15.4.20 Explain in detail the convolution operation.
16
Other Models of Deep Learning and
Applications of Deep Learning
16.1.1 Working of RNN
Let us understand the working of RNN using a sample scenario.
Consider a deeper network that has one input layer, one output layer, and three hidden
layers. Normally each hidden layer is associated with its own set of weights and biases,
namely (w1, b1), (w2, b2), (w3, b3) (Figure 16.2) for each of the three hidden layers. This in
essence means that each of these layers are independent of each other and do not retain
information associated with the earlier outputs.
RNN however brings in dependency by using the same weights and biases across all
layers, reducing the number of parameters and memorizing previous outputs by feeding
each output into the next hidden layer. Therefore these three layers are combined to form
a single recurrent layer with the all the hidden layers having the same weights and biases
(Figure 16.3).
FIGURE 16.1
Recurrent neural network (RNN).
FIGURE 16.2
Deeper recurrent neural network.
FIGURE 16.3
Combination of three hidden layers.
ht f ht 1 , xt (16.1)
where,
• ht current state
• ht−1 previous state
• xt input state
After applying the activation function (tanh) in Equation (16.1), the current state becomes
(Equation 16.2),
where,
yt = why ht (16.3)
where,
yt output
why weight at output layer
1. The network consists of n input layers having same weight and activation function
and is given a single time step of the input.
2. Then, using the current input and the previous state output, the current state is
calculated.
3. For the next time step, the current state ht becomes ht−1.
4. Depending on the problem, one can repeat as many time steps as necessary to join
information from all previous states.
5. The final current state is used to calculate the output after all time steps have been
completed.
6. Then error is generated which is the difference between the output obtained from
the RNN model and the actual target output.
7. This error is then backpropagated and used to update the weights and train the
network (RNN).
The main advantage of RNN is that an RNN remembers every piece of information over
time. It is only useful in time series prediction because of the ability to remember previous
inputs. This is referred to as long short term memory. Recurrent neural networks are also
used in conjunction with convolutional layers to increase the effective pixel neighborhood.
However, there are issues with gradient vanishing and exploding in RNN. It is also
extremely difficult to train an RNN. Moreover, when using tanh or ReLU as an activation
function, it cannot process very long sequences.
In today’s world, several machine learning methods are utilized to process the many
varieties of data that are available. Sequential data is one of the sorts of data that is most
challenging to work with and to forecast. Sequential data is distinct from other types of
data in the sense that, whereas all of the characteristics of a normal data set can be consid-
ered to be order-independent, this cannot be assumed for a sequential data set. This is one
of the ways in which sequential data differs from other types of data. The idea of using
recurrent neural networks to process this kind of data came about as a result of the need to
do so. The structure of this artificial neural network is unique in comparison to that of
other such networks. The recurrent network follows a recurrence relation rather than a
feed-forward pass and employs backpropagation through time to learn, whereas other
networks “move” in a linear way throughout the feed-forward process or the backpropa-
gation process.
Multiple fixed activation function units make up the recurrent neural network; one of
these units is assigned to each time step in the process. A unit’s internal state, often known
as the concealed state of the unit, can be accessed by the unit. This hidden state is
400 Machine Learning
representative of the historical information that the network presently possesses at a certain
time step. This hidden state is changed at each time step to signal the change in the net-
work’s knowledge about the past, and this update takes place behind the scenes. The hid-
den state is kept updated by employing the recurrence relation that follows (Equation 16.4).
ht fW xt , ht 1 (16.4)
where
ht The new hidden state
ht − 1 The old hidden state
xt The current input
fW The fixed function with trainable weights
Calculating the new hidden state at each time step requires using the recurrence relation
described above. This newly produced hidden state is put to use in the production of yet
another newly produced hidden state, and so on (Figure 16.4).
Normally the network starts off in its initial hidden state, which is denoted by h0 where
in most cases this is a vector of zeros. However, it is possible to encode the assumptions
concerning the data into the initial hidden state of the network. For example, in order to
find out the tenor of a speech delivered by a famous person, the tenor of that person’s pre-
vious speeches might be encoded into the initial hidden state of the problem. Another
approach is to make the initial hidden state a trainable parameter. However, initializing the
hidden state vector to zeros is in most cases the best and efficient option.
The following is how each recurrent unit works:
1. Both the current input vector and the previously hidden state vector should be
taken as input. The current input and the hidden state are being treated as vec-
tors, and therefore each element of the vector is associated with a dimension that
is orthogonal to the other dimensions. This is because each dimension is being
treated as a vector. Therefore, the only circumstances in which the product of one
element multiplied by another element yields a number that is not zero are those
in which both elements are themselves nonzero and both elements belong to the
same dimension.
2. Perform element-wise multiplication of the current input vector and hidden state
vector by their corresponding weights, producing the associated parameterized
vectors. The trainable weight matrix contains the weights that correspond to the
various vectors.
FIGURE 16.4
Basic work-flow of a recurrent neural network.
Other Models of Deep Learning and Applications of Deep Learning 401
3. Then the vector addition of the two parameterized vectors is carried out, and in
order to construct the new hidden state vector the element-wise hyperbolic tan-
gent is calculated (Figure 16.5).
During recurrent network training (Figure 16.6), the network creates an output at each
time step. The network is trained using gradient descent using this output.
The backpropagation that is being employed here is quite analogous to the one that is
utilized in a standard artificial neural network, with a few small modifications. These mod-
ifications can be identified as follows.
Let the predicted output of the network at any time step be yt and the actual output be
yt. Then the error at each time step is given by (Equation 16.5),
Et yt log yt (16.5)
The total error is given by the summation of the errors at all the time steps (Equation
16.6):
E Et
t
(16.6)
E y log y
t t
t
FIGURE 16.5
Working of recurrent unit.
FIGURE 16.6
Recurrent network training.
402 Machine Learning
∂E
Similarly, the value can be calculated as the summation of gradients at each time
step (Equation 16.7). ∂W
E Et
W
W
t
(16.7)
Using the chain rule of calculus and using the fact that the output at a time step is a func-
tion of the current hidden state of the recurrent unit, the following expression arises
(Equation 16.8):
Take note that the weight matrix W utilised in the preceding calculation is different for
the input vector and the hidden state vector. This is done just for the ease of the notation
and has nothing to do with the actual meaning of the matrix. Thus the following expres-
sion arises (Equation 16.8),
Therefore, the only difference between backpropagation through time and a standard
backpropagation is that the errors at each time step are totaled up to calculate the total
error. This is the only difference (Figure 16.7).
In spite of the fact that the fundamental recurrent neural network is reasonably efficient,
it is susceptible to a severe difficulty. Backpropagation is a procedure that, when used in
deep networks, can result in a number of concerns, including the following:
1. Vanishing gradients results in unstable behaviour and occurs when useful gradi-
ent information cannot be propagated from the output back to the previous layers
and the gradients may have very small values tending towards zero. This may
result in premature convergence to a poor solution.
FIGURE 16.7
Backpropagation in RNN.
Other Models of Deep Learning and Applications of Deep Learning 403
2. Exploding gradients is a phenomenon that takes place when the gradients keep
getting larger during backpropagation, resulting in very large weight updates
causing the gradient descent to diverge.
The issue of vanishing gradients may be tackled by using rectified linear units (ReLU) as
activation function. The issue of exploding gradients could perhaps be fixed by employing
a workaround, which would include imposing a threshold on the gradients that are passed
back in time. However, this technique is not considered to be a solution to the issue, and it
also has the potential to decrease the effectiveness of the network. Long-short term mem-
ory networks and gated recurrent unit networks are the two primary varieties of recurrent
neural networks that have been developed in order to address issues of this nature.
16.2 Auto-encoders
One example of how neural networks are typically put to use is in situations that call for
supervised learning. It requires training data to be provided, which should include a label
for the output. The neural network makes an effort to learn the mapping that exists between
the input label and the output label that is provided. What would happen, though, if the
input vector itself was used in place of the output label? After then, the network will make
an attempt to discover the mapping from the input to itself. This would be the identity
function, which is an extremely straightforward mapping. However, if the network is not
permitted to merely replicate the input, then the network will be compelled to only record
the most important characteristics. This limitation reveals an entirely new realm of appli-
cations for neural networks, which was previously unexplored. Dimensionality reduction
and particular data compression are among the most important uses. In the beginning, the
network is educated using the input that was provided. The network makes an attempt to
rebuild the input by using the features that it has collected, and it then provides an output
that is a close approximation of the input. During the training process, you will be comput-
ing the error and then backpropagating the error. The conventional architecture of an auto-
encoder is shaped very similarly to a bottleneck. An auto-encoder can be broken down into
its component parts as follows, according to its schematic (Figure 16.8).
• De-noising auto-encoder: This sort of auto-encoder works with an input that has
been partially corrupted and learns to recover the original, undistorted image
from the training data. As was just pointed out, utilising this technique is an effi-
cient approach to prevent the network from merely replicating the input.
• Sparse auto-encoder: This kind of auto-encoder often has more hidden units than
the input, but only a few of those hidden units are allowed to be active at the same
time. This characteristic of the network is referred to as its sparsity. Controlling the
network’s sparsity can be accomplished by either manually zeroing the required
hidden units, fine-tuning the activation functions, or adding a loss component
to the cost function. All three of these methods are described further later in the
chapter.
404 Machine Learning
FIGURE 16.8
Components of an auto-encoders.
Let us understand the training of an Auto-encoder for a data compression scenario using
the following steps. When doing a data compression technique, the feature of the compres-
sion that is most significant is the degree of dependability with which the compressed data
can be reconstructed. Because of this need, the structure of the auto-encoder, which is a
bottleneck, must be exactly as it is.
1. The first step is to encode the data that was input. The auto-encoder will first
attempt to encode the data by making use of the weights and biases that have been
initialised (Figure 16.9).
2. The next step is to decode the data that was input. In order to ensure that the
encoded data are accurate representations of the original input, the auto-encoder
will do a reconstruction of the data using the encoded version (Figure 16.10).
3. Backpropagating the fault is the third step. When the reconstruction is complete,
the loss function is computed so that the reliability of the encoding may be evalu-
ated. The error that was generated is then passed backwards (Figure 16.11).
This training process is reiterated multiple times until an acceptable level of reconstruction
is reached (Figure 16.12).
After the training phase, the only component of the auto-encoder that is kept is the encoder,
which is used to encode data that is comparable to that which was utilised during the train-
ing process. The following are the various ways that the network can be constrained:
Other Models of Deep Learning and Applications of Deep Learning 405
FIGURE 16.9
Step 1—encoding the data.
FIGURE 16.10
Step 2—decoding the data.
1. Keep hidden layers small: If the size of each hidden layer is kept as small as fea-
sible, then the network will be forced to pick up only the representative aspects of
the data, thereby encoding the data. This can be accomplished by keeping the size
of each hidden layer as small as possible.
2. This technique, known as regularisation, involves adding a loss element to the cost
function. By doing so, the network is encouraged to train itself in ways other than
just duplicating the input.
406 Machine Learning
FIGURE 16.11
Step 3—backpropagation of the error.
FIGURE 16.12
Iterative training processes.
Other Models of Deep Learning and Applications of Deep Learning 407
1. Forget gate (f): This gate determines the degree to which the preceding data is
forgotten.
2. Input gate (i): It is responsible for determining the amount of information that will
be written into the internal cell state.
3. Input modulation gate (g): It is frequently thought of as a component of the input
gate, and much of the published information on LSTMs either does not mention
it or incorrectly assumes that it is contained inside the input gate. It is used to
modulate the information that the input gate will write onto the internal state cell
by adding nonlinearity to the information and making the information zero-mean.
This is done in order to accomplish the purpose of modulating the information.
This is done in order to shorten the amount of time required for learning because
zero-mean input has a faster convergence. Include this gate in the construction of
the LSTM unit even if its actions are less significant than those of the other gates
and are frequently regarded as a concept that provides finesse even though this is
not the case because it is good practise.
4. Output gate (o): The function of the output gate (o) is to decide what output (next
hidden state) should be generated based on the present state of the internal cell.
The fundamental process flow of a long short-term memory network is very comparable
to the fundamental process flow of a recurrent neural network; the only difference is that
the internal cell state is also transmitted along with the hidden state in the long short-term
memory network (Figure 16.13).
408 Machine Learning
FIGURE 16.13
Process flow of a long short-term memory network.
FIGURE 16.14
Working of an LSTM recurrent unit.
The steps involved in the operation of an LSTM unit are given subsequently and as
shown in Figure 16.14:
1. The inputs are the current input, the previous hidden state, and the prior internal
cell state.
2. Now the values of the four distinct gates are determined as follows:
• For each gate, element-wise multiplication of the concerned vector with the
corresponding weights for each gate is calculated to obtain the parameterized
vectors for the current input and the prior concealed state.
• The corresponding activation function is applied element-wise to the param-
eterized vectors for each gate. The activation functions to be applied to each
gate are described as follows.
ct i g + f ct 1 (16.10)
• first calculating the element-wise multiplication vector of the input gate and
the input modulation gate,
Other Models of Deep Learning and Applications of Deep Learning 409
• next calculating the element-wise multiplication vector of the forget gate and
the previous internal cell state, and
• finally summing the two vectors.
The circle notation in the figure represents element-wise multiplication. The weight matrix
W comprises distinct weights for each gate’s current input vector and preceding hidden
state.
The LSTM network also generates an output at each time step, which is then used for
gradient descent training (Figure 16.15).
Backpropagation methods used by recurrent neural networks and long short-term
memory networks are similar but differ in the way mathematical aspects are modelled.
Let yt be the predicted output at each time step and yt be the actual output at each time
step. Then the error at each time step is given by (Equation 16.11):
The total error is thus given by the summation of errors at all time steps (Equation
16.12):
E E
t
t
(16.12)
E y log y
t t
t
FIGURE 16.15
LSTM network working.
410 Machine Learning
∂E
Similarly, the value can be calculated as the summation of the gradients at each time
step (Equation 16.13). ∂W
E Et
W
W
t
(16.13)
Using the chain rule and using the fact that yt is a function of ht and indeed is a function
of ct, the following expression arises (Equation 16.14):
Thus the total error gradient is given by the following (Equation 16.15):
Note that the gradient equation involves a chain of ∂ct for an LSTM backpropagation
while the gradient equation involves a chain of ∂ht for a basic recurrent neural network.
16.3.1 How Does LSTM Solve the Problem of Vanishing and Exploring Gradients?
Recall the expression for ct.
ct i g f ct 1
ct
The value of the gradients is controlled by the chain of derivatives starting from .
Expanding this value using the expression for ct (Equation 16.16), ct 1
ct
For a simple RNN, the term begins to take values larger than 1 or less than 1 within
ct 1
the same range after a given duration. This is the underlying source of the problem with
ct
vanishing and exploding gradients. The term in an LSTM can take any positive value
ct 1
at each time step, and hence convergence is not guaranteed. If the gradient begins to con-
verge toward 0, the gate weights can be modified to bring the gradient closer to 1. During
the training phase only these weights are adjusted, and hence the network can learn when
the gradient can converge to zero and when to retain it.
GAN deploys a learning technique that produces new data which is statistically similar to
the input training data. For instance, a GAN that has been trained on images can produce
new photographs that, at least to human observers, appear to be superficially legitimate
and have many realistic properties. GANs were initially conceived of as a sort of genera-
tive model that could be used for unsupervised learning.
The general notion behind a GAN is predicated on the concept of “indirect” training
through the discriminator, which is another neural network that can determine how “real-
istic” the input seems and is also being dynamically updated. That is, during the training,
the generator does not minimise the distance with respect to an image; rather, it is trained
to deceive the discriminator into thinking that the image is closer than it actually is. Because
of this, the model is able to learn in a way that is not supervised. In evolutionary biology,
mimicry is analogous to GANs, and both types of networks engage in an evolutionary
arms race with one another.
FIGURE 16.16
GAN overview.
412 Machine Learning
photographic image of a real human, is to identify those that are genuine photographs of a
real human. The generator generates new synthetic pictures, which are then sent to the
discriminator. The purpose of the generator is to produce convincing hand-drawn repre-
sentations of humans: to deceive without being discovered. The main task of the discrimi-
nator is to identify and recognize any images produced by the generator as fraudulent. The
steps followed by GAN for the example scenario is summarized in Figure 16.17.
The input to the generator network is a D-dimensional noise vector with random inte-
gers, and the output is a synthetic image. This synthetic image along with a set of photo-
graphs from the real-world data set acts as an input to the discriminator. The discriminator
upon receiving both authentic and counterfeit photographs outputs a number between 0
and 1. This value is a probability value, where 1 indicates authentic and 0 represents
counterfeit.
FIGURE 16.17
Working of GAN.
Other Models of Deep Learning and Applications of Deep Learning 413
Discriminative models, on the other hand, do not have this capability. Both types of mod-
els are useful. When working on data modelling problems in the real world, this is quite
desired, as obtaining labelled data can be expensive, at best, and unfeasible, at worst. On
the other hand, there is an abundance of unlabeled data, thus this is not a problem.
The objective of the generator is to minimize the objective function, and that of the dis-
criminator is to maximize the objective function.
Ultimately the task of the generator is to approach μG ≈ μref, that is, to match its output
in par with the reference distribution.
In discriminator training, keep the generator constant. Train against a static opponent.
This enhances the generator’s gradient knowledge. Training the discriminator using
ground truth data plays a major role for producing a clear gradient by the generator. GAN
training is time-consuming, hence the utmost care for training becomes essential.
16.5.1 Components of RBNN
The RBNN architecture has an input layer, a hidden layer, and an output layer. The RBNN
algorithm can contain no more than one hidden layer at any given time. This hidden layer
is referred to as a feature vector in RBNN. The dimension of the feature vector is increased
via RBNN (Figure 16.18).
Training the RBNN: The various steps involved in training a RBNN is as shown in
Figure 16.19.
414 Machine Learning
FIGURE 16.18
Architecture of RBNN.
FIGURE 16.19
RBNN training.
2. In the second training phase, the weighting vectors between hidden layers and
output layers have to be updated.
Comparison of Radial Basis Neural Network (RBNN) and Multi-Layer Perceptron (MLP)
1. Training in RBNN is faster than in a multilayer Perceptron (MLP); it takes many
interactions in MLP.
2. We can easily interpret the meaning or function of each node in the hidden layer
of the RBNN. This is difficult in MLP.
3. Parameterization (like the number of nodes in the hidden layer and the number of
hidden layers) is difficult in MLP. But this is not found in RBNN.
4. Classification will take more time in RBNN than MLP.
16.6.1 Layers
The MLP is made up of three or more layers of nonlinearly activating nodes, including an
input layer, an output layer, and one or more hidden layers. Due to the fact that MLPs are
fully connected, each node in one layer connects with a specific weight wij to every node
in the layer that follows it. The pictorial representation of multilayer perceptron learning
is shown in Figure 16.20.
As per linear algebra, if a multilayer perceptron has a linear activation function that
maps the weighted inputs to the output of each neuron, then any number of layers may be
reduced to a two-layer input–output model. In MLPs some neurons use a nonlinear activa-
tion function that was intended to simulate the frequency of action potentials, or firing, of
real neurons.
The two historically common activation functions are both sigmoids and are described
by Equations (16.17) and (16.18).
y vi tanh vi (16.17)
416 Machine Learning
FIGURE 16.20
Multilayer perceptron learning.
and
1
y vi 1 e vi (16.18)
The rectifier linear unit (ReLU) is one of the most frequently and commonly used activa-
tion functions in MLP as it overcomes the numerical problems related to the sigmoids. The
first is a hyperbolic tangent that ranges from −1 to 1, while the other is the logistic function,
which is similar in shape but ranges from 0 to 1. Here yi is the output of the ith node (neu-
ron) and vi is the weighted sum of the input connections. Alternative activation functions
have been proposed, including the rectifier and soft plus functions. More specialized acti-
vation functions include radial basis functions (used in radial basis networks, another class
of supervised neural network models).
16.6.2 Learning in MLP
The perceptron is able to learn by adjusting the connection weights after each piece of data
is processed. These adjustments are made based on the degree of mistake in the output in
comparison to the result that was anticipated. This is an example of supervised learning,
and it is accomplished by backpropagation, which is a generalization of the technique for
the linear perceptron that calculates the least mean squares.
We can indicate the degree of error in an output node j in the nth data point (training
example) by using Equation (16.19).
ej n dj n yj n (16.19)
where, d is the target value and y is the value that the perceptron produces. After that, the
node weights can be modified depending on adjustments that minimize the error in the
overall output, which is provided by Equation (16.20):
e n
1
n 2
j (16.20)
2 j
Other Models of Deep Learning and Applications of Deep Learning 417
n
w ji n yi n (16.21)
v j n
where yi is the output of the neuron that came before it and η is the learning rate. The learn-
ing rate is chosen to ensure that the weights quickly converge to a response and do not
oscillate. Calculating the derivative requires taking into account the variable induced local
field vj, which in turn fluctuates. It is not difficult to demonstrate that this derivative, when
applied to an output node, can be simplified to Equation (16.22):
n
e j n v j n (16.22)
v j n
where Φ′ is the derivative of the activation function that was stated earlier, which does not
vary. The analysis is made more challenging by the fact that the change in weights is being
applied to a hidden node, yet it is possible to demonstrate that the relevant derivative is
(Equation 16.23):
n n
v j n
v j n v n w
k k
kj n (16.23)
This is dependent on the shift in weights of the kth nodes, which are the ones that con-
stitute the output layer. As a result, in order to modify the weights of the hidden layers, the
weights of the output layers must be modified in accordance with the derivative of the
activation function. Because of this, the algorithm in question reflects a backpropagation of
the activation function.
16.6.3 Applications of MLPs
MLPs are helpful in research because of their capacity to tackle issues in a stochastic man-
ner, which frequently enables approximate solutions to exceedingly difficult problems
such as fitness approximation. Because classification is a subset of regression that occurs
when the response variable is categorical, multilayer perceptrons (MLPs) are effective clas-
sifier algorithms. MLPs are popularly used in diversified applications like speech recogni-
tion, image recognition, and software for machine translation.
FIGURE 16.21
Functioning—MLP Versus RBNN.
TABLE 16.1
RBNN Versus MLP
S.No RBNN MLP
16.7.1 Architecture SOM
The input layer and the output layer are the two layers that make up SOM. The following
is a description of the self-organizing map’s architecture (Figure 16.22) with two clusters
and n input features for any sample.
16.7.2 Working of SOM
Self-organizing maps operate in two modes: training and mapping. First, training uses
input data to build a lower-dimensional representation (the “map space”). Second, map-
ping classifies input data using the map.
Training aims to represent a p-dimensional input space as a two-dimensional map space.
A p-variable input space is p-dimensional. A map space has “nodes” or “neurons” arranged
in a hexagonal or rectangular grid. The number of nodes and their placement are deter-
mined before data processing and exploration.
Each map node has a “weight” vector that represents its position in the input space.
While nodes in the map space stay fixed, training involves moving weight vectors toward
the input data (lowering a distance measure like Euclidean distance) without damaging
the map space’s topology. After training, the map may identify further input space obser-
vations by selecting the node with the closest weight vector (smallest distance metric).
Consider an input set with the dimensions (m, n), where m represents the number of
training examples and n represents the number of features present in each example. The
weights of size (n, C), where C is the number of clusters, are first initialized. The winning
vector (the weight vector with the shortest distance from the training example, for exam-
ple, the Euclidean distance) is then updated after iterating over the input data for each
training example. Weight update is decided by the following (Equation 16.25):
where i stands for the ith feature of the training example, j is the winning vector, α is the
learning rate at time t, and k is the kth training example from the input data. The SOM
FIGURE 16.22
SOM architecture.
420 Machine Learning
network is trained, and trained weights are utilised to cluster new examples. A new exam-
ple is included in the collection of successful vectors.
Thus the various steps explained previously are summarized as follows:
1. Initialization of weight
2. For epochs ranging from 1 to N
3. Pick a training instance
4. Determine the winning vector
5. A winning vector update
6. For each training example, repeat steps 3, 4, and 5
7. Creating a test sample cluster
FIGURE 16.23
Illustration of SOM working.
Other Models of Deep Learning and Applications of Deep Learning 421
A SOM can be interpreted in many different ways. Similar items frequently fire nearby
neurons because in the training phase, the weights of the entire neighbourhood are moved
in the same direction. In order to create a semantic map, SOM places similar samples
together and dissimilar samples away.
The alternative interpretation is to consider neuronal weights as pointing devices to the
input region. They represent a discrete approximation of the training sample distribution.
Whereas fewer neurons point when there are few training examples, more neurons point
where there are more training samples.
Principal components analysis may be thought of as a nonlinear generalization in the
context of SOM (PCA). SOM was not developed at first as an answer to an optimization
issue. However, there have been numerous attempts to alter the SOM description and cre-
ate an optimization problem that yields comparable outcomes.
FIGURE 16.24
Restricted Boltzmann machine with three visible units and four hidden units (no bias units).
E v v (16.26)
RBM is trained using Gibbs sampling and contrastive divergence. Gibbs sampling is a
Markov chain Monte Carlo approach for getting a sequence of observations approximating
a defined multivariate probability distribution in statistics when direct sampling is
impossible.
If the input is represented by v and the hidden value by h, then the prediction is p(h|v).
With knowledge of the hidden values, p(v|h) is utilised to anticipate the regenerated input
values. Suppose this procedure is performed k times, and after k iterations, the initial input
value v 0 is transformed into v k.
Contrastive divergence is an approximation maximum-likelihood learning approach
used to approximate the gradient, which is the graphical slope depicting the link between
a network’s weights and its error. In situations where we cannot directly evaluate a func-
tion or set of probabilities, an inference model is used to approximate the algorithm’s
learning gradient and determine the direction in which to move. In contrastive divergence,
updated weights are utilised. After calculating the gradient from the reconstructed input,
delta is added to the previous weights to create the new weights.
FIGURE 16.25
Schematic overview of a DBN.
16.9.2.2 Drawbacks
There are certain hardware requirements for DBN, and DBN demands vast data to conduct
better approaches. Because of its intricate data models, DBN requires a significant invest-
ment to train. There must be hundreds of different machines. Classifiers are necessary for
DBN in order to comprehend the output.
Other Models of Deep Learning and Applications of Deep Learning 425
16.9.2.3 Applications of DBN
DBN is widely used for object detection. The DBN is able to identify occurrences of specific
classes of things. Image creation is another application for DBN. DBN is also used com-
monly for image classification. DBN also detects video. It is able to capture motion. DBN is
capable of understanding natural language, also known as speech processing, in order to
provide a complete description. DBN estimates human poses. DBN is utilised extensively
throughout data sets.
16.10 Applications
16.10.1 Deep Learning for Vision
Computer vision is a subfield of machine learning concerned with the interpretation and
comprehension of images and videos. It helps computers to view and perform visual tasks
similar to human beings. Computer vision models translate visual data into contextful
features which allow models to evaluate images and videos. Such interpretations are then
used for decision-making tasks. In computer vision, deep learning approaches are deliv-
ering on their promise. Computer vision is not “solved,” but deep learning is required to
solve many difficult challenges in the field at the cutting edge. Let’s have a look at three
instances to illustrate the results that deep learning may achieve in the field of computer
vision:
1. Automatic object detection: Object detection is the process of locating and clas-
sifying objects in images and video. In object detection, given a snapshot of a scene
to a system, it must locate, draw a bounding box around, and classify each object.
In this context we’ll use a vehicle detection example and understand how to use
deep learning to create an object detector. The same steps can be used to create any
object detector.
Step 1: For vehicle detection, first we need a set of labelled training data, which
is a collection of photos with the locations and labels of items of inter-
est. Specifically, someone must examine each image or video frame and
mark the positions of any things of interest. This method is referred to as
ground truth labelling. Labelling the ground truth is frequently the most
time-consuming aspect of developing an item detector.
Step 2: Training the vehicle detector using R-CNN network
(i) Define a network architecture using any tool/software platform that
supports deep learning.
(ii) Examine parts of an image using R-CNN algorithm.
Step 3: After training, test the vehicle detector using a few test images to see if the
detector is working properly.
Step 4: Once the vehicle detector is working properly, again test using large set of
validation images using various statistical metrics.
426 Machine Learning
FIGURE 16.26
AlexNet architecture.
FIGURE 16.27
GoogleNet architecture.
Step 1: Extract features from each frame of a video by utilising a pretrained con-
volutional neural network, such as GoogleNet, and converting the video
into a sequence of feature vectors.
Step 2: Teach an LSTM network to predict the video labels using the sequences
as training data.
428 Machine Learning
FIGURE 16.28
The structure of the LSTM network.
In the previous figure, a sequence input layer is used to input image sequences to the
network. Features are extracted by using convolutional layers to extract features, that is,
to apply the convolutional operations to each frame of the videos. Independently, use a
sequence folding layer followed by the convolutional layers. This will allow you to apply
the convolutional operations to each frame of the videos separately. A sequence unfold-
ing layer in conjunction with a flatten layer is used in order to re-establish the sequence’s
original structure and reformat the results as vector sequences. LSTM layers followed by
output layers are used to classify the resulting vector sequences.
level is learned during the training. Recently, Google researchers rebuilt multiple
existing graph convolution methods into a framework known as a message pass-
ing neural network (MPNN) and utilised MPNNs to predict quantum chemical
characteristics.
Deep learning methods based on other types of molecular representations like
SMILES string as the input to LSTM RNNs to build predictive models, and CNN
on images of 2D drawings of molecules, were also explored in this field of research.
2. Generation of new chemical structures using deep learning: Another interesting
application of deep learning in chemo informatics is the generation of new chemi-
cal structures through NNs. A variational auto-encoder (VAE) has been used for
the following: (i) to generate chemical structures, (ii) as a molecular descriptor
generator coupled with a GAN to generate new structures that were claimed to
have promising specific anticancer properties, (iii) to generate novel structures
with predicted activity against dopamine receptor type 2.
RNNs also have been very successful to generate novel chemical structures.
After training the RNN on a large number of SMILES strings, the RNN method
worked surprisingly well for generating new valid SMILES strings that were
not included in the training set. Reinforcement learning technology, called deep
Q-learning, together with an RNN has been used to generate SMILES with desir-
able molecular properties. Also a policy-based reinforcement learning approach
to tune the pretrained RNNs for generating molecules with given user-defined
properties were also attempted in this research.
3. Application of deep learning in biological imaging analysis: Biological imaging and
image analysis are used from preclinical R&D to clinical trials in medication dis-
covery. Imaging lets scientists see hosts’ (human or animal), organs, tissues, cells,
and subcellular components’ phenotypes and behaviours. Digital picture analy-
sis reveals underlying biology, disease, and drug activity. Fluorescently labelled
or unlabelled microscopic pictures, CT, MRI, PET, tissue pathology imaging, and
mass-spectrometry imaging are imaging modalities (MSIs). DL has also found suc-
cess in biological image analysis, where it outperforms standard classifiers.
For microscopic images, CNNs have been used for segmenting and subtyping
individual fluorescently labelled cells, cell tracking, and colony counting. DL iden-
tifies tumour areas, leukocytes, and fat tissue. DL is also utilised for histopathol-
ogy diagnosis beyond picture segmentation.
RNN has been used a lot to model how users interact in social networks by tak-
ing text from news and discussions and using temporal properties and influence
to model how that text changes over time. Deep learning techniques are used in
a framework called DeepInf to find hidden features and analyse social influence.
CNN looks at the structure of social networks like Twitter, Open Academic Graph,
Digg, and Weibo as well as the features that are unique to each user. In the history
of social network research, ontology-based restricted Boltzmann machine (ORBM)
models can be used to predict how people will act on online social networks.
• Sentiment analysis: A sentiment is a feeling, thought, idea, or opinion that has to
do with how a person feels. People’s opinions and, as a result, how they feel about
a thing or event are affected by social networks in a big way. Text or pictures can
both be used to show how someone feels.
Deep feed forward neural networks, LSTM-RNN, and CNNs are often used to
figure out how people feel about trailer comments on YouTube. DL can also be
used to analyse how investors feel about stock prices by making use of a model
to predict stock prices. Stock market indicators can be analysed using CNN and
LSTM for predicting the sentiments of the investors. Many works that combines
semantic and emotional features extracted from user generated texts in social
media to detect personality using deep learning techniques are attempted in social
network analysis research. Other deep learning models like a multiview deep net-
work (MVDN), matrix-interactive attention network (M-IAN), and deep interac-
tive memory network (DIMN) models are used to find interactions of context and
target opinion.
• Text classification: Capsule network (CapsNet) built using ANN and a gated
recurrent unit (GRU) are types of deep learning techniques used for modelling cat-
egorised relationships and classifying social text. Other forms of DL based archi-
tecture like multitask CNN (mtCNN) and reliable social recommendations using
generative adversarial network (RSGAN) frameworks are used for learning from
user profiles from social media data. RNN-based deep-learning sentiment analysis
(RDSA) algorithm is used for recommendation of nearest spots like streets, shops,
hospitals, and so on, which are in the neighbourhood to a user. Also a framework
called Friendship using Deep Pairwise Learning (FDPL) based on contextual infor-
mation and deep pairwise learning based on Bayesian ranking scheme are used for
recommending friends based on the geospecific activities of users.
• Community detection: An application of social network analysis is community
detection. A deep integration representation (DIR) algorithm based on deep joint
reconstruction (DJR) has been used to detect events in network communities. A
deep multiple network fusion (DMNF) model is used to extract latent structures
such as social communities and group the online social network (OSN) data. User
roles based on interactions and influences in OSNs are detected using deep RNNs.
• Anomaly detection: Anomaly detection in social networks is identifying irregu-
lar/malicious behaviour of users/transactions/events/etc. in a social network.
Examples for anomalies in social networks can be spams, activities pertaining to
terrorist networks, cyber bullying, and so on.
Deep Friend, used for classification of malicious nodes in an OSN and presented
in Wanda and Jie (2021) uses dynamic deep learning. A multitask learning (MTL)
framework based on deep neural networks is used for hate speech detection. A
deep probabilistic learning method is deployed widely to detect covert networks
Other Models of Deep Learning and Applications of Deep Learning 431
16.11 Summary
• Recurrent neural networks (RNNs) are a type of neural network. They utilise a
hidden layer, which retains some information about a sequence. The hidden state
is the primary and most significant characteristic of RNNs.
• The long short-term memory network (LSTM) is a type of recurrent neural net-
work. LSTM tries to “forget” unnecessary material while attempting to “remem-
ber” all of the previous information that the network has encountered.
• The RBNN architecture has an input, a hidden, and an output layer. A minimum
of three layers of nodes are required for the construction of an MLP. Training in
MLP is accomplished through the use of a supervised learning method known as
backpropagation.
• A self-organizing map (SOM) is an unsupervised machine learning technique.
SOM was first proposed by Teuvo Kohonen in the 1980s. It is a special kind of arti-
ficial neural network that is trained using a process called competitive learning.
• Deep belief networks (DBNs) are deep layers of random and generative neural
networks. DBNs were first presented as probabilistic generative models in 2007 by
Larochelle, Erhan, Courville, Bergstra, and Bengio.
• Finally, applications involving various deep learning models have been discussed.
16.12 Points to Ponder
• Hidden state, which retains some information about a sequence, is the most sig-
nificant characteristic of recurrent neural network (RNN)s.
• Vanishing gradients result in unstable behaviour where the gradients may have
very small values tending towards zero.
• In auto encoders, a bottleneck is imposed, which forces a compressed knowledge
representation of the original input.
432 Machine Learning
E.16 Exercises
E.16.1 Suggested Activities
E.16.1.1 Design and implement a human activity recognition (HAR) using TensorFlow
on the Human Activity Recognition Using Smartphones Data Set (https://
archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smart
phones) and an LSTM RNN. Here, you need to classify the type of movement
among six activity categories, which are walking, walking upstairs, walking
downstairs, sitting, standing, and laying. For the input data, you will be using
an LSTM on the data to learn (as a cell phone attached on the waist) to recog-
nise the type of activity that the user is doing.
E.16.1.2 Implement an image colorization solution where you color black and white
images using GAN. Use the image colorization data set from Kaggle (https://
www.kaggle.com/code/shravankumar9892/colorization-dataset/data).
E.16.1.3 Using variation auto encoders, generate fresh data similar to the training data.
The MNIST dataset is a good place to start generating numbers: (https://
www.kaggle.com/code/ngbolin/mnist-dataset-digit-recognizer/data).
Self-Assessment Questions
E.16.2 Multiple Choice Questions
E.16.2.3 The phenomenon that takes place when the gradients keep getting larger,
causing the gradient descent to diverge, is called
i Exploding gradients
ii Diverging gradients
iii Vanishing gradients
E.16.2.4 The sort of auto-encoder that works with an input that has been partially cor-
rupted and learns to recover the original, undistorted image from the train-
ing data is called a
i Sparse auto-encoder
ii De-noising auto-encoder
iii Variational auto-encoder
E.16.2.6 The gate that is responsible for determining the amount of information that
will be written into the internal cell state is
i Forget gate
ii Input gate
iii Input modulation gate
E.16.2.7 Deciding the number of nodes in hidden layer and the number of hidden lay-
ers is not required in
i Convolutional neural networks
ii Multilayer perceptron
iii Radial basis neural network
E.16.2.9 A neural network in which each neuron is linked to each and every other
neuron in the network is called a
i Radial basis neural network
ii Restricted Boltzmann machines
iii Recurrent neural network
E.16.2.11 _____ is a Markov chain Monte Carlo approach for getting a sequence of
observations approximating a defined multivariate probability distribution.
i Gaussian sampling
ii Gibbs sampling
iii Boltzmann sampling
No Match
E.16.3.1 Recurrent neural networks A Recurrent unit that works to forget unnecessary material
while attempting to remember all of the previous
information that the network has encountered
E.16.3.2 Multilayer perceptron B Used to simulate data that exhibits any trend or function
E.16.3.3 Self-organizing map C Find intrinsic data patterns by reconstructing input
E.16.3.4 De-noising auto-encoder D Handle sequential data and memorize previous inputs
using its memory
E.16.3.5 GoogleNet E Produces new data which is statistically similar to the
input training data
E.16.3.6 Generative adversarial F Enables approximate solutions to exceedingly difficult
networks problems such as fitness approximation
E.16.3.7 Deep belief networks G Construct a low-dimensional representation of a higher
dimensional data collection while maintaining the
topological structure of the data
E.16.3.8 Long short-term memory H Works with an input that has been partially corrupted
network and learns to recover the original, undistorted image
from the training data
E.16.3.9 Radial basis function neural I Overcome the slow learning, local minima pitfalls in
network typical neural networks
E.16.3.10 Restricted Boltzmann machine J Image classification with much lower error rate
E.16.4 Short Questions
E.16.4.1 Explain how we say that RNNs have a memory.
E.16.4.2 Give how current state, activation function, and output are defined with
respect to RNNs.
E.16.4.3 Discuss some advantages and disadvantages of RNN.
E.16.4.4 Explain the working of each recurrent unit of RNN.
E.16.4.5 Describe the three types of auto-encoders.
E.16.4.6 List the steps of training of an auto-encoder for a data compression scenario.
Other Models of Deep Learning and Applications of Deep Learning 435
References
Cireșan, D., Meier, U., & Schmidhuber, J. (2012, June). Multi-column deep neural networks for image
classification. 2012 IEEE Conference on Computer Vision and Pattern Recognition. New York, NY:
Institute of Electrical and Electronics Engineers (IEEE), 3642–3649.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007, June). An empirical evalua-
tion of deep architectures on problems with many factors of variation. Proceedings of the 24th
International Conference on Machine Learning, 473–480. https://doi.org/10.1145/1273496.1273556
Wanda, P., & Jie, H. J. (2021). DeepFriend: Finding abnormal nodes in online social networks using
dynamic deep learning. Social Network Analysis and Mining, 11, 34. https://doi.org/10.1007/
s13278-021-00742-2
A1. Solutions
Chapter – 1
E.1.2
E.1.2.1 iii
E.1.2.2 iii
E.1.2.3 i
E.1.2.4 ii
E.1.3
E.1.3.1 B
E.1.3.2 A
E.1.3.3 D
E.1.3.4 C
E.1.4
Correct Order
E.1.4.5
E.1.4.9
E.1.4.7
E.1.4.10
E.1.4.1
E.1.4.2
E.1.4.6
E.1.4.8
E.1.4.4
E.1.4.3
Chapter – 2
E.2.2
E.2.2.1 i
E.2.2.2 iii
E.2.2.3 ii
437
438 A1. Solutions
E.2.3
E.2.3.1 D
E.2.3.2 J
E.2.3.3 A
E.2.3.4 I
E.2.3.5 B
E.2.3.6 H
E.2.3.7 G
E.2.3.8 E
E.2.3.9 C
E.2.3.10 F
Chapter – 3
E.3.2
E.3.2.1 ii
E.3.2.2 i
E.3.2.3 i
E.3.2.4 iii
E.3.2.5 ii
E.3.2.6 iii
E.3.2.7 iii
E.3.2.8 i
E.3.2.9 ii
E.3.2.10 i
E.3.2.11 ii
E.3.2.12 i
E.3.2.13 iii
E.3.2.14 ii
E.3.2.15 iii
E.3.2.16 ii
E.3.2.17 ii
E.3.2.18 i
E.3.2.19 iii
E.3.2.20 ii
E.3.2.21 iii
E.3.3
E.3.3.1 E
E.3.3.2 J
A1. Solutions 439
E.3.4
E.3.4.1 7/8
E.3.4.2 9/13
E.3.4.3 5/9
E.3.4.4 0.00379
E.3.4.5 0.56
E.3.4.6 a) 0.016
b) 0.384
c) 0.018
d) 0.582
E.3.4.7 0.144
E.3.4.8 a) 0.5263
b) 0.4735
E.3.4.9 91.28%
E.3.4.10 0.5556
Chapter – 4
E.4.2
E.4.2.1 ii
E.4.2.2 iii
E.4.2.3 iii
E.4.2.4 i
E.4.2.5 iii
E.4.2.6 i
E.4.2.7 ii
E.4.2.8 i
E.4.2.9 iii
E.4.2.10 i
E.4.2.11 iii
E.4.2.12 ii
E.4.2.13 iii
E.4.2.14 i
E.4.2.15 ii
E.4.2.16 i
440 A1. Solutions
E.4.3
E.4.3.1 D
E.4.3.2 J
E.4.3.3 F
E.4.3.4 G
E.4.3.5 A
E.4.3.6 H
E.4.3.7 C
E.4.3.8 I
E.4.3.9 E
E.4.3.10 B
Chapter – 5
E.5.2
E.5.2.1 i
E.5.2.2 ii
E.5.2.3 iii
E.5.2.4 ii
E.5.2.5 ii
E.5.2.6 iii
E.5.2.7 i
E.5.2.8 iii
E.5.2.9 i
E.5.2.10 ii
E.5.3
E.5.3.1 I
E.5.3.2 C
E.5.3.3 A
E.5.3.4 E
E.5.3.5 B
E.5.3.6 D
E.5.3.7 J
E.5.3.8 F
E.5.3.9 G
E.5.3.10 H
A1. Solutions 441
Chapter – 6
E.6.3
E.6.3.1 ii
E.6.3.2 i
E.6.3.3 iii
E.6.3.4 ii
E.6.3.5 i
E.6.3.6 iii
E.6.3.7 i
E.6.3.8 ii
E.6.3.9 iii
E.6.3.10 i
E.6.3.11 ii
E.6.3.12 i
E.6.3.13 iii
E.6.3.14 ii
E.6.4
E.6.4.1 E
E.6.4.2 H
E.6.4.3 J
E.6.4.4 A
E.6.4.5 F
E.6.4.6 G
E.6.4.7 I
E.6.4.8 D
E.6.4.9 C
E.6.4.10 B
Chapter – 7
E.7.2
E.7.2.1 iii
E.7.2.2 ii
E.7.2.3 i
E.7.2.4 ii
E.7.2.5 iii
E.7.2.6 i
E.7.2.7 iii
E.7.2.8 i
442 A1. Solutions
E.7.3
E.7.3.1 D
E.7.3.2 F
E.7.3.3 J
E.7.3.4 G
E.7.3.5 H
E.7.3.6 E
E.7.3.7 A
E.7.3.8 C
E.7.3.9 B
E.7.3.10 I
Chapter – 8
E.8.2
E.8.2.1 i
E.8.2.2 ii
E.8.2.3 iii
E.8.2.4 iii
E.8.2.5 i
E.8.2.6 ii
E.8.2.7 i
E.8.2.8 iii
E.8.2.9 ii
E.8.2.10 i
E.8.2.11 ii
E.8.2.12 iii
E.8.3
E.8.3.1 E
E.8.3.2 J
E.8.3.3 I
E.8.3.4 H
E.8.3.5 A
E.8.3.6 D
E.8.3.7 B
A1. Solutions 443
Chapter – 9
E.9.2
E.9.2.1 ii
E.9.2.2 i
E.9.2.3 iii
E.9.2.4 ii
E.9.2.5 i
E.9.2.6 ii
E.9.2.7 iii
E.9.2.8 ii
E.9.2.9 i
E.9.2.10 iii
E.9.2.11 i
E.9.2.12 ii
E.9.3
E.9.3.1 C
E.9.3.2 F
E.9.3.3 J
E.9.3.4 A
E.9.3.5 D
E.9.3.6 G
E.9.3.7 B
E.9.3.8 I
E.9.3.9 E
E.9.3.10 H
Chapter – 10
E.10.2
E.10.2.1 i
E.10.2.2 iii
E.10.2.3 ii
E.10.2.4 iii
E.10.2.5 i
444 A1. Solutions
E.10.3
E.10.3.1 D
E.10.3.2 E
E.15.3.3 B
E.10.3.4 J
E.10.3.5 A
E.10.3.6 H
E.10.3.7 G
E.10.3.8 C
E.10.3.9 I
E.10.3.10 F
Chapter – 11
E.11.2
E.11.2.1 i
E.11.2.2 iii
A1. Solutions 445
E.11.3
E.11.3.1 C
E.11.3.2 F
E.11.3.3 I
E.11.3.4 B
E.11.3.5 J
E.11.3.6 E
E.11.3.7 H
E.11.3.8 A
E.11.3.9 G
E.11.3.10 D
Chapter – 12
E.12.2
E.12.2.1 iii
E.12.2.2 i
E.12.2.3 ii
E.12.2.4 i
E.12.2.5 ii
E.12.2.6 iii
E.12.2.7 ii
E.12.2.8 i
E.12.2.9 ii
E.12.2.10 iii
E.12.2.11 ii
E.12.2.12 ii
446 A1. Solutions
Chapter – 13
E.13.2
E.13.2.1 iii
E.13.2.2 ii
E.13.2.3 i
E.13.2.4 i
E.13.2.5 ii
E.13.2.6 iii
E.13.2.7 ii
Chapter – 14
E.14.2
E.14.2.1 iii
E.14.2.2 i
E.14.2.3 ii
E.14.2.4 i
E.14.2.5 ii
E.14.2.6 iii
E.14.2.7 i
E.14.2.8 i
E.14.2.9 iii
E.14.2.10 ii
A1. Solutions 447
E.14.3
E.14.3.1 J
E.14.3.2 H
E.14.3.3 F
E.14.3.4 A
E.14.3.5 I
E.14.3.6 B
E.14.3.7 E
E.14.3.8 C
E.14.3.9 G
E.14.3.10 D
Chapter – 15
E.15.2
E.15.2.1 i
E.15.2.2 ii
E.15.2.3 ii
E.15.2.4 iii
E.15.2.5 i
E.15.2.6 iii
E.15.2.7 ii
E.15.2.8 ii
E.15.2.9 iii
E.15.2.10 ii
E.15.2.11 i
E.15.2.12 iii
E.15.2.13 ii
E.15.2.14 i
E.15.2.15 iii
E.15.3
E.15.3.1 G
E.15.3.2 I
E.15.3.3 D
E.15.3.4 A
448 A1. Solutions
Chapter – 16
E.16.2
E.16.2.1 ii
E.16.2.2 iii
E.16.2.3 i
E.16.2.4 ii
E.16.2.5 i
E.16.2.6 iii
E.16.2.7 iii
E.16.2.8 iii
E.16.2.9 ii
E.16.2.10 i
E.16.2.11 ii
E.16.2.12 i
E.16.3
E.16.3.1 D
E.16.3.2 F
E.16.3.3 G
E.16.3.4 H
E.16.3.5 J
E.16.3.6 E
E.16.3.7 I
E.16.3.8 A
E.16.3.9 B
E.16.3.10 C
Index
Page numbers in Bold indicate tables and page numbers in italics indicate figures.
conditional independence, 159, 162–164, deployment, 24, 199, 298, 351, 359
169–179 deterministic, 274, 285
conditional probability, 58–59, 154, 165–166 diagnostic inference, 168–169, 172–173
conditional probability of observation, 154 dimensionality reduction, 82–83, 92–95, 213,
conditional probability table (CPT), 165, 168 230–232, 236
confidence, 133, 154, 238–241 discount factor, 281, 283
confidentiality, 343, 359, 361 discovering, 17, 93, 115, 122, 258, 264, 317, 324
confusion matrix, 191–194, 196–197, 215–216, discrete data, 79
229, 351 discrimination, 343–349
cosine similarity, 45, 216, 307 discriminative methods, 89
connected world, 1 discriminative model, 157, 412–413
consumer convenience, 334 discriminator, 411–413
content based recommendation, 305, 311 disease diagnosis, 12, 325
context aware recommendation, 311–312 disease identification, 324
contingency table, 56–57, 60 disease outbreak prediction, 324, 327
convolutional layer, 384, 428 disruptive impact, 341
convolutional neural network (CNN), 285, distance, 51, 89, 131–132, 215–220, 224, 419
382–391, 426, 431 distance matrix, 217, 219–220
correlation, 29, 66, 234 distance metric, 89, 132, 218, 419
cosine similarity, 216, 307 distance-based model, 89
cost function, 141, 163, 177–178, 375, 378–379 divergence, 45, 69–70, 352, 423
covariance matrix, 233–234 diverse pool, 347
credit score computation, 348 divisive clustering, 217–218
creditworthiness, 13 domain, 15, 82, 162, 296–297, 300, 323–324
cross entropy, 69 Drive PX, 354
cross-validation, 85, 199 drug discovery, 324, 326, 428
current state, 251–252, 256–257 dynamic programming, 275–279
D E
DARPA-XAI, 353 each time approach, 278
data stream, 257–259, 262 education, 328–330
data analysis pipeline, 350 effectors, 370
data biases, 81 eigen vectors, 45
data exploration, 213 email intelligence, 321
data mining, 11, 28, 103 energy management, 336
data privacy, 359–361 engineering, 83, 86, 297, 335–336
data representation, 46, 78–80 ensemble methods, 191, 199–200, 204
data sparsity, 305 entropy, 45, 66–69, 136–139, 228
data stream clustering, 259, 262 environment engineering, 336
data stream mining, 257–258 epoch, 86, 297, 420
data streams, 257–258 epsilon probability, 280, 287
dataset, 26, 108, 327 error, 26, 132–134, 142–142, 174–179, 254–260,
decision boundary, 84, 139–141 280, 376
decision node, 134–135 error penalty, 144
decision tree, 134–136, 145–146, 202–203 ethical issues, 343, 348
decision-making, 342, 344, 352–353, 355 ethics, 324–344
decision rule, 134, 136, 196, 355 euclidean distance, 51, 131–133, 215, 419
deep belief network, 423–424 euclidean space, 49, 51, 223
deep learning, 3, 4, 96, 380–385, 397, 425–431 evaluating assessments, 330
deep neural network, 7, 285, 367, 429–430 evaluation, 34–35, 38, 191–193, 196–198
DeepFace, 8 everyday life, 5
dendrites, 371 exhaustive events, 55–56, 59
Index 451
F H
F1 score, 196 H2O tool, 351
face recognition, 93, 128, 342, 347–348 hard margin, 142–143
fairness, 342–348, 351–352 healthcare, 114, 145, 323–324
fairness testing, 351 hidden layer, 145, 376–377, 397–398, 413–415,
false negative, 192, 351 421–423
false positive, 65, 192, 196, 351 Hidden Markov Model (HMM), 254, 301
feature detection, 388 hierarchical clustering, 122, 217–220, 222–223
feature extraction, 82–83, 385 Hoeffding tree, 259–260
feature reduction, 231–232 hold-out, 85
feature selection, 31, 82–83, 86 homomorphic encryption, 360–361
feature subset selection, 82 human resource management, 334
feature vector, 25, 91, 94, 157–158, 300, hyper-parameters, 45, 85, 145, 199, 297, 386
413–414 hyperplane, 139–141, 143–144
features, 13, 23, 35, 82, 103–105, 230–232, 297, hypothesis, 11, 24, 26, 40, 91, 153–157
302, 356, 334–345, 430 hypothesis set, 26
federated learning, 360–361
filters, 111, 388–390
I
financial management, 332
financial services, 14 IBM Deep Blue computer, 8
first time approach, 278 IBM Open Scale tool, 351
Fisher linear discriminant, 232, 234–236 ID3, 45, 136
flattening, 387, 390–391 ImageNet, 8, 368, 386
forecasting, 29, 336 imbalanced data, 189, 205, 324
forget gate, 407, 409 incremental approach, 278–279
forward propagation, 371–372 independence, 52, 103, 166–167
fraud detection, 15, 29, 114–115, 205 independent and identically identified, 80
frequent itemset, 238–241 independent variables, 11, 25
inductive learning, 83
inference, 26, 168–169, 172–173, 423
G
information gain, 45, 135–139
game playing, 10, 12, 28, 40 information theory, 45, 66, 69, 137, 231
gaussian distribution, 62–63 input gate, 407–408
gender-biased issues, 348 input layer, 370, 380, 383–384, 399, 422
General Data Protection Regulation (GDPR), 346 input modulation gate, 407–408
generalization, 83–86, 142 installation of Weka, 103, 107, 115
GAN, 411–413 intelligence, 2, 166
Generative Adversial Networks (GAN), 8, 368, interacting, 271, 318, 328, 352
410–412 interpretability, 83, 175–176, 298–299, 345–346
generative methods, 89 interpreting, 24, 317, 323
generative model, 157–158 item-based collaborative filtering, 309
generator, 411–413 itemset, 238–241
geometric model, 88–89
GINI, 135–137, 139
J
good projection, 235
Google Assistant, 319 Jaccard coefficient, 227–228
Google Brain, 8, 368 Jeopardy, 8
Google Map, 322 joint events, 55
452 Index
joint probability, 55–57, 158–159, 162–164, Markov decision process, 252, 256, 274–276
168, 172 Markov model, 251–255, 263
joint probability distribution, 162–163, 172 Markov random field, 257
Markov source, 66, 68–69
matching, 28, 322
K
matrix, 46–48, 191–193, 216–217
K-fold cross-validation, 199, 297 matrix operations, 47–48
Kaggle, 8, 328 Maximum A Posteriori, 155–156
kernel trick, 143 Maximum Likelihood, 45, 155–156
kernel, 8, 143–145, 384, 388, 415 medical diagnosis, 63
k-means clustering, 89, 122, 211, 223–226, 262 medical imaging, 323–325
k-median, 262 memory, 67–69, 257–258
k-medoids, 223 memoryless source, 66–69
knowledge engineering bottleneck, 4 minima, 223, 273, 423
Kohonen Maps, 418, 421 Minkowski distance, 131–132, 215
Kullback–Leibler Divergence, 69 misclassification, 141–145, 203–205
missing data, 80–81
model complexity, 86, 175
L
model selection, 26, 30, 32, 349
label, 13, 26, 93, 127–129, 300 model verification, 297
labelled samples, 95, 224 monitoring, 259, 298, 317, 327, 332–333, 337–338
Lasso regression, 177 Monte Carlo, 277–280
latent variable, 80, 254–255 multilayer perceptron, 415–417
leaf, 134–136, 259 multi-class classification, 128–129, 197–198
learning, 2 mutually exclusive events, 55
learning analytics, 329
learning rate, 178, 277, 282–283, 419, 426
N
likelihood, 45, 53, 154–156
LIME, 353 naïve Bayes, 155, 158–160, 259–260, 301
linear algebra, 45–47 naïve Bayes classifier, 121, 155, 158–160
linear dependence, 52 nearest neighbour, 7, 21, 301
linear model, 89, 139, 175–176 NearMiss algorithm, 206
linear regression, 32, 121, 174–177, 179–180 negative reinforcement, 95, 272–273
linear transformation, 45–47, 232 neighbor based algorithm, 307
LinkedIn, 352, 357, 359 neocognitron, 7, 368, 385
linking biases, 349 Netflix, 8, 303, 317, 320
logical model, 88 NetTalk, 7
logistic regression, 45, 121, 128, 179–180 neural network, 5, 6, 96, 285, 365–375, 380–381
long short term memory, 399, 407 neurons, 356, 370–372, 380–381, 421–422
loss function, 26, 45, 141–142, 175–176, 378 new hidden state, 400–401
NLP, 248, 300–302
non-euclidean, 222–223
M
non-linear dependencies, 80
machine learning, 1–19 nonparametric learning, 87–88
machine learning model, 3, 82, 84, 86–87, 296, normal distribution, 62–63
351 numerical data, 25
machine translation, 247–248, 369
Manhattan distance, 131–132
O
manufacturing, 326, 335–336
marginal probability, 56–59, 154 object recognition, 8, 356, 426
marginalization, 57, 168–169 observable variables, 80
marketing, 215, 263–264, 332–333 observations, 86, 121–122, 153–154, 175–176,
Markov condition, 163–164 179–180
Index 453
user preference, 23, 305, 312 vector space, 31, 47, 52, 77, 78, 131, 150, 183, 300
user profiling, 348 visualization, 31, 35, 47, 95, 103, 105, 106, 110,
utility function, 303, 304 113, 119–120, 123, 125
utility matrix, 304
W
V
weather prediction, 14, 20, 42
validation, 26, 85, 132–133, 199 weights, 38, 40, 62, 96, 104, 174, 203, 216, 273,
vanishing gradient, 368, 371, 374, 392 285, 287, 294
variance, 84–86, 97, 99, 101, 136, 175–177, weka, 103–112, 123–125, 180
200–206, 233–235, 297, 315, 414 weka explorer, 106–107, 123, 124
vector projection, 52–53 WSS, 229