Chatterjee I. Machine Learning and Its Application... Guide..2022
Chatterjee I. Machine Learning and Its Application... Guide..2022
Authored by
)LUVWSXEOLVKHGLQ
BENTHAM SCIENCE PUBLISHERS LTD.
End User License Agreement (for non-institutional, personal use)
This is an agreement between you and Bentham Science Publishers Ltd. Please read this License Agreement
carefully before using the ebook/echapter/ejournal (“Work”). Your use of the Work constitutes your
agreement to the terms and conditions set forth in this License Agreement. If you do not agree to these terms
and conditions then you should not use the Work.
Bentham Science Publishers agrees to grant you a non-exclusive, non-transferable limited license to use the
Work subject to and in accordance with the following terms and conditions. This License Agreement is for
non-library, personal use only. For a library / institutional / multi user license in respect of the Work, please
contact: [email protected].
Usage Rules:
1. All rights reserved: The Work is 1. the subject of copyright and Bentham Science Publishers either owns
the Work (and the copyright in it) or is licensed to distribute the Work. You shall not copy, reproduce,
modify, remove, delete, augment, add to, publish, transmit, sell, resell, create derivative works from, or in
any way exploit the Work or make the Work available for others to do any of the same, in any form or by
any means, in whole or in part, in each case without the prior written permission of Bentham Science
Publishers, unless stated otherwise in this License Agreement.
2. You may download a copy of the Work on one occasion to one personal computer (including tablet,
laptop, desktop, or other such devices). You may make one back-up copy of the Work to avoid losing it.
3. The unauthorised use or distribution of copyrighted or other proprietary content is illegal and could subject
you to liability for substantial money damages. You will be liable for any damage resulting from your
misuse of the Work or any violation of this License Agreement, including any infringement by you of
copyrights or proprietary rights.
Disclaimer:
Bentham Science Publishers does not guarantee that the information in the Work is error-free, or warrant that
it will meet your requirements or that access to the Work will be uninterrupted or error-free. The Work is
provided "as is" without warranty of any kind, either express or implied or statutory, including, without
limitation, implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the
results and performance of the Work is assumed by you. No responsibility is assumed by Bentham Science
Publishers, its staff, editors and/or authors for any injury and/or damage to persons or property as a matter of
products liability, negligence or otherwise, or from any use or operation of any methods, products instruction,
advertisements or ideas contained in the Work.
Limitation of Liability:
In no event will Bentham Science Publishers, its staff, editors and/or authors, be liable for any damages,
including, without limitation, special, incidental and/or consequential damages and/or damages for lost data
and/or profits arising out of (whether directly or indirectly) the use or inability to use the Work. The entire
liability of Bentham Science Publishers shall be limited to the amount actually paid by you for the Work.
General:
1. Any dispute or claim arising out of or in connection with this License Agreement or the Work (including
non-contractual disputes or claims) will be governed by and construed in accordance with the laws of the
U.A.E. as applied in the Emirate of Dubai. Each party agrees that the courts of the Emirate of Dubai shall
have exclusive jurisdiction to settle any dispute or claim arising out of or in connection with this License
Agreement or the Work (including non-contractual disputes or claims).
2. Your rights under this License Agreement will automatically terminate without notice and without the
need for a court order if at any point you breach any terms of this License Agreement. In no event will any
delay or failure by Bentham Science Publishers in enforcing your compliance with this License Agreement
constitute a waiver of any of its rights.
3. You acknowledge that you have read this License Agreement, and agree to be bound by its terms and
conditions. To the extent that any other terms and conditions presented on any website of Bentham Science
Publishers conflict with, or are inconsistent with, the terms and conditions set out in this License
Agreement, you acknowledge that the terms and conditions set out in this License Agreement shall prevail.
PREFACE
Over the past two decades, the evolution of Machine Learning has risen to a great
extent. With the invention of Artificial Intelligence, things were getting more
comfortable and accessible. Artificial intelligence needs a system to be fed with
pre-defined conditional statements to perform some tasks on behalf of human
beings. Gradually human needs a more stable and autonomous system that can
learn on its own. There comes a lack of another technology that drastically
changes the concept of artificial intelligence. With the invention of machine
learning, advancements are going on every single day. With ever-increasing data
sources and automated computation, technologies based on machine learning are
coming alive very often.
Among many books available in the market on this topic, this book targets
reaching all the corners of the reading society. From naïve learners to professional
machine learning experts will find this book handy and helpful for everyday
application. Most books are written in challenging mathematical perspectives,
which pose incomprehensibility for most readers, especially students and industry
engineers.
This book aims to cover most of the Machine Learning curriculum prescribed in
most of the top universities. It also covers advanced topics like Deep Learning
and Feature Engineering. This book's added feature is the entire chapter on real-
world machine learning applications using Python programming, which will be
truly beneficial for all the researchers and engineers, with open-ended ideas on
new problems and their solutions in a Pythonic way.
This book is written effortlessly and straightforwardly with enriched theories and
more minor mathematical complications, but more easily comprehensive
application aspects. In every chapter, topics are described in such a way, keeping
in mind readers from all sections. Every topic and subtopic is described with
examples and Python code snippets for a more accessible explanation. The
chapters are presented with a well-explained illustration and flowchart for a better
ii
understanding of the topic. Thus, this book on machine learning will surely catch
the beginners' attention in the Machine Learning domain. The audience will
include, University students, Young Researchers, Ph.D. students, Professors, and
software engineers who want to gain knowledge in Machine learning from
scratch. I believe this book will be in demand of most of the University libraries
and bookstores.
Not applicable.
CONFLICT OF INTEREST
A single author entirely writes this book. So, the conflict of interest does not
apply to this book.
ACKNOWLEDGEMENTS
Writing a book is more exciting than I anticipated and more rewarding than I
could have dreamed. Nothing would have been imaginable without the strength
and ability bestowed upon me by the Almighty God because I would not be able
to do anything without Him. I want to thank and express my gratitude to my
closest family and friends.
I want to express intense gratitude to my dearest people, Mr. Ajay Kumar and
Mrs. Rekha Devi, for their unceasing impetus and motivation. A huge cheer to
you!
A special thanks to my students, without whose contribution, this book may not
look so magnificent as it is now. I am grateful to my dearest student Ms. Videsha
Bansal for her continuous support and contribution in organizing the contents. I
am very thankful to my beloved student at Tongmyong University, Mr. Sunghyun
Kim, for his contribution and support in the programming section of the book. I
am also thankful to my beloved research student Ms. Lea Baumgärtner for her
encouragement and assistance in the programming section of the book. A special
thanks to my student Mr. Sajal Jain for supporting me in giving the professional
touch to the figues and diagrams.
DEDICATION
There is no knowledge in nature; all knowledge comes from the human soul. Man
manifests knowledge, discovers it within himself, which is pre-existing through
eternity.
Swami Vivekananda
Machine Learning and Its Application, 2021, 1-18 1
CHAPTER 1
When most people use the word artificial intelligence, robots are usually the first
to come to mind. In this way, big-budget films and novels have created a story about
a robot that destroys earth and humans. However, it is not far from the facts.
Authors Stuart Russell and Peter Norvig discuss the topic in their pioneering
textbook Artificial Intelligence: A Modern Approach [1] by unifying their work
around the theme of intelligent agents in computers. The authors described AI as:
"the study of agents that receive percepts from the environment and perform
actions" (Russel and Norvig viii) [1].
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
2 Machine Learning and Its Application Indranath Chatterjee
Artificial intelligence originated from the idea that it is possible to describe human
intelligence so that a computer can effectively imitate it and perform tasks, from
the easiest to those that are much more complicated. Artificial intelligence's
purposes include knowledge, logic, and understanding.
1.1.1. Evolution of AI
Artificial intelligence is certainly not another idea; its narrating roots go as far back
as Greek relics. Notwithstanding, it was not precisely a century before the
mechanical upset took off, and AI went from fiction to truly conceivable reality.
In the initial portion of the twentieth century, sci-fi acquainted the world with the
idea of robots. By the 1950s, we had an age of researchers, mathematicians, and
scholars who were socially acclimatized in their brains with the idea of human-
made consciousness (or artificial intelligence). One such individual was Alan
Turing, a British scientist who investigated the numerical chance of human-made
consciousness. Turing recommended that people utilize accessible data to take care
of issues and decide why the machines cannot do something very similar? This was
the consistent system of his 1950 paper, "Computing Machinery and Intelligence,"
in which he examined how to assemble intelligent machines and test their
knowledge.
Machine Learning Machine Learning and Its Application 3
From 1957 to 1974, AI got importance. Personal computers could store more data
and turned out to be fast, less expensive, and more available. AI calculations
additionally improved, and individuals improved at knowing which calculation to
apply to their concerns. Early exhibits, for example, Newell and Simon's General
Problem Solver and Joseph Weizenbaum's ELIZA, showed promising results
toward the objectives of critical thinking and the translation of communication in
language individually. These victories, just as the promotion of driving specialists
(particularly the participants of the DSRPAI) persuaded government offices, for
example, the Defense Advanced Research Projects Agency (DARPA), to subsidize
AI research few organizations. The public authority was especially inspired by a
machine that could interpret and decipher communication in language just as high
throughput information handling. Idealism was high, and assumptions were
considerably higher. In 1970 Marvin Minsky revealed to Life Magazine, "from
three to eight years we will have a machine with the general intelligence of an
average human being." However, while the essential rule verification was there,
there was far to go before the ultimate objectives of standard language preparation,
conceptual reasoning, and self-acknowledgment could be accomplished.
As we see in Fig. (1.1), it shows the evolution of artificial intelligence from the year
1952 to 1970.
Herbert Simon stated in 1957 that "It is not my aim to surprise or shock you, but
the simplest way I can summarize is to say that there are now in the world machines
that think, that learn, and that create. Moreover, their ability to do these things is
4 Machine Learning and Its Application Indranath Chatterjee
going to increase rapidly until – in a visible future – the range of problems they
can handle will be coextensive with the range to which the mind has been applied”.
Marvin Minsky stated in 1970 that "In three to eight years, we will get a machine
with average human intelligence”.
During the 1980s, the AI outlook changed to emblematic AI and the supposed
"expert system" or "information-based frameworks." The fundamental thought was
to get human master information in a PC structure and spread it as a program to
numerous computers.
Numerous AI analysts during the 1990s intentionally called their work by different
names, like informatics, knowledge-based frameworks, cognitive frameworks,
optimization algorithm, or computational intelligence. The new names assisted
with obtaining financing.
Exploration shows that 40% of organizations use artificial intelligence (AI) but lack
unmistakable proof of its application. Without proper knowledge and
understanding of this field, that develops increasingly, based on the scope of
popular expressions in AI such as data mining, machine learning, and deep learning.
When discussing the dimensions of AI, we mostly talk about six dimensions of AI,
which are as follows:
a) Reactive AI: The most fundamental yet, at the same time, very valuable AI is
called reactive machines since it responds to existing conditions as its name
suggests. The Dark Blue, the supercomputer made by IBM, is a remarkable
illustration of reactive AI.
circumstances in precisely the same manner without fail. There won't ever be a
difference in real life. Reactive machines can't learn or imagine the past or future.
Email spam filtering and Netflix suggestions are different examples of reactive AI.
With limited memory, the AI is fabricated so that models are consequently prepared
and afterward trained dependent on the model behavior.
Complicated classification problems can be pursued with limited memory AI. Self-
driving vehicles utilize restricted memory AI since the AI that power these vehicles
use the information they were prepared and modified to observe how to work. It
can also understand available information it sees to peruse its current circumstance
and change when in need.
The theory of mind will be better outfitted to work with people genuinely than
different types of AI. Computer scientists are working passionately to find insights
and gain ground, yet there's much work to do before AI can adequately treat every
human emotion.
This AI hasn't grown effectively yet because we don't have the equipment or
algorithms to uphold it. Up to that point, computer scientists will keep on upgrading
restricted memory AI and creating a theory of mind AI.
8 Machine Learning and Its Application Indranath Chatterjee
We have seen the different dimensions of AI and the four significant kinds of AI.
Let's have a look into the present-day usage of AI in the field of research, business
applications, and day-to-day applications. As we have discussed, the learning phase
of artificial intelligence is indeed essential. Thereby arises the need for a framework
for learning the various aspects of the environment, human behavior, and working
principle of different tasks. Intelligent machines can also perform on their own like
humans, even for newly assigned tasks. Thus, AI gains another level of intelligence
with the capacity of learning. With the invention of learning capacity, AI gains
popularity with machine learning, where the machine can learn based on some
training given to it. Gradually, machine learning needs upgradation too. Then few
other domains arise, namely, deep learning and reinforcement learning. This book
will cover each of these topics in detail in the upcoming chapters.
Artificial intelligence helped education vastly turn out to be more normal today,
allowing educators and instructive organizations to evaluate kids and customize
instruction philosophies to every kid. They can increment the productivity of
organization responsibilities to permit instructors to influence more towards
comprehension and versatility. By utilizing the best attributes of machines and
educators, we can achieve the optimal result in AI and ML in the education sector.
Since its commencement during the 1950s, AI research has aimed at five fields of
specialization:
3. Planning for Proper Execution: The capacity to set and accomplish objectives.
Machine learning is now able to improve results, matching the precision of medical
specialists. Directly, scientists feed the computer with a vast amount of data in the
form of images and textual data to better understand the machine's ability to 'learn'.
Different organizations dealing with machine learning for medical services, such
as Google and Xerox, use many clinical images marked by specialists. Machine
learning algorithms are released on these visual informational indexes, searching
for factual examples to sort out what highlights of a picture, proposing its merits
the specific mark.
The way toward learning starts with some information, like models, direct insight,
or guidance, to search for designs in information and settle on better choices later
on depending on the models we give. The vital point is to permit the machines to
adapt naturally without human intercession or help and change activities.
"A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P if its performance at tasks in T, as measured
by P, improves with experience E."
Along these lines, the expression "learning" just truly has significance concerning
some assignments.
Walking: children don't inherently have the foggiest idea of how to walk. However,
instead, they endeavor to walk commonly. They acquire insight, and some learning
calculation (in their cerebrum) learns the examples of neuron firings that lead to
rare falling.
Speech: instances of human discussion encircle us, and we get it from the beginning
in childhood improvement.
Deep learning typically requires enormous data or information to guarantee that the
architecture might have many variables and doesn't overfit the model during
training. Convolutional neural networks are intended to work on imaging datasets,
12 Machine Learning and Its Application Indranath Chatterjee
Machine learning is regularly utilized for projects that include predicting a result or
uncovering patterns. In these models, a restricted assemblage of information is used
to assist the machines with learning designs that they can later use to ensure new
information. Standard algorithms utilized in ML incorporate linear regression,
decision trees, support vector machines (SVM), naive Bayes, and linear
discriminant analysis.
Deep learning models will set aside some effort during training. Several pre-trained
networks and open-access datasets abbreviate the training process yet take a
considerable time to execute. It also requires an ample amount of computational
capacity of the computer. Computer scientists applying deep learning should invest
a dominant part of their energy preparing models and adjust to their neural
architecture model framework.
ML computations might be more attractive in the event when one needs faster
outcomes. They are quick to prepare and require less computational force. The
number of highlights and perceptions will be the key factors that influence training
time. Computer scientists applying ML ought to invest a dominant part of their
energy in creating and assessing highlights to improve model exactness.
Deep learning models will set aside some effort during training. Several pre-trained
networks and open-access datasets abbreviate the training process yet take a
considerable amount of time to execute. It also requires an ample amount of
computational capacity of the computer. Computer scientists applying deep
learning ought to invest a dominant part of their energy preparing models and adjust
to their neural architecture model framework.
Machine Learning Machine Learning and Its Application 13
Fig. (1.4). Overview of the division of AI into its sub-fields - ML, DL and RL.
Currently, there are basically three main types of learning available; supervised,
semi-supervised, and unsupervised. However, recently, two more types of learning
are getting popular among computer scientists; reinforcement learning and self-
supervised learning. This book will talk about each of the types of learning in detail
in the following chapters. The outline of the entire book is given at the end of this
chapter to understand the book's flow. Here, we quickly visit each of the learning
processes to get an overview of the same. As discussed, the five main learning
algorithms are:
a) Supervised learning
b) Unsupervised learning
c) Semi-supervised learning
d) Reinforcement learning
e) Self-supervised learning
1.5.1. Supervised
Supervised machine learning algorithms are used to predict future events using
labeled examples to define what you previously learned from new data. Starting
with an analysis of a set of available training data, the learning algorithm performs
an approximate function to predict the output value. The system can provide output
for new inputs after proper training. The learning algorithm can look for errors to
compare the output to the exact intended output and correct the model accordingly.
1.5.2. Unsupervised
Unsupervised machine learning algorithms are primarily used for training, while
the information used is not classified or labeled. Research into how systems
determine the ability to explain hidden structures in unlabeled data arises from
unsupervised studies. The system doesn't search for the correct output, but you can
explore the data and search the dataset to explain the hidden structure in the data in
the label.
Machine Learning Machine Learning and Its Application 15
1.5.3. Semi-supervised
1.5.4. Reinforcement
Reinforcement machine learning algorithms are learning methods that interact with
the environment through tasks and detect errors or give rewards. Trial and error
detection and delay compensation are the most relevant features of reinforcement
learning. Using this method, one can maximize the efficiency of mechanical and
software agents by automatically determining the ideal behavior in specific
situations. Agents need simple reward feedback to know which action is best. This
is known as reinforcement learning.
1.5.5. Self-supervised
learning is very needed. Most people are unaware of the fundamental theories and
applications of machine learning.
This book aims to cover most of the machine learning curriculum prescribed in
most of the top universities. It will also cover advanced topics like deep learning
and feature engineering. This book's added feature will be an entire chapter on real-
world machine learning using python programming, which will be truly beneficial
for all the researchers and engineers, with open-ended ideas on new problems and
their solutions pythonic way.
This book is written effortlessly and straightforwardly with enriched theories and
fewer mathematical complications but more easily comprehensive application
aspects. In every chapter, topics are described in such a way, keeping in mind
readers from all sections. Every topic and subtopic will be described with examples
and python code snippets for a more accessible explanation. Chapters are presented
with a well-explained illustration and flowchart for a better understanding of the
topic. Thus, this book on machine learning will surely catch the beginners' attention
in the machine learning domain. The audience will include university students,
young researchers, ph.d. students, professors, and also the software engineers who
want to gain knowledge in machine learning from scratch.
This book dedicates all the chapters entirely to machine learning and deep learning
architecture. As the name suggests, it deals with mostly machine learning
applications and deep learning algorithms in real life. The overall outline of the
book is given below:
Chapter 3 talks about the unsupervised machine learning algorithm, describing the
state-of-the-art clustering algorithms. This chapter gives an elaborative definition
of k-mean clustering, hierarchical clustering, and self-organizing map. It also
defines the algorithmic framework of each algorithm with hands-on examples with
detailed Python codes and outputs.
Chapter 4 describes the regression models. This chapter mainly focused on two
essential regression algorithms, namely, linear regression and logistic regression.
This chapter also describes each of them in detail supporting Python programs
better to understand the regression process in a real dataset.
Chapter 6 starts with Section II of the book that describes the advanced machine
learning part, i.e., the deep learning approaches. This chapter is entirely devoted to
deep learning architectures, starting from artificial neural networks to convolution
neural networks until recurrent neural networks. This book also introduces the
reinforcement learning concept to the readers. All the algorithms are supported with
hands-on examples and Python code for better understanding. Detailed
mathematical explanations with easy descriptions are given for all the mentioned
algorithms in this chapter.
Chapter 8 is entirely dedicated to the application section of the book. The title of
the book suggests more emphasis on the applicability of machine learning. This
chapter is written for all the beginners in the machine learning and deep learning
subject. This chapter presents the Python programs for various state-of-the-art
problem areas. These problem areas are chosen carefully to cater to the variety of
needs of every learner covering maximal domain-specific tasks. This chapter
covers the domain of pattern recognition, including face recognition, object
recognition, optical character recognition; The area of medical imaging and
computational linguistics, in terms of natural language processing, are also
presented carefully with detailed Python implementation on the real-life dataset. In
the end, this chapter throws light on the probable most recent research problems,
which need the immediate attention of computer scientists and AI specialists. This
part will benefit the authors with the knowledge of current trends in AI research.
Chapter 9 finally concludes the book by summarizing all the chapters for a quick
overview and future work.
CONCLUDING REMARKS
CHAPTER 2
Supervised machine learning algorithms are used to predict future events using
labeled examples to define what you previously learned from new data. Starting
with an analysis of available training data, the learning algorithm performs an
approximate function to predict the output value. The system can provide output
for new inputs after proper training. The learning algorithm can look for errors to
compare the output to the exact intended output and correct the model accordingly.
The training dataset is used to explain the model to get the desired product through
supervised learning. This training dataset contains inputs and appropriate outputs
so that the model can train over time. The algorithm calculates the accuracy with a
loss function by adjusting the loss to a minimum value.
Supervised models can be used to build and enhance multiple business applications,
including the following businesses:
Deep data insights and improved automation can benefit your supervised learning
business, but building a sustainable supervised learning model presents some
challenges. Some of these challenges are:
The data set is more likely to have human errors, resulting in algorithmic learning
errors.
Class labels are often String values. "Spam", "No-Spam", and numerical values
need to be mapped before providing the modeling algorithm. This is often
mentioned as label encoding, and each class label is assigned a unique integer.
"Spam" = 0, "No-Spam" = 1.
There are several types of classification algorithms that model problems to predict
modeling classification.
They have no fixed solution on how to map the algorithm to the problem type.
Instead, it is generally recommended that practitioners conduct controlled
experiments and find the best algorithm and configuration for a given classification
task.
Instead of class labels, some operations may require an estimate of the class
member potential for each instance. This provides additional uncertainty in the
22 Machine Learning and Its Application Indranath Chatterjee
forecast that can be explained later by the application or by the user. A widely used
diagnosis to assess future potential is the receiver operating characteristic (ROC)
curve.
1. Binary classification
2. Multi-class classification
3. Multi-label classification
4. Imbalanced classification
3. Multi-label Classification: It refers to those problems with more than two labels
or classes. In this problem, each sample may be assigned multiple labels. Let us
look at an example of photo classification. The picture can contain various objects
in the scene, and the model picture can predict multiple known entities such as
"Vehicle", "Fruit," and "Human".
2.2.1. Overview
In the preceding section, we learned about supervised machine learning and its
classification. This section will learn about the Decision Tree and its different types
of algorithms that generally fall under supervised learning. Under the umbrella of
supervised machine learning, the decision tree is one of the major topics to be
covered. The decision tree is very similar to a flowchart and is mainly used as a
tool for classification and prediction [4]. As the name suggests decision tree has a
"tree" like structure. It has two primary attributes, one is the node, and the other is
the leaf. Nodes denote a test on an attribute, and all single leaf holds a class label.
The branch of the decision tree presents the output test. Fig. (2.1) illustrates a
standard decision tree with its root node and branches.
Root Node: It is the starting point of any decision tree. According to the dataset,
the root node is further divided into two or more homogeneous sets. It is the parent
node.
Leaf Node: The final outputs are represented as leaf nodes. There is no further
division after a leaf node. Leaf nodes are also known as child nodes.
Splitting: Division of root node or decision node into sub-nodes according to
different conditions is splitting.
Branch or Sub Tree: A tree formed by splitting the tree.
Pruning: Removal of any different or non-wanted branches from the tree is
known as pruning.
Along with some terminologies, there are some assumptions which we need to
remember before we draw any decision tree. Some of the significant assumptions
are mentioned below:
The decision tree is divided into various levels. Each level needs different
attributes. Identifying additional attributes at each level is the most challenging task
while we form a decision tree. The process of finding different attributes is known
as attribute selection. There are two main methods for attribute selection. One is
information gain, and another is the Gini index.
Supervised Machine Learning Machine Learning and Its Application 25
2.2.2.2. Entropy
Number of instances of x = 4
Number of instances of y = 6
Entropy plays a significant role while we draw a decision tree. With the help of
entropy, we can decide the splitting branches and make the boundaries of a decision
tree. The value of entropy ranges between 0 to 1; the less the value of entropy more
trustable it is.
There are chances that a decision tree formed could go under a transformation,
which means the entropy of the decision tree will change. The change in entropy is
calculated by information gain. Information gain calculates the difference between
the "before and after entropy" of a decision tree after the transformation. With the
help of information gain, we split the node and built the decision tree.
26 Machine Learning and Its Application Indranath Chatterjee
The algorithm of all the decision trees is always designed to increase the
information gain's value. According to the rules of decision tree formation, the node
or attribute with the maximum information gain value will be the first one to be
split. The formula for finding the information gain is as follows:
Where,
Information gain also measures the change in the entropy observed in the decision
tree due to the formation of the sub-sets, according to the given conditions.
When we choose an element, there is a probability that the selected element could
be identified incorrectly. Thus, to measure this probability, we have the Gini index.
It is a metric-like structure—the lower the Gini index value, the more preference a
randomly chosen element is given.
2.2.3.1.1. Requirements
A. Initially, we will focus on the training terminology provided with the root node.
C. Create a summary of each subset of the subsets of training events that will be
categorized along the training tree path.
A. If all positive or negative training examples remain, label the nodes "TRUE" or
"FALSE" accordingly.
B. In the absence of an attribute, label the node with the most votes among the
remaining training events.
C. If no instances remain, label the node with the most votes of the parent's training
instance.
The table given below (Table 2.1) shows sample data for a hypothetical basketball
match.
… … … … … … …
What we know:
We know that the game will be played 'away' at 9 pm, and 'Joe' will play center
on the offense side (Table 2.2).
It is a classification side, as seen in Fig. (2.2), where we can see how the
observations are derived based on the decision.
Generalizing the learned rule to new examples.
Fig. (2.2). Decision tree built on Table 2.1 data (intermediate level).
Suppose S has 25 instances, 15 positives and 10 negatives [15 (true class), 10 (false
class)]. Then the entropy of S corresponding to this classification is:
15 15 10 10
𝐸(𝑆) = −( )log 2 ( ) − ( )log 2 ( )
25 25 25 25
It should be noted that if the result is inevitable, then the entropy is '0'. If you don't
know about the system, the entropy is maximal (or any result is equally possible).
30 Machine Learning and Its Application Indranath Chatterjee
Before partitioning,
When we use the ‘where’ attribute, it divides into 2 subsets, as seen in Fig. (2.3).
6 6 6 6
Entropy1st Set H(home) = 𝐸(𝑆) = − ( ) log 2 ( ) − ( ) log 2 ( ) = 1
12 12 12 12
4 4 4 4
Entropy2nd Set H(away) = 𝐸(𝑆) = − ( ) log 2 ( ) − ( ) log 2 ( ) = 1
8 8 8 8
Fig. (2.4). The intermediate step of building tree while using "When" clause.
We use the 'when' attribute, it divides the tree into three subsets, as seen in Fig.
(2.4):
1 1 3 3
Entropy1st Set H(5pm) = 𝐸(𝑆) = − ( ) log 2 ( ) − ( ) log 2 ( )=0.811
4 4 4 4
9 9 3 3
Entropy2nd Set H(7pm) = 𝐸(𝑆) = − ( ) log 2 ( ) − ( ) log 2 ( )=0.811
12 12 12 12
0 0 4 4
Entropy3rd Set H(9pm) = 𝐸(𝑆) = − ( ) log 2 ( ) − ( ) log 2 ( )=0
4 4 4 4
2.2.3.1.5. Decision
Here, we see that the value of information gain is more than the 'where' attribute
using the' when' attribute. Thus, considering the 'when' attribute is better than taking
the 'where' attribute. Similar way, we should calculate the information gain of other
attributes. If it is a non-terminating node, we should choose the attribute having a
higher value of information gain.
Before we build a decision tree, some decision tree algorithms are used foremost,
and we should discuss them.
2. C4.5: After ID3, C4.5 is the most used decision tree algorithm. This algorithm
can use information gain or gain index to classify all the attributes and expand the
tree. This algorithm can handle both continuous and missing attribute values
together.
2.2.5.1. Advantages
2.2.5.2. Disadvantages
When training examples are fewer, and many classes are available, then decision
trees are prone to errors.
Expensive training.
import numpy as np
iris = load_iris()
X = iris.data
34 Machine Learning and Its Application Indranath Chatterjee
y = iris.target
decision_tree.fit(X_train, y_train)
tree.plot_tree(decision_tree)
plt.show()
OUTPUT:
Supervised Machine Learning Machine Learning and Its Application 35
# Make predictions
y_predicted = decision_tree.predict(X_test)
print("Predicted values:")
print(y_predicted)
OUTPUT:
Predicted values:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 0 2 1 0
1 2 1 0 2]
print(confusion_matrix(y_test,y_predicted ))
print("-------------------------------------")
print(accuracy_score(y_test, y_predicted)*100)
OUTPUT:
[[13 0 0]
[ 0 15 1]
[ 0 3 6]]
-------------------------------------
89.47368421052632
2.3.1. Overview
Regarding the supervised machine learning technology, random forest is one of the
most popular and used algorithms. Random Forest is a classifier with numerous
decision trees over multiple subsets of a given dataset and takes averages to
improve the prediction accuracy of that dataset. Random forest is an algorithm that
could be used for classification and regression problems of machine learning. They
36 Machine Learning and Its Application Indranath Chatterjee
improve the overall model because they can solve multiple complex problems faster
and accurately.
Random forest algorithm is more accurate because the solution is not based on just
one single tree. It is based on the number of trees. Take estimates from all available
trees and make the final product based on the highest part obtained. Having many
trees in the forest increases accuracy and prevents overfitting problems.
There are several advantages for using random forest, below are some specific
points:
Read the below steps to understand the complete working process of a random
forest algorithm:
Step-1: From the available training set, random data sets are selected.
Step-2: From the chosen data points, we now build a decision tree.
Step 5: For new data points, find the predictions for each decision tree and assign
the latest data points to the category with the most votes. The function of the
algorithm can be better understood with the following example.
Example: Suppose you have a dataset containing multiple fruit images. So, this
dataset is fed into a random forest classifier. The data is divided into subsets, and
each decision is provided in a tree. In the training phase, each decision tree produces
a prediction result. Based on most of the products, the random forest classifier
predicts the final decision as new data points arrive. Consider the following image.
When compiling an output from N number of decision trees, different decision trees
predict different outputs. Some outputs could be correct and incorrect, but
collectively, the random forest indicates the most accurate output. Therefore, two
assumptions should be followed for an accurate random forest prediction:
i. Character variables in the data set must have some tangible value so that the
classification can predict the exact outcome instead of the expected result.
ii. Estimates for each plant should have very low correlations.
38 Machine Learning and Its Application Indranath Chatterjee
2.3.4.1. Advantages
Random forest algorithm may be used for both classification and regression tasks,
but it is less suitable for regression operations.
iris = load_iris()
X = iris.data
y = iris.target
random_forest_clf.fit(X_train, y_train)
y_predicted = random_forest_clf.predict(X_test)
print("Predicted values:")
print(y_predicted)
OUTPUT:
Predicted values:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0
2 2 1 0
2]
40 Machine Learning and Its Application Indranath Chatterjee
print(confusion_matrix(y_test,y_predicted ))
print("-------------------------------------")
print(accuracy_score(y_test, y_predicted)*100)
OUTPUT:
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
-------------------------------------
97.36842105263158
2.4.1. Overview
KNN algorithm could be used for classification and regression problems. In the
KNN algorithm, the input should have a 'k' nearest training example, but the output
depends entirely on the problem type.
The output of a classification problem is always a class member. Objects are sorted
by multiple neighbors that are assigned to the most common category of objects
among the closest neighbors (K is a positive integer, usually tiny). If k = 1, the
object is only served to the nearest adjacent square.
Supervised Machine Learning Machine Learning and Its Application 41
If you have a regression problem, the output object has a property value. This value
is the average of the nearest neighbor values. KNN is designed to predict new data
using information from the nearest K neighbor from existing data.
KNN doesn't have a process that can be called learning. That's because when new
data comes in, only neighbors are selected by measuring the distance between the
existing data. That's why some people call KNN a lazy model, meaning that they
don't build a model separately. It is also called Instance-based learning a little
politely. It is a concept that contrasts with Model-based learning, which creates a
model from data and performs a task, and is intended to perform tasks such as
classification/regression using only each observation without a separate model
creation process [5].
However, it should be noted that the KNN algorithm does not directly create these
interfaces, and it visualizes which categories are classified when given new data.
We now know that KNN could be used for classification and regression problems.
Still, it is more likely to be used for classification problems on a larger scale. KNN
is evaluated on three main aspects:
Euclidean Distance
For example, in Fig. (2.6), given below, we see a straight line joining two points A
and B:
3
d(x,y) = √∑I=1(0 − 2)2 + (3 − 0)2 + (2 − 0)2 = √17
Supervised Machine Learning Machine Learning and Its Application 43
Manhattan Distance
Distance calculated when moving from A to B only in the direction of each axis.
To get from one building to another in Manhattan, New York, you must follow a
lattice-shaped path, which is easy to remember—also called Taxicab Distance.
d(𝑥,𝑦) = ∑|xi − yi |
i=1
Mahalanobis Distance
This method calculates the distance by reflecting both the variance within the
variable and the covariance between the variables. This is a distance indicator that
considers the correlation between variables.
If you set the Mahalanobis distance between x and y as c, x as x1, x2, and y as (0,0)
and solve the above equation, you can write as follows. It is the form of the equation
of an ellipse.
𝑥1 2 𝑠1 + 2𝑥1 𝑥2 𝑠2 + 𝑥2 2 𝑠4 = 𝑐 2
𝑠1 𝑠2
𝛴 −1 = [𝑠 𝑠4 ]
3
Correlation Distance
other words, it can reflect the similarity of the two data patterns. The correlation
coefficient's value ranges from -1 to 1, so the correlation distance ranges from 0 to
2. If it is 0, you can interpret the pattern of the two data as very similar, and if it is
2, it is not.
𝑑𝑐𝑜𝑟𝑟(𝑥,𝑦) = 1 − 𝜌𝑥𝑦
Use the Spearman Rank correlation of the data directly as a measure of distance.
The remaining features are the same as the correlation distance.
𝑑𝑐𝑜𝑟𝑟(𝑥,𝑦) = 1 − 𝜌𝑥𝑦
𝑛
6
𝜌𝑥𝑦 =1− 2
∑(𝑟𝑎𝑛𝑘(𝑥𝑖 ) − 𝑟𝑎𝑛𝑘(𝑦𝑖 ))2
𝑛(𝑛 − 1)
𝑖=1
Suppose, for example, that the seasonal temperature ranks for each region are given
as follows, as seen in Table 2.3:
Seoul 3 1 2 4
New York 3 1 2 4
Sydney 2 4 3 1
So, then, the rank correlation distance between Seoul and New York is as follows:
6
𝜌𝑆𝑒𝑜𝑢𝑙,𝑁𝑒𝑤 𝑌𝑜𝑟𝑘 = 1 − {(3 − 3)2 + (1 − 1)2 + (2 − 2)2 + (4 − 4)2 }=1
4(42 −1)
The best K is data specific, so you need to find it greedily. The K value in the KNN
algorithm plays an important role; thus, the selection of the K value must be made
precisely.
Following are some pointers to be considered while we choose the K value for the
algorithm:
The first step for choosing a particular K value is Hit and Trial. You must place
different values to get one of the most accurate values among them. 5 is the most
preferred K value.
Low values like 1 or 2 are preferred less because they create noise in the final
output and affect model accuracy.
Large values for K are found to be suitable, but some difficulties could be
observed in some cases.
In addition, to find the best K, we divide the training data and the verification data;
one must experiment while changing the value of K.
i. Measure the distance between each row of test data and training data. Here,
Euclidean distance is the most widely used method, so we use distance as the
metric. Other metrics you can use are Chebeshev, cosine, etc.
ii. Sort the calculated distance in ascending order based on the distance value to get
top K rows from the sorted array
iris = load_iris()
X = iris.data
y = iris.target
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predicted = knn_clf.predict(X_test)
print("Predicted values:")
print(y_predicted)
OUTPUT:
PREDICTED VALUES:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0
2 2 1 0 2]
print(confusion_matrix(y_test,y_predicted ))
print("-------------------------------------")
print(accuracy_score(y_test, y_predicted)*100)
OUTPUT:
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
-------------------------------------
97.36842105263158
2.5.1. Overview
For example, a ball could be a tennis ball only if it is green in color, about 6.7 cm,
and weighs around 55 grams. Together all these three properties make a ball a tennis
ball irrespective of other available features.
Building this classification model is not just easy but works effectively on
enormous data sets. It is simple to handle and outruns all the previous models of
classification.
The bias theorem provides a way to measure the probability of the answer P(c|x) at
P(c), P(x), and P(x|c). See the equation below:
𝑃(𝑥|𝑐)𝑃(𝑐)
𝑃(𝑐|𝑥) = 𝑃(𝑐|𝑋)
𝑃(𝑥)
Supervised Machine Learning Machine Learning and Its Application 49
P(c|x) stands for the posterior probability of class (c), with predictor (x).
Have you wondered how flowers baring plants are classified? It is done with
different colors, sizes, and the number of flowers, but machine learning is done
through predictive modeling. We have read and understood different classification
algorithms, but the conditional probability model of classification is a method that
helps us predict the output by various observations. The conditional probability
model classification algorithm is used for problems where we don't need a
numerical output like regression problems. In such problems, the input is said to be
X, and the output is said to be Y.
For example, in a classification problem, k class labels y1, y2,..., yk and n input
variables, 𝑥1 , 𝑥2 ,…, 𝑥𝑛 . One can compute the conditional probability for a class
label using a given instance or set of input values for 𝑥𝑛 .
P(𝑦𝑖 |𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
50 Machine Learning and Its Application Indranath Chatterjee
You can then compute the conditional probability for each class label, and the
highest probability label is likely to be returned as a classification.
It is easy to guess from the previous P(yi) data set, but conditional observation
probabilities based on the class P(x1 , x2 , . . . , xn |yi ) are not possible unless the
number of instances is too large. It is large enough to effectively predict the
probability distribution for all possible combinations of values.
Similarly, applying the Bayes theorem directly also becomes impractical when the
number of variables or features increases.
Now we focus on how to calculate the elements of the equation used in the Naïve
Bayes algorithm. The previous P(yi) is estimated by dividing the training dataset's
observation frequency with the label class as the total number of examples in the
training dataset.
The conditional probability for the value of a feature given a class label can also be
estimated from the data—precisely, one data distribution per variable and examples
of data belonging to a given class. If one has K classes and n variables, one must
create and maintain k × 𝑛 different probability distributions.
Supervised Machine Learning Machine Learning and Its Application 51
Different types of data for each feature require different approaches. Specifically,
the data are used to estimate the parameters of one of the three standard probability
distributions.
For categorical variables like labels, one can use a multinomial distribution. If the
variable is binary, such as yes/no or true/false, you can use a binomial distribution.
Gaussian distribution is often used when the variable is as numeric as the measure.
The three distributions are very popular that they are often given the name Naive
Bias implementation. Such as:
Datasets with mixed data types for the input variables may need to select a different
kind of data distribution for each variable.
It is not necessary to use one of these three normal distributions. For example, if a
real-valued variable has another kind of distribution, for example, exponential, one
can use that specific distribution. In some instances, a real-valued variable doesn't
have a properly defined distribution, like bimodal or multimodal. It is suggested to
use a kernel density estimator instead to estimate the probability distribution.
Naive Bayes algorithms have proven to be effective and are therefore widely used
in text classification functions. Words in the document can be encoded as binary
(with terms), numeric (word events), or frequency (TF/IDF) input vectors and
binary, polynomial, or Gaussian probability distributions, respectively.
We have understood how the Naïve Bayes algorithm works with the concept of
probability and probability distributions. Let us predict an example dataset (as
52 Machine Learning and Its Application Indranath Chatterjee
observed in Table 2.4) for a football match with three attributes: Field as home
ground and away; Team as domestic and international; and Outcome or the target
class as Win and Defeat.
Let us convert some categorical values to a number. One input of each instance
carries two values, and the target class also has two values. So, converting each
categorical values into binary, we get:
1) Variable:
a) Field
i) Home = 1
ii) Away = 0
1) Variable:
a) Team
i) Domestic = 1
ii) International = 0
Supervised Machine Learning Machine Learning and Its Application 53
1) Variable:
a) Outcome
i) Win = 1
ii) Defeat = 0
Where:
P(h|d) = probability of hypothesis 'h' for the given data d, which is also known
as posterior probability.
P(d|h) = probability of data 'd' for the given hypothesis 'h', when it is was true.
P(h) = probability of hypothesis 'h', when it is true, which is also known as the
prior probability of h.
P(d) = probability of the data.
54 Machine Learning and Its Application Indranath Chatterjee
In fact, we don't need the possibility to predict a class for new data values. The only
thing required is the numerator and the most extensive response class, predicting
the target class.
First, let's take the records from the dataset and use the trained model to tell which
class they belong to.
For the potential of the model, we connect both classes and compute the response.
The output begins with a "Win" response. Multiply the conditional probabilities
together and multiply the event probabilities of the class.
We see that 0.32 is greater than 0.04, so we can easily predict the output class as
'Win'.
2.5.3.1. Advantages
2. The Naïve Bayes classifier compares well with other models such as logistic
regression, requiring less training data.
Supervised Machine Learning Machine Learning and Its Application 55
3. This algorithm performs better for categorical variables than for numeric
variables. For numerical variables, it assumes a normal distribution.
2.5.3.2. Disadvantages
1. If a categorical variable cannot be found in the training dataset, the model will
assign a zero probability and cannot predict. This is frequently mentioned as "zero
frequency". To solve this, one can use a smoothing technique. The modest
smoothing method is Laplace estimation.
2. On the other hand, Naïve Bayes is also known as a lousy predictor, so the output
probability is not considered very seriously.
This section throws few important practical tips when working with Naive Bayes
models:
Naive Bayes, by definition, assumes that the input variables are independent of each
other. Some or most of the variables depend on reality, but this often works well.
However, the efficiency of the algorithm decreases because it depends more on the
input variable.
To compute separate conditional probabilities for instances of class labels, you need
to combine multiple possibilities, one for each class and one for each input variable.
56 Machine Learning and Its Application Indranath Chatterjee
For example:
When new data becomes accessible, it can be relatively straightforward to use this
fresh data and the old data to update the parameter estimates for each change
probability distribution. This allows the model to easily access new data or an ever-
changing distribution of data over time.
iris = load_iris()
X = iris.data
y = iris.target
naive_bayes_clf = GaussianNB()
naive_bayes_clf.fit(X_train, y_train)
y_predicted = naive_bayes_clf.predict(X_test)
print("Predicted values:")
print(y_predicted)
OUTPUT:
Predicted values:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0
2 2 1 0 1]
print(confusion_matrix(y_test,y_predicted ))
print("-------------------------------------")
print(accuracy_score(y_test, y_predicted)*100)
OUTPUT:
[[13 0 0]
[ 0 16 0]
[ 0 0 9]]
-------------------------------------
100.0
2.6.1. Overview
"Support Vector Machine" (SVM) is an algorithm that could be used for both types
of challenges, classification and regression. Still, it is preferred more for the
classification model. When using the SVM algorithm, consider a single data point
Supervised Machine Learning Machine Learning and Its Application 59
in the N dimension when the value of each feature is the value of a specific
coordinate. It then finds and classifies data points using the hyperplanes that make
a massive difference between the two classes.
The support vectors are the data points which hold the hyperplane. The SVM
classifier tries to build the best boundary for both classes (hyperplane or line). Fig.
(2.7) shows the basic principle of the support vector machine. The charm of SVM
is that it is a well-designed machine learning algorithm based on a stunning and
solid theoretical background. The magic of the algorithm is that it is practical
because it is easy to implement in several ways and has useful functions [6].
The problems that need to be solved in SVM are when one needs to divide the
dataspace using a decision boundary.
Let us take two sample inputs, '+' and '-', now question how to divide them into two
classes (Fig. 2.8). Once we draw a line that divides both the data points equally, the
easiest way is to draw a line to keep an equal and wide distance between the two
data samples and the line. This line should maintain the most comprehensive
possible distance with the data points. In Fig. (2.9) let us say that the drawn line is
the dotted line, so the distance between the dotted line and the line passing through
the closest data point should be broad.
60 Machine Learning and Its Application Indranath Chatterjee
Fig. (2.9). The hyperplane trying to separate the points based on support vectors.
Even an intuitive solution seems to be a trivial problem. However, generalizing the
answer to this question into a solid and logical theory is not very simple. SVM is a
static method to find the solution to the desired problem.
Supervised Machine Learning Machine Learning and Its Application 61
First, let us understand the nature of the decision rules determining the boundary's
⃗⃗⃗, which is a vector that is orthogonal to the centerline.
limits. Let's draw a vector 𝑤
⃗⃗⃗. 𝑢
𝑤 ⃗⃗ + 𝑏 ≥ 0, 𝑡ℎ𝑒𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡 𝑖𝑠 ′ + ′
So the formula mentioned above becomes the decision rule. It's also the first tool
we need to understand SVM. But there are still many shortcomings. Yet we have
no idea what ⃗⃗⃗⃗⃗
𝑤 should be determined, and how 𝑏 should be determined in that
formula. It's just one thing to know that ⃗⃗⃗⃗⃗
𝑤 is orthogonal to the centerline.
Till now, we are aware that a hyper-plane segregates two different classes. Now we
will learn about how to find the correct hyper-plane.
The most crucial rule to identify the correct hyper-plane is: "Correct hyper-plane is
the one which segregates the two classes more accurately." Thus, in case-1, hyper-
plane B is the most accurate one.
In the given data, hyper-plane C has the maximum distance; thus, it is the correct
hyper-plane in this case. The distance between the hyper-plane and the nearest data
point is called margin.
Case-3: Note: Use the rules outlined in the previous section to identify the correct
hyperplane.
Case-4: Can We Classify Two Classes? Below, you can't use a straight line to
separate the two classes because one star is an outlet in the other (circle) class area
(Fig. 2.14).
64 Machine Learning and Its Application Indranath Chatterjee
Fig. (2.14). Two classes, one data point being missed with other.
As mentioned earlier, the star at the other end is like the outer layer of a star class.
The SVM algorithm can ignore the outlet and find the hyperplane with the
maximum margin (Fig. 2.15). So, it can be said that the classification of SVM is
vital for outlets.
SVM can solve this problem by introducing additional features. Here we add a new
function to z = x2 + y2. Now plot the data points with z points on the axis and x.
It's easy to have a linear hyperplane between these two classes in an SVM classifier.
However, there may be another problem: do we have to add this feature to the
hyperplane manually? No, the SVM algorithm has techniques like kernel tricks.
The SVM kernel is a function that receives a low-dimensional input space and
converts it to a high-dimensional space, transforming an indivisible problem into a
partitioning problem. This is useful for most linear separation problems. In short, it
does a very complex data transformation and then explores the process of
separating the data based on labels or defined outputs. Looking at the hyperplane
in the original input space, it seems like a circle (Fig. 2.18).
There is a trick. SVM doesn't need real vectors to do its magic; it can only be
obtained with dot products. This means we can go beyond expensive math to a new
level! Instead, it does the following:
Supervised Machine Learning Machine Learning and Its Application 67
z = x² + y²
𝑎 · 𝑏 = 𝑥𝑎 · 𝑥𝑏 + 𝑦𝑎 · 𝑦𝑏 + 𝑧𝑎 · 𝑧𝑏
Let us instruct SVM to do the job, but we call it a kernel function with a new dot
product. This is a kernel trick that allows one to review costly calculations. Usually,
the kernel is linear, and with it, linear classification is done. However, with a non-
linear kernel, one can get a non-linear classifier. Changes in data change the dot
product to the location you want, and the SVM will perform.
The kernel trick isn't part of the SVM. It can be used with other linear classifications
such as logistic regression. The support vector machine only handles finding the
boundaries.
When we are ready with the feature vector, one needs to select a learning model's
kernel function. Each problem is different, and the kernel functions also depend on
data. The data consisted of concentric circles in this example, so we chose a kernel
that matches those data points. The three algorithms, namely, SVC, NuSVC, and
LinearSVC, are helpful for binary and multi-class classification on a single data set.
As seen in Fig. (2.19), there are found kernels working in the Iris dataset.
SVC and NuSVC are the same way, but they accept slightly different sets of
parameters and have various mathematical calculations. On the other hand, you can
implement lineRSVC support vector classification much faster for a linear kernel.
The linearSVC parameter assumes it is linear, so it doesn't allow the kernel. Some
features of SVC and NuSVC are missing, such as support. Like other classifiers,
SVC, NuSVC and LinearSVC take two arrays as inputs. That is size array with
training pattern, size array X, and size label array Y.
68 Machine Learning and Its Application Indranath Chatterjee
If you use SVM in different areas, you can use hundreds of features. Meanwhile,
the NLP classifier uses thousands of features to have at most one word for each
word appearing in the training data. This changed the problem slightly. While using
a non-linear kernel may be a good idea in other respects, many of these features
make non-linear kernel data more useful. So it's better to stick to only good old
linear kernels, in which case you can get good performance in these cases.
2.6.4.1. Advantages
b) Efficient memory.
2.6.4.2. Disadvantages
a) If the number of features exceeds the number of samples, then the due
normalization date is vital for choosing a kernel function to avoid overfitting.
b) SVM does not directly predict probabilities and is computed using the expensive
5-fold cross-validation.
iris = load_iris()
X = iris.data
y = iris.target
svm_clf = svm.SVC()
svm_clf.fit(X_train, y_train)
y_predicted = svm_clf.predict(X_test)
print("Predicted values:")
print(y_predicted)
OUTPUT:
Predicted values:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0
2 2 1 0 2]
print(confusion_matrix(y_test,y_predicted ))
print("-------------------------------------")
print(accuracy_score(y_test, y_predicted)*100)
Supervised Machine Learning Machine Learning and Its Application 71
OUTPUT:
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
-------------------------------------
97.36842105263158
CONCLUDING REMARKS
CHAPTER 3
The unsupervised machine learning algorithms are primarily used for training,
while the information used is not classified or labeled. Research into how systems
determine the ability to explain hidden structures in unlabeled data arises from
unsupervised studies. The system does not search for the correct output, but it can
explore the data and search the dataspace to explain the hidden structure in the data
and label them.
Back in the 1850s, John Snow, a London physician, developed a map where he
plotted Cholera deaths. With the help of these maps, he identified that majority of
deaths were reported near a well. Thus, this map exposed the problem and solution
altogether.
Clustering is a unique method that helps us identify the group among all the
available scattered data. There is no fixed pattern or steps to perform clustering on
the given set of data. The users are free to use their preferred clustering method
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
Unsupervised Machine Learning Machine Learning and Its Application 73
1. Density-based Methods: This is one of the best methods to merge two different
clusters because such methods only consider a dense cluster with similar points.
The different points are situated at a low, dense region in the same space. DBSCAN
and OPTICS are some of their examples.
4. Grid-based Methods: These methods are fast and independent because the
entire data space is divided into a smaller finite number of cells like a grid. The
clustering operation becomes very easy in such data space. Sting and clique are
some examples of grid-based methods.
Natural Disasters: The study of disasters like earthquake and tsunami helps us
determine dangerous zones and take different precautions.
Marketing: It helps in discovering the customer segments for better marketing
strategies.
74 Machine Learning and Its Application Indranath Chatterjee
3.2.1. Overview
In the K-Mean clustering algorithm, first, we calculate the optimum centroid for
the given data. This is done by the iteration method. The first assumption here is
that we know the K-value, i.e., the number of clusters. This algorithm type is also
Unsupervised Machine Learning Machine Learning and Its Application 75
K-Mean clustering algorithm has been designed based on the distance between the
data points and centroid. The placing of data in the cluster is done when the squared
distance between the data point and centroid is the minimum. Thus, in other words,
minor variation among the data point of a cluster, more similar to the data points.
The following steps will help us understand the algorithm of K-Mean clustering:
Step 4 – Keep on iterating for the optimum centroid. From which the distance
between the data point and centroid is not changing.
First, the sum of the square distances between the data point and the centroid
(center) is calculated. Now, let us allocate each data point to the closest cluster of
the other. Finally, average all the data points in that cluster and calculate the
centroid of the cluster.
There are some limitations when we use the K-mean clustering algorithm:
Since K-mean clustering uses distance to form clusters, standardizing the data is
done for better and relevant results.
Allocating a random value of K could prevent the algorithm from reaching a
global optimum. Thus, it is recommended to use different initializations of
centroids.
76 Machine Learning and Its Application Indranath Chatterjee
Clustering of documents
Market segmentation
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data
Let us take a simple example with a small dataset showing the implementation of
k-means algorithm, using K=2 (follow the Table 3.1).
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
Unsupervised Machine Learning Machine Learning and Its Application 77
(Table 3.1) cont.....
6 4.5 5.0
7 3.5 4.5
Step 1: Initialization:
o Here, we choose randomly two centroids (k=2) for two clusters. To start with,
M1=(1.0,1.0) and M2=(5.0,7.0). We can see the change in Table 3.2.
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step 2:
o Now, we obtain two different clusters having data points {1,2,3} and {4,5,6,7}.
After this, the new centroids are:
Now, we will have a new centroid table having two groups of datapoints as cluster,
as follows in Table 3.3.
1 0 7.21
2 1.12 6.10
3 3.61 3.61
4 7.20 0
5 4.72 2.5
6 5.31 2.06
7 4.30 2.92
\\
Step 3:
o Now, using these centroids, we calculate the distance using Euclidean measure,
as shown in table below. Thus, the new clusters are: {1,2} and {3,4,5,6,7}. Next
centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1), as seen in Table 3.4.
1 1.57 5.38
2 0.47 4.28
Unsupervised Machine Learning Machine Learning and Its Application 79
(Table 3.4) cont.....
3 2.04 1.78
4 5.64 1.84
5 3.15 0.73
6 3.78 0.54
7 2.74 1.08
Step 4:
o Now, the new clusters are {1,2} and {3,4,5,6,7}. Thus, we observe no change in
the cluster in terms of the allocation of data points (see Table 3.5).
o Thus, the algorithm ends here for K=2, and finally, we obtained the result having
of two clusters {1,2} and {3,4,5,6,7}, which remain unchanged (Fig. 3.1).
Fig. (3.1). Finally selected data points in two clusters after k-means algorithm (k=2).
1 0.56 5.02
80 Machine Learning and Its Application Indranath Chatterjee
(Table 3.5) cont.....
2 0.56 3.92
3 3.05 1.42
4 6.60 2.20
5 4.16 0.41
6 4.78 0.61
7 3.75 0.72
1https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_clustering_
algorithms_k_means.htm
Unsupervised Machine Learning Machine Learning and Its Application 81
kmeans = KMeans(n_clusters = 4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
Now, to plot and visualize the cluster’s centroid data points selected by k-means algorithm:
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 400, centers = 4,
cluster_std = 0.60, random_state = 0)
Now again to visualize the dataset:
plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 20, cmap =
'summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c = 'blue', s =
100, alpha = 0.9);
plt.show()
Read Data
Read the data as an input text file ('data.txt'). In this data, each line represents
each item, which has a numerical value, comma-separated. Get any CSV data
of this type from any open source site.
After reading the data, convert them into list. Each list is again a nested list
having the item values for each of the features.
def ReadData(fileName):
f = open(fileName, 'r');
82 Machine Learning and Its Application Indranath Chatterjee
lines = f.read().splitlines();
f.close();
items = [];
for i in range(1, len(lines)):
line = lines[i].split(',');
itemFeatures = [];
for j in range(len(line)-1):
def FindColMinMax(items):
n = len(items[0]);
minima = [sys.maxint for i in range(n)];
maxima = [-sys.maxint -1 for i in range(n)];
Here the minima and maxima are the lists elements having the ‘min’ and
‘max’ values. Then randomly initialize the mean feature values in the range
of the ‘min’ and ‘max’ values.
def InitializeMeans(items, k, cMin, cMax):
# Initialize means to random numbers between
Unsupervised Machine Learning Machine Learning and Its Application 83
Euclidean Distance
In the next step, use Euclidean's distance as a distance measure of similarity.
def UpdateMean(n,mean,item):
for i in range(len(mean)):
m = mean[i];
m = (m*(n-1)+item[i])/float(n);
mean[i] = round(m, 3);
return mean;
84 Machine Learning and Its Application Indranath Chatterjee
def Classify(means,item):
# Classify item to the mean with minimum distance
minimum = sys.maxint;
index = -1;
for i in range(len(means)):
return index;
# Calculate means
for e in range(maxIterations):
item = items[i];
clusterSizes[index] += 1;
cSize = clusterSizes[index];
means[index] =
UpdateMean(cSize,means[index],item);
belongsTo[i] = index;
return means;
return clusters;
Cosine Distance: Cosine distance measure reveals the cosine of the angle
between the point vectors of the two points in the n-dimensional space.
Manhattan Distance: Manhattan distance measure reveals the sum of the
absolute differences between the two data points' coordinates.
Minkowski Distance: Minkowski distance measure, also called as
generalized distance metric'. This measure can be employed in both ordinal
and quantitative variables.
1) If the number of data is not that large, the initial grouping cluster decides to be
large.
2) The number of clusters 'K' must be decided in advance. The downside is that
each run won't give the same results, leaving the cluster to rely on initial
random assignments.
3) We don't know of any real clusters that use the same data. This is because input
in a different order can form different clusters if the data size is small.
4) It is sensitive to the early stages. Different starting positions can give different
results to the cluster. Algorithms can be stuck on a local optimal.
1) Relatively efficient and fast. Calculate the result for "O(tkn)", where n is the
number of objects or points, 'k' is the cluster's number, and 't' is the iterations.
[[ 3.64089227 0.32484208]
[-1.48626296 5.3677692 ]
88 Machine Learning and Its Application Indranath Chatterjee
[ 4.66021043
. 0.4713859 ]
[-3.461949 3.71875156]
[-1.90725058 0.398738 ]
[ 0.01951167 4.73374699]]
[1 0 1 0 0 0]
84.89302060953563
# The determined final centers of clusters
kmeans_clustering.cluster_centers_
array([[ 0.7085658 , 0.57518068],
[ 0.21810856, -1.15192112],
[-1.01524508, 0.50484285]])
11
# Optimize cluster algorithm by finding the best number of c
lusters
kmeans_kwargs = {
"init": "random",
"n_init": 15,
"max_iter": 200,
"random_state": 0,
}
plt.style.use("fivethirtyeight")
plt.plot(range(1, 16), sse_list)
plt.xticks(range(1, 16))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
90 Machine Learning and Its Application Indranath Chatterjee
3.3.1. Overview
The unlabeled data points in the unsupervised machine learning algorithm are
clustered by the Hierarchical Clustering method. This clustering method has a
predetermined ordering from top to bottom.
How to Perform?
4. Combine two adjacent clusters resulting from K-2 clustering to create more
clusters.
1. Divisive Clustering
The divisive clustering, also called as the top-down clustering method, applies all
observations to a single cluster and then divides the cluster into at least two identical
clusters.
2. Agglomerative Clustering
The main differences between K means and Hierarchical Clustering are mentioned
in Table 3.6.
Unsupervised Machine Learning Machine Learning and Its Application 91
K-means clustering uses the distance between Hierarchical clustering can be either division
the data point and centroid to form different or agglomerative.
clusters. The clusters are spherical shapes.
The value of K should be known. Depending on the dendrogram, one can stop
at any number of clusters.
Advantages Advantages
Guaranteed convergence. Can handle any similarity in distances.
Can cluster different sizes and shapes. Applicable to any attribute type.
Disadvantages Disadvantage
Prediction of K-Value is difficult. With large data sets, it could be expensive
and slow.
1. Single Linkage
When all the data points are added one by one in a particular cluster, single linkage
clustering is used. The distance between the two clusters is defined as the distance
between their two nearest data points.
2. Complete Linkage
When the clusters are clearly segregated and very compact, then complete linkage
is used. The inter-cluster distance may be considered as the longest distance within
any two data points between the two clusters.
3. Simple Average
Simple average algorithms define the distance between the clusters as the average
distance between each data point so that the weights of the two clusters are the same
according to their final output.
Dendrogram
When we visualize the connection or relationships among all the clusters, we draw
a tree-like structure known as a dendrogram. Fig. (3.2) shows the tree-like hierar-
chical clusters and Fig. (3.3) shows the concentric clsuters. With the increase in the
length of lines of two clusters, the two clusters get farther. We concentrate on the
distance between two objects to correctly interpret a dendrogram [7].
Clades: The sections of the dendrogram are called the Clades. They are organized
in the way of similarity.
2. The desired cluster can be obtained by 'cutting' the dendrogram at the correct
level.
The main task is to calculate the distance between two clusters. Different definitions
of distance measures between clusters result in different algorithms. Let us see a
hands-on example to perform agglomerative clustering.
1. Input Setting
First, let us start with clusters of single data points and a distance/proximity matrix.
Let us take some data points p1, p2, p3, and so on. The circles represent the data
points. The chart is the proximity matrix, as seen in Fig. (3.4).
Unsupervised Machine Learning Machine Learning and Its Application 95
Fig. (3.4). Initial data points, blank clusters and initial proximity matrix.
2. Intermediate State
After the initial steps of merging, we get few clusters. Each cluster is named C1,
C2, C3, C4, and C5. Now the matrix is changed to the cluster's proximity matrix.
The datapoints p1, p2, …, p12 are now linked to a particular cluster (Fig. 3.5).
3. Intermediate State
Let us merge any of the two closest clusters in the next step, for example, here, C2
and C5. Now, we need to update the distance matrix in a proper cluster manner.
Now the points are further linked to another level of clustering (Fig. 3.6).
After incremental updating and merging, we will obtain the merged clusters and the
points linked to the root cluster (Fig. 3.7).
So ultimately, each cluster is a set of points. To find the distance between two
clusters, there are several alternative approaches. Here, we will see three
approaches, single-link, complete link, and group-average clustering.
The nested cluster structure and dendrograms for the Single-link distance metric
are given below (Figs. 3.8 and 3.9).
Fig. (3.8). Nested cluster showing the final set of data points.
The nested cluster structure and dendrograms for the complete-link distance metric
are given below.
As par its strength is concerned, it is more balanced clusters (with equal diameter)
and less susceptible to noise, whereas it tends to break large clusters and all clusters
tend to have the same diameter, i.e., small clusters are merged with larger ones.
0.25
0.2
0.15
0.1
0.05
0
3 6 4 1 2 5
The Average-link distance is the settlement between single and complete link
distance clustering methods. As par its strength is concerned, it is less susceptible
to noise and outliers, whereas it is biased towards globular clusters.
[[-2.08121364 2.4472446 ]
[-1.89760802 1.97226647]
[ 2.11567076 3.06896151]
[ 1.40252881 4.98069536]
[-0.21258918 3.79697097]]
[2 2 0 0 0]
print(normalized_samples[:5])
0 1
0 -0.996463 -0.084027
1 -0.956947 -0.290264
2 0.969678 0.244386
3 0.383107 0.923704
4 -0.464855 0.885387
k = [2, 3, 6]
3.4.1. Overview
Back in the 1970s, the self-organizing map, also known as Kohonen Map, was
inspired by a biological neural system. It is an artificial neural network that follows
an unsupervised learning approach. Its training is done through a competitive
learning algorithm. SOM is one of the best clustering methods for the data spread
in a multidimensional map. It also performs mapping on the data points. It reduces
all complex problems into very simple and easy to interpret situations. SOM has
two layers, the input layer and the output layer. The architecture of the self-
organizing (SOM) map having two clusters and 'n' number of inputs may be seen
in Fig. (3.14).
Unsupervised Machine Learning Machine Learning and Its Application 105
Suppose input data has a size of mn with m number of training instances and n
number of features or variables.
At first, set the initial weights having size (n, C), where 'C' is the number of clusters.
After initialization, iterate through each training instance, updating the winning
vector, i.e., the updating weight vector having the shortest distance from other
training instances. The formula of weight update is:
Here, alpha (𝜶) is known as the learning rate at time 't'; 'j' represents the winning
vector; 'i' represents the ith feature of each instance; 'k' represents the kth instances.
After completing the training part of the self-organizing map, the new weights that
are trained are now used for clustering new data samples. When a new data sample
comes, it chooses the cluster having a winning vector (Fig. 3.15).
106 Machine Learning and Its Application Indranath Chatterjee
Fig. (3.15). The Self-organizing map projecting the data points on the data plane.
3.4.3.1. Advantage
1. The nice thing about SOMs is that they are simple to comprehend. It is as easy
as that: if they are near together and there is grey area between them, they are
comparable. They are distinct if there is a dark ravine between them. Data can
rapidly learn how to utilize self-organizing maps in contrast to Multidimensional
Scaling and N-land.
2. As I have demonstrated, they efficiently categorize data and then assess their
quality, allowing you to compute how excellent a map is and how strong the
similarities between items are.
3.4.3.2. Disadvantage
1. Getting the correct data is a big issue with SOMs. Unfortunately, to construct a
map, you'll require a value for each dimension of each sample member. This is a
limiting characteristic of the usage of SOMs, commonly referred to as missing data,
because it is not always possible and frequently quite challenging to obtain all of
this data.
2. SOMs arrange sample data so that comparable samples are generally surrounded
by similar samples in the result, although similar samples are not necessarily close
to each other. If you have many purple hues, you won't always get one giant cluster
with all the purples in it; occasionally, the clusters will divide, and you'll get two
purple groupings.
4. The more neighbors you utilize to determine the distance for that black and white
similarity map, the better the similarity map you'll receive, but the number of
distances the method needs to compute grows exponentially [7].
108 Machine Learning and Its Application Indranath Chatterjee
Among various perspectives, these four perspectives are popular for self-organizing
maps:
3. The self-organizing maps are one of the best tools for statistical analysis and data
visualization:
For complicated data sets, the SOM is commonly utilized as a data mining and
visualization approach. In business and medical, applications include image
processing and speech recognition, process management, economic analysis, and
diagnostics.
import math
class SOM :
# This is the function for computing the winning vector
using Euclidean distance
Unsupervised Machine Learning Machine Learning and Its Application 109
for i in range(len(data)):
D0 = D0 + math.pow((data[i] - w[0][i]), 2)
D1 = D1 + math.pow((data[i] - w [1][i]), 2)
if D0 > D1 :
return 0
else :
return 1
for i in range(len(w)):
w[J][i] = w[J][i] + alpha * (data[i] - w[J][i])
return w
m, n = len(T), len(T[0])
# weight initialization ( n, C )
w = [[0.2, 0.6, 0.5, 0.9], [0.8, 0.4, 0.7, 0.3]]
# training
ob = SOM()
epochs = 3
alpha = 0.5
for i in range(epochs):
for j in range(m):
# training sample
data = T[j]
J = ob.winning(w, data)
if __name__ == "__main__":
main()
As we have seen from the above example of SOM in the Pythonic approach. It is
to be noted that we have used the Euclidian distance as a measure of it. However,
we can change the distance measure for change in result, as described above in the
overview section. Now, let us see a Python pseudo-code for SOM.
# Dataset
from sklearn.datasets import load_iris
def common_most(lst):
#
np.random.seed(1)
dimensions = 4
rows = 30
columns = 30
range = rows + columns
learn = 0.5
steps = 5000
# 1. load data
Data_train, data_test, lab_train, lab_test = load_data()
for s in range(steps):
if s % (steps/10) == 0: print("stepsize = ", str(s))
pct_left = 1.0 - ((s * 1.0) / steps)
current_range = (int)(pct_left * range)
current_rate = pct_left * learn
t = np.random.randint(len(Data_train)
(bmu_row, bmu_col) = closest_node(Data_train, t, map, ro
ws, columns)
112 Machine Learning and Its Application Indranath Chatterjee
for i in range(rows):
for j in range(columns):
if dist_man(bmu_row, bmu_col, i, j) < current_range:
map[i][j] = map[i][j] + current_rate *
(Data_train[t] - map[i][j])
if i-1 >= 0:
sumDig += dist_euclidean(v, map[i-1][j]); ct += 1
if i+1 <= rows-1:
sumDig += dist_euclidean(v, map[i+1][j]); ct += 1
if j-1 >= 0:
sumDig += dist_euclidean(v, map[i][j-1]); ct += 1
if j+1 <= Cols-1:
sumDig += dist_euclidean(v, map[i][j+1]); ct += 1
u_matrix[i][j] = sumDig / ct
plt.imshow(u_matrix, cmap='red'
plt.show()
for t in range(len(Data_train)):
(m_row, m_col) = closest_node(Data_train, t, map, row, c
olumns)
map1[m_row][m_col].append(data_y[t])
CONCLUDING REMARKS
CHAPTER 4
Regression: Prediction
Abstract: This chapter introduces the readers to the in-depth knowledge of regression
analysis. Regression is a concept used both in statistics and computer science,
specifically in machine learning. However, the concept remains unaltered, but the
applications. This chapter will learn about two primarily used regression analysis
algorithms, linear regression and logistic regression. Here, each of the algorithms will
be described in detail, with hands-on application. We will also learn linear and logistic
regression in a more elaborative way while demonstrating through Python program
on a real-world dataset.
Dependent Variable: It is the one that we are attempting to figure out or predict.
Independent Variables: These are variables that impact the analysis or target
variable and offer us information about the variables' relationships with the
target variable.
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
Regression: Prediction Machine Learning and Its Application 115
When it comes to regression analysis, what are the most common blunders? When
dealing with regression analysis, it is critical to have a thorough understanding of
the issue statement. We should presumably utilize linear regression if the issue
description mentions forecasting. We should apply logistic regression if the issue
description mentions binary categorization. Similarly, we must assess all of our
regression models based on the issue statement.
Classification aims to predict a discrete class label. Regression, on the other hand,
is the challenge of predicting a continuous variable. There is some overlap between
the classification and regression techniques. A classification algorithm may predict
a continuous value in the form of a probability for a class label. In contrast, a
regression method may predict a discrete value in the form of an integer number.
Some algorithms, such as decision trees and artificial neural networks, may be
utilized for classification and regression with minor tweaks. Some methods, such
as linear regression for predictive regression modeling and logistic regression for
classification predictive modeling, cannot readily be utilized for both issue types.
1. Stock market analysis: Understand the trend in stock prices, predict prices, and
assess risks in the insurance sector in the financial industry.
2. Business: Understand the efficacy of marketing efforts, foresee pricing, and
product sales.
Regression: Prediction Machine Learning and Its Application 117
3. Manufacturing: Evaluate the link between variables that affect the performance
of a better engine.
4.2.1. Overview
Let us take an example. Consider '𝒀𝒕 ' as a "dependent" variable whose values we
have to predict, and let us consider 𝑿𝟏𝒕 , …,𝑿𝒏𝒕 as the "independent" variables. The
equation can be framed as:
𝑌 = 𝐵0 + 𝐵1 𝑋1 + 𝐵2 𝑋2 + 𝐵3 𝑋3 + ⋯ + 𝐵𝑛 𝑋𝑛
According to this formula, the prediction for 'Y' will be a 'straight-line' function of
each of the 'X' variables, keeping others constant, and the contributions of various
X variables to the predictions are additive. The constants 𝑩𝟎 , 𝑩𝟏 ,..., 𝑩𝒏 The
coefficients of the variables are the slopes of their unique 'straight-line' interactions
with 'Y'. Other factors being fixed, 𝑩𝒊 is the change in the expected value of 'Y' per
unit of change in 𝑿𝒊 . The so-called intercept, or extra constant 𝑩𝟎 , is the estimate
that the model would make if all of the 'X' are zero. The coefficients and intercept
are calculated using least squares, which means they are set to the distinctive values
that reduce the sum of squared errors within the data sample to which the model is
fitted [9]. The model's prediction errors are generally considered to be regularly
distributed, independently, and identically.
regression reveals a linear connection, it determines the change in the value of the
dependent variable with changing the value of the independent variable.
𝑦 = 𝑚𝑥 + 𝑏
EXAMPLE
Let's assume we have a dataset with the following columns (features): the amount
of money a firm spends on television (TV) advertising each year and its yearly sales
in units sold. Here, we are trying to frame an equation that will allow us to estimate
how many units a firm will sell based on its spending on radio advertising.
Companies are represented by the rows (observations) (Table 4.1).
Our prediction method generates a sales forecast with a company's radio advertising
budget and current Weight and Bias settings.
4.2.1.2.1. Weight
4.2.1.2.2. Commercials on TV
It's the variable that's unrelated to the dependent one. These are referred to as
features in machine learning.
4.2.1.2.3. Bias
The point where our line crosses the y-axis is known as the intercept. We can call
intercepts as bias in machine learning. All of our forecasts are controlled by bias.
The right settings for Weight and Bias will be determined by our algorithm. Our
equation will get close to the line of best fit after our training (Fig. 4.2).
Fig. (4.2). Example showing trained regression model across TV advertisement and net sales.
𝑓(𝑥, 𝑦, 𝑧) = 𝑤1 𝑥 + 𝑤2 𝑦 + ⋯ . +𝑤𝑖 𝑧
EXAMPLE
Company A 100 25 49 12
Company B 130 31 27 29
Company C 90 25 14 17
Company D 170 55 29 27
When the number of features starts increasing in any model or example, the
complexity of the model increases accordingly. Thus, if the complexity increases
in our model, it would be challenging to visualize or comprehend the data.
To avoid the difficulty due to the increasing complexity, we can break the data into
small parts and compare one or two features at a time.
Our forecast function generates a sales prediction based on our current weights
(coefficients) and a company's TV, radio, internet, and magazine
advertisement spending. Our model will look for weight values that will reduce our
cost function the most.
defpredict(features,weights):
predictions=np.dot(features,weights)
returnpredictions
The linear regression line is a linear line that shows the relationship between a
dependable and independent variable. This linear regression line can predict two
types of relationships between the dependable and independent variables.
In a positive linear relation, the value of the dependent variable rises on the Y-axis
while the independent variable rises on the X-axis (Fig. 4.3).
A negative linear relationship occurs when the dependent variable falls on the Y-
axis while the independent variable rises on the X-axis (Fig. 4.4).
Regression: Prediction Machine Learning and Its Application 123
2. Since the “actual” relationships between our variables are frequently somewhat
linear throughout the range of values.
3. Even though they’re not, we can frequently transform the variables to make the
relationships linear. This is one of the strongest assumptions. Steps involved in
the regression model are:
c. Analyze the errors and search for any non-linear patterns in the scatterplots
once you’ve fitted the model.
d. If you see any non-linear trends, you can rule them out using variable
transformations and derive a useful prediction using linear regression.
124 Machine Learning and Its Application Indranath Chatterjee
There are five regression algorithms. Let us understand each one of them.
We use the Ordinary Least Squares (OLS) technique to do linear regression. The
square of the difference from the mean is calculated for each data point in this
technique, and the total loss function is minimized.
𝑛
𝑙 = ∑(𝑦𝑖 − 𝑦̂)2
𝑖=1
2. Polynomial
𝑦𝑖𝑒𝑙𝑑𝑠
𝑥 = (𝑥0 , 𝑥1 ) → 𝑥 ′ = (𝑥0 , 𝑥02 , 𝑥1 , 𝑥12 , 𝑥0 𝑥1 )
Regression: Prediction Machine Learning and Its Application 125
3. Lasso
𝑛 𝑝
2
𝑙 = ∑(𝑦𝑖 −𝑦̃) + 𝛼 ∑ |𝑤𝑗 |
𝑖=1 𝑗=1
4. Ridge
Ridge regression lowers the model's complexity in the same way that lasso
regression does. The sole difference is that the regularization term is now L2 rather
than L1.
𝑛 𝑝
𝑖=1 𝑗=1
5. Stepwise
𝑦 = 𝑎𝑥 + 𝑏(𝑥 − 𝑥̅ )𝐻𝛼 + 𝑐
# Make predictions
y_pred = regressor.predict(X_test)
# The coefficients
print('Coefficients: \n', regressor.coef_)
# 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(y_test, y_pred))
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
# Plot outputs
plt.scatter(X_test, y_test, color='black')
Regression: Prediction Machine Learning and Its Application 127
plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")
plt.title("Linear Regression on Diabetes Dataset")
plt.show()
4.3.1. Overview
The logistic function, also known as the sigmoid function, was created by
statisticians to characterize the features of population increase in ecology, such as
how it rises fast and eventually reaches the environment's carrying capacity. It's an
S-shaped curve that can translate any real-valued integer to a value between 0 and
1 but never precisely between those two points.
Where 𝒆 is the natural logarithms' base and value is the numerical value to be
transformed. The values between -5 and 5 have been converted into the range 0 and
1 using the logistic function, as shown below.
128 Machine Learning and Its Application Indranath Chatterjee
1
1 + 𝑒 −𝑧
Binary classification techniques employ logistic regression. All of our observations
assist us in allocating a distinct set of classifications. Unlike linear regression,
which produces continuous numerical values, logistic regression produces a
probability value translated to two or more discrete classes using the logistic
sigmoid function.
To forecast an output value, input values (𝒙) are blended linearly using weights or
coefficient values (referred to as the Greek capital letter Beta (𝜷)). The output (𝒚)
value being modeled is a binary value (0 or 1) rather than a numeric number, which
is a significant distinction from linear regression [11].
𝑒 𝛽0 +𝛽1 𝑥
𝑦 =
1 + 𝑒𝛽0 +𝛽1 𝑥
Where 𝒚 is the anticipated output, 𝜷𝟎 is the bias or intercept term, and 𝜷𝟏 is the
single input value coefficient (𝒙). The 𝜷 coefficient for each column in your input
data must be learned from your training data.
Suppose you are given two different data sets and asked to perform linear and
logistic regression. One data set is time spent studying, and the second is about
exam scores of different students. Both the regression methods will predict different
things:
Linear Regression: Predictions from the linear regression are continuous. Thus,
they will help predict the score of students between the range of 0 to 100.
Logistic Regression: Predictions from the logistic regression are discrete; thus, it
will help predict whether the student passed the exam or failed.
Regression: Prediction Machine Learning and Its Application 129
When there are just two categories, binary classification is the simplest type of
classification issue. Binary logistic regression aims to train a classifier that can
determine a new input observation class using a binary choice.
Take a look at the example. The number of hours spent sleeping and studying
determines whether a student passes or fails a test. Thus, the number of hours spent
sleeping and studying are your characteristics, while passing (1) and failing (0) are
the grades in such situations. The problem's data set is listed below Table 4.3.
Table 4.3. Table showing result of students based on the hours spent in study and sleep.
We did regression in the linear regression section, so we only had one output 𝒚 and
attempted to make it as close to the real goal 𝒚 as feasible. Instead of regression,
classification is used here, with each input 𝑿 being assigned to one of 𝑳 classes.
So far, we've looked at the case where each training sample 𝒙 either belongs to a
specific class (𝒚 = 𝟏) or doesn't (𝒚 = 𝟎). However, we can extend this concept to
the scenario where 𝒙 can be a member of many classes at the same time, i.e.,
𝒚𝝐{𝟎, 𝟏}𝑪 , where 𝑪 is the number of classes. For example, if the inputs are images
and we have three classes ("horse," "truck," and "plant"), we may say that a given
image has a truck and a plant.
130 Machine Learning and Its Application Indranath Chatterjee
4.3.1.3.1. Procedure
i. Predict the probability that the observations are in that single class.
ii. Prediction = finding the maximum value of the probability of the categories or
classes.
For each sub-problem, we select one class (𝒚 = 𝟏) and combine all the others into
a second class (𝒚 = 𝟎). After that, we choose the class with the greatest estimated
value.
Binary logistic regression with the Sigmoid function and multi-class logistic
regression with a SoftMax are the most used methods.
1. Sigmoid Function
The sigmoid function (also known as the standard logistic function) is defined for
a scalar real integer as:
1
𝜎𝑧 =
1 + 𝑒 −𝑧
It returns values from the range of 0 and 1. Because of this characteristic, it may be
used to interpret a real-valued score 𝒛 as a probability.
2. Softmax Function
𝑒 𝑥𝑖
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 (𝑥𝑖 ) =
∑𝑛𝑗=1 𝑒 𝑥𝑗
Each element in 𝑺𝒐𝒇𝒕𝒎𝒂𝒙 (𝒙𝒊 ) is compressed to fit within the range [𝟎, 𝟏], and
the total number of elements is 1. As a result, the softmax function helps convert a
discrete probability distribution from an arbitrary vector of actual values.
3. Decision Boundary
A threshold can be established to forecast which class a data belongs to. The derived
estimated probability is categorized into classes based on this threshold.
If the value which has been predicted is less than 0.5, categorize the email as spam;
otherwise, label it as not spam.
There are two types of decision boundaries: linear and non-linear. To provide a
complicated decision boundary, the polynomial order can be raised.
4. Cost Function
The cost function is different in the case of logistic regression compared to linear
regression. The cost function used for linear regression is basically the "mean
squared error (MSE)". If this is utilized for logistic regression, the function of
parameters will be non-convex (theta). Only if the function is convex, then the
gradient descent will lead to a global minimum.
The probability of the default class is modeled using logistics. Suppose we're
modeling packages as cargo shipment or general shipment based on their weight,
for example. In that case, the first class might be cargo, and the logistic regression
model could be expressed as the probability of cargo given a package weight:
132 Machine Learning and Its Application Indranath Chatterjee
𝑃(𝑝𝑎𝑐𝑘𝑎𝑔𝑒 = 𝑐𝑎𝑟𝑔𝑜|𝑤𝑒𝑖𝑔ℎ𝑡)
To put it another way, we're modeling the probability that an input (𝑿) belongs to
the default class (𝒀 = 𝟏); we can write it down like this:
𝑒 𝛽0 +𝛽1 𝑥
𝑦 =
1 + 𝑒𝛽0 +𝛽1 𝑥
The above equation can be reversed as follows. By adding a natural logarithm (𝒍𝒏)
to the opposite side, we may eliminate the 𝒆 from one side:
𝑃(𝑥)
ln ( ) = 𝛽0 + 𝛽1 𝑥
1 − 𝑃(𝑥)
This is important because we can see that the output on the right is calculated
linearly again, and the input on the left is a log of the default class's probability.
The odds of the default class is the ratio on the left. Odds are computed as a ratio
of the probability of an occurrence, divided by the probability of an event, which is
not occurring. For example, 0.6/(1-0.6) has odds of 1.5. Instead, we could write:
𝑙𝑛(𝑂𝑑𝑑𝑠) = 𝛽0 + 𝛽1 𝑥
We call the left-hand side the log-odds or the probit since the odds are log
converted. Although various types of functions can be used for the transform, the
link function, which connects the linear regression equation to the probabilities, is
commonly referred to as the link function.
𝑂𝑑𝑑𝑠 = 𝑒 𝛽0 +𝛽1 𝑥
All of this demonstrates that the model is still a linear combination of inputs but
that this linear combination is related to the default class's log-odds.
The optimal coefficients would result in a model that predicted a value very close
to 1 for the default class (i.e., cargo package) and a value extremely close to 0 for
the other class (i.e., general package). The idea behind maximum-likelihood
logistic regression is that a search method looks for coefficients that decrease the
difference between the model's predicted probabilities and the values in the data.
It's as simple as putting numbers into the logistic regression equation and
computing a result to make predictions with a logistic regression model.
Let's see an example. Assume we have a model that can predict whether a package
will be transported by freight train or by general transportation depending on its
weight. If the shipment weighs 700 kilograms, it will be transported by freight rail
or ordinary passenger train.
The coefficients of 𝜷𝟎 = -110 and 𝜷𝟏 = 0.5 have been learned. We can compute the
probability of a package with a weight of 250kg using the equation above
(P(package_cargo|weight=250).
𝑒 𝛽0+𝛽1𝑥
𝑦 =
1 + 𝑒𝛽0+𝛽1𝑥
𝑒 −110+0.5∗250
≈𝑦 =
1 + 𝑒 −110+0.5∗250
≈ 𝑦 = 0.999999694
134 Machine Learning and Its Application Indranath Chatterjee
Here, the probability is near 1, which means the package will be shipped via cargo
rail.
The assumptions that logistic regression makes about the distribution and
connections in your data are pretty similar to those that linear regression makes. In
the end, the goal of predictive modeling and machine learning initiatives is to make
correct predictions rather than analyze the findings. As a result, as long as the model
is resilient and performs well, you can break some assumptions.
4. Remove Correlated Data Values: If you have several strongly correlated data
values, the model can overfit, much like linear regression. Calculate the pairwise
correlations between all inputs and exclude those that are highly correlated.
# Get dataset
X_data, y_data = load_iris(return_X_y=True)
Predictions: [1 1 0 2 0 2 0 2 2 1 1 2 1 2 1 0 1 1 0 0 1 1 0 0 2
0 0 2 1 0 2 1 0 2 2 1 0
1]
Probablitites: [[8.78532904e-02 5.87258994e-01 3.24887716e-01]
[7.01767782e-03 6.65604088e-01 3.27378234e-01]
[9.57854812e-01 2.16742805e-02 2.04709071e-02]
[3.22859757e-04 1.38904034e-01 8.60773106e-01]
[9.11913061e-01 7.08308459e-02 1.72560932e-02]
…
…
…
[3.49142225e-01 5.41654473e-01 1.09203302e-01]
[7.58113591e-01 1.22837251e-01 1.19049158e-01]
[3.30693805e-02 5.67477098e-01 3.99453521e-01]]
Score: 0.7894736842105263
136 Machine Learning and Its Application Indranath Chatterjee
# The coefficients
print('Coefficients: \n', regression_classifier.coef_)
# 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(y_test, predictions))
Coefficients:
[[-2.5238339 2.04057617]
[ 0.48047069 -1.37876396]
[ 2.04336321 -0.66181221]]
Mean squared error: 0.21
Coefficient of determination: 0.63
# Plot outputs
# Plot the decision boundary.
x_min, x_max = X_data[:, 0].min() - .5, X_data[:, 0].max() + .
5
y_min, y_max = X_data[:, 1].min() - .5, X_data[:, 1].max() + .
5
h = .02 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_m
in, y_max, h))
Z = regression_classifier.predict(np.c_[xx.ravel(), yy.ravel()
])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(6, 4))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)
plt.show()
Regression: Prediction Machine Learning and Its Application 137
CONCLUDING REMARKS
CHAPTER 5
Reinforcement Learning
Abstract: This chapter introduces the readers to a new concept of machine learning,
other than supervised and unsupervised learning. This concept is popularly known as
reinforcement learning. Reinforcement learning is a kind of machine learning
algorithm, where the model learns itself based on surrounding behavior and new
technique of rewarding. This chapter will gradually teach the readers about each
concept for understanding reinforcement learning in-depth, alongside a basic
application with Python. This chapter will look at the concepts to understand why it
is getting so much attention these days. This serves as a beginner's guide to
reinforcement learning. Reinforcement learning is undoubtedly one of the most
visible study areas at the moment, with a promising future ahead of it, and its
popularity is growing by the day.
Its purpose is to teach the model how to conduct certain activities in a specific
environment, leading to the discovery of the best actions to take in various scenarios
to reach the ultimate objective. Using repeated trials to maximize a cumulative
reward, the agent learns to attain a goal in an unpredictable and complicated
environment. The agent uses a trial-and-error approach to find a solution to the
problem, and the actions it takes, earn it either rewards or penalties. Reinforcement
Learning is used to determine the optimum behavior or course to follow in a
particular circumstance.
Let us now understand the concept in a more elaborative way, rather than just
mentioning the terminologies. Reinforcement learning is a new technique of
machine learning, not a very old concept. So, it is still under the research and
development phase for utilizing its capacity at most.
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
Reinforcement Learning Machine Learning and Its Application 139
5.1.1. Overview
Machine learning is a section of computer science dealing with the creation and
development of algorithms that enable computers to learn from data sources, such
as sensor data or databases. Detecting complicated patterns and making intelligent
judgments based on data is a crucial focus of machine learning research.
RL is about agents who must notice and react to their surroundings. This method
combines traditional AI and machine learning methods. It is a complete problem-
solving environment.
It is neither a sort of neural network nor a substitute for neural networks. Instead,
for the learning device, it is an orthogonal technique. Reinforcement learning
provides feedback that evaluates the learner's performance without establishing
criteria of accuracy in the form of behavioral objectives. For example, we can
consider bicycle training.
The model is trained with a training dataset that includes a correct response key in
supervised learning. The decision is based on the initial input, which contains all
the data necessary to train the computer. Because the judgments are distinct from
one another, each one is represented by a label.
Before getting deeper into the working model of reinforcement learning, let us first
understand the key terminologies used in RL. Fig. (5.1) states the overall
architecture of reinforcement learning. We observe two significant elements, agent
and environment. In this diagram, we observe that these two blocks are connected
with three arrows; the first one is the state, which connects the agent from the
environment. The second one is action, which relates the environment from the
agent and is supported by a policy. The third one is the reward, which connects the
Reinforcement Learning Machine Learning and Its Application 141
agent from the environment [12]. Fig. (5.1) illustrates the overview of the working
condition of reinforcement learning.
Reward: In an RL problem, the reward function defines the aim. To attain this
purpose, the policy is changed.
Value Function: A reward function explains what is desirable in the short term,
but a value function describes what is beneficial over time. The value of a state is
the total amount of reward an agent might expect to get in the future, starting from
that state.
Environment Model: The process replicates the behavior of the environment using
a model of the environment. It is used for planning, and if one knows its present
state and action, one can forecast the next state and reward.
There are quite a few methods to solve the problem in reinforcement learning.
1. Positive Recommendation
2. Negative Reinforcement
Reinforcement learning uses the evaluative feedback technique. There are two
kinds of evaluative feedback techniques available.
It indicates the proper course of action to take, regardless of the activity taken.
For example, supervised learning.
o Associative
It is situation-specific.
It helps in mapping from a circumstance to the optimum actions for that scenario.
o Non-associative
When the job is fixed, the learner either tries to discover a single best action or
tries to monitor the best action as it varies over time when the work is not fixed.
o Greediest action
The greediest activity is one that has the most significant expected return.
Reinforcement Learning Machine Learning and Its Application 145
Ɛ-greedy
An action is chosen at random now and then with a bit of probability (Ɛ). The
action is chosen consistently, regardless of the action-value estimations.
Ɛ -soft
The optimal action is chosen with probability (1–Ɛ), and a random action is
chosen uniformly the remainder of the time.
𝑆𝑡𝑎𝑡𝑒 = 𝑆
𝐴𝑐𝑡𝑖𝑜𝑛 = 𝐴(𝑆), 𝐴
𝑃𝑜𝑙𝑖𝑐𝑦 = 𝜋 (𝑆) → 𝑎
=𝜋*
146 Machine Learning and Its Application Indranath Chatterjee
Let us see an example to understand MDP. For example, let us consider a fair dice
rolling game. The dice have fair corners and six different faces, numbers ranging
from 1 to 6. Fig. (5.2) illustrates an example showing the dice game using the MDP
model. The condition of the games lies is:
2. If you resign, you will be awarded 10 points, and the game will be over.
3. If you keep going, you'll get 6 points and 6-sided dice. The game is over if the
dice lands on a 1 or 2. If not, the game moves on to the following round.
4. There's an obvious trade-off to be made here. For example, we can trade a 4 point
guaranteed gain for the chance to roll dice and advance to the next round.
To apply MDP in reinforcement learning for this application, let us define the
𝑺, 𝑨, 𝑷, 𝑹, according to the conditions of the game.
Agent: The agent might be in the game or out of the game in the dice game.
Reward: Depending on the action, rewards are offered. Continuing the game earns
you 6 points, while leaving earns you 10 points. It is necessary to maximize the
'overall' return.
Fig. (5.2). Example showing the dice game using the MDP model.
Bellman's Equation
To understand MDP, Bellman's equation plays a vital role. It is the core part of
Markov's decision process model. It lays forth a framework for calculating the
optimal projected reward in a given state by addressing the question, "What is the
maximum reward an agent may earn if they do the best action now and in the
future?"
Where,
𝑽(𝑺) stands for the expected return during the current state;
𝑹(𝑺, 𝒂) stands for the expected reward for action 'a' at state' S';
𝜸 𝑽(𝑺′) stands for discount factor (𝜸) times the return for next state;
The importance of 𝜸 should be noted, ranging from zero to one (𝟎 ≤ 𝜸 ≤ 𝟏), when
calculating the best reward. The 𝑽(𝑺′) term is thoroughly wiped out when 𝜸 = 0,
and the model solely cares about the immediate reward. If 𝜸 is adjusted to 1, on the
other hand, the model influences prospective future benefits equally to immediate
rewards. Thus, the best 𝜸 value is generally between 0 and 1, with the value of
further-out rewards declining with the gap.
Let's utilize the Bellman formula to figure out how much money we could win in
the dice game. Because we have two options, our enlarged equation will be max.
(the reward of choice 1 vs. the reward of choice 2). The choice is to withdraw from
the game, eventually giving a one-time reward of 10 points.
𝟐
However, choice 2 results in a 6 points reward as well as a 𝒓𝒅 probability of
𝟑
progressing to the following round, when the option can be made again (calculating
by expected return). In front of terms, we add a discount factor 𝜸 to indicate the
calculation of 𝑺′. Even with a maximum 𝜸 = 𝟏, the following equation is recursive,
but it will finally converge to one value since the value of the following iteration
𝟐
drops by .
𝟑
2 2 2
max(10, 6 + 𝛾 (max (10, 6 + 𝛾 (max (10, 6 + 𝛾 … . . ) … )))
3 3 3
Either we can withdraw from the game and receive an extra 10 points in expected
value, or we may stay and receive an extra 6 points in anticipated value at each
stage. The predicted value increases by two-thirds for each subsequent round
because there is a two-thirds chance of continuing even if the agent decides to stay.
Reinforcement Learning Machine Learning and Its Application 149
1. Value Functions
i. Under a policy, the value function indicated as 𝑽(𝑺) expresses how good a state
is for an agent. In other words, beginning from the current condition under policy
𝝅, what is the average benefit that the agent will receive?
ii. The expected return when starting in state 𝑺 and following the policy 𝝅.
iii. As previously stated, the policy is mapping the probability of doing each
conceivable action in each state (𝝅(𝒂/𝒔)). When a policy informs precisely what
to do in each condition and does not provide probabilities, it is said to be
deterministic.
iv. It should now be self-evident that maximizing the value function for each state
would yield "the optimal policy (𝝅 ∗)". The following is the equation of optimal
policy:
ii. The expected return when starting in state 𝑺, performing action 𝑨, and following
policy 𝝅.
Dynamic Programming
Dynamic Programming was coined by Richard Bellman and is used to solve issues
that can be broken down into subproblems. With the crucial assumption that the
features of the environment are known, dynamic programming can assist in
identifying optimum solutions to industry planning challenges. It is a significant
step to begin learning about RL algorithms that can tackle more complicated tasks.
150 Machine Learning and Its Application Indranath Chatterjee
We can effectively identify an optimum policy for the agent to follow given the
complete model and specifications of the environment (MDP). It consists of two
significant steps:
2. Subproblem solutions are cached or saved for later use to obtain the best solution
to the problem.
There are two popular methods for Dynamic programming, which are as follows:
Bellman equations define a system of 'n' equations, which could solve the problem.
However, it will use the iterative version of the equation, as stated above. It will
start with any arbitrary value function (V0) and continue to iterate until the Vk
converges.
As a result, the new policy will almost certainly be better than the old one, given
enough iterations. It will eventually return the best policy. This sounds excellent,
but there is a catch; each iteration of policy iteration includes another iteration of
policy review, which might necessitate repeated sweeps over all states.
5.2.4.2. Q-learning
Instead of identifying the potential worth of the state being moved to, Q-Learning
proposes analyzing the quality of an activity made to go to that state. We don't know
about probabilities in Q-learning because they aren't explicitly described in the
model. Instead, through interacting with the surroundings, the model must learn
this and the environment on its own.
152 Machine Learning and Its Application Indranath Chatterjee
We know this equation returns a value of going to a specific state, considering the
environment's stochasticity. Suppose we add the concept of evaluating the quality
of activities taken to reach a specific state 𝑺′. To determine 𝑸(𝑺, 𝒂), i.e., the
cumulative quality of its possible actions, we must first deconstruct the formula.
When we remove the max() method, we obtain the following:
Generally, we include all potential actions and states in the equation that yields
𝑽(𝑺), and then we take the most prominent value induced by doing a given action.
Thus, the value footprint generated by the calculation above is for only one possible
action. In reality, we might consider it to be the action's quality.
We'll make a minor tweak to the previous equation now that we have an equation
to assess the quality of a specific action. Now, we may state that 𝑽(𝑺) is the highest
of all possible Q values (𝑺, 𝒂). Let's use this to our equation and replace 𝑽(𝑺′) with
a function of Q().
We just need to compute one function, 𝑸, and 𝑹(𝑺, 𝒂) is a quantified measure that
generates rewards for going to a specific state. Q-values are the quality of the
actions.
The temporal difference will assist the computer in computing Q-values in response
to environmental changes over time. Think about where the computer is now in the
indicated state and wishes to go to the higher state. It's worth noting that the robot
(automated system) already understands the Q-value of doing the action while
going to the higher state.
Reinforcement Learning Machine Learning and Its Application 153
We know that the environment is stochastic. Therefore, the robot's reward after
shifting to the top state may differ from previously seen. We use the same procedure
to recalculate the new 𝑸(𝑺, 𝒂) and deduct the previously calculated 𝑸(𝑺, 𝒂) from
it.
In the above equation, if we replace 𝑇𝐷𝑡 (𝑎, 𝑆) with its complete mathematical
form, then the final equation would be:
𝑄𝑡 (𝑆, 𝑎) = 𝑄𝑡−1 (𝑆, 𝑎) + 𝛼(𝑅(𝑆, 𝑎) + 𝛾 𝑚𝑎𝑥 𝑄(𝑆′, 𝑎′) − 𝑄𝑡−1 (𝑆, 𝑎))
In the above equation, the 𝛼 is called the learning rate. It maintains the adaptability
of the computer to random environmental changes. 𝑄𝑡 (𝑆, 𝑎) is the Q-value at the
current state, and 𝑄𝑡−1 (𝑆, 𝑎) is the Q-value at an earlier state.
It is instructed to move in only four directions: up, down, left, and right. Corner
movements are not allowed.
Fig. (5.3). Example showing the goal and path of a reinforcement learning example.
154 Machine Learning and Its Application Indranath Chatterjee
It will get a reward (R) of +5 at the block's position [4,3], and the reward will be
subtracted with -5 values if it reaches the block position [4,2]. For every step it
moves, the robot will be given a reward of -0.04, i.e., the robot's goal is to reach the
target and in fewer possible steps. The blue block at position [2,2] is a 'blocked'
block, which needs to be avoided. The start position is at [1,1].
Our target is to find a suitable solution for the robot to reach the target, earn the
maximum reward, and maintain the states, actions, and rewards. As we know, in
reinforcement learning, the actions are stochastics, and the policy is to map a path
from each state to action.
The robot will perform some probable steps to move through the room with blocks
for earning the reward. However, the optimal steps may be similar to the steps
mentioned below. The figure below shows the robot's movement across the room
for 'step-wise reward of -0.04 (Fig. 5.4).
As we know, the optimal solution is not easy to achieve. Now, let us see some
probable steps if we keep changing the 'Step-wise' reward (Fig. 5.5).
Till now, we learned the basics of reinforcement learning and its detailed
architecture. After understanding the working principle of RL, we are here to see
some real-life applications of RL, which will be mentioned one by one, in short.
The following are the applications that we can build, develop, and improve.
In this application, the robot can be taught to move more effectively by optimizing
its strategy based on its rewards.
Action: One of four movements (1) Moving forward, (2) Moving backward, (3)
Turning left and (4) Turning Right.
Reward: There is an addition of reward value when it reaches its goal. If it moves
in the wrong direction or halt or fall, it will lose the point of subtraction of reward.
Reinforcement Learning Machine Learning and Its Application 157
This application has the potential to increase the value of a customized class system.
For example, more effective learning can help the user, and more successful
advertising can benefit the system.
Agent: In an online learning catalog, the program determines what to display next.
Reward: If the user decides to click the class video, the reward is positive; if the
user decides to click the advertisement, the reward is more extensive; and if the
user decides to leave, the reward is negative.
In this application, the agent in this scenario monitors the surroundings and
determines their present state. The status can indicate how many adverts are
currently on the page and whether or not additional adverts can be added. The agent
then selects one of the three options for each phase. It can design an effective policy
if it is designed to get positive incentives whenever revenue increases and negative
incentives if revenue decreases.
Agent: The program that determines how many advertisements are suitable for a
particular page.
Action: One of the three options: (1) adding a new advertisement to the page, (2)
deleting an advertisement from the page, (3) neither adding nor deleting an
advertisement from the page.
Let us now dig into some other real-world applications of reinforcement learning,
which are given below:
158 Machine Learning and Its Application Indranath Chatterjee
4. Application in Healthcare
As we know that users' tastes vary regularly, offering news to people based on
ratings and likes might rapidly become outdated. The reinforcement learning
system may track the reader's return behaviors using reinforcement learning.
Obtaining news features, reader features, context features, and reader news features
would be required to build such a system. Content, headline, and publisher are just
a few examples of news features. The reader's interaction with the material, such as
clicks and shares, is referred to as reader features. News elements such as time and
freshness of the news are examples of context characteristics. Following that, a
reward is determined depending on the user's actions.
There are several factors to consider with self-driving automobiles, including speed
limitations in various locations, drivable zones, and avoiding crashes, to name a
few.
Forecasting future sales and stock prices may both be done with supervised time
series models. On the other hand, these models do not determine what to do at a
given stock price. A reinforcement learning agent can select whether to keep,
purchase, or sell a task. To guarantee that the RL model is working correctly, it is
assessed using market benchmark criteria.
Unlike prior techniques, which required analysts to make every choice, automation
ensures uniformity throughout the process. Every financial transaction's loss or
profit is used to calculate the reward function.
8. Application in Gaming
Let's take a look at a game program, especially AlphaGo Zero. AlphaGo Zero was
able to learn the game of Go from the ground up using reinforcement learning. It
figured out how to play against itself. Alpha Go Zero exceeded the version of Alpha
Go known as Master, which had beat world number one Ke Jie after 40 days of
self-training. It had a single neural network and only utilized black and white stones
from the board as input characteristics. A basic tree search based on a single neural
network is utilized to evaluate location and sample movements without employing
Monte Carlo rollouts.
5.4.1. Advantages
5.4.2. Disadvantages
import networkx as nx
G=nx.Graph()
G.add_edges_from(points_list)
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G,pos)
nx.draw_networkx_edges(G,pos)
nx.draw_networkx_labels(G,pos)
plt.show()
Reinforcement Learning Machine Learning and Its Application 161
if point[0] == goal:
R[point[::-1]] = 100
else:
# reverse of point
R[point[::-1]]= 0
162 Machine Learning and Its Application Indranath Chatterjee
(0, 1)
(1, 5)
(5, 6)
(5, 4)
(1, 2)
(2, 3)
(2, 7)
R
matrix([[ -1., 0., -1., -1., -1., -1., -1., -1.],
[ 0., -1., 0., -1., -1., 0., -1., -1.],
[ -1., 0., -1., 0., -1., -1., -1., 100.],
[ -1., -1., 0., -1., -1., -1., -1., -1.],
[ -1., -1., -1., -1., -1., 0., -1., -1.],
[ -1., 0., -1., -1., 0., -1., 0., -1.],
[ -1., -1., -1., -1., -1., 0., -1., -1.],
[ -1., -1., 0., -1., -1., -1., -1., 100.]])
Q = npy.matrix(npy.zeros([MATSIZE,MATSIZE]))
# learning parameter
gamma = 0.8
StateInit = 1
Reinforcement Learning Machine Learning and Its Application 163
def actAvail(state):
stateCurr_row = R[state,]
av_act = npy.where(stateCurr_row >= 0)[1]
return av_act
totalAvailAct = actAvail(StateInit)
def nextActSample(actAvail_range):
nextAct = int(npy.random.choice(totalAvailAct,1))
return nextAct
action = nextActSample(totalAvailAct)
if IndMax.shape[0] > 1:
IndMax = int(npy.random.choice(IndMax, size = 1)
)
else:
IndMax = int(IndMax)
ValMax = Q[action, IndMax]
ValMax 0.0
0
# Training
scores = []
for i in range(700):
stateCurr = npy.random.randint(0, int(Q.shape[0]))
totalAvailAct = actAvail(stateCurr)
action = nextActSample(totalAvailAct)
score = update(stateCurr,action,gamma)
scores.append(score)
print ('Score:', str(score))
# Testing
stateCurr = 0
steps = [stateCurr]
Reinforcement Learning Machine Learning and Its Application 165
while stateCurr != 7:
if IndNextStep.shape[0] > 1:
IndNextStep = int(npy.random.choice(IndNextSte
p, size = 1))
else:
IndNextStep = int(IndNextStep)
steps.append(IndNextStep)
stateCurr = IndNextStep
plt.plot(scores)
plt.show()
ValMax 0.0
Score: 0
ValMax 0.0
Score: 0
ValMax 0.0
Score: 0
166 Machine Learning and Its Application Indranath Chatterjee
….
ValMax 0.0
Score: 100.0
ValMax 0.0
Score: 100.0
ValMax 0.0
Score: 100.0
….
ValMax 195.20000000000002
Score: 253.77049180327867
ValMax 195.20000000000002
Score: 333.7704918032787
ValMax 195.20000000000002
Score: 413.7704918032787
ValMax 0.0
Score: 413.7704918032787
ValMax 195.20000000000002
Score: 413.7704918032787
ValMax 195.20000000000002
Score: 413.7704918032787
….
ValMax 220.12313600000004
Score: 793.3629135606895
ValMax 156.12313600000004
Score: 793.3629135606895
ValMax 156.12313600000004
Score: 793.3629135606895
ValMax 320.09850880000005
Reinforcement Learning Machine Learning and Its Application 167
Score: 804.5956028896065
ValMax 156.12313600000004
Score: 804.5956028896065
ValMax 320.09850880000005
Score: 815.8282922185235
….
ValMax 215.09043650560008
Score: 903.7658383345106
ValMax 215.09043650560008
Score: 903.7658383345106
ValMax 359.0904365056001
Score: 908.778277635306
ValMax 215.09043650560008
Score: 908.778277635306
ValMax 359.0904365056001
Score: 908.778277635306
….
ValMax 319.75851256065664
Score: 982.2907658174032
ValMax 319.75851256065664
Score: 982.2907658174032
ValMax 319.75851256065664
Score: 982.2907658174032
ValMax 399.95948513068856
Score: 982.2907658174032
ValMax 499.9493564133607
168 Machine Learning and Its Application Indranath Chatterjee
Score: 982.2932982532384
ValMax 319.75851256065664
Score: 982.2932982532384
Trained Q matrix:
[[ 0. 63.95818066 0. 0. 0.
0. 0. 0. ]
[ 51.16654452 0. 79.94772582 0. 0.
51.16654452 0. 0. ]
[ 0. 63.95818066 0. 63.99065531 0.
0. 0. 100. ]
[ 0. 0. 80. 0. 0.
0. 0. 0. ]
[ 0. 0. 0. 0. 0.
51.16654452 0. 0. ]
[ 0. 63.95818066 0. 0. 40.93323562
0. 40.93323562 0. ]
[ 0. 0. 0. 0. 0.
51.16654452 0. 0. ]
[ 0. 0. 79.94772582 0. 0.
0. 0. 100. ]]
Most efficient path:
[0, 1, 2, 7]
Reinforcement Learning Machine Learning and Its Application 169
CONCLUDING REMARKS
CHAPTER 6
Deep learning is a subclass of machine learning that uses neural networks to learn
from unstructured or unlabeled data in a manner comparable to that of the human
brain. Giving the system vast data will eventually learn to comprehend it and
respond in meaningful ways.
Deep learning is a kind of machine learning that learns to represent the world as a
layered hierarchy of concepts. Each concept is defined in more straightforward
concepts and more abstract representations calculated in less abstract
representations. Each concept is represented in connection to other concepts. This
provides it with an ample amount of power and flexibility.
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
A New Approach to Machine Learning Machine Learning and Its Application 171
o Deep learning can operate with structured and unstructured data, whereas
machine learning works best with vast quantities of structured data.
o Machine learning methods detect patterns from labeled sample data, whereas
deep learning algorithms receive a considerable volume of data as input and analyze
it to extract characteristics from an object.
A deep neural network (DNN) comprises several layers, transforming the input data
into more abstract representations (for example, edge - nose - face). To generate
predictions, the output layer integrates such features.
It's a sophisticated neural network (having multiple hidden layers in between input
and output layers). Non-linear connections can be modeled and processed by them.
Deep-learning architectures like deep neural networks, deep belief networks, neural
graph networks, recurrent neural networks, and convolutional neural networks are
employed in areas like natural language processing (NLP), machine translation,
computer vision, bioinformatics, speech recognition, drug design, medical image
analysis, material inspection, and board game programs. In deep learning, the term
"deep" refers to the usage of several layers in the network. Deep learning is a more
172 Machine Learning and Its Application Indranath Chatterjee
recent form with an infinite number of layers of fixed size, allowing for practical
application and optimization while maintaining theoretical universality under
moderate circumstances. In terms of efficiency, trainability, and understandability,
deep learning layers are also heterogeneous and depart considerably from
physiologically informed connectionist models, hence the "structured" component.
Like machine learning, deep learning can be of two types, supervised and
unsupervised learning. Various algorithms can be listed in a supervised learning
way, such as convolutional neural networks, recurrent neural networks, and various
forms. The popular deep learning algorithms in the unsupervised part are neural-
network-based self-organizing map, restricted Boltzmann machine, and
autoencoders. The concept of supervised and unsupervised remains the same for
both operations, as in machine learning algorithms.
The working principle of deep learning algorithms lies in the architectural design
of the model. It also depends on the kind of deep learning model, such as supervised
and unsupervised. However, the basic working principle or algorithm can be framed
in the following way:
1. First, to acquire the appropriate answer, we must identify the actual problem,
which must be comprehended. The practicality of deep learning should also be
evaluated.
2. Second, we must determine the pertinent facts that must relate to the actual
situation and be adequately prepared.
3. Third, select a suitable deep learning algorithm.
4. Fourth, during training the dataset, an algorithm should be applied.
5. Fifth, the dataset should be subjected to final testing.
Step 2: Apply the inputs to the network for each training pattern.
Step 3: Calculate the output for each neuron from the input layer to the output layer,
passing through the hidden layer.
A New Approach to Machine Learning Machine Learning and Its Application 173
Step 5: To compute error signals for pre-output layers, use the output error.
Where, 𝑊𝑒𝑖𝑔ℎ𝑡𝑎𝑏+ represents the new weight, 𝑊𝑒𝑖𝑔ℎ𝑡𝑎𝑏 represents the initial
weight, and 𝑂𝑢𝑡𝑝𝑢𝑡𝑏 (1 − 𝑂𝑢𝑡𝑝𝑢𝑡𝑏 ) represents the sigmoid function.
The deep learning architecture runs mainly in two parts, the training phase and the
testing phase. However, most of the operation is done in the training phase, which
is described below.
The network's weights are individually specified and serve as the foundation for
linking inputs to outputs. Training is the process of defining these weights and
tuning them to translate the input to output with an acceptable degree of error across
many iterations.
The error is the gap between the actual and intended output. This error is used to
modify the network's weights, starting with the output layer and working backward
to the input layer. This back-propagation of the error to change the weights in the
network reduces the error for the complete training set across many training
samples.
174 Machine Learning and Its Application Indranath Chatterjee
1. Feedforward Neural Network: This is the most popular form of neural network
in practical applications. The input is the first layer, and the output is the final. We
refer to neural networks with more than one hidden layer as "deep" neural networks.
They perform several modifications to alter the similarity between cases. Each
layer's neurons' activity is a non-linear dependence of the activities in the layer
below.
1. Perceptrons Learning
Perceptrons are the earliest generation of neural networks, which are nothing more
than computer representations of neurons. Frank Rosenblatt made him famous in
the early 1960s. They appear to have a sophisticated learning algorithm and have
made several significant claims about what they can learn. The book "Perceptrons,"
written by Minsky and Papers in 1969, described what they could achieve and
demonstrated their limitations. Many individuals think that all neural network
models have these drawbacks. On the other hand, Perceptron learning is still
A New Approach to Machine Learning Machine Learning and Its Application 175
frequently employed today for problems involving large aspect vectors with
millions of features.
Yann LeCun and his coworkers created LeNet, a very good recognizer for
handwritten numbers, in 1998. In a feedforward network, they introduced several
hidden layers, mapping neuronal units within each layer, pooling the output layer,
and few other modifications. This new network is capable of doing a lot more tasks
than just classification. Convolutional neural networks are the name given to them
afterward.
pixels in an image or one patch of an image from the rest of the image appears much
more natural.
Recurrent Neural Networks are mighty since they combine two properties: a
distributed hidden state that allows them to store a large amount of knowledge about
the past effectively and non-linear dynamics that permit them to renovate their
hidden state in complex ways. Recurrent neural networks can calculate everything
that your computer can compute with enough neurons and time.
6. Hopfield Networks
7. Deep Autoencoders
While discussing the benefits and challenges of deep learning, we may lack balance
in the weighing machine [17]. The benefit side will always be on the heavier side.
However, there are a few challenges too, which need not be ignored.
6.1.4.1. Advantages
o Features are automatically inferred and adjusted to get the desired result. It is not
necessary to extract features ahead of time. This eliminates the need for time-
consuming machine learning approaches.
o Many different applications and data sets can benefit from the same neural
network-based technique.
o GPUs can conduct massive parallel computations and are scalable for enormous
amounts of data. Furthermore, it provides superior performance outcomes while
dealing with large amounts of data.
o The deep learning architecture is adaptable, which means it may solve new issues
in the future.
6.1.4.2. Disadvantages
o As data models are complicated data models, training is highly costly. Deep
learning also necessitates the use of expensive GPUs and hundreds of computers.
The users' costs will rise as a result of this.
6.2.1. Overview
Do you ever consider what it's like to construct something as complex as a brain,
how these devices function, or what they accomplish? Let's examine how nodes
communicate with neurons and how artificial and organic neural networks vary.
Synapse, dendrites, cell body, and axon are parts of a biological neural network
(BNN). Neurons do the processing in this neural network. Dendrites receive
messages from other neurons, the soma adds up all of the signals, and the axon
sends the signals to other cells. The organic nervous system heavily influences
artificial neural networks. It is, therefore, quite beneficial to have some
understanding of how this system works. Most living things that have the potential
to adapt to a changing environment require a learning control unit. To execute tasks,
180 Machine Learning and Its Application Indranath Chatterjee
The control unit, or brain, is split into structural and functional subunits, each with
its own set of functions, such as vision, hearing, motor control, and sensor control.
Nerves link the brain to the sensors and actors throughout the body. The brain has
a vast number of neurons, around 1011 on average. These are the fundamental
components of the central nervous system (CNS). Synapses are spots on the surface
of the neurons that link them. The brain's complexity stems from the vast number
of highly linked basic units that function in parallel, with each neuron getting
information from up to 10,000 other neurons.
The neuron is made up of all of the structures that make up an animal cell. In a
primary cell, the structure and operations are enormously complicated. Even the
most advanced neuron models in artificial neural networks appear to be toy-like in
comparison. The cell body (soma), dendrites, and axon are the three primary
structural components of the neuron. The neuron's organelles are housed in the cell
body, and the 'dendrites' originate there. These are slender, widely branched fibers
that link to many cells inside the cluster by branching out in diverse directions.
Input connections are formed from other cells' axons to the dendrites or directly to
the cell's body. Axondentrititic and axonsomatic synapses are two types of
synapses. Each neuron has only one axon. It's a single, long fiber that carries the
cell's output signal as electrical impulses along its length. The axon's end can split
into many branches, which are subsequently linked to other cells. The function of
the branches is to spread out the signal to a large number of different inputs. Fig.
(6.1) illustrates the similarity between the biological neural network and artificial
neural network.
The neurons essentially perform the following function: they sum up all of the cell's
inputs, which may differ depending on the intensity of the connection or the
frequency of the incoming signal.
A threshold function is used to process the input sum and generate an output signal.
Compared to a modern computer, the processing time of each cycle is around one
milliseconds, while the transmission speed of the neurons is about 0.6 to 120
milliseconds. In roughly 100 milliseconds, a human can identify the image of
another individual. Given an individual neuron's processing time of 1 millisecond,
this implies that a small number of neurons, less than 100, are involved in serial
A New Approach to Machine Learning Machine Learning and Its Application 181
Fig. (6.1). The similarity of biological neural network (BNN) and artificial neural
network (ANN).
Artificial Neural Network and Neural Computing phrases have a wide range of
definitions and interpretations in the literature. The definitions that follow are
geared toward computing, but they are still quite thorough, in my opinion, and they
cover a wide range of perspectives on what an ANN is.
Laurene Fausett of artificial neural networks uses only the connectionist research
technique. He said an artificial neural network is a data-processing system with the
same performance characteristics as biological neural networks. Artificial neural
networks were created as extensions of mathematical models of human cognition
or neural biology, based on the following assumptions:
2. Signals are transferred from one neuron to the next via connecting connections.
3. Each connecting link has a weight associated with it, doubting the signal
conveyed in a conventional neural network.
4. To calculate its output signal, each neuron applies an activation function to its
net input.
A neural network is a computer model with layers of linked nodes that simulates
the networked organization of neurons in the brain. It can be trained to identify
patterns, categorize data, and predict future occurrences using data [17].
Artificial neural networks are systems that are based on biological neural networks.
Without any task-specific rules, these systems learn to execute tasks by being
exposed to various datasets and examples. Instead of being programmed with a pre-
coded understanding of these datasets, the system identifies features from the given
data.
So basically, we can jot down some points for an artificial neural network, which
are as follows:
To comprehend neural networks, we must first dissect and comprehend the most
fundamental unit of a Neural Network, the Perceptron [18].
It is made up of four key elements: (i) Inputs, (ii) Bias and Weights (iii) Function
of Summation Function (iv) Transformation or activation Function.
(1958). Perceptron, like neurons, is a network that receives some inputs, processes
them, and provides an output.
Fig. (6.2). A perceptron with input x, weights w and bias b, giving output y.
While the dendrite of real neurons receives electrical impulses from the axons of
other neurons, these electrical signals are represented numerically in the perceptron.
Electrical impulses are modified in varying degrees at synapses between dendrites
and axons. In the perceptron, this is also replicated by multiplying each input value
by a weight value.
A New Approach to Machine Learning Machine Learning and Its Application 185
Only when the overall intensity of the input signals exceeds a particular threshold
does a real neuron fire an output signal. In a perceptron, we mimic this phenomenon
by computing the weighted sum of the inputs, which represents the overall intensity
of the input signals, and then applying a step function to the sum to produce the
output. This output is transmitted to other perceptrons, much like in biological
neural networks.
Linear Separability
The notion of linear separability states that the input space is divided into areas
based on the condition of network response, be it positive or negative.
If two sets of points in 2-D space can be divided by a straight line, they are said to
be linearly separable. In general, if a hyperplane of (n-1)-dimensions separates two
sets of points in n-dimensional space, they are linearly separable.
Assume a network with a positive response in the first quadrant and a negative
response in all other quadrants (AND function) using binary or bipolar data; the
decision line between the positive and negative response regions is then
constructed. As a result, the AND gate is linearly separable, making it simple to
build using the perceptron learning technique.
The perceptron can hardly represent linearly separable functions or functions that
can be shown in a two-dimensional graph with a single straight line separating two
parts of the values. It cannot, however, simulate the XOR function since it is not
linearly separable. When the two data points are not linearly separable, a linear
separator that minimizes the mean squared error may be desired.
An input and output layer makes up a Single Layer Perceptron. A complex limiting
function is used as the activation function. The Single Layer Perceptron is defined
as a configuration of one input layer of neurons feeding forward to one output layer
of neurons. The earliest suggested neural model was the single-layer perceptron. A
vector of weights makes up the content of the neuron's local memory. A single layer
perceptron is computed by calculating the sum of the input vectors, each with the
value multiplied by the corresponding member of the weights vector. The value
presented in the output will be the activation function's input. Fig. (6.4) illustrates
a single layer perceptron network.
A New Approach to Machine Learning Machine Learning and Its Application 187
𝑛
1 𝑖𝑓 𝑛𝑒𝑡𝑗 ≥ 0
𝑦𝑗 = 𝑓(𝑛𝑒𝑡𝑗 ) = { 𝑤ℎ𝑒𝑟𝑒 𝑛𝑒𝑡𝑗 = ∑ 𝑥𝑖 𝑤𝑖𝑗
0 𝑖𝑓 𝑛𝑒𝑡𝑗 < 0
𝑖=1
Algorithm
Step 3: Using the weight set, iterate over the training set's input patterns 𝒙𝒋 ,
computing the weighted sum of inputs 𝒏𝒆𝒕𝒋 = 𝒙𝒊 𝒘𝒋 for 𝒊 = 𝟏 𝒕𝒐 𝒏 for each input
pattern 𝒋.
Step 4: Using the step function, as given below, compute the result 𝒚𝒋 .
188 Machine Learning and Its Application Indranath Chatterjee
𝑛
1 𝑖𝑓 𝑛𝑒𝑡𝑗 ≥ 0
𝑦𝑗 = 𝑓(𝑛𝑒𝑡𝑗 ) = { 𝑤ℎ𝑒𝑟𝑒 𝑛𝑒𝑡𝑗 = ∑ 𝑥𝑖 𝑤𝑖𝑗
0 𝑖𝑓 𝑛𝑒𝑡𝑗 < 0
𝑖=1
Step 5: For each input pattern 𝒋, compare the computed output 𝒚𝒋 with the intended
output 𝒚𝒋 . If all of the input patterns were successfully categorized, output the
weights and quit the program.
If the computed outputs 𝒚𝒋 are zero when they should be one, then 𝒘𝒊 = 𝒘𝒊 +
𝜶𝒙𝒊 , 𝒊 = 𝟎, 𝟏, 𝟐, . . . , 𝒏, where 𝜶 is a constant value and the learning parameter.
The most common form of neural network in use today is multilayer perceptrons
(MLP). They are a form of feedforward neural network, a fundamental type of
neural network capable of approximating generic classes of functions, such as
continuous and integrable functions.
Perceptrons are building blocks that only become effective in bigger functions like
multilayer perceptrons, much as Rosenblatt founded the perceptron on a
McCulloch-Pitts neuron, proposed in 1943.
The multilayer perceptron is deep learning's "hello world": a fantastic spot to start
learning about deep learning. A multilayer perceptron (MLP) is a type of artificial
neural network that has several layers. More than one hidden layers with any
number of units make up a multilayer perceptron. In the input layers, linear
combination functions are used. In the hidden layers, it employs sigmoid activation
functions. It has any number of outputs and may be activated in any way. Any
continuous function may be approximated using MLPs with one hidden layer.
A New Approach to Machine Learning Machine Learning and Its Application 189
MLP, like any other feed-forward neural network, is primarily concerned with two
motions: back and forward. Since each guess is a test of what we believe we know,
and each response is feedback letting us know how incorrect we are, you might
think of this ping pong of guesses and replies as a type of pendulum.
The signal flows from the input layer via the hidden layers to the output layer in
the forward pass, and the output layer's decision is compared to the ground truth
labels. In the backward pass, partial derivatives of the error function w.r.t. the
different weights and biases are back-propagated through the MLP using
backpropagation and the chain rule of calculus. The differentiation operation
creates a gradient, or error landscape, along which the parameters may be tweaked
to get the MLP closer to the error minimum. Any gradient-based optimization
technique, such as stochastic gradient descent, can be used to do this. Fig. (6.5)
shows multi-layer perceptron network.
2. The artificial neural network gets data from the outside environment in the form
of patterns and vector images. The notation 𝒙(𝒏) for n number of inputs designates
these inputs.
3. Each input is multiplied by the weights assigned to it. The information that the
neural network uses to solve a problem is called weights. The strength of the
connectivity between neurons inside the neural network is often represented by
weight.
4. Inside the computing unit, all of the weighted inputs are added together. If the
weighted sum is zero, bias is applied to the output to make it non-zero or scale up
the system response. The weight and input of bias are always set to one.
5. Any numerical number between 0 and ∞ corresponds to the sum. The threshold
value is chosen to limit the response so that it reaches the desired value. The sum
is sent forward through an activation function in this case.
6. To obtain the desired result, the activation function is set to the transfer function.
There are two types of activation functions: linear and nonlinear.
Artificial neural networks are of many types. Among them, there are eight most
prominent kinds of artificial neural networks [19], which are as follows:
Many neural networks work together to produce the outcomes in this kind of neural
network. Each of these neural networks performs and constructs a variety of sub-
tasks. When compared to other neural networks, this gives a unique collection of
inputs. To complete any job, there is no signal exchange or contact between these
neural networks. When the number of connections is reduced, the calculation
performance improves, reducing the requirement for neural networks to
A New Approach to Machine Learning Machine Learning and Its Application 191
communicate [18]. The overall processing time will also be determined by the
number of neurons, which participate in the computation of outcomes and their
engagement in the process. Modular neural networks are one of AI's fastest-
growing fields.
Frank Rosenbluth introduced the single-layer perceptron as the first neural network
model in 1958. It is one of the first learning models. The weight vector and the bias
parameter are used to calculate a linear decision function. Understanding artificial
neural networks are required to understanding the perceptron layer.
3. Multilayer Perceptron
The feedforward neural network is the purest form of an ANN since the information
only goes in one way. Data enters through input nodes and exits through output
nodes in this type of neural network, including hidden layers. This neural network
employs a classifying activation function. Backpropagation is not permitted, and
only forward propagation is permitted.
Vectors from any dimension are fed into a discrete map in this neural network. The
map is used to produce training data for an organization. The map might have one
or two dimensions. Depending on the value, the weight of the neurons may
fluctuate. Recognizing data patterns is one of the effective uses of the Kohonen
Neural Network (also known as SOM network). It is also utilized in medical
research to categorize illnesses more accurately. After evaluating the trends in the
data, the data is grouped into distinct groups.
7. Hopfield Network
Dr. John J. Hopfield created the Hopfield neural network in 1982. It is made up of
one or more completely linked recurrent neurons in a single layer. For auto-
association and optimization tasks, the Hopfield network is widely employed.
There are two basic types: discrete and continuous. Discrete works in a discrete line
way, which means that the input and output patterns are discrete vectors, which can
be binary (0,1) or bipolar (+1,1). There are no self-connections in the network,
which has symmetrical weights. In the continuous type of network, time is a
continuous variable. It is also utilized in auto association and optimization issues
like the traveling salesman.
These networks resemble the Hopfield network, except that some neurons are input
and others are concealed in nature. Here the weights are chosen randomly and then
learned using the backpropagation method.
Credit Rating
Voice recognition
Financial Forecasting
Fraud detection
# 1. Import libraries
# 2. Import dataset
trainY.shape
(60000,)
Model: "sequential"
____________________________________________________________
Layer (type) Output Shape Param #
============================================================
flatten (Flatten) (None, 784) 0
A New Approach to Machine Learning Machine Learning and Its Application 195
____________________________________________________________
dense (Dense) (None, 256) 200960
____________________________________________________________
dense_1 (Dense) (None, 10) 2570
============================================================
Total params: 203,530
Trainable params: 203,530
Non-trainable params: 0
____________________________________________________________
# 5. Learning model
ANN.fit(trainX, trainY, epochs=5)
Epoch 1/5
1875/1875 [==============================] - 6s 3ms/step - loss:
3.6962 - accuracy: 0.7367
Epoch 2/5
1875/1875 [==============================] - 5s 3ms/step - loss:
0.6083 - accuracy: 0.7863
Epoch 3/5
1875/1875 [==============================] - 5s 3ms/step - loss:
0.5426 - accuracy: 0.8107
Epoch 4/5
1875/1875 [==============================] - 5s 3ms/step - loss:
0.5054 - accuracy: 0.8242
Epoch 5/5
1875/1875 [==============================] - 5s 3ms/step - loss:
0.4857 - accuracy: 0.8314
<tensorflow.python.keras.callbacks.History at 0x7f0b040ad290>
# 6. Evaluate model
loss, acc = ANN.evaluate(testX, testY, verbose=2) # Calculate
Accuracy
print('Test accuracy : ', acc) # Print result
# 7. Test
result = ANN.predict(testX)
number : 9000
actual : 6 Shirt
predict : 6 Shirt
number : 9001
actual : 9 Ankle boot
predict : 9 Ankle boot
number : 9002
actual : 0 T-shirt/top
predict : 3 Dress
number : 9003
actual : 1 Trouser
predict : 1 Trouser
number : 9004
actual : 6 Shirt
predict : 6 Shirt
6.3.1. Layers
In the past section, we have discussed artificial neural networks. This section deals
with all the components and functions of a neural network. A neural network
decomposes your information into layers of abstraction. An input layer, hidden
layers, and an output layer make up the structure. The layers are linked together via
nodes, or neurons, with each layer using the preceding layer's output as its input.
Receiving a set of inputs, performing computations, and then using the result to
solve the problem is the primary function.
The main three components of a neural network are the layers, input, output, and
hidden layers.
A New Approach to Machine Learning Machine Learning and Its Application 197
Artificial input neurons comprise up the input layer of a neural network, which
delivers the initial data into the system for processing by succeeding layers of
artificial neurons. The input layer is the first step in the artificial neural network's
workflow. Since the input layer is the initial layer of the network, artificial neurons
in the input layer have a distinctive function to perform. Experts explain this by the
input layer being made up of passive (not performing any operations) neurons that
do not take in data from preceding levels.
In particular, artificial neurons have a set of weighted inputs and functions based
on those weighted inputs. However, an input layer can theoretically be made up of
artificial neurons with no weighted inputs or calculated differently. For example,
randomly, because the information is entering the system for the first time. The data
is sent from the input layer to the succeeding layers in the neural network model,
where the neurons have weighted inputs.
The input layer is in charge of receiving the data. These inputs can be retrieved
from a remote location. In a neural network, there must always be one input layer.
The input layer receives the data, conducts the computations using its neurons, and
then sends the results to the next layers.
The output layer of an artificial neural network is the final layer of neurons that
generates the program's outputs. Given that output layer neurons are the last "actor"
nodes on the network, they may be constructed or monitored differently from other
artificial neurons in the neural network. Fig. (6.7) shows the output layer.
The output layer is in charge of generating the final product. In a neural network,
there must always be one output layer. The output layer receives the inputs from
the layers above it, conducts the computations using its neurons, and then computes
the output.
The output layer, in a way, condenses and concretely generates the ultimate
product. However, to fully comprehend the neural network, it is necessary to
examine the input layer, hidden layers, and output layer as a whole.
Neural networks now outperform most machine learning methods because of the
addition of hidden layers. Hidden layers are located between the input and output
layers. They are not visible to external systems and are private to the neural
network, as the name suggests. In a neural network, there might be one or more
hidden layers. Fig. (6.8) shows the hidden layer of a neural network.
The more hidden layers in a neural network, the longer it will take for the network
to create an output and the more complicated issues it will handle. The neurons add
the bias and determine the weighted sum of inputs and weights before performing
an activation function. Hidden layers enable a neural network's function to be split
down into particular data manipulations. Each function in the hidden layer is
tailored to deliver an inevitable result. For example, hidden layer algorithms that
recognize human eyes and ears may be used with later layers to identify faces in
200 Machine Learning and Its Application Indranath Chatterjee
pictures. While the functions for identifying eyes on their own are insufficient for
independently recognizing objects, they can operate together in a neural network.
The design of hidden layers in the neural network is the subject of several machine
learning model evaluations. There are various ways to build up these hidden layers
for creating diverse outcomes, such as convolutional neural networks for image
processing, recurrent neural networks with memory, and basic feedforward neural
networks that straightforwardly function on training data sets.
This section attempts to give fundamental insight into bias and weights. The
essential notion in a neural network is its weights and bias. The weights are
multiplied with the inputs and fed into an activation function. The bias is added
when the inputs are transferred across neurons.
6.3.2.1. Weights
The coefficients of the equation are called weights. Negative weights lower the
output's value. When a neural network is trained on a training set, it is given a set
of weights, to begin with. The optimal weights are then created by optimizing these
weights during the training phase.
Weight represents the strength of the link between units. If the weight from node 1
to node 2 is high, neuron 1 has a more substantial impact on neuron 2. The relevance
of the input value is reduced when it is given a weight. If the weights are close to
zero, altering this input will have little effect on the output. Negative weights
indicate that increasing this input will result in a decrease in output. Weight
determines how much of an impact the input has on the output. The weighted sum
of the inputs is computed initially by a neuron.
𝑦 = ∑ 𝑤𝑖 ∗ 𝑥𝑖 + 𝑏
𝑖=1
𝑋 = (𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 )
𝑊 = (𝑤1 , 𝑤2 , 𝑤3 , … , 𝑤𝑛 )
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑆𝑢𝑚 = (𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤𝑛 𝑥𝑛 )
Finally, the computed value is sent to the activation function, which generates a
result.
𝑛
6.3.2.2. Bias
Adding a constant value to the product of inputs and weights is known as bias. To
compensate for the outcome, bias is used. The bias is used to move the activation
function's outcome to the positive or negative side. It is an additional input to
neurons with a constant value of 1 and its connection weight. This ensures that the
neuron will be activated even if all of the inputs are zero.
202 Machine Learning and Its Application Indranath Chatterjee
Assume your neural network is supposed to output 10 when the input is 5 and the
weight is 1. How will you assure that the network's neuron returns 5 since the total
weight and input will equal 5? The answer is that a bias of 5 can be added.
Since bias is the inverse of the threshold, it determines when the activation function
is activated. The inclusion of bias decreases variance, which gives the neural
network more flexibility and generalization.
Inside the network, both weights and bias are learnable parameters. Before learning
begins, a teachable neural network will randomize both the weight and bias
parameters. Both parameters are changed as training progresses toward the desired
values and the correct output. The amount to which the two factors impact the input
data differs. Simply said, bias is the distance between the predicted value and the
desired value. Biases make up the discrepancy between the function's output and
its intended output.
A low bias value indicates that the network makes more assumptions about the
output's form. In contrast, a high bias value indicates that the network make fewer
assumptions about the output's form. Weights, on the other hand, are a measure of
the connection's strength. The degree of effect a change in the input has on the
output is determined by its weight. A low weight value will have little impact on
the input, whereas a higher weight value will significantly impact the output.
these functions should be nonlinear. Multi-state activation, Sigmoid, ReLu, etc., are
the activation functions utilized in deep neural networks [19].
The activation function is also regarded as Transfer Function. It may also be used
to connect two neural networks. For an artificial neural network to learn and
interpret complicated patterns, activation functions are critical. Its primary purpose
is to bring non-linear characteristics into the network. It calculates the ‘weighted
sum’, adds a sign, and then chooses whether or not to fire a particular neuron. Their
primary function is to transform an ANN node’s input signal into an output signal.
That output signal is now used as an input in the network's following layers. The
non-linear activation function will aid the model in comprehending the complexity
and providing precise findings.
The output signal would be a simple linear function if we didn't use an activation
function. A linear function is just a one-degree polynomial. Although a linear
equation is simple to solve, it is restricted in complexity and has less capacity to
learn complicated functional mappings from data. A neural network without an
activation function is just a linear regression model, which has limited power and
frequently fails to perform well. The neural network would also be unable to learn
and represent other complex data types such as pictures, videos, audio, and speech
without the activation function.
The activation functions may be split into two categories – linear activation
function and non-linear activation functions.
The function is a line or linear, as you can see. As a result, the functions' output will
be unconstrained by any range.
204 Machine Learning and Its Application Indranath Chatterjee
It doesn't assist with the complexities of varied characteristics of the typical data
supplied to neural networks. Fig. (6.9) depits the linear activation function’s curve.
This section will explain various activation functions widely used in deep learning
applications, along with their working principle and nature of the output.
When we plot a non-linear function, we see that it has a degree more than one and
has a curvature. We now require a neural network model to learn and represent
practically any arbitrarily complicated function that links inputs to outputs. As a
result, we may construct non-linear mappings from inputs to outputs by employing
a non-linear activation function. Fig. (6.10) shows the non-linear activation
function.
It allows the model to generalize or adapt to a wide range of data while still
distinguishing between the outputs.
1
𝑠(𝑥) =
1 + 𝑒 −𝑥
When you combine several logistic functions, you obtain a multi-logistic function.
𝑛
𝑚(𝑥) = ∑ 𝑠(𝑝𝑖 (𝑥 − 𝑥𝑖 ))
𝑖=1
The identity activation function merely adjusts an input by a factor, suggesting that
the inputs and outputs have a linear connection. The identity function returns the
same value as the input it was given. It's also known as an identity map, identity
connection, or identity transformation. If 𝒇 is a function, then the identity relation
for argument 𝒙 is 𝒇(𝒙) = 𝒙 for all x values. Fig. (6.12) shows the identity
activation function.
𝑓(𝑥) = 𝑥; ∀ 𝑥 ∈ 𝑅
The Heaviside step function, also known as the unit step function, is a step function
named after Oliver Heaviside (1850–1925), whose value is zero for negative inputs
and positive arguments. It is typically represented as H or (but sometimes u, 1, or
1). It is an example of the broad category of step functions, which may all be written
as linear combinations of this one's translations. It is a linear activation function
(Fig. 6.13).
A New Approach to Machine Learning Machine Learning and Its Application 207
0 𝑓𝑜𝑟 𝑥 < 0
𝑓(𝑥) = {
1 𝑓𝑜𝑟 𝑥 ≥ 0
Sometimes, the logistic sigmoid function can cause a neural network to become
stuck during training. However, the beauty of an exponent is that its value in the
above equation never approaches zero or exceeds one. The big negative numbers
are scaled to 0, while the significant positive values are scaled to 1. Fig. (6.14) is
the sigmoid activation function’s curve.
1
𝑓(𝑥) =
1 + 𝑒 −𝑥
′ (𝑥)
𝑓 = 𝑓(𝑥[1 − 𝑓(𝑥)])
𝑑 𝑑 1
𝜎(𝑠) = ( )
𝑑𝑥 𝑑𝑥 1 + 𝑒 −𝑥
𝑑
= (1 + 𝑒 −𝑥 )−1
𝑑𝑥
= −(1 + 𝑒 −𝑥 )−2 (−𝑒 −𝑥 )
𝑒 −𝑥
=
(1 + 𝑒 −𝑥 )2
2
𝑓(𝑥) = Tanh(𝑥) = −1
1 − 𝑒 −2𝑥
Because its range is between -1 and 1, its output is now zero centered, i.e., output
lies between -1 and +1.
A New Approach to Machine Learning Machine Learning and Its Application 209
As a result, optimization is easy using this approach, and it is always favored over
the Sigmoid function in practice. However, it still has the Vanishing gradient issue.
One of the most commonly utilized activation functions is Rectified Linear Unit
(RELU). RELU should be used in the hidden layer whenever possible. The idea is
straightforward. It also gives the output a touch of nonlinearity. The outcome, on
the other hand, might range from 0 to infinity.
It was recently demonstrated that it improved convergence by six times over the
Tanh function. The positive component of the argument of the ReLU activation
function is defined as follows:
It is also known as the ramp function (Fig. 6.16). Here, in RELU, the vanishing
gradient problem is avoided and corrected.
210 Machine Learning and Its Application Indranath Chatterjee
Dying RELU Problem: ReLU neurons can be driven into states where they are
virtually dormant for all inputs at times. Because no gradients travel backward
through the neuron in this state, the neuron remains permanently inactive and
"dies." This is a variant of the issue of disappearing gradients. Vast quantities of
neurons in a network can become trapped in dead states in some situations, thereby
reducing model capacity. When the learning rate is set extremely high, this issue
occurs. It may be addressed by switching to leaky ReLUs, which assign a slightly
positive slope, but performance suffers as a result.
𝑒𝑥
𝑦=
𝑠𝑢𝑚(𝑒 𝑥 )
The softmax activation will output one value for each node in the output layer by
definition. The output numbers will be probabilities, and the values will total to 1.0.
The data must be prepared before modeling a multi-class classification issue. The
class labels in the target variable are the first label encoded, which means that each
class label is assigned an integer from 0 to N-1, where N is the number of class
labels.
The target variables that have been label encoded are then one-hot encoded. This,
like the softmax output, is a probabilistic representation of the class label. Each
class label and its location are given a position in a vector. All values are set to 0,
and a 1 is used to indicate the class label's location.
Three class labels, for example, will be integer encoded as 1, 2, and 3. Then vectors
were encoded as follows:
Class 1: [1, 0, 0]
Class 2: [0, 1, 0]
Class 3: [0, 0, 1]
Let's pretend you're working on a neural network that will forecast the likelihood
of rain in the future. The softmax activation function, which can compute the
likelihood of an event occurring in the future, can be employed in the output layer.
Softplus is a more recent addition to the sigmoid and Tanh functions. It was initially
released in 2001. Since Softplus is differentiable and its derivative is
A New Approach to Machine Learning Machine Learning and Its Application 213
The Sigmoid and Tanh functions create outputs with upper and lower bounds, but
the Softplus function produces outputs on a scale of (𝟎, +∞). That is the key
distinction.
𝑓(𝑥) = log(1 + 𝑒 𝑥 )
1
𝑇ℎ𝑢𝑠, 𝑑𝑒𝑟𝑖𝑣𝑎𝑡𝑖𝑣𝑒 𝑜𝑓 𝑓(𝑥) = 𝑓 ′ (𝑥) =
1 + 𝑒 −𝑥
The ELU (Exponential Linear Unit) is a function that tends to converge to zero
faster and give more accurate results. Unlike other activation functions, ELU
contains an additional alpha constant that must be positive.
With the exception of negative inputs, ELU is quite similar to RELU. For non-
negative inputs, they are both in identity function form. ELU, on the other hand,
smooths out gradually until its output equals 𝜶, whereas RELU smooths out
abruptly.
214 Machine Learning and Its Application Indranath Chatterjee
The exponential linear unit (ELU) is a deep neural network that accelerates learning
and improves classification accuracy. Fig. (6.19) shows the curve for exponential
linear unit activation function.
Other than these essential activation functions, as listed above, there are few more
activation functions are available, which as follows:
1. Arch Tan.
2. Leaky Rectified Linear Unit (LReLU’s).
3. Parametric rectified linear unit (PReLU).
4. Randomized leaky rectified linear unit (RReLU).
5. S-shaped rectified linear activation unit (SReLU).
6. Adaptive piecewise linear (APL).
7. SoftExponential.
The computation and storing of intermediate variables for a neural network from
the input layer to the output layer is called forward propagation [20]. We'll now go
through the mechanics of a neural network with one hidden layer one by one. It is
A New Approach to Machine Learning Machine Learning and Its Application 215
the process of feeding input values to a neural network and receiving a predicted
value as an output. Forward propagation is also known as inference. When we feed
the input data to the first layer of the neural network, nothing happens. The second
layer receives the data from the first layer and multiplies, adds, and activates them
before passing them on to the next layer. The procedure is repeated for succeeding
levels, and the final layer produces an output value.
6.3.5. Backpropagation
𝜕𝑧 𝜕𝑧 𝜕𝑦
= 𝑝𝑟𝑜𝑑𝑢𝑐𝑡( , )
𝜕𝑥 𝜕𝑦 𝜕𝑥
This is simple for vectors: it is simply matrix multiplication. We utilize the proper
counterpart for higher-dimensional tensors.
We receive an output value, which is the expected value, after forward propagation.
We compare the predicted value to the actual output value to compute the error. To
compute the error value, we utilize a loss function. The derivative of the error value
is then calculated for each and every weight in the neural network. The chain rule
of differential calculus is used in backpropagation. The derivatives of the error
value with respect to the weight values of the final layer are calculated first in the
chain rule. These derivatives are referred to as gradients, and the gradient values
are used to compute the gradients of the second final layer.
The learning rate is a hyperparameter that regulates how much the model changes
each time the model weights are altered in response to the predicted error. A too
little value may result in a prolonged training process that becomes stuck, whereas
216 Machine Learning and Its Application Indranath Chatterjee
a value too big may result in learning a sub-optimal set of weights too quickly or
an unstable training process.
When designing your neural network, the learning rate may be the essential
hyperparameter. As a result, it is critical to understand how to explore the impacts
of the learning rate on model performance and understand the learning rate's
dynamics on model behavior.
Gradient descent is one of the most widely used optimization methods, and it is by
far the most frequent method for optimizing neural networks. At the same time,
every current deep learning package includes implementations of multiple gradient
descent optimization techniques.
Formally,
𝑦 = 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 )
Find 𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏 that maximizes or minimizes the output ‘y’. Usually, a cost or
loss function or a profit/likelihood function is utilized for the task.
A New Approach to Machine Learning Machine Learning and Its Application 217
Fig. (6. 20). Curve showing the concept of global/local value for maxima/minima.
6.3.7.1. Gradient
The gradient is a single variable which is the derivative or the slope of the tangent
line at a point 𝑥0 (Fig. 6.21). A gradient measures the change in all weights in
relation to the change in error.
The steeper the gradient or slope and the quicker a model can learn, the higher the
gradient. If the slope becomes null or zero, the model will cease learning.
Formally saying:
𝜕
𝜃𝑗 = 𝜃𝑗 − 𝛼. 𝐽(𝜃1 , 𝜃2 , … , 𝜃𝑘 )
𝜕𝜃𝑗
𝑓𝑜𝑟 𝑗 = 0, 1, … , 𝑛 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒
𝑤ℎ𝑒𝑟𝑒 𝛼 𝑖𝑠 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒.
To make the algorithm sprint, we may select a number that is as high as feasible.
While this logic appears correct, there is a risk of overshooting the minima, causing
the algorithm to converge slowly or, worse, diverge. As a result, the learning rate
𝜶 is the step by which the gradient descent algorithm descends in a single iteration.
During a gradient descent, the learning rate 𝜶 determines the size of the step.
Choosing an appropriate 𝜶 value takes some forethought. If it's too little, the
program will take numerous incremental steps to arrive at the minimum. As a result,
gradient descent may be sluggish.
If 𝜶 is set very high, the algorithm may fail to converge and miss the minimum
point. Worse, it may diverge, meaning it will move away from the minimum and
A New Approach to Machine Learning Machine Learning and Its Application 219
use an endless amount of time. The slope of the tangent line gets lower as 𝜽 it
approaches the minimum until it hits zero. As a result, there's no need to change 𝜶,
which may stay the same throughout the procedure.
There are three types of gradient descent, each with a different amount of data used
to compute the objective function's gradient. We create a trade-off between the
precision of the parameter update and the time it takes to execute an update,
depending on the amount of data.
The gradient of the cost function with respect to the parameters 𝜽 is computed for
the whole training dataset using vanilla gradient descent, also known as batch
gradient descent.
𝜃 = 𝜃 − 𝛼∇𝜃 𝐽(𝜃)
Batch gradient descent is complex and intractable for datasets that do not fit in
memory, as we need to compute the gradients for the whole dataset to conduct just
one update. Batch gradient descent also prevents us from updating our model in
real-time. For convex error surfaces, batch gradient descent is guaranteed to
converge to the global minimum, and for non-convex surfaces, to a local minimum.
For each training example 𝒙𝒊 and label 𝒚𝒊 , stochastic gradient descent (SGD)
conducts a parameter update. For big datasets, batch gradient descent performs
unnecessary calculations by recalculating gradients for comparable samples before
each parameter change. By making one update at a time, SGD eliminates
redundancy. As a result, it is often faster, and it may also be used to study online.
𝜃 = 𝜃 − 𝛼∇𝜃 𝐽(𝜃; 𝑥𝑖 ; 𝑦𝑖 )
The volatility of the SGD allows it to reach new and maybe better local minima.
This complicates the convergence to the precise minimum since SGD will continue
to overshoot. SGD, on the other hand, has been demonstrated to have the same
convergence behavior as batch gradient descent when the learning rate is gradually
220 Machine Learning and Its Application Indranath Chatterjee
Finally, mini-batch gradient descent combines the best features of both batch and
stochastic gradient descent, performing an update for each mini-batch of n training
samples.
6.4.1. Overview
Around the 1980s, CNNs were developed and utilized for the first time. At the time,
the best a CNN could do was detect handwritten digits. It was primarily used in the
postal industry to read zip codes, PINs, and other similar information. The essential
thing to understand about any deep learning model is that it takes enormous data
and computational resources to train. Alex Krizhevsky decided in 2012 that it was
time to revive the multi-layered neural network branch of deep learning.
Researchers were able to revive CNNs due to the availability of enormous
quantities of data, including ImageNet datasets with millions of tagged pictures and
an abundance of computer resources [22].
A New Approach to Machine Learning Machine Learning and Its Application 221
The design of modern CNNs, as they are known informally, was influenced by
biology, group theory, and much experimentation. When it comes to the task with
a one-dimensional sequence structure, such as audio, text, and time series analysis,
computer scientists frequently use CNNs. Since they are made up of neurons with
learnable weights and biases, convolutional neural networks are pretty similar to
conventional neural networks. Each neuron gets some inputs, does a dot product,
and then executes a non-linearity if desired. From the raw picture pixels on one end
to class scores on the other, the entire network still represents a single differentiable
scoring function. They still have a loss function on the last layer, and we can use
the same approaches we used to build ordinary neural networks.
First, we will go through the fundamental operations that form the foundation of all
convolutional networks. The convolutional layers themselves, padding and stride,
the pooling layers used to aggregate information across neighboring spatial areas,
and a thorough examination of the topology of contemporary architectures are all
included.
Convolutional neural networks use the fact that the input pictures to limit the design
more logically. Unlike a conventional neural network, a CNN's layers have neurons
organized in three dimensions: width, height, and depth. Fig. (6.22) depicts an
overview of a convolutional neural network.
6.4.2.1. Layers
Different layers modify their inputs differently, and some layers are more suited for
specific tasks than others.
Fig. (6.23). Input image converts into the matrix before getting fed into CNN.
A New Approach to Machine Learning Machine Learning and Its Application 223
For example, a typical first-layer filter may be 10x10x3, with 10 pixels of width
and height and 3 pixels of depth since pictures have depth three (RGB color
channels). Each filter is slide or 'convolved' over the width and height of the input
volume and dot products between the filter's entries during the forward pass (Fig.
6.24). The input is computed at any location. As we move the filter over the input
volume's width and height, we'll get a 2-D activation map with the filter's responses
at every spatial point.
Fig. (6.24). Sample convolution operation on input image matrix using the filter to obtained
convolved image.
1. Filter's Dimensions
2. Stride
The stride S specifies the number of pixels by which the window advances after
each convolutional or pooling process.
3. Zero-padding
The act of adding P zeroes on each side of the input's bounds is known as zero-
padding. This value can be manually entered or calculated automatically.
As it scans the input 𝑰 in terms of its dimensions, the convolution layer employs
filters that conduct convolution operations. The filter size 𝑭 and stride 𝑺 are two of
its hyperparameters, and the result of the output is 𝑶. It is frequently helpful to
calculate the parameters that a model's architecture will assess its complexity.
224 Machine Learning and Its Application Indranath Chatterjee
𝐼𝑛𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 = 𝐼 × 𝐼 × 𝐶
𝑂𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 = 𝑂 × 𝑂 × 𝐾
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 = (𝐹 × 𝐹 × 𝐶 + 1) ⋅ 𝐾
𝑤ℎ𝑒𝑟𝑒 𝑘 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟𝑠.
The output volume is controlled by three hyperparameters: depth, stride, and zero-
padding. We may calculate the output volume's spatial dimension as a function of
the input volume size (W), the convolution layer filter size (F), the stride (S), and
the number of zero paddings utilized on the border (P). So the formula for
determining how many neurons "fit":
(𝑊𝐹 + 2𝑃)/𝑆 + 1
Let’s take an instance. Assume a 7x7 input with a 3x3 filter with stride 1 and pad 0
would yield a 5x5 output.
The output of neurons linked to particular areas in the input will be computed by
the convolution layer, which will calculate a dot product between their weights and
a tiny region in the input volume that they are connected to.
The feature maps are fed into an activation function in the same way that they would
be in a traditional artificial neural network. They are given into a rectifier function,
A New Approach to Machine Learning Machine Learning and Its Application 225
which returns zero if the input value is less than zero and returns the same input
value if it is more than zero (Fig. 6.25).
Fig. (6.25). RELU activation function being used after the convolution layer.
Fig. (6.26). Pooling layer with maxpool operation having 2x2 filter and stride size 2.
The pooling layer employs several filters to detect various aspects of the picture,
such as edges, corners, body, feathers, eyes, and beaks. Using the max pooling
operation, the pooling layer acts independently on each depth slice of the input and
resizes it spatially. In the above picture, we see a pooling layer with filters of size
𝟐 × 𝟐 and a stride of 2. In this scenario, the max-pooling operation would take a
maximum of two 𝟐 × 𝟐 values.
Where,
𝑊1 − 𝐹
𝑊2 =
𝑆+1
𝐻1 − 𝐹
𝐻2 =
𝑆+1
𝐷2 = 𝐷1
The maximum pooling operation assigns the maximum value as the output. In
contrast, the average pooling operation as signs the average value as the output,
using the input components in the pooling window. One of the essential advantages
A New Approach to Machine Learning Machine Learning and Its Application 227
After you've acquired the pooled featured map, you'll need to flatten it. The whole
pooled feature map matrix is flattened into a single column, which is then given to
the neural network for processing. Flattering is the process of combining all of the
pooled feature map's 2-D arrays into a single long continuous linear vector.
Fig. (6.27). Flatten layer: Converting the 2-D pooled matrix into 1-D vector.
The pooled feature map is transformed into a one-dimensional vector since it will
be fed into an artificial neural network (Fig. 6.27). To put it another way, this vector
will now serve as the input layer for a neural network.
The flattened feature map is then sent through a neural network after flattening. The
input layer, the fully linked layer, and the output layer make up this stage. In ANNs,
the completely connected layer is identical to the hidden layer, except it is fully
linked in this case. The predictable classes are found in the output layer. The data
is being sent via the network, and the prediction error is computed. To enhance the
228 Machine Learning and Its Application Indranath Chatterjee
forecast, the error is then backpropagated through the system. Fig. (6.28) depicts
the fully connected layer in a CNN.
1. LeNet
Yann LeCun pioneered the first successful uses of convolutional networks in the
1990s. The LeNet design, which was used to read zip codes, numbers, and other
data, is the most well-known of these.
A New Approach to Machine Learning Machine Learning and Its Application 229
2. AlexNet
AlexNet, created by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton, was the
first effort to popularize Convolutional Networks in Computer Vision. The
Network's design was quite similar to LeNet's, but it was deeper, more prominent,
and included convolutional layers layered on top of each other.
3. VGGNet
Its primary contribution was to demonstrate that network depth is an essential factor
in achieving high performance. Their best network has 16 convolution and fully
connected layers and, more importantly, a highly homogenous design that performs
just 3x3 convolutions and 2x2 pooling from start to finish. Their pre-trained model
is offered in Caffe as a plug-and-play solution.
4. ResNet
Kaiming He et al. first introduced the residual network or ResNet. It has unique
skip connections and makes extensive use of batch normalization. The design also
lacks completely linked layers at the network's end. ResNets are the most advanced
convolutional neural network models available today. They are the usual option for
employing a convolutional neural network in practice.
5. GoogLeNet
# 1. Import libraries
# 2. Import dataset
Model: "sequential"
____________________________________________________________
_____
Layer (type) Output Shape Param
#
============================================================
=====
conv2d (Conv2D) (None, 30, 30, 64) 1792
232 Machine Learning and Its Application Indranath Chatterjee
____________________________________________________________
_____
conv2d_1 (Conv2D) (None, 28, 28, 64) 36928
____________________________________________________________
_____
dropout (Dropout) (None, 28, 28, 64) 0
____________________________________________________________
_____
flatten (Flatten) (None, 50176) 0
____________________________________________________________
_____
dense (Dense) (None, 100) 5017700
____________________________________________________________
_____
dense_1 (Dense) (None, 10) 1010
============================================================
=====
Total params: 5,057,430
Trainable params: 5,057,430
Non-trainable params: 0
# 5. Learning model
hist = CNN.fit(trainX, trainY, validation_data = (testX, tes
tY),epochs=50, batch_size=32, verbose=1) # Learning
Epoch 1/50
1563/1563 [==============================] - 54s 7ms/step -
loss: 7.5301 - accuracy: 0.1213 - val_loss: 2.2952 -
val_accuracy: 0.1151
Epoch 2/50
1563/1563 [==============================] - 10s 6ms/step -
loss: 2.2528 - accuracy: 0.1498 - val_loss: 2.0047 -
val_accuracy: 0.2428
Epoch 3/50
1563/1563 [==============================] - 10s 6ms/step -
loss: 1.9544 - accuracy: 0.2672 - val_loss: 1.6887 -
val_accuracy: 0.3762
…
…
…
Epoch 49/50
1563/1563 [==============================] - 10s 6ms/step -
loss: 0.1764 - accuracy: 0.9599 - val_loss: 4.1499 -
val_accuracy: 0.5796
Epoch 50/50
A New Approach to Machine Learning Machine Learning and Its Application 233
# 6. Test
result = CNN.predict_generator(testX[:25,], steps=10) # Doin
g test
result = np.round(result) # Round result
plt.figure(figsize=(10,10))
for i in range(len(result)):
print(i, ':') # Print data number
plt.imshow(testX[i], cmap=plt.cm.binary) # Print images
plt.show() # Print conclusion
print("real : ", data_labels[np.argmax(testY[i])]) # Print
real index
print("predict : ", data_labels[np.argmax(result[i])]) # P
rint predicted index
print('\n\n')
1 :
real : Ship
predict : Ship
2 :
real : Ship
predict : Ship
234 Machine Learning and Its Application Indranath Chatterjee
3 :
real : Airplane
predict : Ship
4 :
real : Frog
predict : Frog
5 :
real : Frog
predict : Frog
6.5.1. Overview
We have only come across two sorts of data so far: tabular data and imaging data.
We unconsciously believed that our data came from the same distribution, which is
not the case. If the words in this text were permuted randomly, it would be pretty
impossible to understand their meaning. Similarly, optical frames in a movie, voice
signals in a conversation, and internet surfing activity follow a strict sequence.
Another difficulty emerges from the fact that we may be required to accept a
sequence as an input and continue it. This is commonly used in time series analysis
A New Approach to Machine Learning Machine Learning and Its Application 235
to anticipate the stock market, a patient's fever curve, or the required acceleration
for a race vehicle.
In summary, while CNN's are good at processing spatial data, recurrent neural
networks (RNNs) are superior at handling sequential data. RNNs use state variables
to retain previous data and use it in conjunction with current inputs to determine
current outputs.
RNNs might learn to build a large number of tiny programs, each of which captures
a nugget of information and runs in parallel, interacting to achieve very complex
results. Despite some valiant attempts, such as Tony Robinson's voice recognizer,
we were unable to fully harness the computing potential of RNNs for many years.
The MLP is the simplest type of RNN, with prior hidden units inserting back into
the network.
To comprehend recurrent nets, you must first learn the fundamentals of feedforward
nets. These networks get their names from how they transmit data via a sequence
of mathematical operations carried out at the network's nodes. One sends data
straight through, while the other loops it, and the latter is referred to be recurrent.
Recurrent networks use the current input sample they view as input and what they
have seen in the past. Fig. (6.29) shows a typical working principle of an RNN unit.
A recurrent net's choice at time step (t-1) impacts the decision it will make at time
step t, one instant later (Fig. 6.30). So, recurrent networks have two sources of
input: the present and the recent past, which combine to decide how they respond
to input data.
236 Machine Learning and Its Application Indranath Chatterjee
In the classical architecture of the recurrent neural network, for each of the
timestamps 𝒕, we can frame the equation of activation 𝒂𝒕 as:
𝑦𝑡 = 𝑔2 (𝑊𝑦𝑎 𝑎𝑡 + 𝑏𝑦 )
where, 𝑊𝑎𝑎 , 𝑊𝑎𝑥 , 𝑊𝑦𝑎 , 𝑏𝑎 , 𝑏𝑦 are the coefficients with activation functions 𝑔1 and
𝑔2 .
A New Approach to Machine Learning Machine Learning and Its Application 237
Recurrent networks differ from feedforward networks in that they have a feedback
loop related to previous judgments, absorbing their outputs as input moment after
moment. Recurrent networks are frequently described as having memory. The
sequence itself contains information, which recurrent networks employ to do jobs
that feedforward networks can't. The hidden state of the recurrent network
preserves that sequential information, which spans several time steps as it cascades
forward to impact the processing of each subsequent sample.
ℎ𝑡 = ∅(𝑊𝑥𝑡 + 𝑈ℎ𝑡−1 )
𝒉𝒕 is the hidden state at time step 𝒕. It is a function of the input at the same time
step 𝒙𝒕 , modified by a weight matrix 𝑾 added to the hidden state of the previous
time step 𝒉𝒕−𝟏 multiplied by its hidden-state-to-hidden-state matrix 𝑼, also known
as a transition matrix similar to a Markov chain. The produced error will be sent to
them via backpropagation and used to alter their weights until the error can no
longer be reduced. The function ∅ is either a logistic Sigmoid function or Tanh,
depending on the sum of the weight input and hidden state, which is a common
technique for condensing extremely big or very tiny values into a logistic space and
making gradients practical backpropagation. Since this feedback loop happens at
each time step in the series, each hidden state contains evidence of not just the
preceding hidden state 𝒉𝒕−𝟏 , but also all those that came before it for as long as
memory allows. A recurrent network will utilize the first character in a series of
letters to assist in perceiving the second character.
Then, using the chain rule, we compute and store gradients via backpropagation.
Since sequences can be rather long, the dependence can be quite long as well. For
example, in a 5000-character sequence, the first token might substantially impact
the token in the last place. This isn't computationally possible, and we'd have to do
over 5000 matrix products to get to that elusive gradient.
Then the update of gradient descent weights from each time-step would be:
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 (𝑡0 , 𝑡𝑙 )
∆𝑤𝑖𝑗 = −𝛼
𝜕𝑤𝑖𝑗
𝑡𝑙 𝜕𝐸 (𝑡)
𝑠𝑒
= −𝛼 ∑
𝑡0 𝜕𝑤 𝑖𝑗
𝝏𝑬𝒔𝒆
The partial derivative 𝝏𝒘𝒊𝒋 have a contribution from various instances of each
weight 𝒘𝒊𝒋 ∈ {𝒘𝑰𝑯 , 𝒘𝑯𝑯 } and rely on the input and hidden layer activations at
preceding tine-step (where 𝒘𝑰𝑯 DQG 𝒘𝑯𝑯 are the weights between the input to
hidden units and hidden to hidden units, respectively). The error will now
backpropagate through time across the network. In real-life applications, this
technique shows enormous improvement.
6.5.2.2. Need for More than Rnn: Vanishing and Exploding Gradient Problem
Recurrent networks, like most neural networks, are rather old. The vanishing
gradient problem became a key roadblock for recurrent network performance. The
gradient communicates the change in all weights concerning the change in error. If
we don't know the gradient, we will not be able to modify the weights to reduce
error, and the network will stop learning.
Deep neural networks' layers and time steps are related to one other by
multiplication. Therefore derivatives are prone to disappearing or exploding. As
they may be shortened or flattened, exploding gradients are very simple to fix.
Vanishing gradients can become too tiny for computers or networks to understand
- making it a more difficult challenge to tackle.
Natural language processing and speech recognition are two domains where RNN
models are commonly employed. Several forms of recurrent neural networks are
accessible, depending on the applications, sources of input, and necessary output.
Let's have a look at the different types:
1. One-to-one
One to one (Tx=Ty=1) RNN is also known as plain neural networks (Fig. 6.31). It
is concerned with a fixed-size input to a fixed-size output, where the output is
independent of the preceding output—for instance, application in image
classification.
2. One-to-many
One to Many (Tx=1, Ty>1) is a kind of RNN algorithm used when a single input
yields numerous outputs (Fig. 6.32). Music creation is a simple illustration of how
it may be used. RNN models are used in music generation models to produce a
music piece from a single musical note.
3. Many-to-one
4. Many-to-many
Many-to-Many RNN (Tx>1, Ty>1) architecture accepts many inputs and produces
multiple outputs, as is obvious, but there are two types of Many-to-Many models.
Fig. (6.34) shows the basic many-to-many RNN architecture.
When the input and output layers are the same sizes, this type of many-to-many
(Tx=Ty) is used. This may also be seen as every input having an output, with
Named-entity Recognition being a famous example.
5. Many-to-many (Bidirectional)
Neural networks were created to resemble the brain. The brain's memory, on the
other hand, is based on associations. For example, even in a new setting, we can
recognize a familiar face within 100-200 milliseconds. When we hear only a few
music bars, we may recall a whole sensory experience, including noises and images.
One item is frequently associated with another in the brain.
Pattern recognition issues are solved with multilayer neural networks trained using
the back-propagation technique. However, we need a different sort of network to
mimic the associative features of human memory: a recurrent neural network. From
its outputs to its inputs, a recurrent neural network has feedback loops. The presence
of such loops has a significant influence on the network's learning capabilities.
In the 1960s and 1970s, numerous academics were interested in the stability of
recurrent networks. However, no one could predict which network would be stable,
and several academics were skeptical that a solution could be found at all. Only in
1982, when John Hopfield established the physical idea of storing information in a
dynamically stable network, was the challenge overcome. Fig. (6.36) shows single-
layer n-neuron Hopfield network.
Mathematically, we can express the weight matrix of the Hopfield network as:
The Hopfield network employs McCulloch and Pitts neurons as its computational
unit, with the sign activation function:
+1 , 𝑖𝑓 𝑋 > 0
𝑠𝑖𝑔𝑛
𝑌 = {−1 , 𝑖𝑓 𝑋 < 0
𝑌 , 𝑖𝑓 𝑋 = 0
Synaptic weights between neurons in the Hopfield network are often expressed as
a matrix:
𝑁
𝑊 = ∑ 𝑌𝑚 𝑌𝑚𝑇 − 𝑀. 𝐼
𝑛=1
Where, 𝑴 is the number of states that the network must memorize, 𝒀𝒏 is the n-
dimensional binary vector, 𝑰 is the 𝒎 × 𝒎 identity matrix, and superscript 𝑻
indicates matrix transposition.
244 Machine Learning and Its Application Indranath Chatterjee
The weight matrix W, the current input vector X, and the threshold matrix q
determine the stable state-vertex. After a few cycles, if the input vector is partially
wrong or incomplete, the initial state will converge towards the stable state-vertex.
Assume that our network is expected to memorize two opposing states, (1, 1, 1)
and (-1, -1, -1). Fig. (6.37) shows the possible states for the three-neuron Hopfield
network. Thus,
1 −1
𝑌1 = [1] 𝑎𝑛𝑑 𝑌2 = [−1]
1 −1
1 0 0
𝐼 = [0 1 0]
0 0 1
Now, we determine the weight matric as:
1 −1 1 0 0
[1
𝑊 = [1] . 1 1] + [−1] . [−1 − 1 − 1] − 2. [0 1 0]
1 −1 0 0 1
0 2 2
= [2 0 2]
2 2 0
The network is then validated using input sequence 𝑿𝟏 and 𝑿𝟐 , those are equal to
the output values 𝒀𝟏 and 𝒀𝟐 .
The Hopfield network is first activated by applying the input vector X. The actual
output vector Y is then calculated, and the result is compared to the initial input
vector X.
0 2 2 1 0 1
𝑌1 = 𝑠𝑖𝑔𝑛 {[2 0 2] [1] − [0]} = [1]
2 2 0 1 0 1
0 2 2 −1 0 −1
𝑌2 = 𝑠𝑖𝑔𝑛 {[2 0 2 ] [ −1 ] − [ 0 ]} = [−1]
2 2 0 −1 0 −1
All six of the remaining states are volatile. Stable states might, on the other hand,
attract states that are near to them. Instable states(-1, 1, 1), (1, -1, 1) and (1, 1, -
1) are attracted to the basic memory (1, 1, 1). When compared to the fundamental
memory, each of these unstable states defines a specific error (1, 1, 1). Instable
states (-1, -1, 1), (-1, 1, -1) and (1, -1, -1) are drawn to the basic memory (-1, -1, -
1). As a result, the Hopfield network may be used to rectify errors.
The maximum number of basic memories stored and retrieved accurately is referred
to as storage capacity. Mmax is the maximum number of basic memories that may
be stored in an n-neuron recurrent network. It can be denoted as 𝑴𝒎𝒂𝒙 = 𝟎. 𝟏𝟓𝒏.
246 Machine Learning and Its Application Indranath Chatterjee
The German academics Sepp Hoch Reiter and Juergen Schmid Huber suggested a
variant of the recurrent network with so-called Long Short-Term Memory units, or
LSTMs, to solve the vanishing gradient problem in the middle of 1990.
LSTMs aid in the preservation of error that can be transmitted backward in time
and layers. They allow recurrent networks to learn across multiple time steps (over
1000) by keeping a more consistent error, creating a route to connect causes and
effects remotely. Because algorithms are usually confronted by settings where
reward signals are sparse and delayed, such as life itself, this is one of the critical
difficulties facing machine learning and AI.
In a gated cell, LSTMs store data outside of the usual course of the RNN. Through
open and close gates, the cell decides what to store and when to allow readings,
writes, and erasures. Unlike digital storage on computers, these gates are analog,
with element-wise multiplication by sigmoids in the range (0-1). Analog offers the
benefit of being differentiable, and hence ideal for backpropagation, over digital.
Fig. (6.38) shows the basic structure of a LSTM algorithm [25].
Those gates operate on the signals they receive, blocking or passing information
according to its strength and import, which they filter with their own sets of
weights, much like the nodes in a neural network. The learning process for recurrent
networks adjusts such weights and the weights that modify input and hidden states.
That is, by an iterative process of guessing, backpropagating error, and changing
A New Approach to Machine Learning Machine Learning and Its Application 247
weights via gradient descent, the cells learn when to allow data to enter, leave, or
be eliminated.
It's worth noting that addition and multiplication play different roles in the input
transformation in LSTM memory cells. Instead of multiplying the current state by
the incoming input to get the next cell state, they add the two, and that is the
difference. For input, output, and forgetting, different sets of weights filter the data.
Because if the forget gate is open, the current state of the memory cell is simply
multiplied by one to propagate forward one more time step, the forget gate is
represented as a linear identity function. Fig. (6.39) shows the different gates in a
LSTM.
1. Cell State
The key idea of LSTMs is the cell state, which is a kind of conveyor belt used in
factories (Fig. 6.40). It transfers the information from one side to another.
2. Forget Gate
LSTM may delete or add information to the cell state, which is carefully controlled
via structures known as gates (Fig. 6.41). A sigmoid layer called the forget gate
layer decides what information from the cell state should be discarded.
𝑓𝑡 = 𝜎(𝑊𝑓 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑓 )
3. Input Gate
It determines what new data will be stored in the cell state. The input gate layer
determines which values will be updated first (Fig. 6.42). The Tanh layer then
generates a new vector of candidate values. Finally, merge the two to generate a
state update.
𝑖𝑡 = 𝜎(𝑊𝑖 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑖 )
A New Approach to Machine Learning Machine Learning and Its Application 249
4. Output Gate
The output gate determines the value of the next hidden state (Fig. 6.43). This state
stores data from earlier inputs. The output is based on the cell state.
𝑂𝑡 = 𝜎(𝑊𝑂 . [ℎ𝑡−1 , 𝑥𝑡 ] + 𝑏𝑂 )
ℎ𝑡 = 𝑂𝑡 ∗ Tanh(𝐶𝑡 )
𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶̃𝑡
where (𝑓𝑡 ∗ 𝐶𝑡−1 ) denotes forgetting last information, and (𝑖𝑡 ∗ 𝐶̃𝑡 )
denotes adding new information.
When manually tuning hyperparameters for RNNs, keep the following in mind:
1. Keep an eye out for overfitting, which occurs when a neural network
"memorizes" the training data.
2. Regularization can help: examples of regularization methods are L1, L2, and
Dropout.
3. Create a separate test set where the network will not be trained.
250 Machine Learning and Its Application Indranath Chatterjee
4. The more extensive the network, the more powerful it becomes; yet, it is also
simpler to overfit.
6. To know when to stop, evaluate the test set's performance at each epoch.
7. Use the Soft Sign activation function instead of Tanh for LSTMs.
# 1. Import libraries
import numpy as np
import os
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import SimpleRNN
from keras.layers import Flatten
from keras.layers import Dense
import scipy.io as sio
import scipy.io.wavfile
import tensorflow as tf
!pip install natsort
import natsort
from keras.utils.np_utils import to_categorical
# Kaggle install
!pip install kaggle # Install kaggle
!pip install --upgrade --force-reinstall --no-
deps kaggle # Check newest version
# Fixing permission
!chmod 600 ~/.kaggle/kaggle.json
# 3. Download dataset
!kaggle datasets download -d charlesaverill/imagenet-
voice # https://www.kaggle.com/charlesaverill/imagenet-
voice?select=inv_data
!unzip imagenet-voice.zip # Unzip dataset
# 4. Make dataset
path = os.listdir('inv_data') # Get file names
path = natsort.natsorted(path) # Sort file names
# Make arrays
trainX = np.empty((1,5000,1))
valX = np.empty((1,5000,1))
testX = np.empty((1,5000,1))
trainY = list()
valY = list()
testY = list()
# Split datas
for i in range(num_class):
filepath = '/content/inv_data/' + path[i+5] + '/' # Set fi
le folders
filenames = list()
for j in range(num_train+num_val+num_test):
filenames += [path[i+5] + '_' + str(j) + '.wav'] # Set
file names
print('filenames',filenames)
for t in range(num_train):
252 Machine Learning and Its Application Indranath Chatterjee
sample_rate,dataframe = sio.wavfile.read(filepath + fi
lenames[t], 'r') # Read wav file
dataframe = dataframe[:5000] # Cut
trainX = np.append(trainX, dataframe.reshape(1,-
1,1), axis = 0) # Add to array
trainY.append(i) # Make Y data
for v in range(num_val):
sample_rate,dataframe = sio.wavfile.read(filepath + fi
lenames[v+num_train], 'r')
dataframe = dataframe[:5000]
valX = np.append(valX, dataframe.reshape(1,-
1,1), axis = 0)
valY.append(i)
for te in range(num_test):
sample_rate,dataframe = sio.wavfile.read(filepath + fi
lenames[te+num_train+num_val], 'r')
dataframe = dataframe[:5000]
testX = np.append(testX, dataframe.reshape(1,-
1,1), axis = 0)
testY.append(i)
plt.figure(2)
plt.plot(trainX[11,:,:])
plt.title(path[6])
plt.show()
Model: "sequential_1"
____________________________________________________________
Layer (type) Output Shape Param
#
============================================================
254 Machine Learning and Its Application Indranath Chatterjee
# 7. Learning model
hist=RNN.fit(trainX, trainY, validation_data = (valX, valY),
epochs=20, batch_size=32, verbose=1)
Epoch 1/20
2/2 [==============================] - 22s 10s/step - loss:
3.3248 - accuracy: 0.4910 - val_loss: 28.7093 - val_accuracy:
0.7000
Epoch 2/20
2/2 [==============================] - 20s 10s/step - loss:
1.6483 - accuracy: 0.9451 - val_loss: 28.2392 - val_accuracy:
0.7000
Epoch 3/20
2/2 [==============================] - 21s 9s/step - loss:
0.2726 - accuracy: 0.9785 - val_loss: 39.3120 - val_accuracy:
0.4000
Epoch 4/20
2/2 [==============================] - 21s 10s/step - loss:
2.8248 - accuracy: 0.9667 - val_loss: 53.3486 - val_accuracy:
0.6000
Epoch 5/20
2/2 [==============================] - 20s 9s/step - loss:
5.5217e-08 - accuracy: 1.0000 - val_loss: 62.5805 -
val_accuracy: 0.7000
…
…
Epoch 20/20
2/2 [==============================] - 21s 10s/step - loss:
0.0000e+00 - accuracy: 1.0000 - val_loss: 137.2343 -
val_accuracy: 0.6000
A New Approach to Machine Learning Machine Learning and Its Application 255
# 8. Test
_, score = RNN.evaluate(testX, testY, batch_size=32, verbose
=0)
score = score * 100.0
print('Test acc : %.3f' % (score))
CONCLUDING REMARKS
CHAPTER 7
Feature Engineering
Abstract: The chapter on feature selection techniques deals with most state-of-the-
art feature selection techniques, which are being used alongside machine learning
algorithms. The feature selection is a crucial element for the better performance of
any machine learning algorithm. This chapter covers majorly two types of feature
selection algorithms, namely, filter-based and evolutionary-based. This chapter
covers two kinds of filter-based approaches in the filter-based algorithms, namely,
hypothetical testing, such as t-test, z-test, ANOVA and MANOVA, and correlation-
based such as Pearson's correlation, Chi-square test, and Spearman's rank correlation.
This chapter also explains various methods such as genetic algorithms, particle swarm
optimization, and ant colony optimization in evolutionary algorithms. For each of the
algorithms, this chapter describes it in detail and the optimized algorithm for
performing the feature selection approach.
7.1. INTRODUCTION
We encounter a vast number of features while working with big sets of data. The
most difficult challenge is to choose the most particular and relevant feature from
all the characteristics in a dataset. As a result, the feature selection techniques assist
us in doing so.
In order to predict the target variable, feature selection approaches are used to
minimize the number of input variables to those that are thought to be most
beneficial to a model. The methods aid us in selecting the fewest number of features
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
Feature Engineering Machine Learning and Its Application 257
1. Filter Techniques
2. Wrapper Techniques
We aim to utilize a subset of features in wrapper techniques and train a model with
them. We determine whether to add or delete features from your subset based on
the inferences drawn from the prior model. The problem may be simplified to a
simple search problem. These approaches are generally highly time-consuming to
compute.
3. Embedded Techniques
Filter and wrapper methods are combined in embedded methods. Algorithms with
built-in feature selection techniques are used to build it. LASSO and RIDGE
regression are two common examples of these techniques, both of which have built-
in penalization mechanisms to reduce overfitting.
Wrapper techniques are disciplined approaches to feature selection that tie the
feature selection process to the kind of model being produced, assessing feature
subsets to identify model performance between features and then selecting the best
performing subset [27].
The three primary techniques used to get the most optimal results under the wrapper
method are:
1. Forward Selection
2. Backward Elimination
3. Bi-directional elimination
1. Forward Selection
2. Backward Elimination
Backward elimination begins with all of the features and eliminates the least
significant feature at each iteration, improving the model's performance. We repeat
this process until no improvement is noticed when characteristics are removed. It
is similar to forwarding selection in that it starts with the whole collection of
features and works backward from there, eliminating features to find the best subset
of a specific size.
Feature Engineering Machine Learning and Its Application 259
3. Bi-directional Elimination
Until we need to delete previously added features, the forward selection and
backward elimination procedures are helpful for feature selection. Both approaches
are used in bi-directional elimination to pick the most relevant characteristics from
a group of features. It functions similarly to a forward selection technique, but it
also verifies previously added features' relevance. When the value of a previously
introduced feature decreases, it is removed using the backward elimination
approach. As a result, it is a hybrid of forward and backward elimination.
Filter-based feature selection is the most frequent feature selection approach that
one should be familiar with to build a decent model. The statistical relationship
taught in school is used to choose the feature with the aid of this approach [26].
This section deals with the statistical hypothesis testing approaches under the filter-
based methods. Hypothesis testing is a statistical procedure in which an analyst
verifies a hypothesis about a population parameter. The analyst's technique is
determined by the type of data and the purpose of the study. The use of sample data
to assess the plausibility of a hypothesis is known as hypothesis testing. Such
information might originate from a more significant population or a data-gathering
method.
7.2.1. T-test
A t-test is used to compare two big data sets, such as two large populations. A t-test
is a statistical test that compares two sets of data and determines its difference.
When we use a t-test, we get a t-score and a p-value. These numbers tell us how
diverse each group is and how likely it is that this variation is due to chance or
sampling error.
There are three types of t-tests, each of which is employed in distinct scenarios:
Feature Engineering Machine Learning and Its Application 261
1. Independent Samples
The Independent Samples t-test compares the means of two independent groups to
discover if there is statistical evidence that the population means are significantly
different. This t-test is a parametric test.
Formula
To determine if the means are different, the t-test statistic value is calculated as
follows:
𝑚𝑥 − 𝑚𝑦
𝑡=
𝑆2 𝑆2
√𝑛 + 𝑛
𝑥 𝑦
Where, 𝑺𝟐 is a common variance estimator for the two samples. The given formula
may be used to compute it. It is a common variance estimator for the two samples.
2
∑(𝑥 − 𝑚𝑥 )2 + ∑(𝑥 − 𝑚𝑦 )2
𝑆 =
𝑛𝑥 + 𝑛 𝑦 − 2
2. One Sample
The One-Sample t-test is a parametric test that examines if the sample mean differs
from a known or hypothesized population mean statistically.
262 Machine Learning and Its Application Indranath Chatterjee
Formula
Let 𝑿 denote a set of 𝒏 values with a mean of 𝒎 and a standard deviation of 𝑺. The
following formula is used to compare the observed mean 𝝁 of the population to a
theoretical value:
𝑚−𝜇
𝑡=
𝑆
√𝑛
Where degree of freedom (df) = 𝑛 − 1.
3. Paired Sample
The dependent sample t-test, also known as the paired sample t-test, is a statistical
technique for determining if the mean difference between two data sets is zero.
Each subject or object is measured twice in a paired sample t-test, resulting in pairs
of observations. Case-control studies and repeated-measures designs are common
uses for the paired sample t-test. Assume you want to assess the efficacy of a
company's training program. You might use a paired sample t-test to compare the
performance of a group of employees before and after they have completed the
program.
Formula
Let X denote a set of n values with a mean of m and a standard deviation of S. The
formula is:
𝑚
𝑡=
𝑆
√𝑛
Where the degree of freedom (df) = 𝑛 − 1.
7.2.1.1. Hypothesis
For the Independent and One Sample tests, the null hypothesis is that both groups'
means are identical. The null hypothesis for the Paired Sample is that the pairs of
differences between both tests are equal.
The Independent Samples t Test's null hypothesis (H0) and alternative hypothesis
(H1) may be stated in two distinct but identical ways:
where the population means for groups 1 and 2 are 𝝁𝟏 and 𝝁𝟐 , respectively.
Hypothesis Testing's core ideas are pretty similar to this case. Until we uncover
convincing evidence to the contrary, we assume the Null Hypothesis to be correct.
The Alternate Hypothesis is then accepted. We also calculate the Significance
Level (𝜶), which is the probability of something happening. As a result, if 𝜶 is
smaller, rejecting the Null Hypothesis will require more proof. Table 7.1. depicts
clearly the occurrence of each kind of error due to outcome of null hypothesis and
alternate hypothesis.
264 Machine Learning and Its Application Indranath Chatterjee
Table 7.1. Table showing the occurrence of Type-1 or Type-2 based on the decision on the
hypothesis.
Decision Made
For example, a person is on judgment by court. There are two kinds of judgment
errors: Type 1 error, which occurs when the judgment is given against the person
while he is innocent, and Type 2 error, which occurs when the judgment is given in
favor of the person while he is guilty.
There are two versions of the T-Distribution Table: one-tail and two-tails. The one-
tail is used to evaluate situations having a defined direction and a fixed value or
range (positive or negative).
7.2.1.3. T-score
The t-score is the ratio of the mean difference between two datasets and the
difference within the dataset generated by the T-test. While the numerator is simple
to compute, the denominator, depending on the data values involved, can be a little
more complicated. The ratio's denominator is a measurement of dispersion or
variability. As a result, the higher the score, the greater the differences between the
datasets, whereas the lower the score, the greater the similarities. T-Score is
sometimes called a t-value.
7.2.1.4. P-value
7.2.2. Z-test
When the variances are known and the size of the sample is large, a z-test is used
to evaluate whether two population means are different. In order to execute an
appropriate z-test, the test statistic is expected to have a normal distribution, and
other factors such as standard deviation should be known.
The z-test is a hypothesis test where the z-statistic is distributed normally. The z-
test is best utilized for samples with more than 30. This is because the samples with
more than 30 are assumed to be approximately (regularly) distributed according to
the central limit theorem [28].
The null and alternative hypotheses and the alpha and z-score should all be given
when doing a z-test. The test statistic should next be computed, followed by the
findings and conclusion.
𝑋̅ − 𝜇
𝑍=
𝜎⁄√𝑛
𝑤ℎ𝑒𝑟𝑒, 𝑋̅ 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒,
𝜇 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛,
𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑛 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒 (𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒)
of a collection of values. When a Z-score is zero, the data point's score is the same
as the mean score. A number that is one standard deviation from the mean has a Z-
score of 1.0. Z-scores can be positive or negative, with a positive number indicating
that the score is higher than the mean and a negative value suggests low.
The central limit theorem (CLT) says that as the sample size grows bigger, the
sample distribution approaches a normal, provided that all samples are similar in
size and independent of the population distribution shape. For the CLT to correctly
forecast the features of a population, sample sizes of 30 or more are deemed
sufficient.
In statistics, several distinct types of tests are employed (i.e., f-test, chi-square test,
t-test). A Z test would be used if:
Data points should be unrelated to one another. In other words, one data point is
unrelated to or unaffected by another.
Your data should be spread evenly. This isn't usually an issue with high sample
numbers (above 30).
Your information should be chosen randomly from a population, with each item
having an equal chance of being chosen.
When we either know the population variance or don't know the population
variance but have a large sample size (n 30), z tests are a statistical technique of
evaluating a hypothesis. Based on this, the z-test is of two kinds: one-sample z-test
and two-sample z-test:
Feature Engineering Machine Learning and Its Application 267
1. One-Sample Z-test
When we wish to compare a sample mean to the population mean, we use the One-
Sample Z-test.
2. Two-Sample Z-test
When we wish to compare the mean of two samples, we use the Two-Sample Z-
test.
T: The t-test is a type of parametric test used to determine how the averages of two
sets of data differ from one another when the standard deviation or variance is
unknown.
3. Z: In the z-test, all data points are independent, and Z follows a Normal
Distribution with a mean of zero and a variance of one [30].
T: In the t-test, All data points are independent, and sample values must be correctly
documented and taken.
7.2.3. ANOVA
The analysis of variance (ANOVA) is a statistical method that divides a data set's
observed aggregate variability into two parts: systematic variables and random
factors. Random factors have no statistical impact on the provided data set, but
systematic variables do. In regression studies, analysts employ the ANOVA test to
assess the impact of independent factors on the dependent variable.
The ANOVA test allows you to compare more than two groups simultaneously to
see whether there's a link between them. The F statistic (also known as the F-ratio)
results from the ANOVA formula allow examining several sets of data to identify
the variability between and within samples.
𝑀𝑆𝑇
𝐹=
𝑀𝑆𝐸
Where mst stands for mean square of treatments, mse is the mean square of error.
𝑀𝑆𝑇 = 𝑆𝑆𝑇/ 𝑝 − 1
𝑀𝑆𝐸 = 𝑆𝑆𝐸/𝑁 − 𝑝
𝑆𝑆𝐸 = ∑ (𝑛 − 1)𝜎 2
𝑝 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
7.2.4. MANOVA
combination varies from distinct groups or levels of the independent variable. The
MANOVA simply evaluates whether the independent grouping variable
concurrently explains a statistically significant amount of variation in the dependent
variable in this way [32]. In particular, ANOVA examines the differences in means
between two or more groups, whereas MANOVA examines the differences
between two or more vectors of means.
7.2.4.1. Assumptions
Unequal sample sizes - When cells in a factorial MANOVA have various sample
sizes, the sum of squares for effect plus error does not equal the entire sum of
squares, just like in ANOVA. As a result, the main effect and interaction tests are
linked. In MANOVA, SPSS corrects uneven sample sizes.
Feature Engineering Machine Learning and Its Application 271
b) With the addition of each additional variable, one degree of freedom is lost.
a) MANOVA allows you to test multiple dependent variables at the same time.
split further, such as integer and floating-point for numerical variables and boolean,
ordinal, or nominal for categorical variables.
The correlation test is used to determine the strength of a relationship between two
variables.
Always keep in mind that the most favored correlation approach is a parametric
correlation.
∑(𝑥 − 𝑚𝑥 ) (𝑦 − 𝑚𝑦 )
𝑟=
2
√∑(𝑥 − 𝑚𝑥 )2 ∑(𝑦 − 𝑚𝑦 )
Feature Engineering Machine Learning and Its Application 273
where,x and y are two vectors (data) of size n, m_x and m_y are the means of x
and y.
Correlations might imply causal relationships or not. In the other direction, causal
relationships between two variables may or may not result in a correlation between
the two variables. Outliers can significantly influence correlations; a single
unexpected observation can have a significant impact on a correlation. A cursory
study of a scatterplot can quickly reveal such outliers.
7.3.2.2. Assumptions
Two variables should have a linear connection with one another. A scatterplot
may be used to determine this: plot the values of variables on a scatter diagram and
see if the plot produces a straight line.
A hypothesis testing approach is the Chi-square test. χ is the Greek symbol known
as Chi. Checking if observed frequencies in one or more categories match predicted
frequencies is one of two popular Chi-square tests. For evaluating connections
between categorical data, the Chi-Square statistic is widely employed. The null
hypothesis is that the categorical variables in the population have no connection,
they are independent [33].
(𝑂𝑖 − 𝐸𝑖 )2
𝜒𝑐2 = ∑
𝐸𝑖
Where,
The summation sign indicates that you must calculate every data item in your data
collection.
Chi-square tests are divided into two categories. For distinct reasons, both employ
the chi-square statistic and distribution:
For example, it determines whether each type of candy has the same amount of
pieces in each bag.
Feature Engineering Machine Learning and Its Application 275
A low chi-square score indicates that your two sets of data are highly correlated. In
principle, chi-square would be 0 if observed and anticipated values were equal, but
this is unlikely to happen in practice. It is not as straightforward as it seems to
determine if a chi-square test result is large enough to suggest a statistically
significant difference.
The alpha value level (𝜶): The most common 𝜶 level is 0.05 (5%), however other
values such as 0.01 or 0.10 are also possible. The researcher himself chooses this.
You conduct the same analysis procedures for the Chi-square goodness of fit test
and the Chi-square test of independence mentioned below.
1. Before you start collecting data, define your null and alternative hypotheses.
2. Decide on the alpha value. This entails determining how much of a chance you're
prepared to take on the possibility of coming to the erroneous decision. Let's say
you want to test for independence with alpha=0.05. You have opted to take a 5%
chance of concluding that the two variables are independent when they are not.
276 Machine Learning and Its Application Indranath Chatterjee
To put it another way, you may use Spearman's correlation to see if a non-
monotonic relationship has a monotonic component.
6 ∑ 𝑑𝑖2
𝜌 =1−
𝑛(𝑛2 − 1)
2. Compare and contrast the two data sets. Giving the rating '1' to the largest number
in a column, '2' to the second largest value, and so on, achieves ranking. The column
with the smallest value receives the lowest ranking. For both sets of measurements,
this should be done.
4. Determine the rank difference (d): This is the difference in rank between the two
values on each table row. The first value's rank is deducted from the second value's
rank.
5. To eliminate negative values, square the differences and then add them together.
6. Using the equation stated above, calculate the coefficient. The result will always
be between 1.0 and -1.0, indicating a perfect positive correlation and a perfect
negative correlation, respectively.
278 Machine Learning and Its Application Indranath Chatterjee
Since they preferably do not make any assumptions about the underlying fitness
landscape, evolutionary algorithms frequently do well at approximating solutions
to a wide range of issues. Explorations of microevolutionary processes and
planning models based on cellular processes are the most common evolutionary
algorithms applications to biological evolution modeling.
There are various kinds of evolutionary algorithms. However, to satisfy the need
and requirements of the readers of this book, this section presents the three most
essential evolutionary algorithms, which are widely used alongside machine
learning algorithms, as a part of feature selection. Here, we will be learning genetic
algorithm, particle swarm optimization, and ant colony optimization in the next
section.
In the year 1962, the foundations of genetic algorithms were developed. John Koza
utilized genetic algorithms to develop programs to fulfill specific tasks later in
1992. Genetic programming was the name he gave to his approach.
In a population, there are diverse genetic structures and chromosomal behavior due
to different species. The genetic algorithm is based on them, and the following are
examples of them:
A search space is a location where all of the people in the surroundings are kept
safe. Each individual is a solution to a specific challenge. A finite-length vector of
components represents each individual. Genes are comparable to these changeable
components. As a result, a chromosome is made up of many genes.
As previously stated, there is continual rivalry among individuals for resources and
mating, and each individual is assigned a fitness score. This score reflects an
individual's level of fitness. Individuals (also considered a chromosome in the
genetic algorithm) with the highest fitness scores are sought. The genetic algorithm
keeps track of a population of 𝒏 individuals and their fitness scores. An individual
with a strong fitness score has a better probability of mating. Selection is made
among those with higher fitness scores based on their offspring's fitness scores.
Since the space size is fixed, some solutions die and are replaced by new ones to
collect new offspring. After the previous generation's mating is completed, a new
generation begins. It is hoped that as generations pass, better solutions will emerge,
while the least suited will perish.
280 Machine Learning and Its Application Indranath Chatterjee
Each new generation has, on average, better genes than prior generations'
individuals. As a result, each successive generation has better partial solutions than
the preceding one. The population has converged when the children produced have
no substantial differences from those generated by preceding populations. The
algorithm is claimed to have converged on a set of problem solutions.
This process is really divided into two parts: crossover and mutation. Following the
selection of the top chromosomes, these members are utilized to produce the
algorithm's next generation. New offspring are generated using the traits of the
chosen parents, who combine the parents' qualities. Depending on the type of data,
this can be challenging, but combining combinations and producing valid
combinations from these inputs in most combinatorial situations is feasible. The
possibility of offspring acquiring a mutation and the severity of the mutation is
usually determined by a probability distribution.
The genetic algorithm is applied when a particular generation is generated. For the
same, there are various operators. Fig. (7.1) illustrates the lifecycle of a typical
genetic algorithm.
2) Selection Operator: This operator is used to the fittest individual for its genes
to be passed down over multiple generations.
3) Crossover Operator: Two individuals were chosen after utilizing the selection
operator. Both individuals are permitted to mate thanks to the crossover operator.
Crossover locations are chosen at random. The genes at these crossover locations
are then swapped, resulting in creating an entirely new individual (offspring).
Crossover is done at a specific point, known as crossover point.
Multiple objective or fitness functions can be used with a genetic algorithm. When
utilizing several fitness functions, we end up with a group of optimum points rather
than a single ideal point, making the procedure a little more complicated. The
Pareto front is a collection of optimum solutions that comprises equally ideal
chromosomes because no solution in the front dominates any other solution. Based
on the context of the problem or some other parameter, a curve or line is then used
to limit the set down to a single solution.
7.4.1.5. Termination
The algorithm must eventually come to an end. This generally happens in two ways:
either the algorithm has achieved its maximum runtime, or the method has hit a
performance threshold. A final solution is chosen and returned at this stage.
Optimization is the process of determining the best values for a system's particular
characteristics to meet all design criteria while keeping costs as low as feasible.
Problems of optimization can be encountered in every branch of study.
Feature Engineering Machine Learning and Its Application 283
Fig. (7.2). A representative model showing the working behavior of PSO algorithm after n-
iterations.
simulation. The PSO algorithm was created to graphically simulate the beautiful
but unexpected dance of a flock of birds.
7.4.2.1. Particles
𝑃𝑖𝑡 = [𝑥0,𝑖
𝑡 𝑡
, 𝑥1,𝑖 𝑡
, 𝑥2,𝑖 𝑡
, … , 𝑥𝑛,𝑖 ]
Let's take a minute to think about our particles before moving on to the PSO
method. As you know, each of these particles represents a possible solution to the
function that has to be reduced. Their coordinates in the search space define them.
The particles in the search space can then be defined at random. Each of these
particles has a velocity that allows them to update their position over time to
determine the global minimum.
𝑉𝑖𝑡 = [𝑣0,𝑖
𝑡 𝑡
, 𝑣1,𝑖 𝑡
, 𝑣2,𝑖 𝑡
, … , 𝑣𝑛,𝑖 ]
In the search space, the particles have already been dispersed at random. After that,
their velocity must be set. The velocity vector, which is defined by its speed in each
direction, will be randomized once again. As a result, we refer to stochastic
algorithms.
7.4.2.2. Swarms
PSO has a lot in common with evolutionary computing approaches like genetic
algorithms. The system starts with a population of random solutions and then
updates generations to look for optima. In contrast to genetic algorithms, however,
PSO lacks evolution operators such as crossover and mutation. The distinction is in
the manner in which the generations are updated.
Feature Engineering Machine Learning and Its Application 285
𝑡
𝑐1 𝑟1 (𝑃𝑏𝑒𝑠𝑡 𝑙𝑜𝑐𝑎𝑙(𝑖)
− 𝑃𝑖𝑡 ) 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑣𝑒,
𝑡
𝑐2 𝑟2 (𝑃𝑏𝑒𝑠𝑡(𝑔𝑙𝑜𝑏𝑎𝑙) − 𝑃𝑖𝑡 ) 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑠𝑜𝑐𝑖𝑎𝑙 (𝑔𝑙𝑜𝑏𝑎𝑙)
Each particle's speed is stochastically accelerated towards its prior best position and
the group's best solution across iterations in the search space.
Each particle's velocity is adjusted at the end of each iteration. The two best values
determine this velocity found so far and are subject to inertia.
The first value represents the particle's greatest personal solution to date. The
second is the best global answer the swarm of particles has discovered thus far. As
a result, each particle has its best personal answer and the best global solution stored
in its memory.
7.4.2.3. Optimization
Random terms weigh the acceleration at each iteration. The weights 𝒓𝟏 and 𝒓𝟐 are
used to modify cognitive and social acceleration stochastically. For each particle
and iteration, these two weights, 𝒓𝟏 and 𝒓𝟐 are different.
The hyperparameter 𝝎 defines the swarm's capacity to shift direction. The inertia
of the particles is related to the coefficient 𝝎. The stronger the convergence, the
smaller the coefficient 𝝎. It's best to stay away from 𝝎 > 𝟏 since this might cause
our particles to diverge.
286 Machine Learning and Its Application Indranath Chatterjee
Compared to other approaches, PSO has been shown to produce superior outcomes
faster and less expensive. It's also possible to parallelize it. Moreover, it does not
take into consideration the gradient of the issue to be solved. To put it another way,
unlike classic optimization approaches, PSO does not require a differentiable issue.
The PSO algorithm comes in a variety of forms. Two factors often drive them. First,
as PSO is near an evolutionary algorithm, hybrid versions with evolutionary
capabilities are being developed. Second, by changing the hyperparameters,
adaptive PSO can enhance performance.
Marco Dorigo created Ant colony optimization (ACO) in the 1990s, inspired by ant
colony foraging behavior. Ants are eusocial insects that want to survive and thrive
as a colony rather than as individuals. They use voice, touch, and pheromones to
communicate with one another. Pheromones are organic chemical substances
produced by ants that cause individuals of the same species to react socially. These
are substances that can function like hormones outside of the person's body who
secret them, influencing the behavior of others who receive them. As most ants
dwell on the ground, they leave pheromone trails on the soil surface that other ants
may follow.
Ants reside in communal nests, and the basic idea of ACO is to watch the ants leave
their homes to get food in the quickest possible time. Initially, ants roam around
their nests at random in quest of food. This randomized search opens up multiple
pathways from the nest to the food supply. Ants bring a portion of the food back
Feature Engineering Machine Learning and Its Application 287
with the required pheromone concentration on its return journey, dependent on the
quality and amount of the meal. The chance of the following ants choosing a
specific path based on these pheromone trials would be a guiding element to the
food source. This likelihood is depending on the pheromone concentration as well
as the pace of evaporation. It is also worth noting that, because pheromone
evaporation rate is a determining factor, the length of each route can be simply
calculated [35].
To understand the scenario, let us assume an actual situation. Let us consider for
simplicity that between the food source and the ant nest, just two potential pathways
have been examined. Fig. (7.3) illustrates the working principle of ants and their
locomotion with pheromones. The stages may be broken down into the following
categories:
Stage 1: All ants have returned to their colony. In the environment, there is no
pheromone content.
Stage 2: Ants begin their search along each trail with the same probability (0.5).
The curved path is longer, and hence the time it takes for ants to reach the food
source is longer.
288 Machine Learning and Its Application Indranath Chatterjee
Stage 3: The ants reach the food source faster by using the shorter path. They are
now faced with a similar selection issue, but this time the likelihood of selection is
higher due to a pheromone trail along the shorter path previously accessible.
Stage 4: As more ants return through the shorter path, the concentrations of
pheromones rise as well. Furthermore, evaporation decreases the pheromone
concentration in the longer path, lowering the likelihood of this path being chosen
in subsequent phases. As a result, the colony as a whole begins to take the shorter
way with more frequency. As a result, route optimization has been achieved.
𝑘
𝜏𝑥𝑦 = (1 − 𝜌)𝜏𝑥𝑦 + ∑ Δ𝜏𝑥𝑦
𝑘
CONCLUDING REMARKS
The chapter on feature selection approaches covered the most current feature
selection strategies used in conjunction with machine learning algorithms. The
selection of features is an essential factor in improving the performance of any
machine learning system. The two types of feature selection algorithms discussed
in this chapter were filter-based and evolutionary-based. This chapter discusses two
types of filter-based algorithms: hypothetical testing, such as the t-test, z-test,
ANOVA, MANOVA, and correlation-based, such as Pearson's correlation, Chi-
square test, and Spearman's rank correlation. This chapter solely covered evolution-
ary methods, including genetic algorithms, particle swarm optimization, and ant
colony optimization. This chapter goes through each method in-depth and the
optimized algorithm for conducting the feature selection technique.
290 Machine Learning and Its Application, 2021, 290-331
CHAPTER 8
8.1. INTRODUCTION
So far, we have looked at every aspect of machine learning and deep learning
models and have a strong knowledge of their design. This chapter presents cutting-
edge real-world applications of machine learning and deep learning techniques.
The chapter has looked at real-world applications from every corner of recent
advancements, from everyday usage of face recognition to object detection. Not
only that, but this chapter also discussed how machine learning and deep learning
are applied in real-life situations.
Till now, we have discussed many algorithms of machine learning and deep
learning. Although the explanation of every algorithm was supported by Python
application on actual data, this chapter elaborates more on each algorithm. The
availability of enormous datasets, along with improvements in algorithms and
exponential increases in computer power, has resulted in an unprecedented surge
of interest in the field of machine learning in recent years. Machine learning
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
Machine Learning and Deep Learning Machine Learning and Its Application 291
At first, we will cover machine learning applications and deep learning in pattern
recognition, specifically for the most used daily applications such as face
recognition, object detection, and optical character recognition [36]. After finishing
all the image-related recognition applications, we will dig into video processing for
identifying objects from a real-time or recorded video. An essential application of
deep learning during the current time is a medical application, specifically medical
imaging [37]. This chapter will teach how to apply machine learning and deep
learning algorithms in neuroimaging to identify MRI and fMRI data patterns.
Finally, we will visit another state-of-the-art application of deep learning, i.e., the
sentiment analysis using natural language processing, a computational linguistic
part.
Face recognition is one of the fields of machine learning that has been studied for
a long time. Moreover, it has developed into a common and popular technology
that can easily recognize faces even in our hands in recent years. In particular, it is
used in mobile phones as a way to maintain security easily. Based on this reality,
in this section, we will see how to distinguish faces with CNN and Avengers image
sets, which show excellent ability to process image data.
1. Technology used
CNN
Face recognition
2. Reference
https://medium.com/hyunjulie/%EC%BA%90%EA%B8%80%EA%B3%BC-
%EA%B5%AC%EA%B8%80-colab-
%EC%97%B0%EA%B2%B0%ED%95%B4%EC%A3%BC%EA%B8%B0-6a274f6de81d
https://www.kaggle.com/rawatjitesh/avengers-face-
recognition
https://www.kaggle.com/ruslankl/brain-tumor-detection-v1-0-
cnn-vgg-16/data
https://www.tensorflow.org/tutorials/keras/classification
3. Kaggle usage
kaggle main page -> (click top-right circle) -> Account ->
(down scroll) -> API -> Create New API Token
4. Description
a) Import libraries
b) Preparing Kaggle
c) Download dataset
d) Split data
e) Dataset import
f) print dates
g) Make CNN model
Machine Learning and Deep Learning Machine Learning and Its Application 293
h) Start learning
i) Save trained model
j) Import pre-trained model
k) Print result
l) Evaluate model
m) Print tested data
# 1. Import libraries
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
from pathlib import Path
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Input, Dense, Dropout, Flatten, Conv2
D
from keras.models import Sequential, Model
# Kaggle install
!pip install kaggle # Install kaggle
!pip install --upgrade --force-reinstall --no-
deps kaggle # Check newest version
# Fixing permission
!chmod 600 ~/.kaggle/kaggle.json
# 3. Download dataset
!kaggle datasets download -d rawatjitesh/avengers-face-
recognition # https://www.kaggle.com/rawatjitesh/avengers-
face-recognition
!unzip avengers-face-recognition # Unzip dataset
path = './cropped_images/'
294 Machine Learning and Its Application Indranath Chatterjee
# 5. Dataset import
shape=48 # Define image's size
train = ImageDataGenerator(rescale=1./255,).flow_from_director
y('./train', target_size=(shape, shape), batch_size=5, color_m
ode = 'rgb', class_mode='categorical') # Import train data fr
om train folder
val = ImageDataGenerator(rescale=1./255).flow_from_directory(
'./validation', target_size=(shape, shape), color_mode = 'rgb'
, batch_size=5, shuffle = False, class_mode='categorical') #
Import validation data from validation folder
test = ImageDataGenerator(rescale=1./255).flow_from_directory(
'./test', target_size=(shape, shape), color_mode = 'rgb', bat
ch_size=5, shuffle = False, class_mode='categorical') # Impor
t test data from test folder
Model: "sequential"
______________________________________________________________
_
Layer (type) Output Shape Param #
==============================================================
=
conv2d (Conv2D) (None, 46, 46, 64) 1792
296 Machine Learning and Its Application Indranath Chatterjee
______________________________________________________________
_
conv2d_1 (Conv2D) (None, 44, 44, 64) 36928
______________________________________________________________
_
dropout (Dropout) (None, 44, 44, 64) 0
______________________________________________________________
_
flatten (Flatten) (None, 123904) 0
______________________________________________________________
_
dense (Dense) (None, 100) 12390500
______________________________________________________________
_
dense_1 (Dense) (None, 5) 505
==============================================================
=
Total params: 12,429,725
Trainable params: 12,429,725
Non-trainable params: 0
______________________________________________________________
_
Epoch 1/50
6/6 [==============================] - 16s 347ms/step - loss:
3.0887 - accuracy: 0.1219 - val_loss: 1.6764 - val_accuracy:
0.0000e+00
Epoch 2/50
6/6 [==============================] - 1s 244ms/step - loss:
1.6356 - accuracy: 0.3619 - val_loss: 1.6145 - val_accuracy:
0.1600
Epoch 3/50
6/6 [==============================] - 1s 243ms/step - loss:
1.5955 - accuracy: 0.1890 - val_loss: 1.6091 - val_accuracy:
0.2800
Epoch 4/50
Machine Learning and Deep Learning Machine Learning and Its Application 297
accurately converted analog data into digital data. Therefore, this section aims to
recognize it using the MNISTdataset, a collection of digits from 0 to 9.
1. Technology used
SVM
MNIST dataset
2. Reference
https://www.tensorflow.org/tutorials/keras/classification
3. Description
a) Import libraries
b) Download dataset
c) print dates
d) Make 2D image to 1D for training (Flatten)
e) Make
f) Training
g) Print result
h) Print tested data
# 1. Import libraries
from tensorflow import keras
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.metrics import accuracy_score
# 2. Download dataset
mnist = keras.datasets.mnist # Download mnist dataset
(trainX, trainY), (testX, testY) = mnist.load_data() # Split
data for train and test
# 5. Make
model = svm.SVC()
# 6. Training
model.fit(train_X, trainY)
# 7. Print result
result = model.predict(test_X)
accuracy_score(testY, result)
0.9792
Object detection and object recognition are both methods for recognizing objects,
but their implementation differs. The technique of detecting instances of things in
pictures is known as object detection. Object detection is a subset of item
302 Machine Learning and Its Application Indranath Chatterjee
recognition in deep learning, where the object is identified and found in an image.
Multiple items can be detected and found inside the same picture as a result of this.
Machine learning allows you to pick the optimum mix of features and classifiers
for learning when it comes to object identification. With only a few pieces of
information, it may produce accurate findings.
It is a specific object that reflects in the camera. This is also one of the fields where
machine learning is frequently used. Machine learning-based object recognition
systems, such as YOLO, lead modern society as a primary driving force for
unmanned systems. However, creating such a vision recognition system requires a
lot of data, which has been recognized as a costly and laborious business. It is
transfer learning that sheds light on such a situation. Transfer learning is an efficient
machine learning technique that can achieve great results by training a small
amount of data on an existing model. Therefore, this chapter aims to learn two cat
and dog classes through transfer learning and output them as models.
1. Technology used
Transfer learning
Cat dog dataset usage
2. Reference
https://www.tensorflow.org/tutorials/keras/classification
https://cafe.daum.net/hyunsikahn/FPA7/12
3. Dataset
https://www.kaggle.com/tongpython/cat-and-dog
Download dataset, and split, upload the archive folder
located on the main screen of Google Drive.
4. Description
a) Import libraries
b) Import google drive
c) Dataset import
d) print dates
e) Import pre-trained dataset(VGG16)
f) Make additional learning course
g) Summation pre-trained and additional
h) Transfer learning
i) Save trained model
j) Import test data
Machine Learning and Deep Learning Machine Learning and Its Application 303
# 1. Import libraries
import matplotlib.pyplot as plt
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.applications.vgg16 import VGG16
from keras.layers import Dense, Dropout, Flatten
from keras.models import Sequential, Model
# 3. Dataset import
shape=48
train = ImageDataGenerator(rescale=1./255,).flow_from_director
y('/content/drive/MyDrive/archive/training_set', target_size=(
shape, shape), batch_size=5, color_mode = 'rgb', class_mode='c
ategorical') # Import train data from train folder
test = ImageDataGenerator(rescale=1./255).flow_from_directory(
'/content/drive/MyDrive/archive/test_set', target_size=(shape
, shape), color_mode = 'rgb', batch_size=5, shuffle = False, c
lass_mode='categorical') # Import test data from test folder
______________________________________________________________
_
block5_conv1 (Conv2D) (None, 3, 3, 512) 2359808
______________________________________________________________
_
block5_conv2 (Conv2D) (None, 3, 3, 512) 2359808
______________________________________________________________
_
block5_conv3 (Conv2D) (None, 3, 3, 512) 2359808
______________________________________________________________
_
block5_pool (MaxPooling2D) (None, 1, 1, 512) 0
______________________________________________________________
_
flatten (Flatten) (None, 512) 0
______________________________________________________________
_
dense (Dense) (None, 4096) 2101248
______________________________________________________________
_
dense_1 (Dense) (None, 4096) 16781312
______________________________________________________________
_
dropout (Dropout) (None, 4096) 0
______________________________________________________________
_
dense_2 (Dense) (None, 2) 8194
==============================================================
=
Total params: 33,605,442
Trainable params: 18,890,754
Non-trainable params: 14,714,688
Epoch 1/50
Machine Learning and Deep Learning Machine Learning and Its Application 307
{'cats': 0, 'dogs': 1}
[[1. 0.]
[1. 0.]
[1. 0.]
…
…
[1. 0.]
[0. 1.]]
Accuracy : 65.461
generally assume that successive frames in a video are connected concerning their
semantic contents using video classification, which is more than simply simple
picture classification.
1. Technology used
Transfer learning
Pretrained data
2. Reference
https://whiteduck.tistory.com/160
3. Description
a) Download video
b) Video view
c) Import libraries
d) Import google drive
e) Import pre-trained model from the main folder of Google Drive
f) Preparing dataset and output
g) Predict
h) Download result
Collecting youtube-dl
Downloading
https://files.pythonhosted.org/packages/a4/43/1f586e49e68f8b41
c4be416302bf96ddd5040b0e744b5902d51063795eb9/youtube_dl-
2021.6.6-py2.py3-none-any.whl (1.9MB)
|████████████████████████████████| 1.9MB 7.5MB/s
Installing collected packages: youtube-dl
Successfully installed youtube-dl-2021.6.6
[youtube] WDlu1OhvYBM: Downloading webpage
[download] Destination: download.mp4
[download] 100% of 55.78MiB in 00:01
# 2. Video view
from IPython.display import HTML
from base64 import b64encode
mp4 = open('data.mp4','rb').read()
video_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=500 controls>
<source src="%s" type="video/mp4">
</video>
""" % video_url)
Machine Learning and Deep Learning Machine Learning and Its Application 311
# 3. Import libraries
from keras.models import load_model
import cv2
from google.colab.patches import cv2_imshow
# 7. Predict
while(cap.isOpened()):
success, img = cap.read() # Get data from captured set
if success == False: break # If there is no left data(end o
f the video), break
img_test = cv2.resize(img, dsize = (48, 48)) # Decoding ima
ge to 48*48
img_test = img_test.reshape(1, 48, 48, -
1) # Make shape for 2D CNN network
result = model_import.predict_generator(img_test, steps=10)
# Predict
print(result)
if round(float(result[0, 0])) == 1: # If cat
cv2.putText(img, "CAT", (10, 50), cv2.FONT_HERSHEY_SIMPLEX
, 0.5, (0,0,0), 1, cv2.LINE_AA) # Print "CAT" at 10, 50 of vid
eo
elif round(float(result[0, 1])) == 1: #If dog
cv2.putText(img, "DOG", (10, 50), cv2.FONT_HERSHEY_SIMPLEX
, 0.5, (0,0,0), 1, cv2.LINE_AA) # Print "DOG" at 10, 50 of vid
eo
# 8. Download result
from google.colab import files
files.download('result.mp4')
Medical imaging refers to the techniques and procedures used to produce pictures
of various human body areas for diagnostic and therapeutic reasons in digital health.
Medical imaging refers to a variety of radiological imaging methods. All diagnostic
and therapeutic investigations/interventions performed in a conventional radiology
department are classified as medical imaging. It includes various imaging
techniques and procedures to scan the human body for diagnostic, therapeutic and
follow-up reasons. It plays a significant part in public health programs aimed at all
demographic groups.
Imaging is an excellent resource for many diseases and a valuable tool for physical
therapists when appropriately utilized. It is critical to understand when imaging is
essential, as superfluous imaging wastes money and increases the risk of early
surgery. Medical imaging encompasses a wide range of radiological imaging
methods, including:
Machine Learning and Deep Learning Machine Learning and Its Application 313
1. X-ray
2. Fluoroscopy
3. Magnetic resonance imaging (MRI)
4. Magnetic resonance imaging (fMRI)
5. Ultrasonography (often known as ultrasound)
6. Endoscopy
7. Positron emission tomography (PET)
Machine learning-based vision recognition has gone a long way, as we have seen.
Its performance is already exceeding human cognitive capacity. Therefore
scientists are considering how to put this impressive capability to use. This has
brought the bio and medical areas into the limelight. Even for major illnesses, there
are situations where only minor alterations emerge in this field. If people label it,
there is a chance it will be misdiagnosed. On the other hand, machine learning is
predicted to lower the likelihood of misdiagnosis and lesseninglower the likelihood
of misdiagnosis and lessening the burden on patients and society by responding
sensitively to relatively tiny changes and accurately classifying them. As a result,
this chapter aims to divide it into three groups using fMRI brain scans.
The tiny variations in blood flow that occur with brain activity are measured using
functional magnetic resonance imaging (fMRI). It can be used to look at the brain's
functional architecture, assess the consequences of a stroke or other disease, or help
with brain therapy. Other imaging techniques may not be able to identify problems
in the brain that fMRI can.
For learning how a normal, sick, or damaged brain works and determining the
possible dangers of surgery or other invasive brain therapies, fMRI is becoming the
diagnostic tool of choice. During an fMRI examination, you will be asked to do
things like tapping your fingers or toes, pursing your lips, moving your tongue,
reading, seeing images, listening to speech, and/or playing simple word games.
Increased metabolic activity in the brain regions responsible for these activities will
result as a result of this. This activity, which includes the dilation of blood vessels,
chemical changes, and more oxygen transport, may then be seen on MRI scans.
314 Machine Learning and Its Application Indranath Chatterjee
1. Technology used
SVM
fMRI
2. Reference
https://github.com/poldrack/fmri-classification-example
3. Description
a) Download dataset from github
b) Install nibabel and nilearn
c) Install keras-utils
d) Import libraries
e) Load dataset
f) Make labels
g) print dates
h) Learning
i) Shuffle labels and test again
# 3. Install keras-utils
!pip install keras-utils
# 4. Import libraries
import os
import nibabel,numpy
import sklearn.svm
import nilearn.input_data
import nilearn.plotting
from tensorflow.keras.utils import to_categorical
%matplotlib inline
# 5. Load dataset
a=nilearn.input_data.NiftiMasker(mask_img='fmri-
classification-example/nback_mask.nii.gz')
Machine Learning and Deep Learning Machine Learning and Its Application 315
dataset=a.fit_transform('fmri-classification-
example/nback_zstats1-11-21_all.nii.gz')
# 6. Make labels
label=numpy.zeros(dataset.shape[0]) # Make numpy array that al
l 0
label[15:30]=1 # From 15 to 30, change 0 to 1
label[30:]=2 # From 30, change 0 to 2
indicator=numpy.kron(numpy.ones(3),numpy.arange(15)) # Make se
ssion indicator
# 7. Print data
thresh=0.8
coords=(0,-80,-10) # Set coords of line
background=nibabel.load('fmri-classification-
example/TRIO_Y_NDC_333_fsl.nii.gz') # Import background image
face=nibabel.load('fmri-classification-
example/nback_zstat1_mean.nii.gz') # Import faces image
nilearn.plotting.plot_stat_map(face,bg_img=background,threshol
d=thresh,title='faces',cut_coords=coords) # Print faces image
scene=nibabel.load('fmri-classification-
example/nback_zstat11_mean.nii.gz') # Import scenes image
nilearn.plotting.plot_stat_map(scene,bg_img=background,thresho
ld=thresh,title='scenes',cut_coords=coords) # Print scenes im
age
chars=nibabel.load('fmri-classification-
example/nback_zstat21_mean.nii.gz') # import characters image
nilearn.plotting.plot_stat_map(chars,bg_img=background,thresho
ld=thresh,title='characters',cut_coords=coords) # Print charac
ters image
316 Machine Learning and Its Application Indranath Chatterjee
# 8. Learning
def run_classifier(shuffle_labels=False,verbose=True): # Make
run_classifier function
accuracy=numpy.zeros((15,3)) # Make acc array for import d
ata
for i in range(15):
obs_tr=indicator!=i # Define train set
obs_te=indicator==i # Define test set
trainX=dataset[obs_tr,:] # Make train data
testX=dataset[obs_te,:] # Make test data
trainY=label[obs_tr] # Make train labels
if shuffle_labels:
numpy.random.shuffle(trainY) # Shuffle train label
s
testY=label[obs_te] # Make test labels
model=sklearn.svm.SVC(kernel='linear') # Make SVM mod
el
model.fit(trainX,trainY) # Training
p=model.predict(testX) # Test
accuracy[i,:]=p==testY # Calculate test accuracy
acc_tr=model.predict(trainX)==trainY # Calculate train
accuracy
if verbose:
print('Session number %d'%i) # Print session numbe
r
Machine Learning and Deep Learning Machine Learning and Its Application 317
Session number 0
Training acc: 1.0
Test acc: 0.3333333333333333
Session number 1
Training acc: 1.0
Test acc: 1.0
Session number 2
Training acc: 1.0
Test acc: 0.6666666666666666
Session number 3
Training acc: 1.0
Test acc: 0.6666666666666666
…
…
Session number 14
Training acc: 1.0
Test acc: 0.6666666666666666
Shuffled
Mean acc: 0.356
Faces acc: 0.333
Scenes acc: 0.467
Characters acc: 0.267
The body's hydrogen atoms are mapped using magnetic resonance imaging (MRI).
Because hydrogen atoms have a single proton and a high magnetic moment, they
are excellent for MRI. An MRI scanner is a giant, strong magnet in which the
patient is placed in its most basic form. The magnetic field generated by the
magnets causes each proton in the hydrogen atom to resonate, allowing the machine
to determine the proton's location. Because water molecules make up around 75
percent of human bodies, MR imaging can acquire accurate and detailed pictures
of the examined body area.
Recent machine learning models show great discriminating capacity even with
minor changes as mentioned in the previous section. Cancer, one of humanity's
most feared diseases, is no exception. Early cancer is frequently misdiagnosed as a
stomach ulcer or a variety of tumors, making it a complex condition to identify. As
a result, machine learning is gaining traction as an area in which it is necessary. As
a result, this chapter aims to develop a model for utilizing CNN to diagnose the
presence or absence of cancer.
Machine Learning and Deep Learning Machine Learning and Its Application 319
1. Technology used
CNN
Medical imaging
2. Reference
https://medium.com/hyunjulie/%EC%BA%90%EA%B8%80%EA%B3%BC-
%EA%B5%AC%EA%B8%80-colab-
%EC%97%B0%EA%B2%B0%ED%95%B4%EC%A3%BC%EA%B8%B0-6a274f6de81d
https://www.kaggle.com/navoneel/brain-mri-images-for-brain-
tumor-detection
https://www.kaggle.com/ruslankl/brain-tumor-detection-v1-0-
cnn-vgg-16/data
https://www.tensorflow.org/tutorials/keras/classification
3. Description
a) Preparing Kaggle
b) Dataset download from kaggle
c) Import libraries
d) Split data
e) Import dataset
f) print dates
g) Make CNN model
h) Learning
i) Test
j) Print result
k) Evaluate model
l) Print tested data
# Kaggle install
!pip install kaggle
# 3. Import libraries
import os
import shutil
from keras.preprocessing.image import ImageDataGenerator
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
path = './brain_tumor_dataset/'
# 4. Split data # https://www.kaggle.com/ruslankl/brain-
tumor-detection-v1-0-cnn-vgg-16/data
for data in os.listdir(path): # For something in path
if not data.startswith('.'): # If "data" is not empty
num = len(os.listdir(path + data)) # num = file count
s of folder(path + data)
Path("./test/" + data.upper()).mkdir(parents=True, exi
st_ok=True) # Make "test" directory
Path("./train/" + data.upper()).mkdir(parents=True, ex
ist_ok=True) # Make "train" directory
Path("./validation/" + data.upper()).mkdir(parents=Tru
e, exist_ok=True) # Make "validation" directory
for (n, name) in enumerate(os.listdir(path + data)):
# For something in folder(path + data)
image = path + data + '/' + name # Make list of i
mage data
if n < 5: # Up to 50%
shutil.copy(image, 'test/' + data.upper() + '/
' + name) # Copy files to "test" directory
elif n < 0.8*num: # Up to 80%
shutil.copy(image, 'train/'+ data.upper() + '/
' + name) # Copy files to "train" directory
Machine Learning and Deep Learning Machine Learning and Its Application 321
else:
shutil.copy(image, 'validation/'+ data.upper()
+ '/' + name) # Copy files to "validation" directory
# 5. Import dataset
shape=48
train = ImageDataGenerator(rescale=1./255,).flow_from_director
y('./train', target_size=(shape, shape), batch_size=5, color_m
ode = 'rgb', class_mode='categorical') # Import train data fr
om train folder
val = ImageDataGenerator(rescale=1./255).flow_from_directory(
'./validation', target_size=(shape, shape), color_mode = 'rgb'
, batch_size=5, shuffle = False, class_mode='categorical') #
Import validation data from validation folder
test = ImageDataGenerator(rescale=1./255).flow_from_directory(
'./test', target_size=(shape, shape), color_mode = 'rgb', bat
ch_size=5, shuffle = False, class_mode='categorical') # Impor
t test data from test folder
Model: "sequential"
______________________________________________________________
_
Layer (type) Output Shape Param #
==============================================================
=
conv2d (Conv2D) (None, 46, 46, 64) 1792
______________________________________________________________
_
conv2d_1 (Conv2D) (None, 44, 44, 64) 36928
______________________________________________________________
_
dropout (Dropout) (None, 44, 44, 64) 0
______________________________________________________________
_
flatten (Flatten) (None, 123904) 0
______________________________________________________________
_
dense (Dense) (None, 100) 12390500
______________________________________________________________
_
dense_1 (Dense) (None, 2) 202
==============================================================
=
Total params: 12,429,422
Trainable params: 12,429,422
Non-trainable params: 0
Machine Learning and Deep Learning Machine Learning and Its Application 323
# 8. Learning
hist = model.fit_generator(train, steps_per_epoch=train.n//32,
epochs=50, validation_data=val, validation_steps=5) # Learni
ng
Epoch 1/50
6/6 [==============================] - 18s 404ms/step - loss:
1.9524 - accuracy: 0.5476 - val_loss: 0.9589 - val_accuracy:
0.2400
Epoch 2/50
6/6 [==============================] - 2s 286ms/step - loss:
0.7243 - accuracy: 0.5662 - val_loss: 0.5409 - val_accuracy:
0.7200
Epoch 3/50
6/6 [==============================] - 2s 286ms/step - loss:
0.5018 - accuracy: 0.7933 - val_loss: 1.6130 - val_accuracy:
0.3200
Epoch 4/50
6/6 [==============================] - 2s 282ms/step - loss:
0.5439 - accuracy: 0.7452 - val_loss: 0.5569 - val_accuracy:
0.6400
…
…
Epoch 50/50
6/6 [==============================] - 2s 287ms/step - loss:
5.7654e-04 - accuracy: 1.0000 - val_loss: 1.3477 - val_accuracy:
0.8000
# 9. Test
result = model.predict_generator(test, steps=10) # Doing test
result = np.around(result) # Round result
{'NO': 0, 'YES': 1}
[[0. 1.]
324 Machine Learning and Its Application Indranath Chatterjee
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]]
Accuracy : 90.000
and software of all kinds, putting the enormous textual and other information at our
fingertips in ways that fit our requirements.
The field's practical objectives are numerous and diverse. Some of the most notable
are: efficient text retrieval on a specific topic; effective machine translation;
question answering, which ranges from simple factual questions to those requiring
inference and descriptive or discursive answers; text summarization; text or spoken
language analysis for a topic, sentiment, or other psychological attributes; dialogue
amplification; and text summarization.
NLP integrates statistical, machine learning, and deep learning models with
computational linguistics rule-based modeling of human language. These
technologies, when used together, allow computers to interpret human language in
the form of text or speech data and 'understand' its whole meaning, including the
speaker or writer's purpose and mood.
Sentiment analysis is a sophisticated text analysis tool that uses machine learning
and deep learning algorithms to mine unstructured data for opinion and emotion
automatically.
Deep learning is seen as the next step in the evolution of machine learning. It is an
artificial neural network that connects algorithms to replicate how the human brain
functions. It has enabled numerous practical applications of machine learning, such
as customer support automation and self-driving automobiles. Let us take a deeper
look at how deep learning may help with sentiment analysis.
1. Technology used
LSTM
NLP
2. Reference
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-
keras?select=database.sqlite
https://wikidocs.net/24586
3. Description
a) Import libraries
b) Download dataset
c) Reduce the length of review to 300
d) Reduce review to 300
e) Print data
f) Make LSTM model
g) Learning
h) Test
i) Define test accuracy
# 1. Import libraries
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.datasets import imdb
# 2. Download dataset
Machine Learning and Deep Learning Machine Learning and Its Application 327
# 5. Print data
index = imdb.get_word_index() # Import indexset
index2={}
for key, value in index.items():
index2[value+3] = key # Change value(num) to sentence
for index, token in enumerate(("<pad>", "<sos>", "<unk>")):
index2[index]=token
print(' '.join([index2[index] for index in trainX[0]])) # Prin
t data
Model: "sequential"
______________________________________________________________
_
Layer (type) Output Shape Param #
==============================================================
=
embedding (Embedding) (None, 300, 128) 640000
______________________________________________________________
_
lstm (LSTM) (None, 196) 254800
______________________________________________________________
_
dense (Dense) (None, 2) 394
==============================================================
=
Total params: 895,194
Trainable params: 895,194
Non-trainable params: 0
# 7. Learning
batch_size = 32
model.fit(trainX, trainY, epochs = 50, batch_size=batch_size,
verbose = 1) # Training
Epoch 1/50
10/10 [==============================] - 34s 1s/step - loss:
0.6945 - accuracy: 0.4781
Epoch 2/50
10/10 [==============================] - 11s 1s/step - loss:
0.6820 - accuracy: 0.5059
Epoch 3/50
10/10 [==============================] - 11s 1s/step - loss:
0.6163 - accuracy: 0.6216
…
…
Epoch 50/50
10/10 [==============================] - 11s 1s/step - loss:
5.7063e-05 - accuracy: 1.0000
# 8. Test
result = model.predict_generator((testX, testY), steps=10) # D
oing test
Machine Learning and Deep Learning Machine Learning and Its Application 329
number : 0
real : 0 Negative
predict : 1 Positive
number : 1
real : 1 Positive
predict : 1 Positive
number : 2
real : 1 Positive
predict : 1 Positive
number : 3
real : 0 Negative
predict : 0 Negative
…
…
number : 298
real : 1 Positive
predict : 0 Negative
number : 299
real : 0 Negative
predict : 0 Negative
Our lives are now more pleasant and preferred than they were previously, thanks to
the mysterious touch of science. Science's significance in our everyday lives is
evident. Science's impact on our lives cannot be overlooked or ignored. If we
attempt to comprehend the precise impact of science on our lives, we will discover
that these are the results of employing artificial intelligence and machine learning
applications. In this part, we attempt to capture the outstanding real-time machine
learning applications that will transform our view of life.
In the real world, image recognition is a well-known and often used example of
machine learning. The intensity of the pixels in black and white or color pictures
may recognize an item as a digital image.
The speech-to-text translation is possible using machine learning. Live voice and
recorded speech may both be converted to text files using specific software
programs.
Machine learning may divide accessible data into categories, which are
subsequently defined by analyst-specified rules. The analysts can determine the
likelihood of a defect after the categorization is complete.
Machine Learning and Deep Learning Machine Learning and Its Application 331
Text documents and other media files, such as music and photos, contain less
information than a tiny video file. As a result, collecting valuable information from
video, i.e., an automated video surveillance system, has become a hot topic of
research.
Machine learning can parse unstructured data and extract structured information.
Customers provide massive amounts of data to businesses. The process of
annotating datasets for predictive analytics tools is automated using a machine
learning algorithm.
Social media uses machine learning to produce appealing and beautiful services for
its users, such as individuals you may know, recommendations, and response
choices.
CONCLUDING REMARKS
We have gone through every facet of machine learning and deep learning models
so far, and we have a good understanding of their underlying architecture. The
readers were introduced to state-of-the-art, real-world applications of machine
learning and deep learning algorithms in this chapter. The chapter has explored
real-world applications from every corner of recent developments, from the
everyday use of facial recognition to object detection. Not only that, but this chapter
also covered how machine learning and deep learning are used in everyday life.
This chapter has addressed the most critical applications in the domains of pattern
recognition, video processing, medical imaging, and computational linguistics,
among others. The Python implementation of all of the applications was provided
in this chapter. This chapter also discussed several additional essential real-world
applications that we utilize regularly.
332 Machine Learning and Its Application, 2021, 332-334
CHAPTER 9
Conclusions
All of the chapters in this book are devoted to machine learning and deep learning
architecture. As the name implies, it focuses on real-world machine learning
applications and deep learning techniques. The following is a general outline of the
book:
The introductory chapter introduced the fundamental concepts of artificial
intelligence and the gradual growth of artificial intelligence, giving rise to machine
learning and deep learning. It threw deep insights into the need for learning and the
concept of learning in the machine learning framework. This chapter also stated the
importance of AI and its various dimensions. In the end, this chapter introduces
machine learning and its variants. It also shortly described each type of machine
learning algorithm.
The second chapter introduced supervised machine learning algorithms. This
chapter explained the algorithms such as Decision Trees, Random Forests, K-
Neighbors, Name Bias Classifiers, and Support Vector Machines. Each algorithm
begins with an overview, then is described with an algorithm framework and hands-
on examples. A detailed program for Python is provided at the end of each
algorithm to understand a practical understanding of the functional behavior of the
classification. The Python code runs on real data sets and ultimately gives the reader
an in-depth knowledge of algorithm applications.
The chapter on clustering algorithms was introduced to the readers as a part of
unsupervised machine learning algorithms. This chapter described the state-of-the-
art clustering algorithms. This chapter presented an elaborative definition of k-
mean clustering, hierarchical clustering, and self-organizing map. It also defined
the algorithmic framework of each algorithm with hands-on examples with detailed
Python codes and outputs.
Next, the readers were given a thorough understanding of regression analysis in the
introductory chapter. The notion of regression is used in both statistics and
computer science, particularly in machine learning. The principle, on the other
hand, hasn't changed, although the applications have. The two most used regression
analysis techniques, linear and logistic regression, were covered in this chapter.
Each algorithm is discussed in full here, along with examples on how to use them.
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
Conclusions Machine Learning and Its Application 333
also covered how machine learning and deep learning are used in everyday life.
This chapter has addressed the most critical applications in the domains of pattern
recognition, video processing, medical imaging, and computational linguistics,
among others. The Python implementation of all of the applications was provided
in this chapter. This chapter also discussed several additional essential real-world
applications that we utilize regularly.
Machine Learning and Its Application, 2021, 335-336 335
REFERENCES
[1] S. Russell, and P. Norvig, Artificial intelligence: a modern approach., 3rd ed Pearson, 2011.
[2] "Understanding the Four Types of Artifical Intelligence", Apr. 23, 2021. [Online]. Available: 2021.
https://www.govtech.com/computing/understanding-the-four-types-of-artificial-intelligence.html
[Accessed: 7th Jul. 2021].
[3] O. Theobald, Machine Learning For Absolute Beginners: A Plain English Introduction (Second
Edition) (Machine Learning From Scratch Book 1)., 2nd ed Scatterplot Press, 2017.
[4] M. Fenner, Machine learning with Python for Everyone. Addison-Wesley Professional, 2019.
[5] O. Theobald, Machine Learning For Absolute Beginners: A Plain English Introduction (Second
Edition) (Machine Learning From Scratch Book 1)., 2nd ed Scatterplot Press, 2017.
[6] D. Conway, and J.M. White, Machine Learning for Hackers: Case Studies and Algorithms to Get You
Started., 1st ed O’Reilly, 2012.
[7] J.P. Mueller, and L. Massaron, Data Science Programming All-in-One For Dummies., 1st ed For
Dummies, 2020.
[8] "Linear Regression — ML Glossary documentation", Available: https://ml-
cheatsheet.readthedocs.io/en/latest/linear_regression.html#making-predictions [Accessed: 5th Jun.
2021].
[9] "Linear Regression in Machine learning – Javatpoint", Available: https://www.javatpoint.com/linear-
regression-in-machine-learning [Accessed: 1st Jul, 2021].
[10] J. Brownlee, Linear Regression for Machine Learning, 2021. Available:
https://machinelearningmastery.com/linear-regression-for-machine-learning/ [Accessed: 3rd Jul.
2021].
[11] "Introduction to linear regression analysis", [Online]. Available: https://people.duke.e
du/%7Ernau/regintro.htm [Accessed: 6th Jul. 2021].
[12] A. Ye, Markov decision process in reinforcement learning: everything you need to know., 2021.
Available: https://neptune.ai/blog/markov-decision-process-in-reinforcement-learning [Accessed: 12th
Jul. 2021].
[13] A. Choudhary, Nuts & Bolts of Reinforcement Learning: Model Based Planning using Dynamic
Programming., 2021.https://www.analyticsvidhya.com/blog/2018/09/reinforcement-learning-model-
based-planning-dynamic-programming/ [Accessed: 13th Jul. 2021].
[14] S. Paul, An introduction to Q-Learning: Reinforcement Learning, 2020. https://blog.floydhub.com/an-
introduction-to-q-learning-reinforcement-learning/ [Accessed: 16th Jul. 2021]
[15] J. Howard, and S. Gugger, Deep Learning for Coders with Fastai and PyTorch: AI Applications
Without a PhD., 1st ed O’Reilly Media, 2019.
[16] J. Le, "The 8 neural network architectures machine learning researchers need to learn", Available:
https://www.kdnuggets.com/2018/02/8-neural-network-architectures-machine-learning-researchers-
need-learn.html [Accessed: 19th Jul. 2021].
[17] https://developer.ibm.com/technologies/artificial-intelligence/articles/cc-machine-learning-deep-
learning-architectures/ n.d. [Accessed: 20th Jul. 2021].
[18] "Deep learning architectures", Available: https://www.xenonstack.com/blog/artificial-neural-network-
applications [Accessed: 19th Jul. 2021].
[19] "Everything you need to know about neural networks", Available: https://hackernoon.com/ everything-
you-need-to-know-about-neural-networks-8988c3ee4491 [Accessed: 19th Jul. 2021].
[20] "4.7. forward propagation, backward propagation, and computational graphs — dive into deep learning
0.16.6 documentation", Available: https://d2l.ai/chapter_multilayer-perceptrons/backprop.html
[Accessed: 19th Jul. 2021].
[21] S. Ruder, "An overview of gradient descent optimization algorithms", Available:
https://ruder.io/optimizing-gradient-descent/ [Accessed: 20th Jul. 2021].
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
336 Machine Learning and Its Application Indranath Chatterjee
SUBJECT INDEX
A evolution of 3, 4, 5
Artificial neural network (ANN) 170, 179,
Activation functions 182, 183, 184, 185, 190, 180, 181, 182, 183, 190, 191, 193, 194,
191, 199, 200, 201, 202, 203, 204, 205, 196, 197, 198, 227, 228
209, 210, 211, 212, 213, 214 and neural computing 181
sigmoidal 210 Artificial neurons 190, 197, 198, 221
threshold 184, 185 Assumptions for random forest 37
utilized 209 Attributes 24, 325
Adaptive piecewise linear (APL) 214 ordering 24
Agglomerative clustering algorithm 94 psychological 325
Algorithm(s) 17, 19, 32, 35, 55, 71, 72, 75, 88, Automatic handwriting production 179
139, 172, 218, 256, 275, 280, 282, 284,
289, 290, 291, 332, 333 B
filter-based 256, 289, 333
for chi-square test 275 Backpropagation method 192
machine-learning 291 Batch gradient descent 219
stochastic 284 Bayes’ theorem 48
ANOVA 268 Behavior 1, 7, 141, 142, 143, 279, 286
formula 268 ant colony foraging 286
test 268 chromosomal 279
Ant colony optimization (ACO) 256, 278, Bellman 148, 150
283, 286, 288, 289, 333 equations 150
Application 192, 291 formula 148
data-intensive 192 Binary 128, 129, 130
image-related recognition 291 classification techniques 128
Approaches 17, 51, 142, 143, 207, 209, 215, logistic regression 129, 130
219, 221, 256, 257, 259, 271, 272, 278, Biological neural network (BNN) 108, 179,
284, 286 180, 181, 182, 185
backward elimination 259 Boltzmann machine 174, 176
classic optimization 286 learning algorithm 176
evolutionary computing 284 network 176
favored correlation 272 Branches, multi-layered neural network 220
filter-based 17, 256, 257, 259, 271
function optimization 143 C
value-based reinforcement learning 142
Architecture 11, 13, 104, 140, 174, 183, 221, Calculus 189, 215
228, 241, 246 differential 215
comprehensive 228 Camera 302, 308, 309
contemporary 221 gazing 309
Artificial intelligence 1, 2, 3, 4, 5, 6, 7, 8, 9, security 308
10, 108, 325, 330 Central 180, 265, 266, 318
branch of 2, 325 limit theorem (CLT) 265, 266
Indranath Chatterjee
All rights reserved-© 2021 Bentham Science Publishers
338 Machine Learning and Its Application Indranath Chatterjee
Filter 257 L
and wrapper methods 257
techniques 257 Learning 6, 133, 216, 331, 333
Filter-based feature selection 259, 260, 271 facet of machine learning and deep 331, 333
approaches 271 programmed dynamic response 6
module 259, 260 rate’s dynamics 216
Financial industry 116 technique 133
Food 286, 287, 288 Learning algorithm 14, 19, 32, 174
source 287, 288 dynamic 32
supply 286 Linear regression 117, 127, 272
Forecast data 115 analysis 117
Forest algorithm 36 correlation coefficient 272
Frameworks 4, 5, 7, 8, 10, 12, 147 on diabetes dataset 127
cognitive 5 technique 117
expert systems 4 Linear separation problems 66
knowledge-based 5 Load 126, 134
neural architecture model 12 diabetes dataset 126
Fraud detection 193 logistic regression 134
Logical theory 60
G Logistic 114, 115, 116, 127, 128, 129, 130,
131, 132, 133, 134, 137, 205, 332, 333
Greedy optimization method 259 function 127, 132, 205
regression 114, 115, 116, 127, 128, 129,
H 130, 131, 132, 133, 134, 137, 332, 333
regression equation 128, 133
Hopfield network 176, 177, 192, 242, 243, Long short term memory (LSTM) 12, 170,
245 246, 247, 248, 249, 250, 326, 327, 328
I M
Influence training time 12 Machine learning 1, 16, 17, 18, 20, 133, 138,
Information, sensor 12 139, 171, 172, 184, 185, 199, 256, 257,
Input and output layers 171, 186, 199, 241 289, 330, 331, 332, 333
algorithms 16, 18, 20, 133, 138, 139, 171,
172, 256, 257, 331, 332, 333
K methods 139, 171, 184, 185, 199
research 139
Key 10, 140 system 289, 333
motivation 10 techniques 1, 257, 330, 331
terminologies 140 technology 17
Kohonen neural network 192 Magnetic resonance imaging (MRI) 291, 313,
318
Markov's decision process (MDP) 143, 145,
146, 147, 149, 150
340 Machine Learning and Its Application Indranath Chatterjee
gradient 215 X
probability 128
random 75, 81, 187 XOR function 186
theoretical 262
threshold 190
Vanishing gradient issue 209
Variables 43, 50, 51, 52, 53, 55, 86, 114, 115,
117, 123, 177, 269, 272, 273, 275, 276
composite 269
measurement 268
numerical 55, 272
qualitative 276
quantitative 86
Vectors 61, 66, 124, 130, 131, 175, 176, 192,
211, 212, 215, 227, 245, 270, 273, 279
actual output 245
binary 176
discrete 192
finite-length 279
higher-dimensional 124
one-dimensional 227
single long continuous linear 227
Video 241, 290, 291, 308, 309, 331, 334
categorization 241
classification 309
processing 290, 291, 308, 309, 331, 334
processing for object detection 308
Visual performance 11